The Art of Natural Language Processing: Classical, Modern and Contemporary Approaches to Text Document Classification

51 Pages Posted: 31 Mar 2020

See all articles by Andrea Ferrario

Andrea Ferrario

Dep. Management, Technology, and Economics ETH Zurich; Mobiliar Lab for Analytics at ETH

Mara Naegelin

Mobiliar Lab for Analytics at ETH; Dep. of Management, Technology, and Economics ETH Zurich

Date Written: March 1, 2020

Abstract

In this tutorial we introduce three approaches to preprocess text data with Natural Language Processing (NLP) and perform text document classification using machine learning. The first approach is based on 'bag-of-' models, the second one on word embeddings, while the third one introduces the two most popular Recurrent Neural Networks (RNNs), i.e. the Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) architectures. We apply all approaches on a case study where we classify movie reviews using Python and Tensorflow 2.0. The results of the case study show that extreme gradient boosting algorithms outperform adaptive boosting and random forests on bag-of-words and word embedding models, as well as LSTM and GRU RNNs, but at a steep computational cost. Finally, we provide the reader with comments on NLP applications for the insurance industry.

Keywords: natural language processing, bag-of-words models, word embeddings, machine learning, recurrent neural networks, deep learning, Python, Tensorflow 2.0, Keras

JEL Classification: C45, C51, C52, G22

Suggested Citation

Ferrario, Andrea and Naegelin, Mara, The Art of Natural Language Processing: Classical, Modern and Contemporary Approaches to Text Document Classification (March 1, 2020). Available at SSRN: https://ssrn.com/abstract=3547887 or http://dx.doi.org/10.2139/ssrn.3547887

Andrea Ferrario

Dep. Management, Technology, and Economics ETH Zurich ( email )

Zurich
Switzerland

Mobiliar Lab for Analytics at ETH ( email )

Zürich, 8092
Switzerland

Mara Naegelin (Contact Author)

Mobiliar Lab for Analytics at ETH ( email )

Zürich, 8092
Switzerland

Dep. of Management, Technology, and Economics ETH Zurich ( email )

ETH-Zentrum
Zurich, CH-8092
Switzerland

Here is the Coronavirus
related research on SSRN

Paper statistics

Downloads
533
Abstract Views
1,933
rank
57,264
PlumX Metrics