Analyzing Textual Information at Scale

34 Pages Posted: 17 Sep 2019

See all articles by Lin William Cong

Lin William Cong

Cornell University

Tengyuan Liang

University of Chicago - Booth School of Business

Xiao Zhang

University of Chicago - Booth School of Business

Date Written: August 1, 2019

Abstract

We overview recent advances in textual analysis for social sciences. Count-based economic model, structured statistical tool, and plain-vanilla machine learning apparatus each has merits and limitations. To take a data-driven approach to capture complex linguistic structures while ensuring computational scalability and economic interpretability, a general framework for analyzing large-scale text-based data is needed. We discuss recent attempts combining the strengths of neural network language models such as word embedding and generative statistical modeling such as topic modeling. We also describe typical sources of texts, the applications of these methodologies to issues in finance and economics, and promising future directions.

Keywords: Big Data, Machine Learning, Text-based Analysis, Topic Models, Unstructured Data, Word Embedding

Suggested Citation

Cong, Lin and Liang, Tengyuan and Zhang, Xiao, Analyzing Textual Information at Scale (August 1, 2019). Available at SSRN: https://ssrn.com/abstract=3449822

Lin Cong (Contact Author)

Cornell University ( email )

Ithaca, NY 14853
United States

HOME PAGE: http://www.linwilliamcong.org

Tengyuan Liang

University of Chicago - Booth School of Business ( email )

Xiao Zhang

University of Chicago - Booth School of Business ( email )

5807 S. Woodlawn Avenue
Chicago, IL 60637
United States

Register to save articles to
your library

Register

Paper statistics

Downloads
26
Abstract Views
78
PlumX Metrics