Textual Factors: A Scalable, Interpretable, and Data-driven Approach to Analyzing Unstructured Information

65 Pages Posted: 4 Jan 2019 Last revised: 4 Nov 2019

See all articles by Lin William Cong

Lin William Cong

Cornell University - Samuel Curtis Johnson Graduate School of Management

Tengyuan Liang

University of Chicago - Booth School of Business

Xiao Zhang

University of Chicago - Booth School of Business

Date Written: September 1, 2019

Abstract

We introduce a general framework for analyzing large-scale text-based data, combining the strengths of neural-network language processing and generative statistical modeling. Our methodology generates textual factors by (i) representing texts using vector word embedding, (ii) clustering words using locality-sensitive hashing, and (iii) identifying spanning vector clusters through topic modeling. Our data-driven approach captures complex linguistic structures while ensuring computational scalability and economic interpretability. We also discuss applications of textual factors in (i) prediction and inference, (ii) interpreting (non-text-based) models and variables, and (iii) constructing new text-based metrics and explanatory variables, with illustrations using topics in finance and economics such as macroeconomic forecasting and factor asset pricing.

Keywords: Big Data, Factor Models, Machine Learning, Text Analytics, Natural Language Processing,Topic Models, Alternative Data

JEL Classification: C55, C80, G10

Suggested Citation

Cong, Lin and Liang, Tengyuan and Zhang, Xiao, Textual Factors: A Scalable, Interpretable, and Data-driven Approach to Analyzing Unstructured Information (September 1, 2019). Available at SSRN: https://ssrn.com/abstract=3307057 or http://dx.doi.org/10.2139/ssrn.3307057

Lin Cong (Contact Author)

Cornell University - Samuel Curtis Johnson Graduate School of Management ( email )

Ithaca, NY 14853
United States

HOME PAGE: http://www.linwilliamcong.com/

Tengyuan Liang

University of Chicago - Booth School of Business ( email )

Xiao Zhang

University of Chicago - Booth School of Business ( email )

5807 S. Woodlawn Avenue
Chicago, IL 60637
United States

Here is the Coronavirus
related research on SSRN

Paper statistics

Downloads
556
Abstract Views
2,877
rank
52,295
PlumX Metrics