Download this Paper Open PDF in Browser

Text Preprocessing for Unsupervised Learning: Why It Matters, When It Misleads, and What to Do about It

45 Pages Posted: 7 Oct 2016 Last revised: 15 Jul 2017

Matthew James Denny

Pennsylvania State University

Arthur Spirling

New York University

Date Written: July 10, 2017

Abstract

Despite the popularity of unsupervised techniques for political science text-as-data research, the importance and implications of preprocessing decisions in this domain have received scant systematic attention. Yet, as we show, such decisions have profound effects on the results of real models for real data. We argue that substantive theory is typically too vague to be of use for feature selection, and that the supervised literature is not necessarily a helpful source of advice. To aid researchers working in unsupervised settings, we introduce a statistical procedure that examines the sensitivity of findings under alternate preprocessing regimes. This approach complements a researcher's substantive understanding of a problem by providing a characterization of the variability changes in preprocessing choices may induce when analyzing a particular dataset. In making scholars aware of the degree to which their results are likely to be sensitive to their preprocessing decisions, it aids replication efforts. We make easy-to-use software available for this purpose.

Keywords: text-as-data, preprocessing, forking paths

Suggested Citation

Denny, Matthew James and Spirling, Arthur, Text Preprocessing for Unsupervised Learning: Why It Matters, When It Misleads, and What to Do about It (July 10, 2017). Available at SSRN: https://ssrn.com/abstract=2849145 or http://dx.doi.org/10.2139/ssrn.2849145

Matthew James Denny (Contact Author)

Pennsylvania State University ( email )

Arthur Spirling

New York University ( email )

19 West 4th Street
New York, NY 10012
United States

HOME PAGE: http://https://www.nyu.edu/projects/spirling/

Paper statistics

Downloads
672
Rank
30,900
Abstract Views
2,923