Quantitative Discovery from Qualitative Information: A General-Purpose Document Clustering Methodology

34 Pages Posted: 13 Aug 2009 Last revised: 25 Aug 2009

See all articles by Justin Grimmer

Justin Grimmer

Harvard University - Faculty of Arts and Sciences

Gary King

Harvard University

Date Written: 2009

Abstract

Many people attempt to discover useful information by reading large quantities of unstructured text, but because of known human limitations even experts are ill-suited to succeed at this task. This difficulty has inspired the creation of numerous automated cluster analysis methods to aid discovery. We address two problems that plague this literature. First, the optimal use of any one of these methods requires that it be applied only to a specific substantive area, but the best area for each method is rarely discussed and usually unknowable ex ante. We tackle this problem with mathematical, statistical, and visualization tools that define a search space built from the solutions to all previously proposed cluster analysis methods (and any qualitative approaches one has time to include) and enable a user to explore it and quickly identify useful information. Second, in part because of the nature of unsupervised learning problems, cluster analysis methods are not routinely evaluated in ways that make them vulnerable to being proven suboptimal or less than useful in specific data types. We therefore propose new experimental designs for evaluating these methods. With such evaluation designs, we demonstrate that our computer-assisted approach facilitates more efficient and insightful discovery of useful information than either expert human coders using qualitative or quantitative approaches or existing automated methods. We (will) make available an easy-to-use software package that implements all our suggestions.

Keywords: unsupervised learning, discovery, content analysis

Suggested Citation

Grimmer, Justin and King, Gary, Quantitative Discovery from Qualitative Information: A General-Purpose Document Clustering Methodology (2009). APSA 2009 Toronto Meeting Paper, Available at SSRN: https://ssrn.com/abstract=1450070

Justin Grimmer

Harvard University - Faculty of Arts and Sciences ( email )

1875 Cambridge Street
Cambridge, MA 02138
United States
617-710-6803 (Phone)

Gary King (Contact Author)

Harvard University ( email )

1737 Cambridge St.
Institute for Quantitative Social Science
Cambridge, MA 02138
United States
617-500-7570 (Phone)

HOME PAGE: http://gking.harvard.edu

Do you have a job opening that you would like to promote on SSRN?

Paper statistics

Downloads
282
Abstract Views
1,722
rank
150,456
PlumX Metrics