Mind the Gap: Accounting for Measurement Error and Misclassification in Variables Generated via Data Mining

Information Systems Research, Forthcoming

66 Pages Posted: 28 Apr 2017 Last revised: 1 May 2017

See all articles by Mochen Yang

Mochen Yang

University of Minnesota - Twin Cities - Carlson School of Management

Gediminas Adomavicius

University of Minnesota - Twin Cities - Carlson School of Management

Gordon Burtch

University of Minnesota - Twin Cities - Carlson School of Management

Yuqing Ching Ren

Carlson School of Management

Date Written: April 28, 2017

Abstract

The application of predictive data mining techniques in Information Systems research has grown in recent years, likely due to their effectiveness and scalability in extracting information from large amounts of data. A number of scholars have sought to combine data mining with traditional econometric analyses. Typically, data mining methods are first used to generate new variables (e.g., text sentiment), which are added into subsequent econometric models as independent regressors. However, because prediction is almost always imperfect, variables generated from the first stage data mining models inevitably contain measurement error or misclassification. These errors, if ignored, can introduce systematic biases into the second stage econometric estimations and threaten the validity of statistical inference. In this commentary, we examine the nature of this bias, both analytically and empirically, and show that it can be severe even when data mining models exhibit relatively high performance. We then show that this bias becomes increasingly difficult to anticipate as the functional form of the measurement error grows more complex, or as the number of covariates in the econometric model increases. We review several methods for error correction and focus on two simulation-based methods, SIMEX and MC-SIMEX, which can be easily parameterized using standard performance metrics from data mining models, such as error variance or the confusion matrix, and can be applied under a wide range of econometric specifications. Finally, we demonstrate the effectiveness of SIMEX and MC-SIMEX by simulations and subsequent application of the methods to econometric estimations employing variables mined from three real world datasets related to travel, social networking, and crowdfunding campaign websites.

Keywords: data mining, econometrics, measurement error, misclassification, statistical inference

Suggested Citation

Yang, Mochen and Adomavicius, Gediminas and Burtch, Gordon and Ren, Yuqing Ching, Mind the Gap: Accounting for Measurement Error and Misclassification in Variables Generated via Data Mining (April 28, 2017). Information Systems Research, Forthcoming. Available at SSRN: https://ssrn.com/abstract=2960258

Mochen Yang (Contact Author)

University of Minnesota - Twin Cities - Carlson School of Management ( email )

19th Avenue South
Minneapolis, MN 55455
United States

Gediminas Adomavicius

University of Minnesota - Twin Cities - Carlson School of Management ( email )

19th Avenue South
Minneapolis, MN 55455
United States

Gordon Burtch

University of Minnesota - Twin Cities - Carlson School of Management ( email )

19th Avenue South
Minneapolis, MN 55455
United States

Yuqing Ching Ren

Carlson School of Management ( email )

420 Delaware St. SE
Minneapolis, MN 55455
United States
612-625-5242 (Phone)

HOME PAGE: http://www.csom.umn.edu/Page2075.aspx?type=staff&eid=38674251

Register to save articles to
your library

Register

Paper statistics

Downloads
340
Abstract Views
1,976
rank
87,779
PlumX Metrics