Linear Probability Models (LPM) and Big Data: The Good, the Bad, and the Ugly

Chatla, Suneel; Shmueli, Galit

doi:10.2139/ssrn.2353841

Download This Paper

Open PDF in Browser

Add Paper to My Library

Linear Probability Models (LPM) and Big Data: The Good, the Bad, and the Ugly

Indian School of Business Research Paper Series

45 Pages Posted: 13 Nov 2013 Last revised: 12 Oct 2016

See all articles by Suneel Chatla

Suneel Chatla

Institute of Service Science, National Tsing Hua University

Galit Shmueli

Institute of Service Science, National Tsing Hua University, Taiwan

Date Written: October 11, 2016

Abstract

Linear regression is among the most popular statistical models in social sciences research. Linear probability models (LPMs) - linear regression models applied to a binary outcome - are used in various disciplines. Surprisingly, LPMs are rare in the IS literature, where logit and probit models are typically used for binary outcomes. LPMs have been examined with respect to specific aspects, but a thorough evaluation of their practical pros and cons for different research goals under different scenarios is missing. We perform an extensive simulation study to evaluate the advantages and dangers of LPMs, especially in the realm of Big Data that now affects IS research. We evaluate LPM for the three common uses of binary outcome models: inference and estimation, prediction and classification, and selection bias. We compare its performance to logit and probit, under different sample sizes, error distributions, and more. We find that coefficient directions, statistical significance, and marginal effects yield results similar to logit and probit. Although LPM coefficients are biased, they are consistent for the true parameters up to a multiplicative scalar. Coefficient bias can be corrected by assuming an error distribution. For classification and selection bias, LPM is on par with logit and probit in terms of class separation and ranking, and is a viable alternative in selection models. It is lacking when the predicted probabilities are directly of interest, because predicted probabilities can exceed the unit interval. We illustrate some of these results through by modeling price in online auctions, using data from eBay.

Keywords: linear regression, linear probability model, binary outcome, selection bias, estimation, inference, prediction, big data, logit, probit

Suggested Citation: Suggested Citation

Chatla, Suneel and Shmueli, Galit, Linear Probability Models (LPM) and Big Data: The Good, the Bad, and the Ugly (October 11, 2016). Indian School of Business Research Paper Series, Available at SSRN: https://ssrn.com/abstract=2353841 or http://dx.doi.org/10.2139/ssrn.2353841