Linear Probability Models (LPM) and Big Data: The Good, the Bad, and the Ugly
45 Pages Posted: 13 Nov 2013 Last revised: 12 Oct 2016
Date Written: October 11, 2016
Linear regression is among the most popular statistical models in social sciences research. Linear probability models (LPMs) - linear regression models applied to a binary outcome - are used in various disciplines. Surprisingly, LPMs are rare in the IS literature, where logit and probit models are typically used for binary outcomes. LPMs have been examined with respect to specific aspects, but a thorough evaluation of their practical pros and cons for different research goals under different scenarios is missing. We perform an extensive simulation study to evaluate the advantages and dangers of LPMs, especially in the realm of Big Data that now affects IS research. We evaluate LPM for the three common uses of binary outcome models: inference and estimation, prediction and classification, and selection bias. We compare its performance to logit and probit, under different sample sizes, error distributions, and more. We find that coefficient directions, statistical significance, and marginal effects yield results similar to logit and probit. Although LPM coefficients are biased, they are consistent for the true parameters up to a multiplicative scalar. Coefficient bias can be corrected by assuming an error distribution. For classification and selection bias, LPM is on par with logit and probit in terms of class separation and ranking, and is a viable alternative in selection models. It is lacking when the predicted probabilities are directly of interest, because predicted probabilities can exceed the unit interval. We illustrate some of these results through by modeling price in online auctions, using data from eBay.
Keywords: linear regression, linear probability model, binary outcome, selection bias, estimation, inference, prediction, big data, logit, probit
Suggested Citation: Suggested Citation