Machine Learning for Instrumental Variable Regression: From Bias to Resilience
45 Pages Posted: 8 Jan 2025
Date Written: November 04, 2024
Abstract
The application of machine learning (ML) in causal inference has attracted significant attention from researchers. A particular focus lies in the integration of ML into two-stage least squares (2SLS), a cornerstone methodology for causal inference. While ML can improve the efficiency of 2SLS by reducing prediction error in the first stage, a major hurdle arises due to the concept of forbidden regression. Specifically, a nonlinear first stage is commonly deemed forbidden because the potential lack of orthogonality between the prediction and prediction error may lead to inconsistent estimates. To provide generalizable insights into the applicability of ML in the first stage of 2SLS, this paper decomposes the bias of a generalized 2SLS estimator into an observable bias and an unobservable bias, without specifying the functional form of the first stage or assuming the proposed instrument to be truly exogenous. Analytical results and extensive simulations show that while a linear prediction can ensure a zero observable bias, it may result in a substantial unobservable bias, especially when the instrument is weak or not strictly exogenous. Conversely, with constrained or orthogonalized ML predictions, it is possible, and even guaranteed under certain conditions, to reduce the unobservable bias without introducing an observable bias. By deriving the expression of bias under minimal assumptions, this paper identifies the sufficient and practically necessary condition for the consistency of ML-augmented 2SLS and offers valuable and previously unexplored insights into its resilience to imperfect instruments, establishing crucial theoretical foundations for the integration of ML into instrumental variable regression.
Keywords: machine learning, causal inference, 2SLS, bias decomposition, endogeneity decomposition, imperfect instruments
Suggested Citation: Suggested Citation
