64 Pages Posted: 31 Dec 2019
Date Written: December 2019
Machine learning (ML) is mostly a predictive enterprise, while the questions of interest to labor economists are mostly causal. In pursuit of causal effects, however, ML may be useful for automated selection of ordinary least squares (OLS) control variables. We illustrate the utility of ML for regression-based causal inference by using lasso to select control variables for estimates of effects of college characteristics on wages. ML also seems relevant for an instrumental variables (IV) first stage, since the bias of two-stage least squares can be said to be due to over-fitting. Our investigation shows, however, that while ML-based instrument selection can improve on conventional 2SLS estimates, split-sample IV, jackknife IV, and LIML estimators do better. In some scenarios, the performance of ML-augmented IV estimators is degraded by pretest bias. In others, nonlinear ML for covariate control creates artificial exclusion restrictions that generate spurious findings. ML does better at choosing control variables for models identified by conditional independence assumptions than at choosing instrumental variables for models identified by exclusion restrictions.
Institutional subscribers to the NBER working paper series, and residents of developing countries may download this paper without additional charge at www.nber.org.
Suggested Citation: Suggested Citation
Here is the Coronavirus
related research on SSRN
This is a National Bureau of Economic Research Paper. NBER charges a fee of $5.00 for this paper.
File name: nber.pdf
If you wish to purchase the right to make copies of this paper for distribution to others, please select the quantity.