Correlation or Causation?: The Sorry State of Inference in Empirical Modeling
14 Pages Posted: 16 Mar 2017
Date Written: august 12, 2016
For decades, statistical methods, many based upon the “general linear model,” have been used to do estimation and test hypotheses in the social and natural sciences, in medicine, and in the private sector. These tools have become increasingly sophisticated and are often paired with powerful open source data analytic software. We now regularly see mathematical/statistical output combined with data visualizations that are truly mind-boggling and, once in a while, thought provoking. But an increasing number of papers and studies appear to have little statistical validity, in which the line between causality and correlation is often non-existent. This is a danger sign not only in science and medicine but also to companies who unwittingly rely on such results for forecasting and business strategy. Could it be true that researchers and analysts who learn ever more powerful analytical methods lack even a basic understanding of the limitations of these methods? The purpose of this short, non-technical paper, which relies heavily upon examples, is to shed some light on the underlying statistical issues. The ideas here are certainly not original with us and have been raised for a number of years across multiple disciplines.
Keywords: Causation, Correlation, Overfitting, p-Hacking, Endogeneity, Identification, Simultaneity, Measurement Error, Data Generation Processes, Inconsistency, Bias
Suggested Citation: Suggested Citation