The Limits of Data Mining: A Thought Experiment
14 Pages Posted: 16 Nov 2018
Date Written: October 25, 2018
Suppose that asset pricing factors are just data mined noise. How much data mining is required to produce the more than 300 factors documented by academics? This short paper shows that, if 10,000 academics generate 1 factor every minute, it takes 15 million years of full-time data mining. This absurd conclusion comes from rigorously pursuing the data mining theory and applying it to data. To fit the fat right tail of published t-stats, a pure data mining model implies that the probability of publishing t-stats < 6.0 is ridiculously small, and thus it takes a ridiculous amount of mining to publish a single t-stat. These results show that the data mining alone cannot explain the zoo of asset pricing factors.
Keywords: Stock return anomalies, publication bias, data mining, multiple testing, p-hacking
JEL Classification: G10, G12
Suggested Citation: Suggested Citation