The Limits of Data Mining: A Thought Experiment

14 Pages Posted: 16 Nov 2018

Date Written: October 25, 2018

Abstract

Suppose that asset pricing factors are just data mined noise. How much data mining is required to produce the more than 300 factors documented by academics? This short paper shows that, if 10,000 academics generate 1 factor every minute, it takes 15 million years of full-time data mining. This absurd conclusion comes from rigorously pursuing the data mining theory and applying it to data. To fit the fat right tail of published t-stats, a pure data mining model implies that the probability of publishing t-stats < 6.0 is ridiculously small, and thus it takes a ridiculous amount of mining to publish a single t-stat. These results show that the data mining alone cannot explain the zoo of asset pricing factors.

Keywords: Stock return anomalies, publication bias, data mining, multiple testing, p-hacking

JEL Classification: G10, G12

Suggested Citation

Chen, Andrew Y., The Limits of Data Mining: A Thought Experiment (October 25, 2018). Available at SSRN: https://ssrn.com/abstract=3272572 or http://dx.doi.org/10.2139/ssrn.3272572

Andrew Y. Chen (Contact Author)

Federal Reserve Board ( email )

20th and C Streets, NW
Washington, DC 20551
United States
202-973-6941 (Phone)

HOME PAGE: http://sites.google.com/site/chenandrewy/

Register to save articles to
your library

Register

Paper statistics

Downloads
510
rank
51,922
Abstract Views
2,145
PlumX Metrics