p-Hacking and False Discovery in A/B Testing
44 Pages Posted: 18 Jul 2018
Date Written: June 28, 2018
We investigate whether online A/B experimenters "p-hack" by stopping their experiments based on the p-value of the treatment effect. Our data contains 2,101 commercial experiments in which experimenters can track the magnitude and significance level of the effect every day of the experiment. We use a regression discontinuity design to detect p-hacking, i.e., the causal effect of reaching a particular p-value on stopping behavior.
Experimenters indeed p-hack, especially for positive effects. Specifically, about 57% of experimenters p-hack when the experiment reaches 90% confidence. Furthermore, approximately 70% of the effects are truly null, and p-hacking increases the false discovery rate (FDR) from 33% to 42% among experiments p-hacked at 90% confidence. Assuming that false discoveries cause experimenters to stop exploring for more effective treatments, we estimate the expected cost of a false discovery to be a loss of 1.95% in lift, which corresponds to the 76th percentile of observed lifts.
Keywords: A/B testing, p-hacking, false discoveries, false positives, experimentation
JEL Classification: C12, C90, C93, M21, M31
Suggested Citation: Suggested Citation