p-Hacking in A/B Testing
46 Pages Posted: 18 Jul 2018 Last revised: 21 Jul 2022
Date Written: December 11, 2018
Abstract
We investigate to what extent online A/B experimenters “p-hack” by stopping their experiments based on the p-value of the treatment effect, and how such behavior impacts the value of the experimental results. Our data contains 2,101 commercial experiments in which experimenters can track the magnitude and significance level of the effect every day of the experiment. We use a regression discontinuity design to detect the causal effect of reaching a particular p-value on stopping behavior.
Experimenters indeed p-hack, at times. Specifically, about 73% of experimenters stop the experiment just when a positive effect reaches 90% confidence. Also, approximately 75% of the effects are truly null. Improper optional stopping increases the false discovery rate (FDR) from 33% to 40% among experiments p-hacked at 90% confidence. Assuming that false discoveries cause experimenters to stop exploring for more effective treatments, we estimate the expected cost of a false discovery to be a loss of 1.95% in lift, which corresponds to the 76th percentile of observed lifts.
Keywords: A/B testing, p-hacking, false discoveries, false positives, experimentation
JEL Classification: C12, C90, C93, M21, M31
Suggested Citation: Suggested Citation