Stata, Fast and Slow: Why Running Many Small Regressions in a Large Dataset Takes So Long; and What to Do About It
19 Pages Posted: 12 Apr 2014
Date Written: April 11, 2014
Stata is fast, often very fast. However, when performing regressions on small sub-samples within a large host dataset (more than 1 million observations) performance can deteriorate by many orders of magnitude. For example, an OLS regression on a sub-sample of 100 consecutive observations takes 3.6 seconds in a host dataset with 1 billion observations, but only 3.8 milliseconds in a host dataset with 1000 observations. The difference in performance is due to the mechanism regress uses to mark estimation samples. This performance deterioration has practical implications in finance research, where many variables of interest are themselves estimated via millions of individual OLS regressions within large panel datasets. I suggest an approach that circumvents this issue by using a simple Mata implementation of regress which I call fastreg. As a test, I estimate daily Fama and French 3-factor betas for individual stocks in the CRSP database from 1923 to 2013 using a 250-day rolling window. In this setting fastreg is approximately 367 times faster than regress. The code for fastreg ado is included in the Appendix and is open-source licensed under the GNU GPL.
Keywords: Stata, statistical computing, large datasets, rolling window regressions, factor beta estimation
JEL Classification: C55, C58, C80, C87
Suggested Citation: Suggested Citation