Combining Prediction Intervals in the M4 Competition
16 Pages Posted: 14 Feb 2019 Last revised: 10 Apr 2019
Date Written: February 5, 2019
The 2018 M4 Forecasting Competition was the first M-Competition to elicit prediction intervals in addition to point estimates. We take a closer look at the twenty valid interval submissions by examining the prediction intervals' calibration and accuracy and evaluating their performance over different time horizons. Overall, the submissions fail to estimate the uncertainty properly. Importantly, we investigate the benefits of interval combination using six recently proposed heuristics that can be applied prior to learning about the quantities' realizations. Our results suggest that interval aggregation offers improvements both in terms of calibration and in terms of accuracy. While averaging interval endpoints maintains its practical appeal as simple to implement and performs quite well when data sets are large, the median and the interior trimmed average are found to be robust aggregators for the prediction interval submissions across all 100,000 time series.
Keywords: hit rates, interval combination methods, calibration, mean scaled interval score, interior trimming, overconfidence
Suggested Citation: Suggested Citation