Significance Testing Against the Random Model for Scoring Models on Top K Predictions
17 Pages Posted: 9 Oct 2008
Date Written: 2005
Performance at top k predictions, where instances are ranked by a (learned) scoring model, hasbeen used as an evaluation metric in machine learning for various reasons such as where the entirecorpus is unknown (e.g., the web) or where the results are to be used by a person with limited time orresources (e.g., ranking financial news stories where the investor only has time to look at relativelyfew stories per day). This evaluation metric is primarily used to report whether the performanceof a given method is significantly better than other (baseline) methods. It has not, however, beenused to show whether the result is significant when compared to the simplest of baselines â" therandom model. If no models outperform the random model at a given confidence interval, then theresults may not be worth reporting. This paper introduces a technique to perform an analysis of theexpected performance of the top k predictions from the random model given k and a p-value on anevaluation dataset D. The technique is based on the realization that the distribution of the numberof positives seen in the top k predictions follows a hypergeometric distribution, which has welldefinedstatistical density functions. As this distribution is discrete, we show that using parametricestimations based on a binomial distribution are almost always in complete agreement with thediscrete distribution and that, if they differ, an interpolation of the discrete bounds gets very closeto the parametric estimations. The technique is demonstrated on results from three prior publishedworks, in which it clearly shows that even though performance is greatly increased (sometimes over100%) with respect to the expected performance of the random model (at p = 0.5), these results,although qualitatively impressive, are not always as significant (p = 0.1) as might be suggestedby the impressive qualitative improvements. The technique is used to show, given k, both howmany positive instances are needed to achieve a specific significance threshold is as well as howsignificant a given top k performance is. The technique when used in a more global setting is ableto identify the crossover points, with respect to k, when a method becomes significant for a givenp. Lastly, the technique is used to generate a complete confidence curve, which shows a generaltrend over all k and visually shows where a method is significantly better than the random modelover all values of k.
Suggested Citation: Suggested Citation