Improving Low-Probability Judgments
72 Pages Posted: 10 Jan 2025
Date Written: November 17, 2024
Abstract
High-stakes debates often pivot on clashing estimates of outcomes that one side sees as so improbable as not to deserve policy prioritization. These debates are especially intractable when they focus on rare events ranging from disasters (e.g., existential risks from Artificial Intelligence, nuclear war, or bioengineered pandemics) to surprising successes (e.g., once inconceivable scientific discoveries). The research literature offers grounds for suspecting that the micro-probability judgments flowing into such debates are both unreliable and biased. This article covers experimental manipulations that achieve improvements in accuracy for low-probability judgments by shifting from the standard linear elicitation scale and Brier scoring rule to nonlinear (logarithmic) elicitation scales and logarithmic scoring rules. These methodological changes produced accuracy improvements of approximately d = 0.2 to 0.5 for individual accuracy scores. Improvements in aggregate accuracy varied more widely by aggregation function (mean vs. median) and accuracy scoring rule, between parity (d = 0) and a large advantage for non-linear over linear scales (d = 0.68). Judgments obtained via the linear scale and text box elicitations systematically overestimated the true values. New scales allowed forecasters to provide precise judgments at the low end of the probability scale and logarithmic scoring rules penalize large errors harshly, incentivising judges to avoid 0%and provide precise non-zero probabilities. An indirect elicitation protocol we developed, successive menus, yielded mixed results, such as improving aggregate accuracy and individual calibration at the cost of increasing outlier judgments and reducing retention. Base rate anchors provided context but no measurable accuracy benefits. These results point to next steps for improving probability judgments of rare events. The most promising next steps include a) using subject-specific Base-Rate Anchors, b) developing training programs specific to low-probability events, c) developing more robust and usable indirect elicitation protocols, and d) assessing all of these methods in longitudinal forecasting tournament featuring many forecasting questions focused on rare events.
Suggested Citation: Suggested Citation