Lookahead Bias in Pretrained Language Models

30 Pages Posted: 11 Apr 2024

Date Written: May 27, 2024


Empirical analysis that uses outputs from pretrained language models can be subject to a new form of temporal lookahead bias. This bias arises when a language model’s pretraining data contains information about the future, which then leaks into analysis that should only use information from the past. Lookahead bias is a form of information leakage that can lead otherwise-standard empirical strategies that use language model outputs to overestimate predictive performance. In this paper we develop tests for lookahead bias, based on the assumption that some events are unpredictable given a prespecified information set. Using these tests, we find evidence of lookahead bias in two applications of language models to social science: Predicting risk factors from corporate earnings calls and predicting election winners from candidate biographies. We additionally find issues with prompting-based approaches to counteract this bias. The issues we raise can be addressed by using models whose pretraining data is free of survivorship bias and contains only language produced prior to the analysis period of interest.

Keywords: lookahead bias, language models, information leakage

JEL Classification: B4,C5,G1

Suggested Citation

Sarkar, Suproteem and Vafa, Keyon, Lookahead Bias in Pretrained Language Models (March 12, 2024). Available at SSRN: https://ssrn.com/abstract=4754678 or http://dx.doi.org/10.2139/ssrn.4754678

Suproteem Sarkar (Contact Author)

Harvard University

1875 Cambridge Street
Cambridge, MA 02138
United States

Keyon Vafa

Harvard University ( email )

1875 Cambridge Street
Cambridge, MA 02138
United States

Do you have a job opening that you would like to promote on SSRN?

Paper statistics

Abstract Views
PlumX Metrics