Lookahead Bias in Pretrained Language Models
32 Pages Posted: 11 Apr 2024 Last revised: 18 Oct 2024
Date Written: June 28, 2024
Abstract
Empirical analysis that uses outputs from pretrained language models can be subject to a form of temporal lookahead bias. This bias arises when a language model's pretraining data contains information about the future, which then leaks into analysis that should only use information from the past. In this paper we develop direct tests for lookahead bias, based on the assumption that some events are unpredictable given a prespecified information set. Using these tests, we find evidence of lookahead bias in two applications of language models to social science: Predicting risk factors from corporate earnings calls and predicting election winners from candidate biographies. We additionally discuss the limitations of prompting-based approaches to counteract this bias. The issues we raise can be addressed by using models whose pretraining data is free of survivorship bias and contains only language produced prior to the analysis period of interest.
Keywords: lookahead bias, language models, information leakage
JEL Classification: B4,C5,G1
Suggested Citation: Suggested Citation