Lookahead Bias in Pretrained Language Models

32 Pages Posted: 11 Apr 2024 Last revised: 18 Oct 2024

Date Written: June 28, 2024

Abstract

Empirical analysis that uses outputs from pretrained language models can be subject to a form of temporal lookahead bias. This bias arises when a language model's pretraining data contains information about the future, which then leaks into analysis that should only use information from the past. In this paper we develop direct tests for lookahead bias, based on the assumption that some events are unpredictable given a prespecified information set. Using these tests, we find evidence of lookahead bias in two applications of language models to social science: Predicting risk factors from corporate earnings calls and predicting election winners from candidate biographies. We additionally discuss the limitations of prompting-based approaches to counteract this bias. The issues we raise can be addressed by using models whose pretraining data is free of survivorship bias and contains only language produced prior to the analysis period of interest.

Keywords: lookahead bias, language models, information leakage

JEL Classification: B4,C5,G1

Suggested Citation

Sarkar, Suproteem and Vafa, Keyon, Lookahead Bias in Pretrained Language Models (June 28, 2024). Available at SSRN: https://ssrn.com/abstract=4754678 or http://dx.doi.org/10.2139/ssrn.4754678

Suproteem Sarkar (Contact Author)

Harvard University

1875 Cambridge Street
Cambridge, MA 02138
United States

Keyon Vafa

Harvard University ( email )

1875 Cambridge Street
Cambridge, MA 02138
United States

Do you have a job opening that you would like to promote on SSRN?

Paper statistics

Downloads
1,361
Abstract Views
4,134
Rank
31,471
PlumX Metrics