Recovering Missing Firm Characteristics with Attention-Based Machine Learning
65 Pages Posted: 10 Jan 2022 Last revised: 23 Jan 2023
Date Written: January 23, 2023
Firm characteristics are used in an abundance of empirical research in accounting, economics, finance, and many related fields. These characteristics are, however, frequently unobservable to the researcher, with intricate patterns as to why and when they are missing. In the past, researchers dropped firm-month observations with missing information. This approach quickly becomes infeasible as the number of characteristics grows, which is required to simultaneously assess their informational content. A second approach that has emerged in response is to impute the cross-sectional mean, which discards important variation over time and in the cross-section. Our study is devoted to the recovery of these missing entries, drawing on the informational content of other – observed – characteristics, their past observations, and information from the cross-section of other firms. We adapt state-of-the-art advances from natural language processing to the case of financial data and train a flexible large-scale machine learning model in a self-supervised environment. To train the model, we consider several masking types which account for empirically observed patterns of missingness. Using the uncovered latent structure governing firm characteristics, we show that our model beats competing methods, as well as several approaches tailored to the imputation of financial data. Based on the completed dataset, we show that average returns to many characteristic-sorted long-short portfolios are likely lower than previously thought. In general, the return distribution of firms with missing characteristics differs significantly from those firms with all information available, highlighting the importance of adequately imputing missing values.
Keywords: Machine Learning, Missing Data, Big Data, Risk Factors
JEL Classification: G10, G12, G14, C1, C55
Suggested Citation: Suggested Citation