How to Use Lexical Density of Company Filings
10 Pages Posted: 13 Sep 2021
Date Written: September 10, 2021
This paper analyzes the application of natural language processing (NLP) on the 10-K and the 10-Q company reports. Using the Brain Language Metrics on Company Filings (BLMCF) dataset, which monitors numerous language metrics on 10-Ks and 10-Qs company reports, we analyze various lexical metrics such as lexical richness, lexical density, and specific density.
In simple words, lexical richness says how many unique words are used by the author. The idea is that the more varied vocabulary the author has, the more complex the text is. Secondly, lexical density measures the structure and complexity of human communication in a text. A high lexical density indicates a large amount of information-carrying words. And lastly, specific density measures how dense the report's language is from a financial point of view. In other words, how many finance- related words are used in the text.
Overall, we can say that this type of alternative data exhibits interesting results. Even though lexical richness produced the weakest results (of our strategies) when applied to the investment universe consisting of 500 stocks, it significantly improved when we expanded the investment universe to 3000 stocks. Moreover, the strategies based on the lexical density and specific density improved the Sharpe ratio even further.
In the Last section, we combine the two metrics (Lexical density and Specific density) in one strategy. Applying both of these metrics to the investment universe with 500 stocks produces a Sharpe ratio of 0.688.
Keywords: Alternative data, Artificial Intelligence, Natural language processing, 10-K & 10-Q reports, lexical richness, lexical density
Suggested Citation: Suggested Citation