Learning Word Embeddings from 10-K Filings for Financial NLP Tasks
10 Pages Posted: 14 Nov 2019 Last revised: 5 Jun 2020
Date Written: September 5, 2019
In this paper, we generate word embeddings learned from corpus of 10-K filings by corporates in U.S. to S.E.C from 1993 to 2018 using word2vec model implemented in PyTorch. Word Embeddings learned from a general corpus of articles from Google News, Wikipedia etc. are readily available online for researchers to use in their models but embeddings learned from 10-K filings are not publicly available. We publish the word embeddings learned from 10-K filings on GitHub for other researchers to use in their NLP tasks such as document classification, document similarity, sentiment analysis, readability index etc. on 10-K filings or other financial documents. We show that using these learned word embeddings we can differentiate between different types of sentiment words in the widely used Loughran-McDonald word lists and generate average similarity scores between them. We also present an application of word embeddings where we can quantitatively track changes in 10-K documents using the learned embeddings.
Keywords: 10-K, Word Embeddings, Word2Vec, Skip-Gram, Natural Language Processing (NLP), Machine Learning, Deep Learning, Neural Networks, PyTorch, t-SNE, Cosine Similarity, Amazon AWS, Quantitative Finance, Alternative Data, Trading Signals
JEL Classification: G1, G2, C45
Suggested Citation: Suggested Citation