Learning Word Embeddings from 10-K Filings Using PyTorch
9 Pages Posted: 14 Nov 2019
Date Written: September 5, 2019
With the rise of alternative data in finding trading signals, Natural Language Processing (NLP) on financial documents has gained significant importance in the recent years. Word Embeddings learned from text corpus are one of the most important inputs to various NLP models, especially Deep Learning based models. In this paper, we generate word embeddings learned from corpus of 10-K filings by corporates in U.S. to S.E.C from 1993 to 2018 using word2vec model implemented in PyTorch . Word Embeddings learned from general corpus of articles from Google News, Wikipedia etc are readily available online for researchers to use in their models but embeddings learned from 10-K filings are not publicly available. Using word embeddings learned from general text for NLP tasks on financial documents may not yield accurate results as it has been proven that word embeddings learned from contextual text yields better and more accurate results compared to general word embeddings. We aim to publish the word embeddings learned from 10-K filings online so that they can be used by other researchers in their NLP tasks such as document classification, document similarity, sentiment analysis, readability index etc. on 10-K filings or other financial documents.
Keywords: 10-K, Word Embeddings, Word2Vec, Skip-Gram, Natural Language Processing (NLP), Machine Learning, Deep Learning, Neural Networks, PyTorch, t-SNE, Cosine Similarity, Amazon AWS, Quantitative Finance, Alternative Data, Trading Signals
JEL Classification: G1, G2, C45
Suggested Citation: Suggested Citation