Learning Word Embeddings from 10-K Filings for Financial NLP Tasks

10 Pages Posted: 14 Nov 2019 Last revised: 5 Jun 2020

See all articles by Saurabh Sehrawat

Saurabh Sehrawat

Stony Brook University - Department of Applied Mathematics & Statistics

Date Written: September 5, 2019

Abstract

In this paper, we generate word embeddings learned from corpus of 10-K filings by corporates in U.S. to S.E.C from 1993 to 2018 using word2vec model implemented in PyTorch. Word Embeddings learned from a general corpus of articles from Google News, Wikipedia etc. are readily available online for researchers to use in their models but embeddings learned from 10-K filings are not publicly available. We publish the word embeddings learned from 10-K filings on GitHub for other researchers to use in their NLP tasks such as document classification, document similarity, sentiment analysis, readability index etc. on 10-K filings or other financial documents. We show that using these learned word embeddings we can differentiate between different types of sentiment words in the widely used Loughran-McDonald word lists and generate average similarity scores between them. We also present an application of word embeddings where we can quantitatively track changes in 10-K documents using the learned embeddings.

Keywords: 10-K, Word Embeddings, Word2Vec, Skip-Gram, Natural Language Processing (NLP), Machine Learning, Deep Learning, Neural Networks, PyTorch, t-SNE, Cosine Similarity, Amazon AWS, Quantitative Finance, Alternative Data, Trading Signals

JEL Classification: G1, G2, C45

Suggested Citation

Sehrawat, Saurabh, Learning Word Embeddings from 10-K Filings for Financial NLP Tasks (September 5, 2019). Available at SSRN: https://ssrn.com/abstract=3480902 or http://dx.doi.org/10.2139/ssrn.3480902

Saurabh Sehrawat (Contact Author)

Stony Brook University - Department of Applied Mathematics & Statistics ( email )

Stony Brook University
Stony Brook, NY 11794
United States

Do you have negative results from your research you’d like to share?

Paper statistics

Downloads
1,284
Abstract Views
4,543
Rank
29,355
PlumX Metrics