Learning Word Embeddings from 10-K Filings for Financial NLP Tasks

10 Pages Posted: 14 Nov 2019 Last revised: 5 Jun 2020

See all articles by Saurabh Sehrawat

Saurabh Sehrawat

Stony Brook University - Department of Applied Mathematics & Statistics

Date Written: September 5, 2019

Abstract

In this paper, we generate word embeddings learned from corpus of 10-K filings by corporates in U.S. to S.E.C from 1993 to 2018 using word2vec model implemented in PyTorch. Word Embeddings learned from a general corpus of articles from Google News, Wikipedia etc. are readily available online for researchers to use in their models but embeddings learned from 10-K filings are not publicly available. We publish the word embeddings learned from 10-K filings on GitHub for other researchers to use in their NLP tasks such as document classification, document similarity, sentiment analysis, readability index etc. on 10-K filings or other financial documents. We show that using these learned word embeddings we can differentiate between different types of sentiment words in the widely used Loughran-McDonald word lists and generate average similarity scores between them. We also present an application of word embeddings where we can quantitatively track changes in 10-K documents using the learned embeddings.

Keywords: 10-K, Word Embeddings, Word2Vec, Skip-Gram, Natural Language Processing (NLP), Machine Learning, Deep Learning, Neural Networks, PyTorch, t-SNE, Cosine Similarity, Amazon AWS, Quantitative Finance, Alternative Data, Trading Signals

JEL Classification: G1, G2, C45

Suggested Citation

Sehrawat, Saurabh, Learning Word Embeddings from 10-K Filings for Financial NLP Tasks (September 5, 2019). Available at SSRN: https://ssrn.com/abstract=3480902 or http://dx.doi.org/10.2139/ssrn.3480902

Saurabh Sehrawat (Contact Author)

Stony Brook University - Department of Applied Mathematics & Statistics ( email )

Stony Brook University
Stony Brook, NY 11794
United States

Here is the Coronavirus
related research on SSRN

Paper statistics

Downloads
324
Abstract Views
1,780
rank
104,205
PlumX Metrics