Exploring Machine Learning Techniques for Text-Based Industry Classification
27 Pages Posted: 27 Jul 2020
Date Written: June 29, 2020
This project aims to develop an effective machine learning text-based industry classification. We explore the use of various word embedding schemes and clustering algorithms for industry classification. BERT, word2vec, doc2vec, latent semantic indexing are used for word embedding, while greedy cosine-similarity, k-means, Gaussian mixture model, and deep embedding for clustering are used as clustering algorithms. We present our results for the companies listed in the US and Chinese markets.
Keywords: Text-based industry classication, BERT, word2vec, doc2vec, latent se- mantic indexing, cosine similarity, k-means, Gausian mixture model, deep embedding for clustering
JEL Classification: C38,C45
Suggested Citation: Suggested Citation