Exploring Machine Learning Techniques for Text-Based Industry Classification

27 Pages Posted: 27 Jul 2020

See all articles by haocheng gao

haocheng gao

affiliation not provided to SSRN

Junjie He

affiliation not provided to SSRN

Kan Chen

RMI,National University of Singapore

Date Written: June 29, 2020

Abstract

This project aims to develop an effective machine learning text-based industry classification. We explore the use of various word embedding schemes and clustering algorithms for industry classification. BERT, word2vec, doc2vec, latent semantic indexing are used for word embedding, while greedy cosine-similarity, k-means, Gaussian mixture model, and deep embedding for clustering are used as clustering algorithms. We present our results for the companies listed in the US and Chinese markets.

Keywords: Text-based industry classi cation, BERT, word2vec, doc2vec, latent se- mantic indexing, cosine similarity, k-means, Gausian mixture model, deep embedding for clustering

JEL Classification: C38,C45

Suggested Citation

gao, haocheng and He, Junjie and Chen, Kan, Exploring Machine Learning Techniques for Text-Based Industry Classification (June 29, 2020). Available at SSRN: https://ssrn.com/abstract=3640205 or http://dx.doi.org/10.2139/ssrn.3640205

Haocheng Gao

affiliation not provided to SSRN

Junjie He

affiliation not provided to SSRN

Kan Chen (Contact Author)

RMI,National University of Singapore ( email )

21 Heng Mui Keng Terrace
Level 4
Singapore, 119613
Singapore

Here is the Coronavirus
related research on SSRN

Paper statistics

Downloads
49
Abstract Views
528
PlumX Metrics