Text Mining Using N-Grams

Schonlau, M., Guenther, N. Sucholutsky, I. Text mining using n-gram variables. The Stata Journal. Dec 2017, 17(4), 866-881.

14 Pages Posted: 7 Apr 2016 Last revised: 17 Jun 2020

See all articles by Matthias Schonlau

Matthias Schonlau

University of Waterloo

Nick Guenther

University of Waterloo

Date Written: April 5, 2016

Abstract

Text mining is the art of turning free text into numerical variables and then analyzing them with statistical techniques. We introduce the Stata command ngram which implements the most common approach to text mining, "bag of words''. An n-gram is a contiguous sequence of words in a text. Broadly speaking, ngram creates hundreds or thousands of variables each recording how often the corresponding n-gram occurs in a given text. This is more useful than it sounds. Ngram is illustrated with the categorization of text answers from two open-ended questions.

Keywords: Stata, n-gram, text mining

Suggested Citation

Schonlau, Matthias and Guenther, Nick, Text Mining Using N-Grams (April 5, 2016). Schonlau, M., Guenther, N. Sucholutsky, I. Text mining using n-gram variables. The Stata Journal. Dec 2017, 17(4), 866-881., Available at SSRN: https://ssrn.com/abstract=2759033 or http://dx.doi.org/10.2139/ssrn.2759033

Matthias Schonlau (Contact Author)

University of Waterloo ( email )

Waterloo, Ontario
Canada

Nick Guenther

University of Waterloo ( email )

Waterloo, Ontario N2L 3G1
Canada

Here is the Coronavirus
related research on SSRN

Paper statistics

Downloads
705
Abstract Views
2,132
rank
39,653
PlumX Metrics