Looking Under the Hood of Stochastic Machine Learning Algorithms for Parts of Speech Tagging

Diesner, Jana; Carley, Kathleen M.

doi:10.2139/ssrn.2726830

Download This Paper

Open PDF in Browser

Add Paper to My Library

Looking Under the Hood of Stochastic Machine Learning Algorithms for Parts of Speech Tagging

35 Pages Posted: 3 Feb 2016

See all articles by Jana Diesner

Jana Diesner

University of Illinois at Urbana-Champaign

Kathleen M. Carley

Carnegie Mellon University; Carnegie Mellon University - H. John Heinz III School of Public Policy and Management; Institute for Software Research - Carnegie Mellon University

Date Written: July 1, 2008

Abstract

A variety of Natural Language Processing and Information Extraction tasks, such as question answering and named entity recognition, can benefit from precise knowledge about a words’ syntactic category or Part of Speech (POS) (Church, 1988; Rabiner, 1989; Stolz, Tannenbaum, & Carstensen, 1965). POS taggers are widely used to assign a single best POS to every word in text data, with stochastic approaches achieving accuracy rates of up to 96% to 97% (Jurafsky & Martin, 2000). When building a POS tagger, human beings needs to make a set of choices about design decisions, some of which significantly impact the accuracy and other performance aspects of the resulting engine. However, documentations of POS taggers often leave these choices and decisions implicit. In this paper we provide an overview on some of these decisions and empirically determine their impact on POS tagging accuracy. The gained insights can be a valuable contribution for people who want to design, implement, modify, fine-tune, integrate, or responsibly use a POS tagger. We considered the results presented herein in building and integrating a POS tagger into AutoMap, a tool that facilitates relation extraction from texts, as a stand-alone feature as well as an auxiliary feature for other tasks.

Keywords: Part of Speech Tagging, Hidden Markov Models, Viterbi Algorithm, AutoMap

Suggested Citation: Suggested Citation

Diesner, Jana and Carley, Kathleen M., Looking Under the Hood of Stochastic Machine Learning Algorithms for Parts of Speech Tagging (July 1, 2008). Available at SSRN: https://ssrn.com/abstract=2726830 or http://dx.doi.org/10.2139/ssrn.2726830