Extracting Information from Large Digital Corpora - A Case Study in Quantitative Methods in Linguistics

Primenjena lingvistika, No. 10, pp. 103–113, 2009

11 Pages Posted: 16 Aug 2013

See all articles by Nikola Dobric

Nikola Dobric

Alpen-Adria-University Klagenfurt - Institut für Anglistik und Amerikanistik

Date Written: 2009

Abstract

Every empirical research into language should not only be based on concrete data but it also has to be verified as scientifically relevant. In all scientific areas, including linguistics, such verification is performed by implementing a selection of statistical test in order to examine the significance, distribution and variation of the obtained research results. The first part of the paper presents the necessary procedures when extracting linguistic data from corpora. The corpus used is of exemplary nature and it is created from Dostoyevsky’s Crime and Punishment (in Russian, English, Serbian and German), in electronic form. The paper further displays the most common quantitative methods used in language analysis, which are all illustrated by clear examples from the small study in this paper, in order to make the complicated statistical calculi understandable and readily useable. In the end, the importance of linguistic statistics in modern language research is further emphasized.

Keywords: corpus, statistics, distribution, variation, significance

Suggested Citation

Dobric, Nikola, Extracting Information from Large Digital Corpora - A Case Study in Quantitative Methods in Linguistics (2009). Primenjena lingvistika, No. 10, pp. 103–113, 2009. Available at SSRN: https://ssrn.com/abstract=2309983

Nikola Dobric (Contact Author)

Alpen-Adria-University Klagenfurt - Institut für Anglistik und Amerikanistik ( email )

Universitätsstrasse 65-67
Klagenfurt, Corinthia 9020
Austria

HOME PAGE: http://www.uni-klu.ac.at/iaa/inhalt/2512.htm

Register to save articles to
your library

Register

Paper statistics

Downloads
25
Abstract Views
222
PlumX Metrics