Extracting Information from Large Digital Corpora - A Case Study in Quantitative Methods in Linguistics
Primenjena lingvistika, No. 10, pp. 103–113, 2009
11 Pages Posted: 16 Aug 2013
Date Written: 2009
Every empirical research into language should not only be based on concrete data but it also has to be verified as scientifically relevant. In all scientific areas, including linguistics, such verification is performed by implementing a selection of statistical test in order to examine the significance, distribution and variation of the obtained research results. The first part of the paper presents the necessary procedures when extracting linguistic data from corpora. The corpus used is of exemplary nature and it is created from Dostoyevsky’s Crime and Punishment (in Russian, English, Serbian and German), in electronic form. The paper further displays the most common quantitative methods used in language analysis, which are all illustrated by clear examples from the small study in this paper, in order to make the complicated statistical calculi understandable and readily useable. In the end, the importance of linguistic statistics in modern language research is further emphasized.
Keywords: corpus, statistics, distribution, variation, significance
Suggested Citation: Suggested Citation