A Statistical Analysis of Text: Embeddings, Properties and Time-Series Modeling

Posted: 11 Feb 2021

See all articles by Ioannis Chalkiadakis

Ioannis Chalkiadakis

Heriot-Watt University - Department of Computer Science

Gareth Peters

University College London - Department of Statistical Science; University of California Santa Barbara; University of Oxford - Oxford-Man Institute of Quantitative Finance; London School of Economics & Political Science (LSE) - Systemic Risk Centre; University of New South Wales (UNSW) - Faculty of Science; Macquarie University - Department of Actuarial Studies and Business Analytics

Mike J. Chantler

Heriot-Watt University - Department of Computer Science

Ioannis Konstas

Heriot-Watt University - Department of Computer Science

Date Written: December 8, 2020

Abstract

Natural Language Processing (NLP) has started from linguists and the computational linguistics community and has evolved to become a multi-disciplinary field that includes mathematics, statistics, and more recently, data science and machine learning. Together with this expansion, there has also been a change in the popularity of the applied methodologies, motivated by the success of data- and computation-intensive 'black-box' models developed mainly by the industry behind chatbots and conversational agents. These approaches, however, tend to focus on model performance on end tasks and the size of the available datasets, and favour increasing the complexity of the model, rather than understanding the statistical structure in the data and what is the optimal way to extract it.

In this research, we take a step back from these approaches and return to the fundamental question of what constitutes structure in text, how we can extract it in an efficient way, and how we can eventually leverage it to perform modeling and statistical analysis with a clear interpretation of the outcomes.

We begin by constructing text stochastic embeddings such that we preserve the key features of natural language: the temporal dependence, semantics, and the laws of grammar and syntax. We then construct a sequence of statistical process summaries that we use to study the constructed text time-series embeddings with regards to long memory and its multifractal extension, stationarity, as well as behaviour at the extremes. To enhance our understanding of the behaviour of the time-series and text properties, we then seek whether the embedding realised processes are statistically different in some formal manner. For this purpose, we apply a specialised inference procedure that allows one to test using a finite number of samples from two processes, whether they contain common information or not, i.e., whether they are the same process.

Our main contributions are: 1) a common framework of $N$-ary relations to construct all our target embeddings: i) a time-series representation of the well-known Bag-of-Words (BoW) model to capture the frequency characteristics of text processing units, ii) a tree valued time-series to capture the grammatical structure of sentences, iii) a graph valued time-series to capture the syntactical structure of sentences, and iv) a combination of BoW and syntax to capture information in word co-occurrence frequencies under restrictions to particular syntactic structures. 2) a novel application of Brownian bridges in the context of NLP. 3) an extension to Item-Response and log-linear models for contingency tables of word counts into a stochastic formulation that captures the specificities of text we have identified with our analysis. Specifically, we employ a Multiple-Output Gaussian Process in the intensity of a Poisson regression model. The structure in the covariance of the Gaussian Process accounts for the temporal dependence in word selection, and the way we formulate our covariance using multiple kernels allows us to measure the contribution of each of the embeddings we have constructed, hence draw conclusions on their representational power.

Keywords: natural language, text processing, long memory, persistence, multifractal time-series, Brownian bridge, Multiple-Output Gaussian Processes, item-response models, contingency tables

Suggested Citation

Chalkiadakis, Ioannis and Peters, Gareth and Peters, Gareth and Chantler, Michael John and Konstas, Ioannis, A Statistical Analysis of Text: Embeddings, Properties and Time-Series Modeling (December 8, 2020). Available at SSRN: https://ssrn.com/abstract=3742085

Ioannis Chalkiadakis (Contact Author)

Heriot-Watt University - Department of Computer Science ( email )

Edinburgh
United Kingdom

Gareth Peters

University College London - Department of Statistical Science ( email )

1-19 Torrington Place
London, WC1 7HB
United Kingdom

University of California Santa Barbara ( email )

Santa Barbara, CA 93106
United States

University of Oxford - Oxford-Man Institute of Quantitative Finance ( email )

University of Oxford Eagle House
Walton Well Road
Oxford, OX2 6ED
United Kingdom

London School of Economics & Political Science (LSE) - Systemic Risk Centre ( email )

Houghton St
London
United Kingdom

University of New South Wales (UNSW) - Faculty of Science ( email )

Australia

Macquarie University - Department of Actuarial Studies and Business Analytics ( email )

Australia

Michael John Chantler

Heriot-Watt University - Department of Computer Science

Edinburgh
United Kingdom

Ioannis Konstas

Heriot-Watt University - Department of Computer Science ( email )

Edinburgh
United Kingdom

Do you have a job opening that you would like to promote on SSRN?

Paper statistics

Abstract Views
206
PlumX Metrics