A Stochastic Gaussian Process-driven model for Text Data
Posted: 11 Feb 2021 Last revised: 16 Aug 2023
Date Written: December 8, 2020
Abstract
Natural Language Processing (NLP) has become a multi-disciplinary field that includes mathematics, statistics, and more recently, data science and machine learning. Dominant approaches include data- and computation-intensive 'black-box' models developed mainly by the industry behind chatbots and conversational agents. These approaches, however, tend to focus on model performance on end tasks and on the size of the available datasets and favour increasing the complexity of the model, rather than understanding the statistical structure in the data and what is the optimal way to extract it.
In this research, we take a step back from these approaches and return to the fundamental question of what constitutes structure in text, how we can extract it in an efficient way, and how we can eventually leverage it to perform modelling and statistical analysis with a clear interpretation of the outcomes.
Building on our previous work on text stochastic embeddings that preserve the key features of natural language (temporal dependence, semantics, laws of grammar and syntax), we significantly extend models for text data that are prevalent in the political and social sciences domains.
Specifically, we extend Item-Response and log-linear models for contingency tables of word counts into a stochastic formulation that captures the specificities of text we have identified with our text time-series analysis. We employ a Multiple-Output Gaussian Process as a stochastic driver, whose covariance structure accounts for the temporal dependence in word selection and sentence structure. The way we formulate our covariance allows us to measure the contribution of each of the text embeddings we have constructed, hence, to draw conclusions on their representational power.
We illustrate our methodology by analysing the discourse of US presidential speeches over the past 50 years.
Keywords: natural language, text processing, long memory, persistence, multifractal time-series, Brownian bridge, Multiple-Output Gaussian Processes, item-response models, contingency tables
Suggested Citation: Suggested Citation