A Stochastic Gaussian Process-driven model for Text Data

Posted: 11 Feb 2021 Last revised: 16 Aug 2023

See all articles by Ioannis Chalkiadakis

Ioannis Chalkiadakis

Institut des Systèmes Complexes de Paris Île-de-France / CNRS - UAR 3611

Gareth Peters

University of California Santa Barbara; University of California, Santa Barbara

Date Written: December 8, 2020

Abstract

Natural Language Processing (NLP) has become a multi-disciplinary field that includes mathematics, statistics, and more recently, data science and machine learning. Dominant approaches include data- and computation-intensive 'black-box' models developed mainly by the industry behind chatbots and conversational agents. These approaches, however, tend to focus on model performance on end tasks and on the size of the available datasets and favour increasing the complexity of the model, rather than understanding the statistical structure in the data and what is the optimal way to extract it.

In this research, we take a step back from these approaches and return to the fundamental question of what constitutes structure in text, how we can extract it in an efficient way, and how we can eventually leverage it to perform modelling and statistical analysis with a clear interpretation of the outcomes.

Building on our previous work on text stochastic embeddings that preserve the key features of natural language (temporal dependence, semantics, laws of grammar and syntax), we significantly extend models for text data that are prevalent in the political and social sciences domains.

Specifically, we extend Item-Response and log-linear models for contingency tables of word counts into a stochastic formulation that captures the specificities of text we have identified with our text time-series analysis. We employ a Multiple-Output Gaussian Process as a stochastic driver, whose covariance structure accounts for the temporal dependence in word selection and sentence structure. The way we formulate our covariance allows us to measure the contribution of each of the text embeddings we have constructed, hence, to draw conclusions on their representational power.

We illustrate our methodology by analysing the discourse of US presidential speeches over the past 50 years.

Keywords: natural language, text processing, long memory, persistence, multifractal time-series, Brownian bridge, Multiple-Output Gaussian Processes, item-response models, contingency tables

Suggested Citation

Chalkiadakis, Ioannis and Peters, Gareth, A Stochastic Gaussian Process-driven model for Text Data (December 8, 2020). Available at SSRN: https://ssrn.com/abstract=3742085

Ioannis Chalkiadakis (Contact Author)

Institut des Systèmes Complexes de Paris Île-de-France / CNRS - UAR 3611 ( email )

113 Rue Nationale
Paris, 75013
France

HOME PAGE: http://www.iscpif.fr/

Gareth Peters

University of California Santa Barbara ( email )

Santa Barbara, CA 93106
United States

University of California, Santa Barbara ( email )

Do you have a job opening that you would like to promote on SSRN?

Paper statistics

Abstract Views
448
PlumX Metrics