A Statistical Analysis of Text: Embeddings, Properties and Time-Series Modeling
Posted: 11 Feb 2021 Last revised: 17 Aug 2022
Date Written: December 8, 2020
Natural Language Processing (NLP) has started from linguists and the computational linguistics community and has evolved to become a multi-disciplinary field that includes mathematics, statistics, and more recently, data science and machine learning. Together with this expansion, there has also been a change in the popularity of the applied methodologies, motivated by the success of data- and computation-intensive 'black-box' models developed mainly by the industry behind chatbots and conversational agents. These approaches, however, tend to focus on model performance on end tasks and the size of the available datasets, and favour increasing the complexity of the model, rather than understanding the statistical structure in the data and what is the optimal way to extract it.
In this research, we take a step back from these approaches and return to the fundamental question of what constitutes structure in text, how we can extract it in an efficient way, and how we can eventually leverage it to perform modeling and statistical analysis with a clear interpretation of the outcomes.
We begin by constructing text stochastic embeddings such that we preserve the key features of natural language: the temporal dependence, semantics, and the laws of grammar and syntax. We then construct a sequence of statistical process summaries that we use to study the constructed text time-series embeddings with regards to long memory and its multifractal extension, stationarity, as well as behaviour at the extremes. To enhance our understanding of the behaviour of the time-series and text properties, we then seek whether the embedding realised processes are statistically different in some formal manner. For this purpose, we apply a specialised inference procedure that allows one to test using a finite number of samples from two processes, whether they contain common information or not, i.e., whether they are the same process.
Our main contributions are: 1) a common framework of $N$-ary relations to construct all our target embeddings: i) a time-series representation of the well-known Bag-of-Words (BoW) model to capture the frequency characteristics of text processing units, ii) a tree valued time-series to capture the grammatical structure of sentences, iii) a graph valued time-series to capture the syntactical structure of sentences, and iv) a combination of BoW and syntax to capture information in word co-occurrence frequencies under restrictions to particular syntactic structures. 2) a novel application of Brownian bridges in the context of NLP. 3) an extension to Item-Response and log-linear models for contingency tables of word counts into a stochastic formulation that captures the specificities of text we have identified with our analysis. Specifically, we employ a Multiple-Output Gaussian Process in the intensity of a Poisson regression model. The structure in the covariance of the Gaussian Process accounts for the temporal dependence in word selection, and the way we formulate our covariance using multiple kernels allows us to measure the contribution of each of the embeddings we have constructed, hence draw conclusions on their representational power.
Keywords: natural language, text processing, long memory, persistence, multifractal time-series, Brownian bridge, Multiple-Output Gaussian Processes, item-response models, contingency tables
Suggested Citation: Suggested Citation