High-Cardinality Categorical Covariates in Network Regressions
Japenese Journal of Statistics and Data Science. Open Access: https://link.springer.com/article/10.1007/s42081-024-00243-4
Posted: 24 Aug 2023 Last revised: 13 Mar 2024
Date Written: March 10, 2024
Abstract
High-cardinality (nominal) categorical covariates are challenging in regression modeling because they lead to high-dimensional models. E.g., in generalized linear models (GLMs), categorical covariates can be implemented by dummy coding which results in high-dimensional regression parameters for high-cardinality categorical covariates. It is difficult to find the correct structure of interactions in high-cardinality covariates, and such high-dimensional models are prone to overfitting. Various regularization strategies can be applied to prevent overfitting. In neural network regressions, a popular way of dealing with categorical covariates is entity embedding, and, typically, overfitting is taken care of by exploiting early stopping strategies. In case of high-cardinality categorical covariates, this often leads to a very early stopping, resulting in a poor predictive model. Building on Avanzi, Taylor, Wang and Wong (arXiv 2023), we introduce new versions of random effects entity embedding of categorical covariates. In particular, having a hierarchical structure in the categorical covariates, we propose a recurrent neural network architecture and a Transformer architecture, respectively, for random effects entity embedding that give us very accurate regression models.
Keywords: categorical covariates, categorical features, nominal features, high-cardinality features, entity embedding, embedding layer, random effects model, neural network, re- current neural network, attention layer, transformer, regularization, ridge regularization, variational inference
JEL Classification: G22
Suggested Citation: Suggested Citation