High-Cardinality Categorical Covariates in Network Regressions

Japenese Journal of Statistics and Data Science. Open Access: https://link.springer.com/article/10.1007/s42081-024-00243-4

Posted: 24 Aug 2023 Last revised: 13 Mar 2024

See all articles by Ronald Richman

Ronald Richman

insureAI; University of the Witwatersrand

Mario V. Wuthrich

RiskLab, ETH Zurich

Date Written: March 10, 2024

Abstract

High-cardinality (nominal) categorical covariates are challenging in regression modeling because they lead to high-dimensional models. E.g., in generalized linear models (GLMs), categorical covariates can be implemented by dummy coding which results in high-dimensional regression parameters for high-cardinality categorical covariates. It is difficult to find the correct structure of interactions in high-cardinality covariates, and such high-dimensional models are prone to overfitting. Various regularization strategies can be applied to prevent overfitting. In neural network regressions, a popular way of dealing with categorical covariates is entity embedding, and, typically, overfitting is taken care of by exploiting early stopping strategies. In case of high-cardinality categorical covariates, this often leads to a very early stopping, resulting in a poor predictive model. Building on Avanzi, Taylor, Wang and Wong (arXiv 2023), we introduce new versions of random effects entity embedding of categorical covariates. In particular, having a hierarchical structure in the categorical covariates, we propose a recurrent neural network architecture and a Transformer architecture, respectively, for random effects entity embedding that give us very accurate regression models.

Keywords: categorical covariates, categorical features, nominal features, high-cardinality features, entity embedding, embedding layer, random effects model, neural network, re- current neural network, attention layer, transformer, regularization, ridge regularization, variational inference

JEL Classification: G22

Suggested Citation

Richman, Ronald and Wuthrich, Mario V., High-Cardinality Categorical Covariates in Network Regressions (March 10, 2024). Japenese Journal of Statistics and Data Science. Open Access: https://link.springer.com/article/10.1007/s42081-024-00243-4, Available at SSRN: https://ssrn.com/abstract=4549049 or http://dx.doi.org/10.2139/ssrn.4549049

Ronald Richman

insureAI ( email )

30 Melrose Blvd
Melrose Arch
Johannesburg, Gauteng 2192
South Africa

University of the Witwatersrand ( email )

Mario V. Wuthrich (Contact Author)

RiskLab, ETH Zurich ( email )

Department of Mathematics
Ramistrasse 101
Zurich, 8092
Switzerland

Do you have a job opening that you would like to promote on SSRN?

Paper statistics

Abstract Views
1,147
PlumX Metrics