Data Driven Dimensionality Reduction to Improve Modeling Performance
28 Pages Posted: 25 Jun 2023
Date Written: June 20, 2023
In a number of applications, data may be anonymized, obfuscated, or highly noisy. In such cases, it is difficult to use domain knowledge or low-dimensional visualizations to engineer the features for tasks such as machine learning, instead, we explore dimensionality reduction (DR) as a data-driven approach for engineering these low-dimensional representations. Through a careful examination of available feature selection and feature extraction techniques, we propose a new class named feature clustering. These new methods could utilize different forms of clustering to help evaluate the relative importance of features and take on properties different from the well-known DR algorithms. To evaluate these algorithms, we develop a parallel computing framework that optimizes their hyperparameters on a sample of application datasets. This framework harnesses the parallel computing power to examine a large number of parameter combinations and enables hyperparameter tuning and model tuning purely based on observed performance. This optimization framework provides mechanism for users to control computational cost and is able to examine many parameter choices in seconds. On a set of building energy data where the key features are known based on domain knowledge, the optimized DR algorithms indeed identify the expected main drivers of building electricity usage: outdoor temperature and solar radiance. This shows the automated optimization procedure is able to find known features. In terms of modeling accuracy, a distance correlation-based feature clustering method outperforms other DR algorithms including the well-known KPCA, LLE, and UMAP on two different tests.
Keywords: Dimensionality reduction, mean-decreased accuracy, feature selection, hyperparamters optimization
JEL Classification: G0, G1, G2, G15, G24, E44
Suggested Citation: Suggested Citation