Data Driven Dimensionality Reduction to Improve Modeling Performance

28 Pages Posted: 25 Jun 2023

See all articles by Joshua Chung

Joshua Chung

University of California, Berkeley; University of California, Berkeley - Lawrence Berkeley National Laboratory (Berkeley Lab)

Marcos Lopez de Prado

Cornell University - Operations Research & Industrial Engineering; Abu Dhabi Investment Authority; True Positive Technologies

Horst Simon

University of California, Berkeley - Lawrence Berkeley National Laboratory (Berkeley Lab)

Kesheng Wu

University of California, Berkeley - Lawrence Berkeley National Laboratory (Berkeley Lab)

Date Written: June 20, 2023

Abstract

In a number of applications, data may be anonymized, obfuscated, or highly noisy. In such cases, it is difficult to use domain knowledge or low-dimensional visualizations to engineer the features for tasks such as machine learning, instead, we explore dimensionality reduction (DR) as a data-driven approach for engineering these low-dimensional representations. Through a careful examination of available feature selection and feature extraction techniques, we propose a new class named feature clustering. These new methods could utilize different forms of clustering to help evaluate the relative importance of features and take on properties different from the well-known DR algorithms. To evaluate these algorithms, we develop a parallel computing framework that optimizes their hyperparameters on a sample of application datasets. This framework harnesses the parallel computing power to examine a large number of parameter combinations and enables hyperparameter tuning and model tuning purely based on observed performance. This optimization framework provides mechanism for users to control computational cost and is able to examine many parameter choices in seconds. On a set of building energy data where the key features are known based on domain knowledge, the optimized DR algorithms indeed identify the expected main drivers of building electricity usage: outdoor temperature and solar radiance. This shows the automated optimization procedure is able to find known features. In terms of modeling accuracy, a distance correlation-based feature clustering method outperforms other DR algorithms including the well-known KPCA, LLE, and UMAP on two different tests.

Keywords: Dimensionality reduction, mean-decreased accuracy, feature selection, hyperparamters optimization

JEL Classification: G0, G1, G2, G15, G24, E44

Suggested Citation

Chung, Joshua and López de Prado, Marcos and López de Prado, Marcos and Simon, Horst and Wu, Kesheng, Data Driven Dimensionality Reduction to Improve Modeling Performance (June 20, 2023). Available at SSRN: https://ssrn.com/abstract=4485887 or http://dx.doi.org/10.2139/ssrn.4485887

Joshua Chung

University of California, Berkeley ( email )

310 Barrows Hall
Berkeley, CA 94720
United States

University of California, Berkeley - Lawrence Berkeley National Laboratory (Berkeley Lab) ( email )

1 Cyclotron Road
Berkeley, CA 94720
United States

Marcos López de Prado (Contact Author)

Cornell University - Operations Research & Industrial Engineering ( email )

237 Rhodes Hall
Ithaca, NY 14853
United States

HOME PAGE: http://www.orie.cornell.edu

Abu Dhabi Investment Authority ( email )

211 Corniche Road
Abu Dhabi, Abu Dhabi PO Box3600
United Arab Emirates

HOME PAGE: http://www.adia.ae

True Positive Technologies ( email )

NY
United States

HOME PAGE: http://www.truepositive.com

Horst Simon

University of California, Berkeley - Lawrence Berkeley National Laboratory (Berkeley Lab) ( email )

1 Cyclotron Road
Berkeley, CA 94720
United States

Kesheng Wu

University of California, Berkeley - Lawrence Berkeley National Laboratory (Berkeley Lab) ( email )

1 Cyclotron Road
Berkeley, CA 94720
United States

Do you have a job opening that you would like to promote on SSRN?

Paper statistics

Downloads
382
Abstract Views
847
Rank
131,992
PlumX Metrics