Matrix-Factorization-Based Dimensionality Reduction in the Predictive Modeling Process: A Design Science Perspective
41 Pages Posted: 30 Sep 2016
Date Written: September 2016
Dimensionality Reduction (DR) is frequently employed in the predictive modeling process with the goal of improving the generalization performance of models. This paper takes a design science perspective on DR.We treat it as an important business analytics artifact and investigate its utility in the context of binary classification, with the goal of understanding its proper use and thus improving predictive modeling research and practice. Despite DR's popularity, we show that many published studies fail to undertake the necessary comparison to establish that it actually improves performance. We then conduct an experimental comparison between binary classification with and without matrix-factorization-based DR as a preprocessing step on the features. In particular, we investigate DR in the context of supervised complexity control. These experiments utilize three classifiers and three matrix-factorization based DR techniques, and measure performance on a total of 26 classification tasks. We find that DR is generally not beneficial for binary classification. Specifically, the more difficult the problem, the more DR is able to improve performance (but it diminishes easier problems' performance). However, this relationship depends on complexity control: DR's benefit is actually eliminated completely when state-of-the-art methods are used for complexity control. The wide variety of experimental conditions allows us to dig more deeply into when and why the different forms of complexity control are useful. We find that L2-regularized logistic regression models trained on the original feature set have the best performance in general. The relative benefit provided by DR is increased when using a classifier that incorporates feature selection; unfortunately, the performance of these models, even with DR, is lower in general. We compare three matrix-factorization-based DR algorithms and nd that none does better than using the full feature set, but of the three, SVD has the best performance. The results in this paper should be broadly useful for researchers and industry practitioners who work in applied data science. In particular, they emphasize the design science principle that adding design elements to the predictive modeling process should be done with attention to whether they add value.
Suggested Citation: Suggested Citation