Some Experiments Comparing Logistic Regression and Random Forests Using Synthetic Data and the Interaction Miner Algorithm

33 Pages Posted: 6 Jun 2011 Last revised: 6 Sep 2011

Date Written: June 5, 2011

Abstract

This paper uses synthetic datasets to classify the conditions in which random forest may outperform more traditional techniques such as logistic regression. We explore the theoretical implications of these experimental findings, and work towards building a theory based approach to data mining. During the course of these experiments we take the simulations where random forests dominate and add additional dimensionality to the data and run logistic regression using the additional attributes through the I* interaction miner algorithm outlined in Sharma 2011. Using the I* procedure with adequate amount of interaction terms the logistic regression can be made to match performance of random forests in the synthetic data sets where random forests dominate (Sharma, 2011). This makes it seem the interaction miner algorithm along with some minimal sufficient amount of interaction and transformations allow logistic regression to match ensemble performance. This implies that, without a certain amount of dimensionality in the data interaction, miner and logistic regression do not benefit from the interactions. Breiman and other work shows Random Forests thrive on dimensionality that said from experiences with various data sets adding additional artificial dimensionality doesn’t help forest (Breiman, 2001). There appears to be some minimum or necessary and sufficient amount of dimensionality after which more information cannot be extracted from the data. The good news is dimensionality can be created using the icreater function which add Tukey’s re-expressions automatically to the data (log, negative reciprocal, and sqrt).

Keywords: interaction mining, simulation synthetic data sets, logistic regression vs. random forest, exponential probability distributions, I*, interaction miner, variable selection

Suggested Citation

Sharma, Dhruv, Some Experiments Comparing Logistic Regression and Random Forests Using Synthetic Data and the Interaction Miner Algorithm (June 5, 2011). Available at SSRN: https://ssrn.com/abstract=1858424 or http://dx.doi.org/10.2139/ssrn.1858424

Dhruv Sharma (Contact Author)

Independent ( email )

2023 N. Cleveland St.
Arlington, VA 22201
United States

HOME PAGE: http://theinterdisciplinarian.com/

Here is the Coronavirus
related research on SSRN

Paper statistics

Downloads
253
Abstract Views
1,106
rank
133,871
PlumX Metrics