Optimal Response Modeling: Comparison of Logistic Regression, Random Forest and I* Algorithm
45 Pages Posted: 29 Aug 2011
Date Written: August 28, 2011
Optimal response modeling is studied using logistic regression, random forests, and I* algorithm of building tuned regressions.
Transforming categorical variables into either WOE (weight of evidence) or probability of response coupled with equal frequency binning of size 10 results in improved models. In addition the laplace correction and m-smoothing help with WOE and probability. I* performed best with probability or WOE adjusted with the laplace correction and equal frequency binning with an AUC of .8., 15% better than winning benchmark of .71.
Adding probability or WOE adjusted by laplace corrections and equal frequency binning to the I* have many advantages as they address some biases in random forests and make modeling more robust and run quicker in terms of performance. These are well known modeling techniques used by seasoned credit scoring modelers for decades. This data shows tuning logistic regression using random forest variable importance results in an optimal predictive model even with data without interaction effects. The I* algorithm is enhanced using a 0way interaction option to tune logistic regression without interaction effects. This is a surprising and important result for automated optimal logistic regression model building. Important enhancements to random forests are suggested such as setting mtry from square root of number of variables to log based 10 * square root of number of variables to address probability of noise variables in the data being large as the variable space increases as well as pre-processing categorical variables with either WOE or empirical bayes probability estimates and binning numeric fields using equal frequency binning to addresses biases in random forest variable selection raised by Strobl.
Keywords: kdd, random forest, logistic, naive bayes, discretization, binning, I*, logistic regression, donation response, bias, variable importance, enhancing random forest, WOE, empirical bayes
Suggested Citation: Suggested Citation