Improving Logistic Regression/Credit Scorecards Using Random Forests: Applications with Credit Card and Home Equity Datasets
16 Pages Posted: 3 Apr 2011 Last revised: 6 Sep 2011
Date Written: May 2, 2010
The following paper uses an approach to credit scoring which is based on the premise that credit scoring models should be built on affordability data (income, assets, free cash flows, and cash flow proxies, etc.) and that these variables should have meaningful interactions with other scorecard attributes. The approach comprises of using Random Forests to identify variables with the most predictive power which are then used to permute interaction terms with other variables to build improved scorecards using logistic regression.
The problem with most credit scorecards is that they use logistic regression which does not account for multi-collinearity and thus during model construction one cannot gauge variable importance using p values. Thus traditional credit scorers have steered clear of interaction terms and exploratory analysis using logistic regression. Random forest variable selection overcomes that short coming and also provide a good benchmark for the estimating the flat maximum the asymptote of predictive power constraining the model. In addition, conditional inference trees were also used along with recursive partitioning to identify segments of data, as well. Conditional inferences trees identified 20 segments using age, payment date and other variables which traditional recursive partitioning could not detect and also outperformed traditional recursive partitioning.
In addition to showing random forests to be power tools and models superior to logistic regression on a large credit card dataset the paper concludes by showing random forest performance against logistic regression on a widely used home equity dataset. Out of the box random forests outperform logistic regression as well as a recent optimized generalized additive neural network based logistic generalized regression model posed by Wallinga. More research is needed on an automated approach to optimally extracting predictive power from random forests and tuning logistic regression scorecards.
Keywords: optimal credit scoring, random forests, logistic regression, mortgage, credit risk,credit cards, KDD
Suggested Citation: Suggested Citation