De-Biased Random Forest Variable Selection
24 Pages Posted: 22 Dec 2011 Last revised: 18 Feb 2013
Date Written: December 22, 2011
This paper proposes a new way to de-bias random forest variable selection using a clean random forest algorithm. Strobl etal (2007) have shown random forest to be biased towards variables with many levels or categories and scales and correlated variables which might result in some inflated variable importance measures. The proposed algorithm builds random forests without each variable and keeps variables when dropping them degrades the overall random forest performance. The algorithm is simple and straight forward and its complexity and speed is a function of the number of salient variables. It runs more efficiently than the permutation test algorithm and is an alternative method to address known biases. The paper concludes some normative guidance on how to use random forest variable importance.
Keywords: random forest, variable importance, interaction effects, logistic regression, interaction effects, predictive modeling, biases
Suggested Citation: Suggested Citation