RFCC: Random Forest Consensus Clustering for Regression and Classification
18 Pages Posted: 19 Mar 2021
Date Written: March 19, 2021
Abstract
Random forests are invariant and robust estimators that can fit complex interactions between input data of different types and binary, categorical, or continuous outcome variables, including those with multiple dimensions. In addition to these desirable properties, random forests impose a structure on the observations from which researchers and data analysts can infer clusters or groups of interest. These clusters not only provide a structure to the data at hand, they also can be used to elucidate new patterns, define subgroups for further analysis, derive prototypical observations, identify outlier observations, catch mislabeled data, and evaluate the performance of the estimation model in more detail.
We present a novel clustering algorithm called Random Forest Consensus Clustering and implement it in the Scikit-Learn / SciPy data science ecosystem. This algorithm differs from prior approaches by making use of the entire tree structure. Observations become proximate if they follow similar decision paths across trees of a random forest. We illustrate why this approach improves the resolution and robustness of clustering and that is especially suited to hierarchical approaches.
Keywords: random forest, clustering, networks, decision tree, consensus clustering
JEL Classification: C1,C01,C10,C19
Suggested Citation: Suggested Citation