RFCC: Random Forest Consensus Clustering for Regression and Classification

18 Pages Posted: 19 Mar 2021

See all articles by Ingo Marquart

Ingo Marquart

ESMT Berlin

Ebru Koca Marquart

affiliation not provided to SSRN

Date Written: March 19, 2021

Abstract

Random forests are invariant and robust estimators that can fit complex interactions between input data of different types and binary, categorical, or continuous outcome variables, including those with multiple dimensions. In addition to these desirable properties, random forests impose a structure on the observations from which researchers and data analysts can infer clusters or groups of interest. These clusters not only provide a structure to the data at hand, they also can be used to elucidate new patterns, define subgroups for further analysis, derive prototypical observations, identify outlier observations, catch mislabeled data, and evaluate the performance of the estimation model in more detail.

We present a novel clustering algorithm called Random Forest Consensus Clustering and implement it in the Scikit-Learn / SciPy data science ecosystem. This algorithm differs from prior approaches by making use of the entire tree structure. Observations become proximate if they follow similar decision paths across trees of a random forest. We illustrate why this approach improves the resolution and robustness of clustering and that is especially suited to hierarchical approaches.

Keywords: random forest, clustering, networks, decision tree, consensus clustering

JEL Classification: C1,C01,C10,C19

Suggested Citation

Marquart, Ingo and Koca Marquart, Ebru, RFCC: Random Forest Consensus Clustering for Regression and Classification (March 19, 2021). Available at SSRN: https://ssrn.com/abstract=3807828 or http://dx.doi.org/10.2139/ssrn.3807828

Ingo Marquart (Contact Author)

ESMT Berlin ( email )

Schlossplatz 1
Berlin, Berlin 10178
Germany

Ebru Koca Marquart

affiliation not provided to SSRN

Do you have a job opening that you would like to promote on SSRN?

Paper statistics

Downloads
91
Abstract Views
505
Rank
445,753
PlumX Metrics