A Modified Gower Distance-Based Clustering Analysis for Mixed-Type Data
36 Pages Posted: 30 Mar 2024
Abstract
ObjectiveThe traditional clustering techniques are usually restricted to either continuous or categorical variables. This study introduces a clustering technique specifically designed for datasets including a combination of continuous and categorical variables. Compared with other mixed-type methods, we could provide better clustering compatibility, adaptability, and interpretability.MethodsWe proposed the modified Gower distance that incorporates feature importance as weights. This novel distance adjustment scheme is for keeping the same divergence level between continuous and categorical features. The metrics also consider the importance of a feature concerning the clustering process which offers a more precise representation of clusters. This algorithm (DAFI) has been evaluated on both simulation data sets and real-world data sets, considering different proportions of important features contributing to clustering. This approach's usefulness is demonstrated through comparisons with other clustering techniques.ResultsAccording to the adjusted rand index, which measures clustering accuracy in simulation studies, and the silhouette index, which measures clustering cohesion and separation in real-world datasets, the recently proposed clustering technique is a robust strategy that consistently outperforms baseline methods in various experimental scenarios.ConclusionDAFI outperformed classic clustering baselines on both simulation datasets and real-world datasets. We envisage that DAFI provides an effective solution for future mixed-type clustering.
Keywords: clustering, Distance measure, Feature importance, Mixed-type data
Suggested Citation: Suggested Citation