A Modified Gower Distance-Based Clustering Analysis for Mixed-Type Data

36 Pages Posted: 30 Mar 2024

See all articles by Pinyan Liu

Pinyan Liu

National University of Singapore (NUS) - Duke-NUS Medical School

Han Yuan

National University of Singapore (NUS) - Duke-NUS Medical School

Nan Liu

Duke-National University of Singapore Medical School - Centre for Quantitative Medicine

Marco Aurélio Peres

National University of Singapore (NUS) - Duke-NUS Medical School

Abstract

ObjectiveThe traditional clustering techniques are usually restricted to either continuous or categorical variables. This study introduces a clustering technique specifically designed for datasets including a combination of continuous and categorical variables. Compared with other mixed-type methods, we could provide better clustering compatibility, adaptability, and interpretability.MethodsWe proposed the modified Gower distance that incorporates feature importance as weights. This novel distance adjustment scheme is for keeping the same divergence level between continuous and categorical features. The metrics also consider the importance of a feature concerning the clustering process which offers a more precise representation of clusters. This algorithm (DAFI) has been evaluated on both simulation data sets and real-world data sets, considering different proportions of important features contributing to clustering. This approach's usefulness is demonstrated through comparisons with other clustering techniques.ResultsAccording to the adjusted rand index, which measures clustering accuracy in simulation studies, and the silhouette index, which measures clustering cohesion and separation in real-world datasets, the recently proposed clustering technique is a robust strategy that consistently outperforms baseline methods in various experimental scenarios.ConclusionDAFI outperformed classic clustering baselines on both simulation datasets and real-world datasets. We envisage that DAFI provides an effective solution for future mixed-type clustering.

Keywords: clustering, Distance measure, Feature importance, Mixed-type data

Suggested Citation

Liu, Pinyan and Yuan, Han and Liu, Nan and Peres, Marco Aurélio, A Modified Gower Distance-Based Clustering Analysis for Mixed-Type Data. Available at SSRN: https://ssrn.com/abstract=4779022 or http://dx.doi.org/10.2139/ssrn.4779022

Pinyan Liu (Contact Author)

National University of Singapore (NUS) - Duke-NUS Medical School ( email )

Singapore
Singapore

Han Yuan

National University of Singapore (NUS) - Duke-NUS Medical School ( email )

Singapore
Singapore

Nan Liu

Duke-National University of Singapore Medical School - Centre for Quantitative Medicine ( email )

8 College Rd.
Singapore, 169857
Singapore
+65 6601 6503 (Phone)

Marco Aurélio Peres

National University of Singapore (NUS) - Duke-NUS Medical School ( email )

Singapore
Singapore

Do you have a job opening that you would like to promote on SSRN?

Paper statistics

Downloads
87
Abstract Views
201
Rank
580,630
PlumX Metrics