Estimating the number of clusters in multivariate data by various fittings of the L-curve

Second Author's Department

Mathematics & Actuarial Science Department

Find in your Library

https://doi.org/10.1007/s40314-024-02839-8

All Authors

Rida Moustafa, Ali S. Hadi

Document Type

Research Article

Publication Title

Computational and Applied Mathematics

Publication Date

2-1-2025

doi

10.1007/s40314-024-02839-8

Abstract

The goal of this paper is to estimate the true but unknown number of clusters K in multivariate data. The contributions are two folds. The first is to narrow the search space for the estimates k^ to 1≤k^≤Kmax. We propose a new method for finding Kmax, which is better than the existing ones. The second is to propose three indices for computing k^ within the range 1≤k^≤Kmax: The R-Index, the FB Index, and the CSum Index. All three indices are based on the L-curve (the plot of Wk vs. k), where Wk is the total within-cluster-similarity (withinness), for values of k in the above range. We give the rationale for each method. We investigate the performance of these three indices and compare them with six of the most commonly used indices using both real benchmark datasets and a challenging synthetic data of varying sample sizes (n=200 to n=3600) and varying number of true clusters K ranging from K=2 to K=36. We use both the Hierarchical clustering and the k-Means clustering algorithms, but the approach can also be used with other clustering methods. The three indices are shown to outperform the existing ones. An additional advantage of our indices is computational complexity, where it is shown that they take much less time to compute than the existing ones.

Comments

Article. Record derived from SCOPUS.

Share

COinS