Sage Journals: Discover world-class research

Abstract

There are clustering algorithms (such as DBSCAN) that do not group all data into clusters, but identify some data as noise and exclude it from clusters. In the literature there are no dedicated validity measures for this kind of noise-aware clusterings. Applying the standard measures blindly (which seems to happen in the literature) yields misleading results. We revise top performing, established validity measures to cope with the results of this kind of clustering algorithms and demonstrate that such clusterings may require an additional type of validity check, assessing not only the cluster validity (separation and compactness), but also the validity of the distinction between noise and cluster instances. Additionally, we propose a balanced score, that captures both types of validity to get a holistic validity score. All proposed measures are evaluated on artificial data, mimicking the experiments of the extensive review [Arbelaitz O, Gurrutxaga I, Muguerza J et al. 2013]. The encouraging results demonstrate that the noise aware extension of the Silhouette coefficient and the Score function are least influenced by the noise level.

Keywords

cluster validity noise clustering internal cluster validity

Get full access to this article

View all access options for this article.

References

von Luxburg

Williamson

Guyon

. Clustering: Science or art? In: Proceedings ICML workshop on unsup. transfer learning, 2012.

Jeon

Kuo

Aupetit

, et al. Classes are not clusters: Improving label-based evaluation of dimensionality reduction. arXiv, 2023.

Färber

Günnemann

Kriegel

, et al. On using class-labels in evaluation of clusterings. In: MultiClust: 1st international workshop on discovering, summarizing and using multiple clusterings held in conjunction with KDD, 2010.

Arbelaitz

Gurrutxaga

Muguerza

, et al. An extensive comparative study of cluster validity indices. Pattern Recognit 2013.

Ester

Kriegel

Sander

, et al. Density-based spatial clustering of applications with noise. In: Int. Conf. knowledge discovery and data mining, 1996.

Davé

. Characterization and detection of noise in clustering. Pattern Recognit Lett 1991.

Ertöz

Steinbach

Kumar

. Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data. In: Proceedings of the 2003 SIAM international conference on data mining 2003, pp.47–58. SIAM.

Asakly

Blecher

Brennan

, et al. Set partition asymptotics and a conjecture of gould and quaintance. J Math Anal Appl 2014.

Topchy

Jain

Punch

. Clustering ensembles: models of consensus and weak partitions. IEEE Trans Pattern Anal Mach Intell 2005.

10.

Topchy

Law

MHC

Jain

, et al. Analysis of consensus partition in cluster ensemble. In: ICDM, 2004.

11.

Faceli

Sakata

de Souto

MCP

, et al. Partitions selection strategy for set of clustering solutions. Neurocomputing 2010.

12.

Rousseeuw

. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 1987.

13.

Davies

Bouldin

. A cluster separation measure. IEEE Trans Pattern Anal Mach Intell 1979.

14.

Bezdek

Pal

. Some new indexes of cluster validity. IEEE Trans Syst Man Cybern 1998.

15.

Saitta

Raphael

Smith

IFC

. A bounded index for cluster validity. In: IAPR Int. Conf. on Machine Learning and Data Mining in Pattern Rec, 2007.

16.

Chen

Banitaan

Maleki

, et al. Pedestrian group detection with k-means and dbscan clustering methods. In: 2022 IEEE eIT, 2022.

17.

Kertanah

Nurmayanti

Aini

, et al. Comparison of algorithms k-means and dbscan for clustering student cognitive learning outcomes in physics subject. Kappa J 2023.

18.

Ogbuabor

Ugwoke

. Clustering algorithm for a healthcare dataset using silhouette score value. AIRCC’s Int Journ of Comp Sc and Inf Tech 2018.

19.

Zhao

Chen

Xie

, et al. A novel silhouettes cluster internal evaluation index based on granular-ball. In: 2023 8th International Conference on Cloud Computing and Big Data Analytics (ICCCBDA), 2023, pp.92–97. DOI: 10.1109/ICCCBDA56900.2023.10154684.

20.

Hopkins

Skellam

. A new method for determining the type of distribution of plant individuals. Ann Bot 1954.

21.

Ros

Riad

Guillaume

. Pdbi: A partitioning davies-bouldin index for clustering evaluation. Neurocomputing 2023; 528: 178–199.

Cluster validity for noise aware clusterings

Abstract

Keywords

Get full access to this article

References