Efficient estimation of the number of clusters for high-dimension data

Abstract

The exponential growth of digital image data has given rise to the need of efficient content management and retrieval tools. Currently, there is a lack of tools for processing the collected unlabeled data in a schematic manner. K-means is one of the most widely used clustering methods and has been applied in a variety of fields, one of them being image sorting. Although a useful tool for image management, the K-means method is heavily influenced by initializations, the most important one being the need to know the number of clusters a priori. A number of different methods have been proposed for identifying the correct number of clusters for K-means, one of them being the variance ratio criterion (VRC). Despite its popularity, the VRC method comes with two very important shortcomings: it only yields good results when the data dimensionality is low and it does not scale well for a high number of clusters, making it very difficult to use in computer vision applications. We propose an extension to the VRC method that works for increased cluster number and high-dimensionality data sets and therefore is fit for image data sets.

Keywords

Clustering number of clusters initializations unsupervised learning schema computer vision variance ratio criterion

Get full access to this article

View all access options for this article.

References

Shapiro

Stockman

GC.

Computer vision. Vol. 3. Hoboken, NJ: Prentice Hall, 2001.

Berkhin

. A survey of clustering data mining techniques. In: J

Kogan

Nicholas

Teboulle

, et al. (eds) Grouping multidimensional data. Berlin: Springer2006, pp. 25–71.

Kass

Raftery

AE.

Bayes factors. J Am Stat Assoc 1995; 90(430): 773–795.

Bozdogan

Model selection and Akaike’s information criterion (AIC): the general theory and its analytical extensions. Psychometrika 1987; 52(3): 345–370.

Dunn

JC.

A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters. J Cybernetics 1973; 3(3): 32–57.

Davies

Bouldin

DW.

A cluster separation measure. IEEE Trans Pattern Anal Mach Intell 1979; 1(2): 224–227.

Rousseeuw

PJ.

Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comp Appl Math 1987; 20: 53–65.

Tibshirani

Walther

Hastie

Estimating the number of clusters in a data set via the gap statistic. J R Stat Soc B 2001; 63(2): 411–423.

Yao

Cao

Zhao

, et al. Robust subspace clustering via penalized mixture of Gaussians. Neurocomputing 2018; 278: 4–11.

10.

Geng

, et al. An improved K-means algorithm based on fuzzy metrics. IEEE Access 2020; 8: 217416–217424.

11.

Awad

Hamad

MM.

Improved K-means clustering algorithm for big data based on distributed smartphoneneural engine processor. Electronics 2022; 11(6): 883.

12.

Lei

Qin

Peng

, et al. Reducing background induced domain shift for adaptive person re-identification. IEEE Trans Ind Inform 2022; 19(6): 7377–7388.

13.

Lin

Zheng

Chen

, et al. Multi-modal 3D shape clustering with dual contrastive learning. Appl Sci 2022; 12(15): 7384.

14.

Ishtiaq

Ahmed

Khan

, et al. Intelligent clustering using moth flame optimizer for vehicular ad hoc networks. Int J Distrib Sens Netw 2019; 15(1): 155014771882446.

15.

Yaari

Huppert

Dattner

Data-driven clustering of infectious disease incidence into age groups. Stat Methods Med Res 2022; 31(12): 2486–2499.

16.

King

Hering

Aguilar

OM.

Building predictive models of counterinsurgent deaths using robust clustering and regression. J Def Model Simul 2016; 13(4): 449–465.

17.

Caliński

Harabasz

A dendrite method for cluster analysis. Commun Stat Theor Methods 1974; 3(1): 1–27.

18.

McQueen

. Some methods of classification and analysis of multivariate observations. In: Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, Berkeley, CA, 21 June–18 July 1965 and 27 December 1965–7 January 1966, pp. 281–297. Berkeley, CA: University of California Press.

19.

Hartigan

Wong

MA.

Algorithm AS 136: a K-means clustering algorithm. J R Stat Soc C 1979; 28(1): 100–108.

20.

Pelleg

Moore

AW.

X-means: extending K-means with efficient estimation of the number of clusters. ICML 2000; 1: 727–734.

21.

Sinaga

Yang

M-S.

Unsupervised K-means clustering algorithm. IEEE Access 2020; 8: 80716–80727.

22.

Ishioka

. An expansion of X-means for automatically determining the optimal number of clusters. In: Proceedings of the fourth IASTED international conference computational intelligence, Calgary, AB, Canada, 4–6 July 2005.

23.

Pelleg

Moore

Accelerating exact K-means algorithms with geometric reasoning. In: Proceedings of the 5th ACM SIGKDD international conference on knowledge discovery and data mining, San Diego, CA, 15–18 August 1999, pp. 277–281. New York: Association for Computing Machinery.

24.

Xie

Bayesian repulsive Gaussian mixture model. J Am Stat Assoc 2020; 115(529): 187–203.

25.

Petralia

Rao

Dunson

Repulsive mixtures. Adv Neural Inf Process Syst 2012; 25.

26.

Cheng

Zhang

Chen

, et al. Aggregation pattern transitions by slightly varying the attractive/repulsive function. PLoS ONE 2011; 6(7): e22123.

27.

Kenyon-Dean

Cianflone

Page-Caccia

, et al. Clustering-oriented representation learning with attractive-repulsive loss. arXiv [Preprint] arXiv:1812.07627, 2018.

28.

Edwards

AWF

Cavalli-Sforza

. A method for cluster analysis. Biometrics 1965; 21: 362–375.

29.

Käster

Wendt

Sagerer

Comparing clustering methods for database categorization in image retrieval. In: B

Michaelis

Krell

(eds) Joint pattern recognition symposium. Berlin: Springer, 2003, pp. 228–235.

30.

Addagarla

Amalanathan

Probabilistic unsupervised machine learning approach for a similar image recommender system for E-Commerce. Symmetry 2020; 12(11): 1783.

31.

Kermani

Samadzadehaghdam

EtehadTavakol

Automatic color segmentation of breast infrared images using a Gaussian mixture model. Optik 2015; 126(21): 3288–3294.

32.

McCallum

Nigam

Ungar

. Efficient clustering of high-dimensional data sets with application to reference matching. In: Proceedings of the 6th ACM SIGKDD international conference on knowledge discovery and data mining, Boston, MA, 20–23 August 2000, pp. 169–178. New York: Association for computing machinery.

33.

Mardia

KV.

Multi-dimensional multivariate Gaussian Markov random fields with application to image processing. J Multivariate Anal 1988; 24(2): 265–284.

34.

Vidal

Sastry

SS.

Principal component analysis. In: Y

Sastry

Vidal

(eds) Generalized principal component analysis. New York: Springer, 2016, pp. 25–62.

35.

Berry

Browne

Understanding search engines: mathematical modeling and text retrieval. Philadelphia, PA: SIAM, 2005.

36.

Berry

Dumais

O’Brien

GW.

Using linear algebra for intelligent information retrieval. SIAM Rev 1995; 37(4): 573–595.

37.

Agrawal

Faloutsos

Swami

Efficient similarity search in sequence databases. In: International conference on foundations of data organization and algorithms, Chicago, IL, 13–15 October 1993, pp. 69–84. New York: Springer.

38.

Keogh

Mehrotra

, et al. Locally adaptive dimensionality reduction for indexing large time series databases. In: Proceedings of the 2001 ACM SIGMOD international conference on management of data, Santa Barbara, CA, 21–24 May 2001, pp. 151–162. New York: Association for Computing Machinery.

39.

Jia

Sun

Lian

, et al. Feature dimensionality reduction: a review. Complex Intell Syst 2022; 8(3): 2663–2693.

40.

Saleh

Elgammal

Large-scale classification of fine-art paintings: learning the right metric on the right feature. arXiv [Preprint]. arXiv:1505.00855, 2015.

41.

Iandola

Han

Moskewicz

, et al. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and¡ 0.5 MB model size. arXiv [Preprint]. arXiv:1602.07360, 2016.

42.

Deng

Dong

Socher

, et al. ImageNet: a large-scale hierarchical image database. In: IEEE conference on computer vision and pattern recognition, Miami, FL, 20–25 June 2009, pp. 248–255. New York: IEEE.

43.

Griffin

Holub

Perona

Caltech-256 object category dataset, 2007, https://authors.library.caltech.edu/records/5sv1j-ytw97

44.

Kvålseth

TO.

On normalized mutual information: measure derivations and properties. Entropy 2017; 19(11): 631.

45.

Estévez

Tesmer

Perez

, et al. Normalized mutual information feature selection. IEEE Trans Neural Netw 2009; 20(2): 189–201.

46.

McDaid

Greene

Hurley

Normalized mutual information to evaluate overlapping community finding algorithms. arXiv [Preprint]. arXiv:1110.2515, 2011.