Empirical evaluation of five algorithms for the initialization phase of the k-Means algorithm

Abstract

A recurring problem in a wide variety of research areas such as pattern recognition, machine learning, data mining and statistics, among others, is characterized as a clustering problem. Such a problem can be described in a simplistic way as: given a set of data (observations, objects, points, etc.), group similar data into clusters (groups). A clustering of a given data set is then characterized as a set of clusters, in which elements belonging to a cluster are similar to each other and elements belonging to distinct clusters are not similar. Clustering algorithms are non-supervised algorithms and, among the many available in the literature, the k-Means, that uses a random initizalization process, can be considered one of the most popular and successful. The performance of the k-Means, however, is highly dependent on a ‘good’ initialization of the $k$ cluster centers (centroids), as well as on the value assigned to the number ( $k$ ) of clusters the final clustering should have. This paper addresses experiments using five initialization algorithms available in the literature namely, the Method1, the k-Means++, the CCIA, the Maedeh and Suresh and the SPSS algorithms, to empirically evaluate their contribution for improving the k-Means performance.

Keywords

Unsupervised learning k-Means initialization algorithms

Get full access to this article

View all access options for this article.

References

Fahad

Alshatri

Tari

Alamri

Khalil

Zomaya

A.Y

Foufou

and Bouras

, A survey of clustering algorithms for big data: taxonomy and empirical analysis, IEEE Transactions on Emerging Topics in Computing 2(3) (2014), 267–279.

Oliveira

A.F.

and Nicoletti

M.C.

, (2018) Favouring the k-means algorithm with initialization methods, In: Abraham

Cherukuri

Melin

Gandhi

. (eds), Intelligent Systems Design and Applications. ISDA 2018 2018. Advances in Intelligent Systems and Computing, v. 940, Springer, Cham.

Oliveira

A.F.

, Favouring the performance of k-Means via centroid initialization methods, M. Sc.dissertation, UNIFACCAMP, C.L. Paulista, Brazil, 2018 (in Portuguese).

Gionis

Mannila

and Tsaparas

, Clustering aggregation, ACM Transactions on Knowledge. Discovery. Data (ACM TKDD), v1. Article 4 (2007), p. 30.

Jain

A.K.

, Data clustering: 50 years beyond K-means, Pattern Recognition Letters 31(8) (2010), 651–666.

Jain

A.K.

Murty

M.N.

and Flynn

P.J.

, Data clustering: a review, ACM Computing Surveys 31(3) (1991), 264–323.

Jain

A.K.

and Law

M.H.C.

, Data clustering: a user’s dilemma, Lecture Notes in Computer Science 3776 (2005), 1–10.

Maedeh

and Suresh

, Design of efficient k-Means clustering algorithm with improved initial centroids, International Journal of Engineering and Technology 5(1) (2013), 33–38.

Everitt

B.S.

Landau

Leese

and Stahl

, (2011) Cluster Analysis, UK: John Wiley & Sons Ltd.

10.

Aggarwal

C.C.

and Reddy

C.K.

, Data clustering algorithms and applications, Chapman & Hall/CRC Data Mining and Knowledge Discovery Series, CRC Press, 2013.

11.

Pizzuti

Talia

and Vonella

, A divisive initialisation method for clustering algorithms, Proc. of The 3rd. European Conference on Principles and Practice of Knowledge Discovery in Databases, 1999, pp. 484–491.

12.

Arthur

and Vassilvitskii

, K-Means++: the advantages of careful seeding, Proc. of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, 2007, pp. 1027–1035.

13.

Dua

and Graff

, UCI Machine Learning Repository [http://archive.ics.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science, 2019.

14.

Hand

D.J.

Daly

Lunn

A.D.

McConway

K.J.

and Ostrowski

, Handbook of Small Data Sets, Chapman and Hall/CRC,1st. edition, 1993.

15.

Ruspini

E.H.

, Numerical methods for fuzzy clustering, Information Sciences 2(3) (1970), 319–350.

16.

Kovács

Legány

and Babos

, Cluster validity measurement techniques, Proc. of the Fifth WSEAS International Conference on Artificial Intelligence, Knowledge Engineering and Data Bases, 2006, pp. 388–393.

17.

Gan

and Wu

, Data Clustering – Theory, Algorithms and Applications, Philadelphia, USA:SIAM, 2007.

18.

GMUM.r, Group of Machine Learning Research, Faculty of Mathematics and Computer Science of Jagiellonian University, Kraków, Poland [online] http://r.gmum.net/samples/cec.basic.html.

19.

Chernoff

, The use of faces to represent points in n-dimensional space graphically, Technical Report n

{}^{\circ}

71, Department of Statistics, Stanford University, 1971.

20.

Brownlee

, Master Machine Learning Algorithms, Ebook, 2018, https://machinelearningmastery.com/master-machine-learning-algorithms/.

21.

MacQueen

J.B.

, Some methods for classification and analysis of multivariate observations, Proc. of 5th. Berkeley Symposium on Mathematical Statistics and Probability, University of California Press, 1967, pp. 281–297.

22.

Dunn

, Well separated clusters and optimal fuzzy partitions, Journal of Cybernetics, 4 (1974), 95–104.

23.

Han

Kamber

and Pei

, Data mining – concepts and techniques, 3

{}^{\rm rd}

Ed., Amsterdam: Morgan Kaufmann Publishers, 2012.

24.

Pavan

K.K

Rao

A.A

Rao

A.V.D.

and Sridhar

G.R.

, Robust seed selection algorithm for k-means type algorithms, International Journal of Computer Science & Information Technology (IJCSIT), 3(5) (2011), 147–163.

25.

Pavan

K.K

Rao

A.A.

Rao

A.V.D.

and Sridhar

G.R.

, Single pass seed selection algorithm for k-Means, Journal of Computer Science 6(11) (2010), 60–66.

26.

Kaufman

and Rousseeuw

P.J.

, Finding Groups in Data, USA: John Wiley & Sons, Inc., 2005.

27.

Al-Daoud

and Roberts

S.A.

, New methods for the initialisation of clusters, Pattern Recognition Letters 17 (1996), 451–455.

28.

M.C.

Chou

C.H.

and Hsieh

C.C.

, Fuzzy C-Means algorithm with a point symmetry distance, International Journal of Fuzzy Systems 7(4) (2005), 175–181.

29.

Celebi

M.E.

Kingravi

H.A.

and Vela

P.A.

, A comparative study of efficient initialization methods for the k-means clustering algorithm, Expert Systems with Applications 40 (2013), 200–210.

30.

Halkidi

Batistakis

and Vazirgiannis

, On clustering validation techniques, Journal of Intelligent Information Systems 17(2–3) (2001), 107–145.

31.

Berthold

M.R.

Borgelt

Höppner

and Klawonn

, Guide to Intelligent Data Analysis, London:Springer-Verlag, 2010.

32.

Mitra

Murthy

C.A.

and Pal

S.K.

, Density-based multiscale data condensation, IEEE Transactions on Pattern Analysis and Machine Intelligence 24(6) (2002), 734–747.

33.

Rousseeuw

P.J.

, Silhouettes: a graphical-aid to the interpretation and validation of cluster analysis, Computational and Applied Mathematics 20 (1987), 53–65.

34.

Tan

P.-N.

Steinback

and Kumar

, Introduction to Data Mining, Pearson Education, Inc., 2006.

35.

Erisoglu

Calis

and Sakallioglu

, A new algorithm for initial cluster centers in k-Means algorithms, Pattern Recognition Letters 3 (2011), 1701–1705.

36.

and Wunch

D.C.

, II, Survey of clustering algorithms, IEEE Transactions on Neural Networks 16 (2005), 645–678.

37.

Theodoridis

and Koutroumbas

, Pattern Recognition, 4

{}^{\rm th}

ed., USA: Elsevier, 2009.

38.

Bandyopadhyay

and Maulik

, Genetic clustering for automatic evolution of clusters and application to image classification, Pattern Recognition 35 (2002), 1197–1208.

39.

Burks

Harrell

and Wang

, On initial effects of the k-Means clustering, Proc. of The 2015 World Congress in Computer Science, Computer Engineering, & Applied Computing, USA, 2015, pp. 200–205.

40.

Günter

and Bunke

, Validation indices for graph clustering, Pattern Recognition Letters 24(8) (2003), 1107–1113.

41.

Sillhouette (clustering) https://en.wikipedia.org/wiki/Silhouette_(clustering).

42.

Khan

S.S.

and Ahmad

, Cluster center initialization algorithm for k-Means clustering, Pattern Recognition Letters 25 (2004), 1293–1302.

43.

Mitchell

T.M.

, Machine Learning, USA: McGraw-Hill, 1997.

44.

Rand

W.M.

, Objective criteria for the evaluation of clustering methods, Journal of the American Statistical Association 66(336) (1971), 846–850.

45.

Zhu

, Semi-supervised learning literature survey, Technical Report 1530, University of Wisconsin-Madison, 2006.

46.

Liu

Xiong

Gao

and Wu

, Understanding of internal clustering validation measures, Proc. of the 10th International IEEE Conference on Data Mining (ICMD), 2010, pp. 911–916.