Selection of K in K -means clustering

Abstract

The K-means algorithm is a popular data-clustering algorithm. However, one of its drawbacks is the requirement for the number of clusters, K, to be specified before the algorithm is applied. This paper first reviews existing methods for selecting the number of clusters for the algorithm. Factors that affect this selection are then discussed and a new measure to assist the selection is proposed. The paper concludes with an analysis of the results of using the proposed measure to determine the number of clusters for the K-means algorithm for different data sets.

Keywords

clustering cluster number selection

Get full access to this article

View all access options for this article.

References

Han

Kamber

Data Mining: Concepts and Techniques, 2000 (Morgan Kaufmann, San Francisco, California).

Al-Daoud

M. B.

Venkateswarlu

N. B.

Roberts

S. A.

Fast K-means clustering algorithms. Report 95.18, School of Computer Studies, University of Leeds, June 1995.

Al-Daoud

M. B.

Venkateswarlu

N. B.

Roberts

S. A.

New methods for the initialisation of clusters. Pattern Recognition Lett., 1996, 17, 451–455.

Alsabti

Ranka

Singh

An efficient K-means clustering algorithm. In Proceedings of the First Workshop on High-Performance Data Mining, Orlando, Florida, 1998; ftp://ftp.cise.ufl.edu/pub/faculty/ranka/Proceedings.

Bilmes

Vahdat

Hsu

E. J.

Empirical observations of probabilistic heuristics for the clustering problem. Technical Report TR-97–018, International Computer Science Institute, Berkeley, California.

Bottou

Bengio

Convergence properties of the K-means algorithm. Adv. Neural Infn Processing Systems, 1995, 7, 585–592.

Bradley

Fayyad

U. M.

Refining initial points for K-means clustering. In Proceedings of the Fifteenth International Conference on Machine Learning (ICML ‘98) (Ed. Shavlik

), Madison, Wisconsin, 1998, pp. 91–99 (Morgan Kaufmann, San Francisco, California).

Wong

T-W.

Numerical studies of MacQueen's K-means algorithm for computing the centroidal Voronoi tessellations. Int. J. Computers Math. Applics, 2002, 44, 511–523.

Castro

V. E.

Yang

A fast and robust general purpose clustering algorithm. In Proceedings of the Fourth European Workshop on Principles of Knowledge Discovery in Databases and Data Mining (PKDD 00), Lyon, France, 2000, pp. 208–218.

10.

Castro

V. E.

Why so many clustering algorithms? SIGKDD Explorations, Newsletter of the ACM Special Interest Group on Knowledge Discovery and Data Mining, 2002, 4 (1), 65–75.

11.

Fritzke

The LBG-U method for vector quantization - an improvement over LBG inspired from neural networks. Neural Processing Lett., 1997, 5 (1), 35–45.

12.

Hamerly

Elkan

Alternatives to the K-means algorithm that find better clusterings. In Proceedings of the 11th International Conference on Information and Knowledge Management (CIKM 02), McLean, Virginia, 2002, pp. 600–607.

13.

Hansen

L. K.

Larsen

Unsupervised learning and generalisation. In Proceedings of the IEEE International Conference on Neural Networks, Washington, DC, June 1996, pp. 25–30 (IEEE, New York).

14.

Ishioka

Extended K-means with an efficient estimation of the number of clusters. In Proceedings of the Second International Conference on Intelligent Data Engineering and Automated Learning (IDEAL 2000), Hong Kong, PR China, December 2000, pp. 17–22.

15.

Kanungo

Mount

D. M.

Netanyahu

Piatko

Silverman

The efficient K-means clustering algorithm: Analysis and implementation. IEEE Trans. Pattern Analysis Mach. Intell. 2002, 24 (7), 881–892.

16.

Pelleg

Moore

Accelerating exact K-means algorithms with geometric reasoning. In Proceedings of the Conference on Knowledge Discovery in Databases (KDD 99), San Diego, California, 1999, pp. 277–281.

17.

Pelleg

Moore

X-means: Extending K-means with efficient estimation of the number of clusters. In Proceedings of the 17th International Conference on Machine Learning (ICML2000), Stanford, California, 2000, 727–734.

18.

Pena

J. M.

Lazano

J. A.

Larranaga

An empirical comparison of four initialisation methods for the K-means algorithm. Pattern Recognition Lett., 1999, 20, 1027–1040.

19.

SPSS Clementine Data Mining System. User Guide Version 5, 1998 (Integral Solutions Limited, Basingstoke, Hampshire).

20.

DataEngine 3.0 – Intelligent Data Analysis – an Easy Job, Management Intelligenter Technologien GmbH, Germany, 1998; http://www.mitgmbh.de.

21.

Kerr

Hall

H. K.

Kozub

Doing Statistics with SPSS, 2002 (Sage, London).

22.

S-PLUS 6 for Windows Guide to Statistics, Vol. 2, Insightful Corporation, Seattle, Washington, 2001; http://www.insightful.com/DocumentsLive/23/44/statman2.pdf.

23.

Hardy

On the number of clusters. Comput. Statist. Data Analysis, 1996, 23, 83–96.

24.

Theodoridis

Koutroubas

Pattern Recognition, 1998 (Academic Press, London).

25.

Halkidi

Batistakis

Vazirgiannis

Cluster validity methods. Part I. SIGMOD Record, 2002, 31(2); available online http://www.acm.org/sigmod/record/.

26.

Kothari

Pitts

On finding the number of clusters. Pattern Recognition Lett., 1999, 20, 405–416.

27.

Cai

Technical aspects of data mining. PhD thesis, Cardiff University, Cardiff, 2001.

28.

Lindeberg

Scale-space Theory in Computer Vision, 1994 (Kluwer Academic, Boston, Massachusetts).

29.

Pham

D. T.

Dimov

S. S.

Nguyen

C. D.

Incremental K-means algorithm. Proc. Instn Mech. Engrs. Part C: J. Mechanical Engineering Science, 2003, 218, 783–795.

30.

Tibshirani

Walther

Hastie

Estimating the number of clusters in a dataset via the gap statistic. Technical Report 208, Department of Statistics, Stanford University, California, 2000.

31.

Blake

Keogh

Merz

C. J.

UCI Repository of Machine Learning Databases, Irvine, California. Department of Information and Computer Science, University of California, Irvine, California, 1998.