A co-evolutionary framework for adaptive multidimensional data clustering

Abstract

Data clustering refers to constructing groups of objects that are highly correlated, based on some similarity measure. It is a very popular technique for intelligent knowledge discovery. A challenge that arises in automatic data clustering, though, is the high dimensionality of data, since each object can be described by several relevant features. Thus, we often need to assign a relative weight for each feature to indicate its importance during the clustering process. With the absence of domain knowledge about the nature of data, assigning such weights becomes a challenging task. Dynamic adjustment of feature weights in an unsupervised manner is an attractive solution for such problem. In this paper, we propose a co-evolutionary algorithm for the dynamic adjustment of feature weights during data clustering. Two populations are simultaneously evolved for the optimization of both the clusters and their associated feature weights. In addition, the number of clusters are also learnt and optimized in the evolutionary process. Extensive experimental results on several datasets from UCI machine learning repository indicate the efficacy of the proposed approach. The algorithm outperforms both a non-adaptive version, where feature weights are not considered, as well as K-means clustering for a fixed number of clusters.

Keywords

Data clustering co-evolutionary algorithm genetic algorithm multidimensional clustering feature selection

Get full access to this article

View all access options for this article.

References

Agustı

L.E.

Salcedo-Sanz

Jiménez-Fernández

Carro-Calvo

Del Ser

and Portilla-Figueras

J.A.

, A new grouping genetic algorithm for clustering problems, Expert Syst. Appl 39 (2012), 9695–9703.

Al-malak

and Hosny

, A Multimodal Adaptive Genetic Clustering Algorithm, in: Proc. Genet. Evol. Comput. Conf. (GECCO 2016), ACM, Denver, Colorado, 2016.

Bandyopadhyay

and Maulik

, An evolutionary technique based on K-means algorithm for optimal clustering in RN, Inf. Sci. 146(n.d.) (2002), 221–237.

Bezdek

Boggavarapu

Hall

and Bensaid

, Genetic algorithm guided clustering, in: IEEE World Congr. Comput. Intell, 1994, pp. 34–39.

Chatzilari

Nikolopoulos

and Patras

, Enhancing computer vision using the collective intelligence of social media, in: New Dir. Web Data Manag. 1, Springer Berlin Heidelberg, 2011, pp. 235–271.

Cole

, Clustering with genetic algorithms, Citeseer, 1998.

Dash

and Liu

, Feature Selection for Clustering, in: Knowl. Discov. Data Mining. Curr. Issues New Appl, Springer Berlin Heidelberg, 2000, pp. 110–121.

DeSarbo

Carroll

Clark

and Green

, Synthesized clustering: A method for amalgamating alternative clustering bases with differential weighting of variables, Psychometrika 49 (1984), 57–78.

Dorigo

, Optimization, learning and natural algorithms, Ph. D. Thesis, Politecnico di Milano, Italy, 1992.

10.

, Data mining techniques and applications: An introduction, Cengage Learning, 2010.

11.

Dyer

D.W.

, Watchmaker framework for evolutionary computation, 2006.

12.

Estivill-Castro

and Murray

, Spatial clustering for data mining with genetic algorithms, Queensland University of Technology, Australia, 1997.

13.

Falkenauer

, Genetic algorithms and grouping problems, John Wiley & Sons, Inc., 1998.

14.

Fränti

Kivijärvi

Kaukoranta

and Nevalainen

, Genetic algorithms for large-scale clustering problems, Comput. J. 40 (1997), 547–554.

15.

Freitas

, Data mining and knowledge discovery with evolutionary algorithms, Springer Science & Business Media, 2013.

16.

Gançarski

Blansche

and Wania

, Comparison between two coevolutionary feature weighting algorithms in clustering, Pattern Recognit 41 (2008), 983–994.

17.

Giannakidou

and Kompatsiaris

, Semsoc: Semantic, social and content-based clustering in multimedia collaborative tagging systems, in: Semant. Comput. 2008 IEEE Int. Conf., 2008.

18.

Gnanadesikan

Kettenring

and Tsao

, Weighting and selection of variables for cluster analysis, J. Classif. 12 (1995), 113–136.

19.

Goldberg

D.E.

, Genetic algorithms in search, optimization, and machine learning, Addion wesley, 1989.

20.

Goldberg

D.E.

and Deb

, A comparative analysis of selection schemes used in genetic algorithms., Found. Genet. Algorithms 1 (1991), 69–93.

21.

Halkidi

Batistakis

and Vazirgiannis

, On Clustering Validation Techniques, J. Intell. Inf. Syst 17 (2001), 107–145.

22.

Hall

Frank

Holmes

Pfahringer

Reutemann

and Witten

I.H.

, The WEKA data mining software: an update, ACM SIGKDD Explor. Newsl 11 (2009), 10–18.

23.

Han

Karypis

and Kumar

, Text categorization using weight adjusted k-nearest neighbor classification, in: Pacific-Asia Conf. Knowl. Discov. Data Min, Springer Berlin Heidelberg, 2001.

24.

Hansen

and Jaumard

, Cluster analysis and mathematical programming, Math. Program 79 (1997), 191–215.

25.

and Tan

, A two-stage genetic algorithm for automatic clustering, Neurocomputing 81 (2012), 49–59.

26.

Hruschka

Campello

and Freitas

A.A.

, A survey of evolutionary algorithms for clustering, IEEE Trans. Syst. Man, Cybern. Part C (Applications Rev.) 39 (2009), 133–155.

27.

Hruschka

and de Castro

, Evolutionary algorithms for clustering gene-expression data, in: Data Mining, 2004. ICDM’04. Fourth IEEE Int. Conf., IEEE, 2004, pp. 403–406.

28.

Hruschka

and Ebecken

, A genetic algorithm for cluster analysis, Intell. Data Anal 7 (2003), 15–25.

29.

Hruschka

E.R.

Campello

R.J.G.B.

Freitas

A.A.

and de Carvalho

A.C.P.L.F.

, A survey of evolutionary algorithms for clustering, IEEE Trans. Syst. Man Cybern. Part C Appl. Rev. 39 (2009), 133–155.

30.

Jain

and Dubes

, Algorithms for clustering data, Prentice-Hall, Inc., Englewood Cliffs, NJ, 1988.

31.

Jain

A.K.

Murty

M.N.

and Flynn

P.J.

, Data Clustering: A Review, ACM Comput. Surv 31 (1999), 264–323.

32.

De Jong

K.A.

, An analysis of the behavior of a class of genetic adaptive systems, University of Michigan, 1975.

33.

Kogan

Nicholas

Teboulle

et al., A Survey of Clustering Data Mining Techniques, in: Group. Multidimens. Data, Springer Berlin Heidelberg, 2006, pp. 25–71.

34.

Krishna

and Murty

, Genetic K-means algorithm, IEEE Trans. Syst. Man, Cybern. Part B. 29 (1999), 433–439.

35.

Krovi

, Genetic algorithms for clustering: a preliminary investigation, in: Syst. Sci. 1992. Proc. Twenty-Fifth Hawaii Int. Conf., IEEE, Hawaii, 1992, pp. 540–544.

36.

Kuncheva

and Bezdek

, Selection of cluster prototypes from data by a genetic algorithm, in: 5th Eur. Congr. Intell. Tech. Soft Comput, 1997, pp. 1683–1688.

37.

Lichman

, UCI Machine Learning Repository, 2013.

38.

Lienhart

Romberg

and Hörster

, Multilayer pLSA for multimodal image retrieval, in: Proc. ACM Int. Conf. Image Video Retr, 2009, p. 9.

39.

Lin

Yang

and Kao

, An efficient GA-based clustering technique, Tamkang J. Sci 8 (2005), 113–122.

40.

Liu

and Yu

, Toward integrating feature selection algorithms for classification and clustering, IEEE Trans. Knowl. Data Eng 17 (2005), 491–502.

41.

Liu

and Shen

, Automatic clustering using genetic algorithms, Appl. Math. Comput 218 (2011), 1267–1279.

42.

Fotouhi

Deng

and Brown

S.J.

, Incremental genetic K-means algorithm and its application in gene expression data analysis, Bioinformatics (2004).

43.

Fotouhi

Deng

and Brown

, FGKA: A fast genetic k-means clustering algorithm, in: Proc. 2004 ACM Symp. Appl. Comput, 2004, pp. 622–623.

44.

Lucasius

Dane

and Kateman

, On k-medoid clustering of large data sets with the aid of a genetic algorithm: background, feasiblity and comparison, Anal. Chim. Acta 282 (1993), 647–669.

45.

Luke

, Essentials of Metaheuristics, second, Lulu, 2013.

46.

Chan

Yao

and Chiu

D.K.

, An evolutionary clustering algorithm for gene expression microarray data analysis, IEEE Trans. Evol. Comput 10 (2006), 296–314.

47.

Mardia

K.V.

Kent

J.T.

and Bibby

J.M.

, Multivariate Analysis, Analysis 97 (1979), 1–4.

48.

Merz

and Zell

, Clustering gene expression profiles with memetic algorithms, in: Int. Conf. Parallel Probl. Solving from Nat, Springer Berlin Heidelberg, 2002, pp. 811–820.

49.

Modha

and Spangler

, Feature weighting in k-means clustering, Mach. Learn 52 (2003), 217–237.

50.

Molina

L.C.

Belanche

and Nebot

, Feature selection algorithms: a survey and experimental evaluation, in: 2002 IEEE Int. Conf. Data Mining, 2002. Proceedings, IEEE Comput. Soc, 2002, pp. 306–313.

51.

Murthy

and Chowdhury

, In search of optimal clusters using genetic algorithms, Pattern Recognit. Lett 17 (1996), 825–832.

52.

Naldi

, Clustering using genetic algorithm combining validation criteria., in: ESANN, 2007, pp. 139–144.

53.

Neha

and Vidyavathi

, A Survey on Applications of Data Mining using Clustering Techniques, Int. J. Comput. Appl 126 (2015).

54.

Nikolopoulos

Giannakidou

and Kompatsiaris

, Combining multi-modal features for social media analysis, in: Hoi

S.C.H.

Luo

Boll

Jin

King

, eds, Soc. Media Model, Springer, 2011, pp. 71–96.

55.

Parsons

Haque

and Liu

, Subspace clustering for high dimensional data: a review, ACM SIGKDD Explor. Newsl 6 (2004), 90–105.

56.

Pena

Lozano

and Larranaga

, An empirical comparison of four initialization methods for the k-means algorithm, Pattern Recognit. Lett 20 (1999), 1027–1040.

57.

Piatrik

and Izquierdo

, Subspace clustering of images using ant colony optimisation, in: 2009 16th IEEE Int. Conf. Image Process, IEEE, 2009, pp. 229–232.

58.

Rai

and Singh

, A Survey of Clustering Techniques, Int. J. Comput. Appl 7 (2010), 1–5.

59.

Scheunders

, A genetic c-means clustering algorithm applied to color image quantization, Pattern Recognit 30 (1997), 859–866.

60.

Shankar

and Karypis

, Weight adjustment schemes for a centroid based classifier, 2000.

61.

Sheikh

and Raghuwanshi

, Genetic algorithm based clustering: a survey, in: First Int. Conf. Emerg. Trends Eng. Technol, IEEE, 2008, pp. 314–319.

62.

Sheng

and Liu

, A hybrid algorithm for k-medoid clustering of large data sets, in: Evol. Comput. 2004. CEC2004, IEEE, 2004.

63.

De Soete

, Optimal variable weighting for ultrametric and additive tree clustering, Qual. Quant 20 (1986), 169–180.

64.

De Soete

, OVWTRE: A program for optimal variable weighting for ultrametric and additive tree fitting, J. Classif 5 (1988), 101–104.

65.

Talbi

E.G.

, Metaheuristics: from design to implementation, John Wiley & Sons, 2009.

66.

Tsai

and Chiu

, Developing a feature weight self-adjustment mechanism for a K-means clustering algorithm, Comput. Stat. Data Anal 52 (2008), 4658–4672.

67.

Wettschereck

Aha

and Mohri

, A review and empirical evaluation of feature weighting methods for a class of lazy learning algorithms, Artif. Intell. Rev. 11 (1997), 273–314.

68.

Wiegand

R.P.

, An Analysis of Cooperative Coevolutionary Algorithms, George Mason University, 2003.

69.

and Wunsch

, Survey of clustering algorithms, IEEE Trans. Neural Networks 16 (2005), 645–678.

70.

and Liu

, Efficient Feature Selection via Analysis of Relevance and Redundancy, J. Mach. Learn. Res. 5 (2004), 1205–1224.