Proper Global Shared Preference Detection Based on Golden Section and Genetic Algorithm for Affinity Propagation Clustering

Abstract

Affinity propagation (AP) clustering is a well-known effective clustering algorithm that outperforms other traditional clustering algorithms. However, the quality of clustering results depends considerably on related sensitive parameters (i.e., preferences and the damping factor). Thus, a feasible procedure based on golden section (GS) and the genetic algorithm (GA) is proposed. This procedure, called the “GS/GA-AP” algorithm, can perform proper global shared preference detection, including identifying a suitable number of clusters. A global shared preference is provided using the GS value between the minimum and maximum of similarities for AP as a default option, and the unsatisfactory clustering result becomes robust when the parameter with GA is selected. Finally, satisfactory experiments using one simulation data set and eight benchmark data sets are performed to verify the effectiveness of the proposed algorithm. The results indicate that GS/GA-AP clearly outperforms the original AP clustering algorithm.

1. Introduction

The implementation of knowledge mining in a massive scale data set generated or captured by sensors, smartphones, or other monitoring devices has been a provocative challenge in recent years, and techniques to address this challenge have been emerging continuously. Supervised learning techniques, such as classification, regression, and semantic web construction [1, 2], are effective in classifying, regressing, and extracting events and interval links from a large scale data set [3, 4], particularly from certain heterogeneous data sets, but may be inappropriate for at processing data set without prior knowledge. Therefore the unsupervised learning techniques, such as clustering [5], are unsuitable for extracting inner-connection within the data set and in accumulating prior knowledge for further processing.

Cluster analysis (or clustering) is the process of separating a set of abstract or physical objects into several classes of similar objects [6]. A cluster is a subset of the original object set, in which the objects are similar to one another and dissimilar to the objects in other clusters. Elementary cluster analysis methods are classified into four categories, namely, partitioning, hierarchical, density-based, and grid-based methods [6].

A partitioning approach is a process of separating an object set into several clusters, each of which contains at least one object. A hierarchical approach captures several hierarchical decompositions of an object set and consists of agglomerative and divisive methods that follow the bottom-to-top or top-to-bottom decomposition directions. The main idea of density-based approaches is to continue the growth of given clusters as long as the density of data points in the neighborhood exceeds the given threshold. A grid-based approach formulates a grid structure, which quantizes the object space into a finite number of cells, and cluster operations work on the grid structure. Numerous traditional and classical methods have been widely used in various computer science fields, and novel methods have been continuously emerging from research institutes and engineering projects.

Cluster analysis helps determine an appropriate or meaningful data partition on a given scientific data set and identify an exemplar configuration that exemplifies the data points; it aims to solve a fundamental problem in machine learning, that is, how each data point is characterized by the cluster or exemplar to which it belongs. The contribution of the aforementioned elementary clustering methods and other clustering methods, such as gene clustering [7], is focused on detecting an appropriate configuration, in which the enhanced updating rules or the selected parameters are customized or provided by experiments. These upgraded methods have been widely used in many computer science areas, such as artificial intelligence, pattern recognition, and gene sequencing [8, 9].

As a novel clustering method, affinity propagation (AP), which was devised by Frey and Dueck and published in Science in 2007 [10, 11], aims to identify an appropriate exemplar configuration and clustering assignments of data points in finite rounds of iterative calculation for responsibilities and availabilities, and the obtained square error of the clustering result of AP is considerably less than that of k-means. AP performs efficiently in many computer science and interdisciplinary fields, such as in clustering face images, detecting genes, and identifying key sentences and air-travel routes [10]; it has attracted the attention of researchers because of its excellent performance in recent years.

The genetic algorithm (GA) [13] tends to find a solution for combinatorial optimisation issues in the solution space. GA, which is derived from biological evolution in the natural environment that occurs over thousands of years, creates a simulation of solution evolution and elimination, in which a series of encoded candidate solutions is updated and an appropriate solution emerges after finite rounds of reproduction-crossover-mutation cycle. Despite its local optimum shortcoming, GA is popular and widely used because of its heuristic search function, rapid convergence, easy parallel implementation, and effective solution. GA has been developed well because of its effectiveness and contribution to the transition periods of many research areas.

AP is sensitive to its related parameters, namely, the damping factor, and preferences, which affect the numbers of iterations and clusters. The damping factor, which is represented by λ, is a common approach for controlling convergence and stabilizing the assignment of data points. A high damping factor restrains the updating of responsibilities and availabilities in the iterative calculation, whereas a low damping factor helps prevent the defective exemplar configuration and clustering assignments of data points. The damping factor is always specified as $0.5$ or $0.9$ according to expert experiences [10].

Preferences are the control knobs that govern the cluster number, which is the required confidence level for data points to become definite exemplars that exemplify clusters. To become an exemplar, a data point is severely penalized and has to surrender to another candidate exemplar if a smaller preference value is assigned to it, thereby leading to a smaller cluster number in the clustering process. By contrast, more clusters can be gathered if each data point maintains a high preference value; thus, the data points are more confident in dominating clusters, which may lead to a situation in which numerous scraps of subsets exist in a disordered collection.

Each data point can be endued with its own preference value and has its own unique confidence level to dominate a cluster. All the preferences of data points can be set with the same value without prior knowledge, thereby providing all the points with equal opportunities to become an exemplar. In general, preferences can remain constant in the iterative calculation process when AP is running to achieve stable exemplar configuration, and breaking away from a wrong exemplar configuration is achieved via the dynamic adjustment of preferences. Thus, all the preferences commonly share the same value that remains unchanged during the entire clustering process for convenient preferences selection, which is called “global shared preference” [12]. We focus on selecting a global shared preference, and thus, the singular form “preference” is regarded as a synonym of “global shared preference” for simplicity, whereas the plural form “preferences” without the modifiers “global shared” exemplifies the vector constructed by the unique preference of each data point in this study if not mentioned specifically.

The quality of the AP clustering result can be upgraded by formulating the selection of the global shared preference. The enhanced result contributes to determining an improved inner-connection and accumulating prior knowledge within the original data set, which has motivated us to explore cluster number and AP quality dominated by preference and to design a feasible method for the proper preference selection. We propose a feasible procedure called “GS/GA-AP” algorithm for the proper detection of a global shared preference in this study. In GS-AP phase, preference is provided using the GS value between the minimum and maximum of similarities for AP, and a fitness function oversees the quality of the clustering result. If the clustering result is unsatisfactory compared with the original AP using the median as preference, we use GA to detect the proper preference for AP, and the proper preference and clustering results can be detected automatically.

The rest of this paper is organized as follows. Section 2 introduces recent research regarding AP, the preference selection for AP, and GA. Section 3 presents the common AP clustering algorithm, the origin of GS, and the fundamental GA. Section 4 describes the GS/GA-AP algorithm. Section 5 provides a brief explanation of the effectiveness of GS/GA-AP. Section 6 presents the evaluation experiments and the results of the original AP and GS/GA-AP algorithms, and Section 7 concludes the study.

2. Related Work

Some previous work had been already presented before Frey and Dueck published AP in Science in 2007. A prototype of AP using continuous sum-product probabilistic model was described in [14] by Frey and Dueck in 2006, which demonstrated the effectiveness of the prototype of AP algorithm on problems of clustering image patches for image segmentation and learning mixtures of gene expression models from microarray data. Dueck and Frey [15] applied a translation-invariant nonmetric similarity to AP in 2007, which achieved a much lower reconstruction error and classification error rate in Olivetti face data set. In 2008, Dueck et al. [16] modified AP and applied it to a subset selection of yeast genes that act as a drug-response footprint and a subset clustering of vaccine sequences that provided maximum epitope coverage for an HIV genome population. It is declared and demonstrated that AP performs well and was widely used in [10, 14–16].

Other researching successors start to improve the algorithm. An improvement of AP via a novel much simpler model [17] was presented by Givoni and Frey in 2009, which was based on a quite different graph model allowing easier derivations of message updating process for extensions and modifications. In 2010, Zhang et al. [18] proposed an algorithm called K-AP to exploit the immediate results of K clusters by introducing a constraint in the message passing process. In the same year, Tang et al. [19] designed PoissonAPS incorporating AP and Poisson to overcome the limitations of AP and automatically cluster SAGE data without user-specified parameters.

In 2012, Givoni et al. [20] extended AP in a principled way to solve the hierarchical clustering problem that was applied in a number of domains such as biology. In 2013, Wang et al. [21] proposed multiexemplar AP algorithm to provide a solution to multitopic and multiexemplar clustering problem, which was a great improvement for AP clustering algorithm. References [17–19] focus on accelerating the convergence speed of AP or applying AP to a hierarchical situation and [20, 21] serve interdisciplinary research, all of which propose the improvement of AP for clustering more properly or being used extensively.

Several papers concerning preferences selection in AP have been published. Wang et al. [22] presented an adaptive preferences selection method based on gradient descent to obtain the suitable clustering result. In 2010, He et al. [23] proposed another adaptive selection method that found out the range of preferences and then searched an appropriate value in the space of preferences to cluster. And in 2012, Su et al. [24] proposed an adaptive AP algorithm for semisupervised hyperspectral band selection and used bisection method to address preferences selection. In 2014, Chen et al. [25] proposed a new approach based on stability, using NMI measurement to calculate the stability of AP. References [22–25] use adaptive methods to detect preferences by gradient descent, searching in the preferences range or bisection method, all of which give feasible preferences to cluster instead of the optimal ones. In fact, people are more satisfied with the optimal clustering result induced by the optimal preferences, which inspires us to build up our model drawn on the experience of [26, 27].

Some papers provided solutions of preferences detection combining AP and optimization algorithms in recent years. Wang et al. [28, 29] applied particle swarm optimization (PSO) algorithm to find the optimal preferences for AP and clustering solution. Zhong et al. [30] also proposed an approach that utilized PSO to detect the optimal preferences and the optimal cluster number in AP. You et al. [31] presented a similar approach to find the preferences. It is obvious that [28–31] have been designed to detect the optimal clustering results, giving inspiration to other researches. So we propose GS/GA-AP to reduce the time complexity at a certain probability and solve the problem on the optimal preference choice.

Proposed by Holland [13] in 1975, genetic algorithm (GA) is a simulation for the survival and evolution process of the creatures in the natural environment. Individuals, who are more appropriately living in the specified environment, have more opportunities to survive, reproduce, and evolve themselves, and others have no choices but are eliminated in several iterative calculation before the algorithm is terminated. Recently GA has been widely used in searching an optimal or feasible solution in NP-hard combinatorial optimization problems, such as dealing with time dependent demand and variable holding cost in a soft computing optimization based on two warehouse inventory models [32], motif discovering in molecular biology research [33], joint tactical air requests [34], describing the linear relationships to bicluster [35], and data clustering [36, 37].

Parallel genetic algorithm (PGA) has been derived from the original GA due to the parallelization property [38] of GA, which demonstrated that GA had possibility to process the large scale data. An effective PGA model was presented in [39], which could automatically parallelize GA as a feasible implementation. Under numerous research for parallelization, PGA combining MapReduce [40] was extensively applied in paralleling computing for big data [41] and automatic generation of test suites [42].

3. Background

3.1. Affinity Propagation

Affinity propagation (AP) [10–12] is a clustering algorithm using max-sum (sum-product) belief propagation in factor graph [43] to identify an appropriate exemplar configuration for the data set. Given a data set $X = \{x_{1}, x_{2}, \dots, x_{n}\}$ , AP tends to identify a valid configuration of exemplar labels: $c = [c_{1}, c_{2}, \dots, c_{n}]$ , where $c_{i}$ is the exemplar of data point $x_{i} (i = 1,2, \dots, n)$ . A similarity function $S i m (c)$ is constructed summing similarities between all data points $x_{i}$ and its exemplar $c_{i}$ , and the exemplar configuration $c = [c_{1}, c_{2}, \dots, c_{n}]$ maximising $S i m (c)$ is the optimal one:

\begin{matrix} S i m (c) = \sum_{i = 1}^{n} s (x_{i}, c_{i}) \forall i \leq n, \end{matrix}

(1)

where

c_{i}

is the exemplar of data point

x_{i}

The algorithm aims at identifying the exemplar configuration to maximise the similarity function $S i m (c)$ , and a constraint condition is attached to it for the purpose of guaranteeing the exemplar-consistency and eliminating an incorrect exemplar configuration, so $S i m (c)$ and the penalty term construct a delicate function $S (c)$ so as to configure the exemplar labels and assign the data points effectively. The similarity function is optimised in practice by minimizing a distance energy function $E (c)$ under a relation that a minus distance metric between two data points is suitable to represent a similarity measure between them; that is, $s (x_{i}, c_{i})$ could be defined using $- d i s t (x_{i}, c_{i})$ :

\begin{matrix} S (c) = S i m (c) + \sum_{k = 1}^{n} δ_{k} (c_{k}) = \sum_{i = 1}^{n} s (x_{i}, c_{i}) + \sum_{k = 1}^{n} δ_{k} (c_{k}), \end{matrix}

(2)

where

S i m (c)

could be calculated using

E (c)

\begin{matrix} S i m (c) ≜ - E (c) = - \sum_{i = 1}^{n} d i s t (x_{i}, c_{i}) \end{matrix}

(3)

and

δ_{k} (c)

is defined by

\begin{matrix} δ_{k} (c) = \{\begin{cases} - \infty, & i f c_{k} = x_{k} b u t \exists i \neq k : c_{i} = x_{i} \\ 0, & o t h e r w i s e . \end{cases} \end{matrix}

(4)

Here, $δ_{k} (c)$ is a penalty term that eliminates incorrect exemplar configurations and data point assignments. $δ_{k} (c)$ is endowed with $- \infty$ when data point $x_{i}$ chooses $x_{k}$ as its exemplar but $x_{k}$ is not correctly labelled as an exemplar, which means the value of similarity function falls down to $- \infty$ and the clustering procedure is invalid above.

The function $S (c)$ is decomposed using a factor graph for maximising itself effectively. A visualization of the factor graph for function $S (c)$ is exhibited in Figure 1(a): the square nodes represent the function nodes, while the circle nodes represent the variable nodes. Edges exist only if the variable node is an input of the function node, which means the function nodes representing $s (x_{i}, c_{i})$ have one edge connected with the variable node $c_{i}$ , while the function nodes representing $δ_{k} (c)$ have n edges connected with all of these variable nodes.

Figure 1

(a) Factor graph for AP [11]. (b) The direction of passing message [11].

Max-sum algorithm is applied in the iterative calculating process in order to recognise an appropriate exemplar configuration. There are two kinds of message passing via edges on the factor graph in this algorithm: one is called $ρ_{i \to k} (c_{i})$ sent from variable node $c_{i}$ to function node $δ_{k}$ ; another is called $α_{i \leftarrow k} (c_{i})$ sent from function node $δ_{k}$ to variable node $c_{i}$ . The direction of passing message is shown in Figure 1(b).

$ρ_{i \to k} (c_{i})$ and $α_{i \leftarrow k} (c_{i})$ are updated by rules (5) according to two updating rules of max-sum algorithm:

\begin{matrix} ρ_{i \to k} (c_{i}) = s (x_{i}, c_{i}) + \sum_{k^{'} : k^{'} \neq k} α_{i \leftarrow k^{'}} (c_{i}), \\ α_{i \leftarrow k} (c_{i}) = \underset{c_{j_{1}}, \dots, c_{j_{i - 1}}, c_{j_{i + 1}}, \dots, c_{j_{n}}}{m a x} [δ_{k} (c_{j_{1}}, \dots, c_{j_{i - 1}}, c_{i}, c_{j_{i + 1}}, \dots, c_{j_{n}}) + \sum_{i^{'}} ρ_{i^{'} \to k} (c_{j_{i^{'}}})] . \end{matrix}

(5)

The exemplar configuration and data point assignments are identified when an iterative calculation for $ρ_{i \to k} (c_{i})$ and $α_{i \leftarrow k} (c_{i})$ converges, whose time complexity is $O (n^{n})$ [12]. A skillful simplicity for $ρ_{i \to k} (c_{i})$ and $α_{i \leftarrow k} (c_{i})$ presented in [11, 12] could reduce the time complexity down to $O (n^{3})$ using the following rules:

\begin{matrix} r (i, k) = ρ_{i \to k} (c_{i} = x_{k}) = s (x_{i}, x_{k}) - \underset{j : j \neq k}{m a x} [s (x_{i}, x_{j}) + a (x_{i}, x_{j})], \end{matrix}

(6)

\begin{matrix} a (i, k) = α_{i \leftarrow k} (c_{i} = x_{k}) = \{\begin{cases} \sum_{i^{'} : i^{'} \neq k} m a x (0, r (x_{i^{'}}, x_{k})), & f o r k = i \\ m i n [0, r (x_{k}, x_{k}) + \sum_{i^{'} : i^{'} \notin \{i, k\}} m a x (0, r (x_{i^{'}}, x_{k}))], & f o r k \neq i . \end{cases} \end{matrix}

(7)

Then $ρ_{i \to k} (c_{i})$ is named “responsibility” $r (i, k)$ , reflecting how well-suited data point k is to become an exemplar for data point i, and $α_{i \leftarrow k} (c_{i})$ is named “availability” $a (i, k)$ , reflecting how appropriate data point i chooses data point k as its exemplar.

At last, the exemplar labels $c = [c_{1}, c_{2}, \dots, c_{n}]$ are computed as

\begin{matrix} c_{i} = \underset{x_{k}}{a r g m a x} [a (i, k) + r (i, k)] \end{matrix}

(8)

after convergence.

The details of AP are described in Algorithm 1.

Algorithm 1: AP clustering algorithm (AP) [10, 12].

Input:

$X = \{x_{1}, x_{2}, \dots, x_{n} \}$ : A data set.

p: The global shared reference.

$m a x_i t e r$ : The maximum of iteration number.

$c o n_i t e r$ : The convergence number of iteration.

$c o n_t h r e s h o l d$ : The threshold of convergence.

λ : The damping factor.

Output:

$L a b e l s = [c l u (x_{1}), \dots, c l u (x_{n})]$ : The clustering assignment labels where $c l u (x_{i})$ represents which cluster $x_{i}$ belongs to.

(1) AP.Initial( $m a x_i t e r$ , $c o n_i t e r$ , $c o n_t h r e s h o l d$ , λ)

(2) $s (i, k)$ $\leftarrow$ Similarity( $x_{i}$ , $x_{k}$ ) for all data points in X

(3) $S \leftarrow {[s (i, k)]}_{n \times n}$

(4) if p is given then

(5) $s (i, i) \leftarrow p$ for all $i < n$

(6) else

(7) $s (i, i)$ $\leftarrow$ median( S ) for all $i < n$

(8) end if

(9) Responsibility $r (i, k)$ $\leftarrow$ $0$ for all $i, k < n$

(10) Responsibility matrix $R \leftarrow {[r (i, k)]}_{n \times n}$

(11) Availability $a (i, k)$ $\leftarrow$ $0$ for all $i, k < n$

(12) Availability matrix $A \leftarrow {[a (i, k)]}_{n \times n}$

(13) Iteration number $i = 0$

(14) repeat

(15) update $r (i, k)$ using (6)

(16) $R \leftarrow {[r (i, k)]}_{n \times n}$

(17) update $a (i, k)$ using (7)

(18) $A \leftarrow {[a (i, k)]}_{n \times n}$

(19) $R_{i} \leftarrow (1 - λ) \times R_{i} + λ \times R_{i - 1}$

(20) $A_{i} \leftarrow (1 - λ) \times A_{i} + λ \times A_{i - 1}$

(21) $i \leftarrow i + 1$

(22) until R and A stay constant for $c o n_i t e r$ times or slightly change less than $c o n_t h r e s h o l d$ or $i ⩾ m a x_i t e r$

(23) Exemplar labels $c = [c_{1}, c_{2}, \dots, c_{n}]$ using (8)

(24) Calculating $L a b e l s = [C l u (x_{1}), \dots, C l u (x_{n})]$ using equation: $C l u (x_{i}) = C l u (c_{i})$

3.2. Golden Section

Golden section, also known as golden ratio, comes from the partition of a line segment. If a line segment is one unit long, the line segment is divided into two unequal parts, where the shorter one is $1 - x$ and the longer one is x. Partition exhibits a substantial harmony and aesthetics when the ratio of shorter part to the longer one is equal to the ratio of longer part to the whole line segment. The equation of this ratio is described in

\begin{matrix} \frac{x}{1} = \frac{1 - x}{x} . \end{matrix}

(9)

Since x should not be zero, the equation is converted to the following without the denominator:

\begin{matrix} x^{2} + x - 1 = 0 . \end{matrix}

(10)

The positive root of equation $x = (\sqrt{5} - 1) / 2$ is the famous golden section ratio.

3.3. Genetic Algorithm

Genetic algorithm (GA) is an intelligent optimization algorithm, the original theoretical foundation of which was proposed by Holland [13] in 1975. GA maintains and reproduces these individuals that are fitter for the environment and eliminates those that are less fit during the whole procedure. Because of the reproduction of superior individuals, these acceptable genes of superior individuals could be inherited by the next generation, which guarantees the whole population to evolve, and are more adaptive to the environment set by GA.

GA algorithm starts by producing initial population of individuals randomly as possible solutions. Each individual should be encoded into a binary string or a real number as a chromosome. A fitness function is specified to calculate the fitness of each individual, and the value of fitness for each individual determines its survival probability: the greater the value of fitness is, the larger the probability is that it can survive and reproduce the next generation. GA uses three phases to search the optimal or satisfactory solution: reproduction, crossover, and mutation. In the reproduction phase, the value of fitness for each individual is calculated, and each individual reproduces the next generation according to its fitness. In the crossover phase, each two individuals surviving exchange their fragments of genes to generate new species. In the mutation phase, each individual has a small probability to alter their genes by changing their preexisting gene to allele one. The reproduction-crossover-mutation cycle is repeated until a stable or satisfactory solution emerges, which is defined as basic steps of a simple GA. The description of GA [13] is shown in Algorithm 2.

Algorithm 2: Genetic algorithm (GA) [13].

Input:

F: Fitness function for GA.

$c_r a t e$ : Crossover rate for GA.

$m_r a t e$ : Mutation rate for GA.

$m a x_i t e r$ : the maximum number of iteration.

Output:

$s o l u t i o n$ : the feasible solution.

(1) GA.Initial(F, $c_r a t e$ , $m_r a t e$ , $m a x_i t e r$ )

(2) $\{{c_s o l}_{i} {\}}_{n}$ $\leftarrow$ GA.GenCandidateSolution()

(3) $\{{c_s o l}_{i} {\}}_{n}$ $\leftarrow$ GA.Encoder( $\{{c_s o l}_{i} {\}}_{n}$ )

(4) Iteration number $i t e r = 0$

(5) repeat

(6) $f_{i}$ $\leftarrow$ F( ${c_s o l}_{i}$ ) for all $i < n$

(7) $\{{c_s o l}_{i} {\}}_{n}$ $\leftarrow$ GA.Reproduce( $\{{c_s o l}_{i} {\}}_{n}$ , $\{f_{i} / \sum f_{i} {\}}_{n}$ )

(8) $c_s o l_{i}$ , $c_s o l_{k}$ $\leftarrow$ GA.Crossover( $c_s o l_{i}$ , $c_s o l_{k}$ , $c_r a t e$ )

for all $i, k < n$ and $i \neq k$

(9) ${c_s o l}_{i}$ $\leftarrow$ GA.Mutate( ${c_s o l}_{i}$ , $m_r a t e$ )

(10) $i t e r \leftarrow i t e r + 1$

(11) until $\{{c_s o l}_{i} {\}}_{n}$ stays constant or the iteration number $i t e r$ $= =$ GA.GetMax_iter()

(12) Output the ${c_s o l}_{i}$ with the greatest $f_{i}$ as $s o l u t i o n$

4. GS/GA-AP Algorithm

We describe GS/GA-AP algorithm in this section. The fitness function for GA is presented in Section 4.1, which is employed to evaluate the fitness and satisfaction of clustering results led by median preference, golden section preference, and the preference from GA. We further give the details and description of GS/GA-AP algorithm in Section 4.2, where how the golden section preference, the preference from GA, and the proper exemplar configuration are generated is explained.

4.1. Fitness Function

Several relative validity indices can be used to evaluate the effectiveness of clustering results, such as variance ratio criterion (VRC), Dais-Bouldin (DB) index, and silhouette coefficient [44]. Silhouette coefficient [45] is a common index which reflects the compactness within a cluster and the separation between clusters simultaneously; it synthetically evaluates the quality of the clustering algorithm by taking the compactness and separation into consideration. Therefore, we select silhouette coefficient as the fitness function for GA in GS/GA-AP.

Suppose a given data set: $X = {x_{1}, x_{2}, \dots, x_{n}}$ , and the separation of X is a set of k clusters: ${C_{1}, C_{2}, \dots, C_{k}}$ . Considering $x \in C_{i}$ , first we calculate the average distance between x and any other data points in $C_{i}$ as $a (x)$ . Similarly, we calculate the minimum of average distance between x and any other data points in any other clusters except $C_{i}$ as $b (x)$ . Here are the formulae of $a (x)$ and $b (x)$ :

\begin{matrix} a (x) = \frac{\sum_{x^{'} \in C_{i}, x^{'} \neq x}^{} d i s t (x, x^{'})}{|C_{i}| - 1}, \\ b (x) = \underset{C_{j} : 1 ⩽ j ⩽ k, j \neq i}{m i n} \{\frac{\sum_{x^{'} \in C_{j}}^{} d i s t (x, x^{'})}{|C_{j}|}\} . \end{matrix}

(11)

And the silhouette coefficient of x is given by

\begin{matrix} s (x) = \frac{b (x) - a (x)}{\max \{a (x), b (x)\}} . \end{matrix}

(12)

The value of silhouette coefficient is from −1 to 1. $a (x)$ reflects the compactness within the cluster which data point x belongs to, and the less $a (x)$ is, the more compact the cluster is. $b (x)$ reflects the separation between other clusters and the cluster which contains x, and the more $b (x)$ is, the better separation between clusters is. So silhouette coefficient captures the compactness within a cluster and the separation between different clusters; it evaluates the quality of the clustering result by taking the compactness and separation into consideration synthetically as well. If silhouette coefficient is much closer to 1, the cluster containing x is more compact, and it is much further from any other clusters.

In order to evaluate the quality of clustering, it is necessary to calculate the average of silhouette coefficient of each data points in the given data set, and the average exhibits the quality of the clustering result. According to the property of silhouette coefficient, it is chosen as the fitness function of GA.

4.2. Description of GS/GA-AP Algorithm

GS/GA-AP algorithm is a feasible procedure we propose to detect the proper global shared preference and the exemplar configuration, in order to deal with dissatisfaction caused by the median of similarity as the preference. GS/GA-AP includes two parts, GS-AP and GA-AP. Median of similarity and golden section preference are selected as the original input parameters for the original AP clustering algorithm, and the fitness of clustering results is evaluated under the monitoring of the given fitness function. The algorithm is terminated if the selection of golden section preference has improved the clustering result up to an acceptable evaluation index and the result is a feasible solution, which is named GS-AP procedure. Otherwise the preference and clustering solution are given by a negative feedback learning process that AP takes preference trained in finite GA rounds of reproducing-crossover-mutating cycle, and the evaluation of clustering result determines whether GA is activated to generate a new preference for AP, all of which is named GA-AP procedure.

The algorithm description of GS/GA-AP is in Algorithm 3.

Algorithm 3: GS/GA-AP.

Input:

$X = \{x_{1}, x_{2}, \dots, x_{n} \}$ : A data set.

$m a x_i t e r$ : The maximum of iteration number.

$c o n_i t e r$ : The convergence number of iteration.

$c o n_t h r e s h o l d$ : The threshold of convergence.

λ : The damping factor.

F: The evaluation function and fitness function for GA.

$c r o s s o v e r_r a t e$ : Crossover rate for GA.

$m u t a t i o n_r a t e$ : Mutation rate for GA.

$m a x_i t e r_g a$ : The maximum number of iteration for GA.

Output:

$l a b e l s = [C l u s t e r (x_{1}), \dots, C l u s t e r (x_{n})]$ : The clustering labels where $C l u s t e r (x_{i})$ represents which cluster $x_{i}$ belongs to.

$o p t i_p r e f e r e n c e$ : The proper global shared preference.

(1) AP.Initial( $m a x_i t e r$ , $c o n_i t e r$ , $c o n_t h r e s h o l d$ , λ)

(2) GA.Initial(F, $c r o s s o v e r_r a t e$ , $m u t a t i o n_r a t e$ , $m a x_i t e r_g a$ )

(3) $s (i, k)$ $\leftarrow$ Similarity( $x_{i}$ , $x_{k}$ ) for all data points in X

(4) $S_{n \times n} \leftarrow {[s (i, k)]}_{n \times n}$

(5) $m e d i a n_p$ $\leftarrow$ Median( $S_{n \times n}$ )

(6) $g s_p$ $\leftarrow$ $(\sqrt{5} - 1) / 2$ × (Min( $S_{n \times n}$ ) + Max( $S_{n \times n}$ ))

(7) $r e s_o r i g i n a l$ $\leftarrow$ AP.SetPreference( $m e d i a n_p$ ).Activate()

(8) $r e s_g s$ $\leftarrow$ AP.SetPreference( $g s_p$ ).Activate()

(9) if $F (r e s_o r i g i n a l) ≪ F (r e s_g s)$ then

(10) Output( $g s_p$ , $r e s_g s$ )

(11) Algorithm terminated.

(12) else

(13) GA.SearchRange(Min( $S_{n \times n}$ ), Max( $S_{n \times n}$ ))

(14) p $\leftarrow$ Random(Min( $S_{n \times n}$ ), Max( $S_{n \times n}$ ))

(15) $r e s$ $\leftarrow$ AP.SetPreference(p).Activate()

(16) $i t e r_n u m \leftarrow 0$

(17) repeat

(18) $i t e r_n u m \leftarrow i t e r_n u m + 1$

(19) $f i t n e s s$ $\leftarrow$ GA.F( $r e s$ )

(20) p $\leftarrow$ GA.Reproduce( $f i t n e s s$ , p)

(21) p $\leftarrow$ GA.Crossover(p)

(22) p $\leftarrow$ GA.Mutate(p)

(23) $r e s$ $\leftarrow$ AP.SetPreference(p).Activate()

(24) until p stays constant for $c o n_i t e r$ times or slightly changed less than $c o n_t h r e s h o l d$ or $i t e r_n u m$ $= =$ GA.GetMaxIter()

(25) Output(p, $r e s$ )

(26) Algorithm terminated.

(27) end if

5. Explanation for GS/GA-AP Effectiveness

We describe a brief explanation for the effectiveness of GS/GA-AP algorithm detecting the proper preference as a simple proof instead of rigorous one in this section. Section 5.1 restates a perspective association of AP clustering of how a global shared preference governs the cluster number. Section 5.2 gives an interpretation that GS-AP provides a more appropriate preference than median or minimum preference, and GA-AP takes a heuristic advantage of GA to accelerate searching a proper preference for AP.

5.1. Perspective Association between Global Shared Preference and Cluster Numbers

A global shared preference is a control knob in governing cluster numbers, and as a metric of “confidence” representing how each data point becomes a definite exemplar to represent a cluster. A greater preference tends to encourage the candidate exemplars to gather data points for constructing clusters, whereas a smaller preference results in that some candidate exemplars surrender to the clusters and other exemplars dominate. The cluster number is not directly proportional to preference owing to a possibility that the algorithm selects exemplars by the max-sum algorithm in a probabilistic graph model inference, which weakens the numerical precision of messages passing in the model and just focuses on the suitability of each data point being exemplar. A brief illustration has been presented in [12] and we give an intuitive explanation shown in this part. An experiment is designed to verify the possibility which takes a data set downloaded from Frey Lab website [46], and the distribution and meta information of the data set are shown in Figure 2(a) and Table 1.

Table 1

The meta information of the toy problem data set.

Items	Values
Number of data points	25
Cluster number	3
Dimensionality	2

Figure 2

(a) A toy problem data set downloaded from Frey Lab website [46]. (b) The clustering result of toyproblem.

An appropriate clustering result is shown in Figure 2(b) that contains 3 clusters. The exemplars are marked with unique colours and the data points are linked to their exemplars with the same colour. The original data distribution and the appropriate clustering result shown in intuitive Figures 2(a) and 2(b) are treated as a standard benchmark to test the effectiveness of original AP and GS/GA-AP.

AP runs on the standard benchmark data set that takes as input a range of global shared preferences from −100 to −0.1 and minus Euclidean distance as similarity metric, and then a line chart revealing the association between these preferences and cluster numbers is presented in Figure 3, which illustrates that cluster number is not directly proportional to preference as a linear model. It is found out that a wide range of global shared preferences leads to a steady clustering result with the growth of preference, until greater global shared preferences cause the oscillation and rapid rise of the cluster numbers. There exists a long range of preferences leading to the proper clustering configuration shown in Figure 2(b), containing the minimum of similarities, and a clustering configuration is judged as a valid one if the preference for AP falls into this interval. It is an expert experiment that the minimum of similarities is no more commonly taken than the median for AP as it always leads to less clusters, although it seems minimum is a better choice as preference than the median of similarities, which falls into the oscillation range in this experiment.

Figure 3

A line diagram showing the association between global shared preferences and cluster numbers.

5.2. Effectiveness of GS/GA-AP

A novel choice of preference is proposed in this study where the golden section value between the maximum and minimum of similarities is taken as input, for the purpose of avoiding the shortcoming of minimum and median of similarities. Taking golden section preference leads to a more appropriate cluster number than the minimum, which ties up the whole data set and the median that has a possibility to separate a data set into scraps, since the golden section preference falls between the minimum and the median, which is shown in Figure 4.

Figure 4

A line diagram showing the association between global shared preferences and cluster numbers, where the golden section preference is marked.

The cluster number is $3$ and is equal to the proper cluster number when AP is set with the golden section preference falling into the optimal preference interval, which should be an evidence that AP with the golden section may not obtain a clustering result in a worse shape than AP with the minimum or median the similarity makes. Evidences that demonstrate the superiority of the golden section preference are shown in Section 6.

Silhouette coefficient affected by the cluster number reflects the quality of clusters, a line chart of which is shown in Figure 5. The line chart exhibits a similar fact that AP with golden section preference obtains a no worse quality of clusters compared to the common minimum or median of similarities, according to the reason that the silhouette coefficient is $0.437849$ that equals the silhouette coefficient minimum the similarities make.

Figure 5

A line diagram showing the association between global shared preferences and silhouette coefficients, where the golden section point is marked using a green point.

GA searches in the interval of minimum and maximum of similarities as the preference space, and it returns a proper preference for the clustering result according to its silhouette coefficient value as fitness, so as to optimise the data point assignments for AP. GA always chooses the proper preference in spite of the negligible local optima, and its heuristic property guarantees the efficiency where the searching process is not too slow to tolerate. Figure 6 shows GA-based preference leads to an appropriate data points assignment, which is another evidence illustrating the effectiveness of GA-AP.

Figure 6

A line diagram showing the association between global shared preferences and cluster numbers, where the golden section preference and GA-based preference are marked using green and red points.

The silhouette coefficient line chart is shown in Figure 7 where the golden section preference and the proper GA-based preference are marked using green and red points. These two silhouette coefficient values are not smaller than the minimum preference, which means taking GA-based preference is capable of obtaining the proper clustering result.

Figure 7

A line diagram showing the association between global shared preferences and silhouette coefficient, where the golden section preference and GA-based preference are marked using green and red points.

An interesting phenomenon seems to be figured out where the proper GA-based preference is much closer to the golden section preference than the minimum of similarities, and a little greater than the golden section one. A speculation for the phenomenon is given in which GA-based preference appears between GS and median of similarities, and it leads to a fitter balance of compactness and separation of the clustering result than the suboptimal one of the GS. The location of GA-based proper preference closing to golden section preference is not a coincidence but it supports that golden section preference is getting close to the proper one, and Figures 6 and 7 give interpretations of the validation of GA-AP, all of which could be treated as evidences for the effectiveness of GS/GA-AP.

6. Experiments and Results

Implemented program instances of GS/GA-AP are created using scikit-learn [47] machine learning toolkit, an open-source machine learning package implemented by Python, and Pyevolve [48], a Python genetic algorithm framework. Comparison experiments validate the effectiveness of GS/GA-AP, in which a testing data set is randomly generated and standard benchmark data sets come from UCI machine learning repository [49]. We replace some cumbersome expressions for brief with their abbreviations and acronyms presented here: the original AP with the median of similarity as default preference is represented by “original AP,” while AP with the golden section preference value and AP optimised by GA are denoted by GS-AP and GA-AP unless mentioned specifically. Some key default parameters for AP and GA are listed here: the similarity measurement in the experiment is the minus squared Euclidean distance, and the assignment of damping factor is $0.9$ . GA codes the solutions by real number and searches proper global shared preferences in the interval of minimum and maximum of the similarity matrix. The default settings for the default selector, mutating rate, crossover rate, and fitness function are the ranking selection, $2 %$ , $80 %$ , and silhouette coefficient.

6.1. Randomly Generated Data Set Test

A two-dimensional random data set containing two hundred data points is generated by function make_blobs from scikit-learn [47]. The function make_blobs is assigned with $(2,2)$ and $(- 2, - 2)$ as two original cluster centres and generates random data point blobs around these two centres. Data points are rather compact within one cluster, and the whole data set is apparently separated into two clusters since there is an obvious gap between these data points. The visible instance distribution of the data set is shown in Figure 8(a), and clustering results given by original AP, GS-AP, and GA-AP are shown in Figures 8(b), 8(c), and 8(d), all of which is visualized by Matplotlib [50].

Figure 8

(a) Test data set. (b) The clustering result of original AP. (c) The clustering result of GS-AP. (d) The clustering result of GA-AP.

The cluster numbers of original AP, GS-AP, and GA-AP are $5$ , $4$ , and $2$ according to the clustering results exhibited in Figures 8(b), 8(c), and 8(d), which indicates that the proper cluster assignment is achieved by GA-AP, and the data point distribution, the connection within one class, and the separation between classes are detected properly by GA-AP phase. A defect appears after GS-AP ran where two near subsets should have been merged into one class, while they stay in a status of separation. Fortunately, the cluster number of GS-AP is quite closer to the actual cluster number of the original data set than the original AP that slashes one class into three subsets. The intuition of cluster numbers directly shows the performance of the original AP, GS-AP, and GA-AP, and GS/GA-AP outperforms the original AP on the randomly generated data set.

Table 2 shows the cluster numbers (Cluster_number) and silhouette coefficients (Silhouette_coefficient) which further verify the effectiveness of these two algorithms. Silhouette coefficient reflects the compactness within one cluster and separation among clusters, as an inner index to evaluate the quality of clustering result. The cluster number indicates the exact distribution of data points as an external property. GA-AP exceeds GS-AP and original AP in the average silhouette coefficient that approaches the proper value in theory, the cluster number reaches the actual number of original clusters as well, all of which validate the effectiveness of GS/GA-AP.

Table 2

The experiment result of AP and GS/GA-AP.

Algorithm	Cluster_number	Silhouette_coefficient
Original AP	5	0.45
GS-AP	4	0.436
GA-AP	2	0.937

6.2. Standard Benchmark Data Set Test

Another testing experiment is designed to evaluate GS/GA-AP algorithm on eight real-world data sets from UCI machine learning repository [49] as the standard benchmark data sets, which offers evidences to validate the superior proper performance and effectiveness. The data set names (Data sets), instance numbers (#Instances), feature numbers (#Features), and cluster numbers of these data sets (#Clusters) are shown in Table 3 after data preprocessing such as filling missing values, normalizing, and calculating minus squared Euclidean data sets.

Table 3

Information of standard benchmark dat aset from UCI.

Data sets	Iris	Wine	Glass	Wholesale	SPECTF	Ionosphere	Ecoli	Synthetic-control
#Instances	150	178	214	440	80	351	336	600
#Features	4	13	9	6	44	34	8	60
#Clusters	3	3	6	3	2	2	8	6

Table 3 shows the meta information of these data sets as standard references. The actual cluster number reflects the inner distribution and connection between data points, as a vital reference to evaluate the original AP and GS/GA-AP algorithm. The time complexity of clustering and similarity calculation influenced by the instance numbers and feature numbers are not involved in the discussion of this study. Noting that the data set “ionosphere” was collected by the system in Goose Bay, Labrador, the experiment conducted on it might reveal the effectiveness in sensor using.

Implemented program instances of GS/GA-AP and original AP are initialized by the default aforementioned parameters. The experiment results of GS-AP, GA-AP, and original AP are shown in Table 4, respectively, including the global shared preferences (p), cluster numbers (#clu), and silhouette coefficients (Sil).

Table 4

The result of original AP, GS-AP and GA-AP clustering.

Data sets	#ac_clu	Original AP			GS-AP			GA-AP
Data sets	#ac_clu	p	#clu	sil	p	#clu	sil	p	#clu	sil
Iris	3	−0.220	7	0.478	−1.035	3	0.765	−1.642	3	0.765
Wine	3	−0.567	13	0.220	−1.486	7	0.220	−2.386	4	0.295
Glass	6	−0.223	17	0.592	−1.579	6	0.603	−0.195	3	0.744
Wholesale	3	−0.039	37	0.324	−1.766	4	0.485	−1.065	5	0.505
SPECTF	2	−0.516	16	0.067	−4.336	2	0.108	−3.241	4	0.132
Ionosphere	2	−17.196	41	0.273	−58.713	8	0.190	−92.472	5	0.355
Ecoli	8	−0.340	17	0.298	−1.202	7	0.404	−0.638	11	0.421
Synthetic-control	6	−0.299	42	0.067	−1.383	11	0.105	−12.431	10	0.375

Global shared preferences (p), cluster numbers (#clu), silhouette coefficients (sil) among original AP, GS-AP, and GA-AP, and the actual cluster numbers of all data sets (#ac_clu) are shown in Table 4. It is observed that GS-AP and GA-AP outperform the original AP in each evaluation reference, and the proper global shared preferences appear as the golden section value of the similarity interval or smaller one, which are capable of demonstrating the effectiveness of GS/GA-AP. The chart for cluster numbers and silhouette coefficients is displayed in Figures 9 and 10 by Highcharts [51] to further analyse the clustering phenomenon by original AP and GS/GA-AP.

Figure 9

The cluster number in standard benchmark data sets.

Figure 10

Silhouette coefficient.

The actual cluster numbers can possibly be treated as references representing the real distribution of data points among the original data sets, although some data sets are much fitter for training and classification rather than being clustered. Clustering algorithms may identify different distribution and divide data sets in a distinctive way, and this is why the actual cluster numbers are regarded as references rather than evaluation index in the experiment. The precise mapping relationship between preference and cluster number for original AP has already been presented in Section 5, where we explain why the algorithm author just gave a simple intuition [12] and chose the median as the default choice [10]. Actually the unstable phenomenon of cluster numbers has been found in [12] where a greater preference value may cause the oscillation of cluster number and more scraps of subsets. The preference value from GS-AP is smaller than the median of similarity substantially and helps control the scrap generating in effect. The GA-AP with GA optimization succeeds in finding the proper preference for AP and approaches the most appropriate solution of clustering, which is robust with the default input. GS/GA-AP outperforms the original AP under the evaluation of cluster numbers.

Silhouette coefficient reflects the compactness within a cluster and the separation between different clusters and evaluates the quality of the clustering algorithm by taking the compactness and separation into consideration synthetically according to its definition. GA-AP always tends to discover an appropriate separation for maximising the silhouette coefficient, while GS-AP is not good at it, owing to some lower silhouette coefficients than the original AP. The data sets are associated with the improvement of silhouette coefficient where higher improvement comes from high silhouette coefficient of the original AP as opposed to the lower ones, for all these AP-based algorithms perform well on the same data sets. It is indicated that GS/GA-AP acquires the great improvement compared to the original AP and improves the clustering results. Three results of eight data sets have improved to satisfactory evaluation index and the results are feasible solution via GS-AP, so GS/GA-AP could save time at a certain probability.

7. Conclusion

We propose a new feasible procedure called GS/GA-AP algorithm, which uses the GS value as a replacement of the default median preference and then deploys genetic algorithm to detect a proper preference for the appropriate cluster solution, if the cluster configuration generated by the GS value is not a satisfactory clustering solution. The original AP and GS/GA-AP algorithms are, respectively, executed on a simulation data set and eight standard benchmark data sets in order to evaluate the effectiveness of GS/GA-AP compared with the original AP. The experiment results reveal a notable phenomenon that GS/GA-AP substantially captures more proper and effective cluster configurations compared with the original AP on the nine data sets, which means GS/GA-AP improves the original AP clustering algorithm heuristically.

We devise GS/GA-AP algorithm that concentrates on the offline data processing as it separates the data set into several clusters autonomously, discovers the inner-connection among the data set in which the data points are similar within a cluster while dissimilar between clusters, and characterizes each data point with its exemplar, so the clustering result reflects more knowledge for further processing. AP takes into account the global similarity optimization to identify the exemplar configuration, and the golden section preference is given as a replacement of the default median preference to enhance the clustering result; GA further conducts a heuristic research of preference for AP to cluster properly if the clustering result induced by golden section preference is not satisfactory or promoted evidently, all of which lead to more proper clustering result than the original AP clustering algorithm.

However, a left complicated problem is the runtime of GS/GA-AP which have not been discussed, since the time complexity of the original AP is $O (n^{3})$ which runs much slower than other clustering methods based on the greedy algorithm. In the future, it is necessary for us to look for a new approach to reduce the time complexity of GS/GA-AP, and the distributed system, such as Hadoop or Spark, can be taken into consideration for the real-time process of the online data as a choice, so that the algorithm can be accelerated in convergence under the condition of guaranteeing the same accuracy and effectiveness.

Footnotes

Competing Interests

The authors declare that they have no competing interests.

Acknowledgments

This research is sponsored by National Natural Science Foundation of China (nos. 61171014, 61371185, 61401029, 61472044, 61472403, and 61571049), the Fundamental Research Funds for the Central Universities (nos. 2014KJJCB32, 2013NT57), China Postdoctoral Science Foundation (2016M591109), and the BNU Graduate Students' Platform for Innovation & Entrepreneurship Training Program (no. 1601121E1) and by SRF for ROCS, SEM.

References

Sun

Yan

Zhang

Xia

Wang

Bie

Tian

Organizing and querying the big sensing data with event-linked network in the internet of things

International Journal of Distributed Sensor Networks 2014 2014 11

218521

10.1155/2014/218521

2-s2.0-84928035926

Sun

Bie

Zhang

Semantic relation computing theory and its application

Journal of Network and Computer Applications 2016 59 219 229

10.1016/j.jnca.2014.09.017

Sun

Yan

Bie

Zhou

Constructing the web of events from raw data in the web of things

Mobile Information Systems 2014 10 1 105 125

10.3233/mis-130173

2-s2.0-84892944449

Sun

Jara

A. J.

An extensible and active semantic model of information organizing for the Internet of Things

Personal and Ubiquitous Computing 2014 18 8 1821 1833

10.1007/s00779-014-0786-z

2-s2.0-84921066508

Guo

Wang

Cai

Real time clustering of sensory data in wireless sensor networks

Proceedings of the IEEE 28th International Performance Computing and Communications Conference (IPCCC '09)

December 2009

Scottsdale, Ariz, USA

33 40

10.1109/pccc.2009.5403841

2-s2.0-77951158417

Han

Kamber

Pei

Cluster analysis

Data Mining: Concepts and Techniques 2006

New York, NY, USA

Elsevier

443 444

Cai

Shi

Salavatipour

M. R.

Goebel

Lin

Using gene clustering to identify discriminatory genes with higher classification accuracy

Proceedings of the 6th IEEE Symposium on BioInformatics and BioEngineering (BIBE '06)

October 2006

Arlington, Va, USA

235 242

10.1109/bibe.2006.253340

2-s2.0-34547444899

Cai

Heydari

Lin

Clustering binary oligonucleotide fingerprint vectors for DNA clone classification analysis

Journal of Combinatorial Optimization 2005 9 2 199 211

10.1007/s10878-005-6857-3

MR2138132

2-s2.0-24144473587

Cai

Goebel

Salavatipour

M. R.

Shi

Lin

Selecting genes with dissimilar discrimination strength for sample class prediction

Proceedings of the 5th Asia-Pacific Bioinformatics Conference (APBC '07)

January 2007

Hong Kong

World Scientific

81 90

10.

Frey

B. J.

Dueck

Clustering by passing messages between data points

Science 2007 315 5814 972 976

10.1126/science.1136800

MR2292174

2-s2.0-33847172327

11.

Frey

B. J.

Dueck

Supporting online material for clustering by passing messages between data points

Science 2007 315 972 976

12.

Dueck

Affinity propagation: clustering data by passing messages [Ph.D. thesis] 2009

Toronto, Canada

Graduate Department of Electrical & Computer Engineering, University of Toronto

13.

Holland

J. H.

Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence 1975

Ann Arbor, Mich, USA

The University of Michigan Press

MR0441393

14.

Frey

B. J.

Dueck

Mixture modeling by affinity propagation

Neural Information Processing Systems 2006 18 379 386

15.

Dueck

Frey

B. J.

Non-metric affinity propagation for unsupervised image categorization

Proceedings of the IEEE 11th International Conference on Computer Vision

October 2007

Rio de Janeiro, Brazil

1 8

10.1109/iccv.2007.4408853

2-s2.0-50649119439

16.

Dueck

Frey

B. J.

Jojic

Vingron

Wong

Constructing treatment portfolios using affinity propagation

Research in Computational Molecular Biology 2008 4955

New York, NY, USA

Springer

360 371 Lecture Notes in Computer Science

10.1007/978-3-540-78839-3_31

17.

Givoni

I. E.

Frey

B. J.

A binary variable model for affinity propagation

Neural Computation 2009 21 6 1589 1600

10.1162/neco.2009.05-08-785

MR2527796

ZBL1183.68476

2-s2.0-67651015658

18.

Zhang

Wang

Nørvåg

Sebag

K-AP: generating specified K clusters by efficient affinity propagation

Proceedings of the 10th IEEE International Conference on Data Mining (ICDM '10)

December 2010

Sydney, Australia

IEEE

1187 1192

10.1109/icdm.2010.107

2-s2.0-79951743719

19.

Tang

Zhu

Yang

A Poisson-based adaptive affinity propagation clustering for SAGE data

Computational Biology and Chemistry 2010 34 1 63 70

10.1016/j.compbiolchem.2009.11.001

MR2593208

2-s2.0-75149195253

20.

Givoni

Chung

Frey

B. J.

Hierarchical affinity propagation

http://arxiv.org/abs/1202.3722

21.

Wang

C.-D.

Lai

J.-H.

Suen

C. Y.

Zhu

J.-Y.

Multi-exemplar affinity propagation

IEEE Transactions on Pattern Analysis and Machine Intelligence 2013 35 9 2223 2237

10.1109/TPAMI.2013.28

2-s2.0-84880876970

22.

Wang

Zhang

Guo

Adaptive affinity propagation clustering

Acta Automatica Sinica 2007 33 12 1242 1246

23.

Chen

Wang

Bai

Meng

An adaptive affinity propagation document clustering

Proceedings of the 7th International Conference on Informatics and Systems (INFOS '10)

March 2010

Cairo, Egypt

1 7

2-s2.0-77953159047

24.

Sheng

Liu

Adaptive affinity propagation with spectral angle mapper for semi-supervised hyperspectral band selection

Applied Optics 2012 51 14 2656 2663

10.1364/AO.51.002656

2-s2.0-84861373812

25.

Chen

D.-W.

Sheng

J.-Q.

Chen

J.-J.

Wang

C.-D.

Stability-based preference selection in affinity propagation

Neural Computing and Applications 2014 25 7-8 1809 1822

10.1007/s00521-014-1671-4

2-s2.0-84921068418

26.

Miao

Wang

Lin

Optimized recognition with few instances based on semantic distance

The Visual Computer 2015 31 4 367 375

10.1007/s00371-014-0931-8

2-s2.0-84924852370

27.

Miao

Wang

Chen

Zhou

Image completion with multi-image based on entropy reduction

Neurocomputing 2015 159 1 157 171

10.1016/j.neucom.2014.12.088

2-s2.0-84933279576

28.

Wang

X.-H.

Qin

Zhang

X.-P.

Automatically affinity propagation clustering using particle swarm

Journal of Computers 2010 5 11 1731 1738

10.4304/jcp.5.11.1731-1738

2-s2.0-78651561170

29.

Wang

X.-H.

Zhang

X.-P.

Zhuang

C.-X.

Chen

Z.-N.

Qin

Automatically determining the number of affinity propagation clustering using particle swarm

Proceedings of the 5th IEEE Conference on Industrial Electronics and Applications (ICIEA '10)

June 2010

Taichung, Taiwan

1526 1530

10.1109/iciea.2010.5514680

2-s2.0-77956024858

30.

Zhong

Zheng

Shen

Zhou

Search the optimal preference of affinity propagation algorithm

Proceedings of the 5th International Conference on Intelligent Computation Technology and Automation (ICICTA '12)

January 2012

Zhangjiajie, China

IEEE

304 307

10.1109/icicta.2012.83

2-s2.0-84863243799

31.

You

Zhao

Qian

Energy consumption monitoring of the steam pipe network based on affinity propagation clustering

Proceedings of the 10th World Congress on Intelligent Control and Automation (WCICA '12)

July 2012

Beijing, China

IEEE

3364 3368

10.1109/wcica.2012.6358455

2-s2.0-84872296289

32.

Yadav

A. Singh

Gupta

Garg

Swami

A soft computing optimization based two ware-house inventory model for deteriorating items with shortages using genetic algorithm

International Journal of Computer Applications 2015 126 13 7 16

10.5120/ijca2015905886

33.

Gutierrez

J. B.

Frith

Nakai

Ortuño

Rojas

A genetic algorithm for motif finding based on statistical significance

Bioinformatics and Biomedical Engineering 2015 9043

New York, NY, USA

Springer

438 449 Lecture Notes in Computer Science

10.1007/978-3-319-16483-0_43

34.

Noble

Kurata

Asharif

M. R.

Joint tactical air request processing via genetic algorithm

Proceedings of the 1st Asian Conference on Defence Technology (ACDT '15)

April 2015

Hua Hin, Thailand

88 91

10.1109/acdt.2015.7111590

2-s2.0-84938152869

35.

Liew

A. W.-C.

Genetic algorithm based detection of general linear biclusters

Proceedings of the 13th International Conference on Machine Learning and Cybernetics (ICMLC '14)

July 2014

Lanzhou, China

IEEE

550 555

10.1109/icmlc.2014.7009667

2-s2.0-84921478365

36.

Jain

A. K.

Murty

M. N.

Flynn

P. J.

Data clustering: a review

ACM Computing Surveys 1999 31 3 264 323

10.1145/331499.331504

2-s2.0-84893405732

37.

Maulik

Bandyopadhyay

Genetic algorithm-based clustering technique

Pattern Recognition 2000 33 9 1455 1465

10.1016/S0031-3203(99)00137-5

2-s2.0-0033715579

38.

Pettey

C. B.

Leuze

M. R.

Grefenstette

J. J.

A parallel genetic algorithm

Proceedings of the 2nd International Conference on Genetic Algorithms and Their Application

July 1987

Cambridge, Mass, USA

155 161

39.

Jin

Vecchiola

Buyya

MRPGA: an extension of MapReduce for parallelizing Genetic Algorithms

Proceedings of the 4th IEEE International Conference on eScience (eScience '08)

December 2008

Indianapolis, Ind, USA

214 221

10.1109/escience.2008.78

2-s2.0-62749166510

40.

Dean

Ghemawat

MapReduce: simplified data processing on large clusters

Communications of the ACM 2008 51 1 107 113

10.1145/1327452.1327492

2-s2.0-37549003336

41.

Verma

Llorà

Goldberg

D. E.

Campbell

R. H.

Scaling genetic algorithms using MapReduce

Proceedings of the 9th International Conference on Intelligent Systems Design and Applications (ISDA '09)

December 2009

Pisa, Italy

13 18

10.1109/isda.2009.181

2-s2.0-77949580645

42.

Di Geronimo

Ferrucci

Murolo

Sarro

A parallel genetic algorithm based on hadoop MapReduce for the automatic generation of junit test suites

Proceedings of the 5th IEEE International Conference on Software Testing, Verification and Validation (ICST '12)

April 2012

Montreal, Canada

IEEE

785 793

10.1109/icst.2012.177

2-s2.0-84862319393

43.

Kschischang

F. R.

Frey

B. J.

Loeliger

H.-A.

Factor graphs and the sum-product algorithm

IEEE Transactions on Information Theory 2001 47 2 498 519

10.1109/18.910572

MR1820474

2-s2.0-0035246564

44.

Hruschka

E. R.

Campello

R. J. G. B.

Freitas

A. A.

De Carvalho

A. C. P. L. F.

A survey of evolutionary algorithms for clustering

IEEE Transactions on Systems, Man and Cybernetics, Part C: Applications and Reviews 2009 39 2 133 155

10.1109/tsmcc.2008.2007252

2-s2.0-63049111403

45.

Rousseeuw

P. J.

Silhouettes: a graphical aid to the interpretation and validation of cluster analysis

Journal of Computational and Applied Mathematics 1987 20 53 65

10.1016/0377-0427(87)90125-7

2-s2.0-0023453329

46.

Frey lab, probabilistic and statistical inference group, http://www.psi.toronto.edu/affinitypropagation/webapp/

47.

Pedregosa

Varoquaux

Gramfort

Michel

Thirion

Grisel

Blondel

Prettenhofer

Weiss

Dubourg

Vanderplas

Passos

Cournapeau

Brucher

Perrot

Duchesnay

Scikit-learn: machine learning in python

Journal of Machine Learning Research 2011 12 2825 2830

48.

Pyevolve http://pyevolve.sourceforge.net/index.html

49.

Lichman

UCI Machine Learning Repository 2013

Irvine, Calif, USA

University of California, School of Information and Computer Sciences

http://archive.ics.uci.edu/ml

50.

Hunter

J. D.

Matplotlib: a 2D graphics environment

Computing in Science and Engineering 2007 9 3 99 95

10.1109/mcse.2007.55

2-s2.0-34247493236

51.

Highcharts http://www.highcharts.com/