Abstract
The analysis of fuzzy(overlapping) community structure in complex networks is an important problem in data mining of network data sets. However, due to the exist of random factors and error edges in real networks, how to measure the significance of community structure efficiently is a crucial question. In this paper, we present a novel statistical framework comparing the significance of fuzzy community structure across various optimization models. Different from the universal approaches, we calculate the similarity between a given node and its leader and employ the distribution of link tightness to derive the significance score, instead of a direct comparison to a randomized model. Based on the distribution of community tightness, a new “p-value” form significance measure is proposed for community structure analysis. Specially, the well-known approaches and their corresponding quality functions are unified to a novel general formulation, which facilitate providing a detail comparison across them. To determine the position of leaders and their corresponding followers, an efficient algorithm is proposed based on the spectral theory. Finally, we apply the significance analysis to some famous benchmark networks and the good performance verified the effectiveness and efficiency of our framework.
Keywords
Introduction
A common feature observed in real networks is the presence of community structures [1–5, 18], i.e. subgraphs which are densely connected to each other while less connected to the subgraphs outside. In many scenarios, nodes in a network can belong to more than one community, called fuzzy(overlapping) communities [2–9]. In order to estimate how much a decomposition of a network which is found by a community detection algorithm is meaningful, we need a quality measure. Consequently, for a particular measure, the community detection algorithms can be ranked. To this end, various measures have been proposed in the literature, so far. The most prevalent measure which has been used extensively in the literature is due to Newman and Girvan [18]. This measure, called modularity, quantifies how much the density of the edges inside identified communities differs from the expected edge density in an equivalent network with similar number of vertices and edges but randomized edge placement, which is taken as the null model for statistical tests. Considering the modularity measure, the community detection problem is transformed to the modularity maximization problem. Modularity function can naturally extended to fuzzy form, which used to detect overlapping communities.
Recently, some optimization algorithms based on Potts models which used to detect community structure have attracted attention. Communities correspond to Potts model spin states, and the associated system energy indicates the quality of a candidate partition. It models an inhomogeneous ferromagnetic system where each node is viewed as a labeled spin in the network. Let A be the adjacency matrix of graph G and let σ
i
denote the label of the community that node i belongs to. Furthermore, the Kronecker Delta function is defined by δ (σ
i
, σ
j
) =1 if σ
i
= σ
j
and δ (σ
i
, σ
j
) =0, otherwise. Having the community membership labels σ, Reichardt and Bornholdt (RB) [16] proposed a generalized Hamiltonian as the core energy function,
Label propagation is another famous algorithm for community detection [27]. Briefly, the algorithm starts with randomly assigning a community label to each node. Then, each node updates its label by replacing it by the label most used by its neighbors. The other well-known optimization approaches used in community detection problem are Simulated Annealing (SA) [25], extermal optimization (DA) [13], expectation maximization [20], Bayesian inference [17], and variational Bayes [15]. For a comprehensive and comparative review on this topic we refer the reader to [4].
Although a lot of optimization method and their functions are proposed, How to determine the hidden properties of a given community [9] effectively remain unclearly answered. To answers these crucial questions, in this paper, we present a novel statistical framework comparing the significance of soft community structure across various optimization methods. Different from the universal approaches, we calculate the similarity of a given node to its leader and employ the distribution of link tightness to derive the significance score, instead of a direct comparison to a randomized model. A small example is shown in Fig. 1a, which illustrates that tighter the following nodes link to its leader, more significant the community is. Based on the distribution of community tightness, a new “p-value” form significance measure is proposed for community structure analysis. Specially, the well-known approaches and their corresponding quality functions are unified to a novel general fuzzy formulation, to provide a detail comparison across them. Then, we can choose the most suitable form of the function by set the parameters properly. To determine the position of leaders and their corresponding followers, an efficient fuzzy detection algorithm is proposed based on the spectral theory. Finally, we apply the significance analysis to some famous benchmark networks and the good performance verified the effectiveness and efficiency of our framework. The detailed procedure can be observed in Fig. 1b.
Leader-driven algorithms [11, 12] constitute a special case of seed-centric approaches. These methods show that, in many real world, especial the social networks, nodes of a network are usually classified into two categories: leaders and followers. For example, considering the famous Karate network [28], nodes 1 and 33 are two significant leaders and corresponding communities are built around them (see Fig. 2a). If two leaders are removed, these communities will be split up, as they link to most followers and keep the community together. Since community are consequence of information spreading, a given community can be defined as the area in which a leader has most influence. So, one can uncover the community partition by finding all natural leaders and their corresponding followers on which they influence. We believe if followers are more tightly linked to the leader, or leader spreads more influence on their followers, this community are more significant or robust. When we use a given optimization method to evolve the community configure, the significance of communities also evolves correspondingly, which shown in Fig. 2b.
The fuzzy community detection algorithm based on leader position
In this study, the relative positions of leader and corresponding followers are crucial to analyze the significance situation. In order to obtain the leader of corresponding community, we extract the candidate fuzzy membership by minimizing the following objective function
Suppose K is the upper bound of number of clusters and A = (a ij ) n×n is the adjacent matrix of a network, then, the detailed algorithm is shown in Algorithm 1 and stated straightforwardly as follows:
Step 1: for a given K
(i) Calculate the diagonal matrix D = (d ii ), where d ii = ∑ k a ik .
(ii) Computing the top K eigenvectors based on generalized eigensystem Ax = tDx, and then establish the eigenvector matrix E K = [e1, e2, . . . , e K ].
Step 2: for each number of communities 2 ≤ k ≤ K:
(i) Establish the matrix E K = [e2, e3, . . . , e K ] from the matrix E K .
(ii) Normalize the rows of E K to unit length using Euclidean distance norm.
(iii) Cluster the row vectors of E K using any community detection method by minimizing Equation.(2) to obtain a membership matrix X k and corresponding leaders.
Step 3: Maximizing the modular function: Pick the optimal number of communities k and the corresponding partition X k that maximizes Q (X k ).
In step 1, given the adjacent matrix A = (a ij ) n×n and a diagonal matrix D = (d ii ), d ii = ∑ k a ik , two matrices D-1/2AD-1/2 and D-1A are used. This is motivated by Ref. [7], which uses the top K eigenvectors of the generalized eigensystem Ax = tDx instead of the K eigenvectors of the adjacent matrix. It shows that after normalizing the rows using Euclidean norm, their eigenvectors are mathematically identical and emphasize that this is a numerically more stable method. In step 2, we choose the initial the starting centers to be as orthogonal as possible which already used in k-means clustering method [6, 26]. This way of choosing centers(leaders) does not cost additional time complexity, and also improve the quality of the partition, thus at the same time reduces the need for restarting the random initialization process. In step 3, the Q function measures the quality of a given community structure organization of a network and can be used to automatically select the optimal number of communities k according to the maximum Q value [26], we will discuss the multiple optimization methods and their corresponding Q function in detail in the following section.
For a given number of communities K
Calculate the top K eigenvector matrix E K = [e1, e2, . . . , e K ] and initiate the community membership X (0) = E K .
Update the position of center and corresponding community membership matrix X to minimize the Equation.(2)
Select the optimal number of communities K and corresponding community membership according to the maximum of Q defined in Equation. (3)
For many community detection algorithms, the target function Q is critical. Here, Q can be tried to be optimized has the following general fuzzy form:
(1) Hofman and Wiggins [15]
(5) Modularity [18]
(6) Label propagation [27]
It is essential to establish a detail framework analyzing the significance of community structure, since real networks own specific characteristics [5, 29]. In this section, we discuss these important characteristics and give a detailed introduction of the framework.
Next, using the maximum entropy principle, the statistical unbiased distribution fulfilling constraint can be obtained using the maximum entropy principle:
However, we can’t determine the scoring parameter η easily. Here, the tightness function of Equation.(15) can be simplified as:
If the network is large enough, according to the mean field theory, s i = s (i|z) owns an approximate Gaussian-distribution with variance M, . The distribution of the tightness S can be calculated straightforwardly using the derivation. Specifically, we need to compute the following quality function:
Given a specifical community, we can calculated the significance score F using the probability that the community tightness S, p (S), larger than or equal to S,
In this section, we will test the validity of our framework on some famous benchmark network and real networks.
Moreover, by comparing five algorithms, we find in Fig. 3a that the 〈F〉 values corresponding to Hofman & Wiggins method is largest, and the Label propagation method is the lowest. This may because Label propagation method emphasize the simplicity of calculation too much while ignoring the accuracy of results. Furthermore, the 〈F〉 values between Modularity optimization method and Label propagation method are similar when 〈k out 〉 becomes lower. This result is similar as Ref. [22], which verifies the inner correlation between these two methods. These observations are no evidence of overall superiority of one method over another, but an example of how to compare the significance and use the different partitioning algorithms on a given network.
Furthermore, when 〈k out 〉 increases, the topology becomes fuzzier and the sizes of communities will become more and more smaller correspondingly. At the same time, as the width parameter μ increases, the significance will favor tighter communities with fewer elements. We test the Hofman & Wiggins method and Label propagation method in Fig. 3b, the value of 〈F〉 corresponding to μ = 0.3 will be larger than μ = 0.1 for all two examples. As a conclusion, we argue that when the corresponding 〈F〉 is smaller than 0.3 on average (〈k out 〉≈4), it is not safe to say there exists significant community structure for a given network.
Discussion
In this paper, we present a novel framework comparing the significance of fuzzy community structure revealed by multiple optimization functions. Based on the distribution of community tightness, a new “p-value” form significance measure is proposed for analysis. As part of the future work, it is necessary to take a deeper look into how different similarity measures impact the results of this method. Additionally, this framework can be easily extended to a weighted and directed form, which only needs to modify the formation of the quality function Q. As a conclusion, this method shows a great performance and deserves more attention from us.
Acknowledgments
We are grateful to the anonymous reviewers for their valuable suggestions which are very helpful for improving the manuscript. The authors are separately supported by NSFC grants 71401194, 91324203, 11131009 and “121” Youth Development Fund of CUFE grants QBJ1410.
