Sage Journals: Discover world-class research

Abstract

Due to the development and popularization of Internet, there is more and more research focusing on complex networks. Research shows that there exists community structure in complex networks. Finding out community structure helps to extract useful information in complex networks, so the research on community detection is becoming a hotspot in recent years. There are two remarkable problems in detecting communities. Firstly, the detection accuracy is normally not very high; Secondly, the assessment criteria are not very effective when real communities are unknown. This paper proposes an algorithm for community detection based on hierarchical clustering (CDHC Algorithm). CDHC Algorithm firstly creates initial communities from global central nodes, then expands the initial communities layer by layer according to the link strength between nodes and communities, and at last merges some very small communities into large communities. This paper also proposes the concept of extensive modularity, overcoming some weakness of modularity. The extensive modularity can better evaluate the effectiveness of algorithms for community detection. This paper verifies the advantage of extensive modularity through experiments and compares CDHC Algorithm and some other representative algorithms for community detection on some frequently used datasets, so as to verify the effectiveness and advantages of CDHC Algorithm.

1. Introduction

Complex networks are generally networks having lots of nodes and connections, such as Internet, citation network, and social network [1–3]. Due to Internet's development and popularization, there is more and more research on complex networks in recent years. Research shows that there exists community structure [4] in complex networks. Connections within a community are dense, while connections between different communities are rare [5]. Generally, a community consists of nodes having similar properties, so finding communities helps to mine some useful information of complex networks, for example, finding a group of people having common interests. How to detect communities in complex networks is a hotspot in recent years.

Newman and Girvan [6] put forward the conception of modularity to measure the quality of algorithms for community detection. The modularity Q is a number less than 1. The bigger Q is, the better quality is. Modularity has become the most widely used assessment criterion for community detection due to two reasons when we do not know the real communities in networks. Firstly, modularity is very convenient to use. Secondly, it seems that the result is better when the modularity is bigger in most cases. Many algorithms [7–9] detect communities by maximizing modularity. As it is impossible to calculate the modularity for every possible case, such kind of algorithms usually adopts greedy strategy to find the case almost maximizing modularity.

As modularity becomes more and more popular, people begin to doubt its effectiveness. Fortunato and Barthélemy [10] said algorithms based on modularity are not able to detect communities whose size is less than an inherent size depending on the number of networks’ edges. It is called by Fortunato the resolution limit of modularity. We find out by experiment that the quality is not necessarily better when the modularity is bigger. Maximizing modularity tends to detect more communities than the real situation, because the upper boundary of modularity is bigger when we detect more communities. In order to overcome this problem, this paper proposes the conception of extensive modularity on the basis of modularity. The upper boundary of extensive modularity is independent of the number of communities that we detect. We find out that the result maximizing extensive modularity is better than that maximizing modularity.

For now, there are many algorithms having advantages and disadvantages for community detection. Algorithms based on graph theory appear early, and their main idea is to divide the network into k such parts of almost the same size satisfying that connections in the same part are dense and connections between different parts are rare. Kernighan-Lin Algorithm [11] and bisection of spectrum based on Laplace eigenvalues [12] are representative algorithms based on graph theory. Generally speaking, it is hard to know the number of communities ahead of time, but algorithms based on graph theory need the number, so the accuracy is very low. Algorithms based on hierarchical clustering are common to see for now. This kind of algorithms decomposes datasets hierarchically according to a certain method until it satisfies some certain conditions. According to the principle of classification, algorithms based on hierarchical clustering can be divided into two parts, which are aggregation algorithm and splitting algorithm. GN Algorithm [6] is a representative splitting algorithm, while CNM Algorithm [13] and Newman Fast Algorithm [14] are representative aggregation algorithms. GN and CNM can usually obtain large value of modularity, while the accuracy is not necessarily high. LCV Algorithm [15] detects communities by finding local central vertices. EACH Algorithm [16] detects communities by calculating the edge antitriangle centrality with the isolated vertex handling strategy. LCV and EACH are both algorithms based on hierarchical clustering. Besides, there are also other kinds of algorithms.

The rest of this paper is organized as follows: Section 2 introduces some definitions including central nodes, nodes’ similarity, link strength between a node and a community, modularity, and extensive modularity; Section 3 describes detailed steps of the algorithm proposed by this paper; Section 4 states experiment results; the last section is our conclusion.

2. Definitions for Community Detection

Notation of Main Symbols gives the meaning of the main symbols that we will use in the whole paper.

2.1. Central Node

Central nodes include global central nodes and local central nodes. In a network, k nodes with maximal degree are called its global central nodes, in which the value of k can be determined according to our needs. If a node's degree is bigger than the degree of all its adjacent nodes, we call it local central nodes.

Global central nodes play an important role in the network and are probably important nodes of communities. So global central nodes help to find communities.

2.2. Nodes’ Similarity

We define the similarity of node a and node b as

\begin{matrix} S i m (a, b) = \frac{|L (a) \cap L (b)|}{(|L (a)| + |L (b)|) / 2} = \frac{2 |L (a) \cap L (b)|}{|L (a)| + |L (b)|} \end{matrix}

(1)

in which

L (a) = A d j (a) \cup {a}

. The bigger

S i m (a, b)

is, the bigger similarity of node a and node b is. If two nodes have a large similarity, they are probably in the same community.

Paper [16] defines the similarity of node a and node b as

\begin{matrix} S i m (a, b) = \frac{|A d j (a) \cap A d j (b)|}{|A d j (a) \cup A d j (b)|} . \end{matrix}

(2)

This definition has some disadvantages. Firstly, calculating the union set consumes more time than calculating the sum of two numbers. Secondly, we can know intuitively that the similarity

S 1

of node 2 and node 4 in Figure 1 should be bigger than the similarity

S 2

of node 2 and node 4 in Figure 2. However, according to this definition, we have

S 1 = 1 / 5 < S 2 = 1 / 3

, which does not conform to our intuition. According to our definition, we have

S 1 = 2 / 4 > S 2 = 1 / 3

, which conforms to our intuition.

Figure 1

Network 1.

Figure 2

Network 2.

2.3. Link Strength between a Node and a Community

We define the link strength between node i and community $C (i \notin C)$ as

\begin{matrix} L i n k [i] [C] = 2 \cdot d [i] [C] + \sum_{j \in A d j (i), j \notin C} d [j] [C] . \end{matrix}

(3)

We multiply

d [i] [C]

by 2, because a direct link between node i and community C is stronger than an indirect link. If the link strength of node i and community C is large, node i should be probably in community C.

2.4. Modularity Q

The modularity Q is defined as

\begin{array}{l} Q = \sum_{C_{i}} (e_{i i} - a_{i}^{2}) = \sum_{C_{i}} e_{i i} - \sum_{C_{i}} a_{i}^{2} \\ = \frac{1}{m} \sum_{C_{i}} d_{i n} (C_{i}) - \frac{1}{m} \sum_{C_{i}} {[\frac{1}{2} d_{e x t} (C_{i}) + d_{i n} (C_{i})]}^{2} . \end{array}

(4)

Q is a number less than 1. The bigger Q is, the better algorithm's quality is. The quality is generally good when Q is between 0.3 and 0.7. Q is difficult to exceed 0.7 in the real situation.

2.5. Extensive Modularity $e Q$

We define extensive modularity $e Q$ as

\begin{matrix} e Q = \{\begin{cases} 0, & c = 1 \\ \frac{Q}{1 - 1 / c} = \frac{c Q}{c - 1}, & c \neq 1 . \end{cases} \end{matrix}

(5)

Let us analyze the upper boundary of modularity:

\begin{array}{l} Q = \sum_{C_{i}} (e_{i i} - a_{i}^{2}) = \sum_{C_{i}} e_{i i} - \sum_{C_{i}} a_{i}^{2} \leq 1 - \sum_{C_{i}} a_{i}^{2} \\ \leq 1 - \frac{1}{c} {(\sum_{C_{i}} a_{i})}^{2} = 1 - \frac{1}{c} . \end{array}

(6)

Through (6), we can know that $1 - 1 / c$ is a more accurate estimation of the upper boundary of Q, which depends on the number of communities that we detect. If we detect only 2 communities, the upper boundary of Q is 0.5, so Q is difficult to exceed 0.4. If we detect 5 communities, the upper boundary of Q is 0.8, so Q is relatively easy to reach 0.4 or even 0.5. Even if Q is larger in the latter case than that in the former case, the quality is not always better. Therefore, we define the extensive modularity to eliminate this problem. According to our definition, the upper boundary of extensive modularity is 1, which does not depend on the number of communities that we detect. The bigger $e Q$ is, the better quality is. We will explain the advantages of extensive modularity through experiments in the following paragraphs.

3. Algorithm Description

One city has one or several centers and is expanded layer by layer around the centers. The closer to the center the layer is, the denser the connections to the layer are. The outer layer is connected mainly by itself and inner layers. Inspired by the hierarchical structure of cities, we consider that a community should have one or several global central nodes and should be expanded layer by layer around the global central nodes. The nodes in layer p are mainly connected by nodes in layer p and layer $p - 1$ . According to such community structure, we propose a method for community detection based on hierarchical clustering (CDHC). The process of CDHC Algorithm includes initializing communities, expanding communities, merging small communities and choosing the best result.

We suppose the network is connected, which means we can reach any node from any node through some edges in the network.

3.1. Initialize Communities

The process of initializing communities can be divided into three steps, which are sorting nodes, choosing global central nodes, and merging global central nodes into communities.

Step 1.

Sort all the nodes by degree in descending order.

Step 2.

Choose k nodes with maximal degree as global central nodes of the network. The value of k is determined by the following equations:

\begin{matrix} k = max \{k_{1}, k_{2}\}, \\ k_{1} = \frac{n}{10} + 2, \\ k_{2} = |\{v | d (v) > \frac{max \{d (v)\}}{2}\}| . \end{matrix}

(7)

The idea is that we want to choose a small number of global central nodes to reduce the quantity of calculation and the global central nodes include the important nodes of each community. If there are only a small number of nodes with large degree and the degree of most nodes is small, we generally have $k_{1} > k_{2}$ , so $k = k_{1}$ . However, it is probable that the degree of all nodes is very close, and we have to choose most nodes as global central nodes. In this case, we should take $k = k_{2}$ .

Step 3.

Initialize the first community, assign the node with maximal degree to the first community, and then mark it as the first community's central node. For each node v of the remaining $k - 1$ global central nodes, we calculate its similarity with each initialized community's central node. If there exists a similarity bigger than some threshold η, assign the node v to the community maximizing the similarity; otherwise, initialize a new community, assign the node v to the new community, and mark the node v as the new community's central node.

Algorithm 1 shows the process of initializing communities.

Algorithm 1: Initialize communities.

input: network G;

sort nodes by degree in descending order;

choose k global central nodes $\{v [1], \dots, v [k]\}$ ;

initialize community $C [1]$ ;

nc = 1; //nc: number of communities;

assign $v [1]$ to $C [1]$ ;

mark $v [1]$ as central node of $C [1]$ ;

for $i = 2$ to k

for $j = 1$ to nc

calculate sim( $v [i]$ , central node of $C [j]$ );

mark maximal sim and its pos;

if (max_sim < threshold)

nc++;

initialize C[nc];

assign $v [i]$ to C[nc];

mark $v [i]$ as central node of C[nc];

else

assign $v [i]$ to C[max_pos];

3.2. Expand Communities

After finishing the process of initializing communities, we need expand these communities. The process of expanding communities includes marking nodes’ level and calculating link strength.

Step 1.

Mark all the global central nodes of the network as the first level. If the node v is connected to a node of the first level and its level is not marked, mark its level as two. If the node v is connected to a node of level p and its level is not marked, mark its level as $p + 1$ . As we suppose the network is connected, all the nodes’ level can be marked if we repeat the previous operation.

Step 2.

Nodes of the first level have already been assigned to communities. For each node v of level two, we calculate the link strength between the node v and each community C. Choose the community maximizing the link strength and then assign the node v to it. By parity of reasoning, for each node v of level, we repeat the previous operation, until every level's nodes have been assigned to communities.

Algorithm 2 shows the process of expanding communities.

Algorithm 2: Expand communities.

for $i = 1$ to n $\{lev [i] = 0;\}$ //initialize nodes’ level

for $i = 1$ to k $\{lev [i] = 1;\}$

add $1,2, \dots, k$ to stack;

foreach i in stack

foreach j in adj(i); //adj(i): adjacent nodes of i;

if lev[j] = 0

lev[j] = lev[i] + 1;

add j to stack;

sort nodes by level in ascending order;

for lev = 2 to n_lev

foreach v in level lev

for $i = 1$ to nc

calculate link[v][C[i]];

mark maximal link and its pos;

assign to C[max_pos];

3.3. Merge Small Communities

After the previous two processes, every node is assigned to a community. Some communities are probably of very small size, and we need to merge these communities into the other communities. This process includes determining small communities and calculating the number of common nodes.

Step 1.

If a community is of size smaller than t, we call it small community; otherwise, we call it large community. The value of t is determined by the following equation:

\begin{matrix} t = max \{5, \frac{max \{|C_{i}|\}}{10}\} . \end{matrix}

(8)

We determine the value of t on the basis of two reasons. Firstly, we deem that communities of size smaller than 5 make no sense. Secondly, when the size of the biggest community is very large, communities of size slightly bigger than 5 have no meaning to exist, just like it is meaningless to compare the size of a city with that of a village.

Step 2.

When merging communities, we will not merge a whole community into another one, because not all the nodes are strongly connected to the same community. We will merge small communities node by node into large communities. For each node v in small communities, calculate the number of common nodes of the node v's adjacent nodes and each large community. If the node v's adjacent nodes have the most common nodes with the community C, reassign the node v to community C.

Algorithm 3 shows the process of merging little communities.

Algorithm 3: Merge little communities.

output: communities $C = \{C [1], C [2], \dots, C [c]\}$ ;

determine little communities and large communities;

foreach node v in each little communities

foreach large community C

calculate the number of common nodes;

between adj(v) and C;

mark the pos of the largest number;

reassign v to C[max_pos];

3.4. Choose the Best Result

Given a network, we will detect its communities after the previous three processes. In the process of initializing communities, we give the threshold η of similarity. Different threshold will lead to different result. The smaller the threshold η is, the more likely the two global central nodes are assigned to the same community, so there tend to be fewer communities. We need to choose an appropriate value of threshold η so as to detect the real number of communities and get the best result.

For the threshold η, take at regular interval ten values from 0.1 to 1, and repeat the previous three processes for each value. Then we obtain ten partitions of the network. Generally, we are not able to know the real communities, so we have to choose the best result through an assessment criterion. For now, modularity is the most widely used criterion. However, we find that the result maximizing the extensive modularity is usually better than the result maximizing the modularity and conforms to the real situation very well. So we choose the partition with maximal value of extensive modularity as the best result.

Now we analyze the time complexity of CDHC. The first process costs $O (n ln n) + O (k c)$ . The second process costs $O (m)$ , and the third process costs no more than $O (m)$ . So the time complexity of CDHC is $O (n ln n) + O (m) + O (k c)$ . In the worst case, k and c are close to n, and the time complexity is $O (n^{2})$ .

4. Experiments

Firstly, we choose three most widely used datasets, which are Zachary Karate Club [17], Dolphin Network [18], and American College Football [19]. Secondly, we introduce some assessment criteria used in the case that the real communities are known. Thirdly, we verify the advantage of extensive modularity on the three datasets. Lastly, we choose some classical or new algorithms and compare them with CDHC on the three datasets.

4.1. Datasets

4.1.1. Zachary Karate Club

It is a social network of friendships between 34 members of a karate club at a US university in the 1970s. A factional division leads to a formal separation of the club into two organizations of nearly equal size.

4.1.2. Dolphin Network

It is an undirected social network of frequent associations between 62 dolphins in a community living off Doubtful Sound in New Zealand. It contains totally 62 dolphins in two groups.

4.1.3. American College Football

It is a network of American football games between Division IA colleges during regular season fall 2000. It contains 115 teams, which are divided into 12 conferences, each containing 8 to 12 teams.

Table 1 is a summary of the three datasets’ basic information, including NOC (number of communities), number of nodes, and number of edges.

Table 1

Dataset information.

Dataset	NOC	Nodes	Edges
Zachary	2	34	78
Dolphins	2	62	159
Football	12	115	613

4.2. Assessment Criteria

Given a network G, $T = {T_{1}, T_{2}, T_{3}, \dots, T_{q}}$ is its real communities, while $R = {R_{1}, R_{2}, R_{3}, \dots, R_{p}}$ is its detected communities. Here we introduce four assessment criteria to evaluate the overlap ratio of the real communities and the detected communities.

4.2.1. Precision [20]

Precision is defined as

\begin{matrix} p r e c i s i o n (R, T) = \frac{\sum_{i = 1}^{p} {max}_{j} (|R_{i} \cap T_{j}| / |R_{i}|)}{p} . \end{matrix}

(9)

Precision is a number between 0 and 1. It means the ration between the retrieved correct items and the retrieved items. It is not a good criterion for community detection. For example, if we consider every node as a community, the precision will be 1. However, it is obviously not a good result. Generally, if we detect more communities, the precision tends to be larger.

4.2.2. Recall [20]

Recall is defined as

\begin{matrix} r e c a l l (R, T) = \frac{\sum_{j = 1}^{q} {max}_{i} (|R_{i} \cap T_{j}| / |T_{j}|)}{q} . \end{matrix}

(10)

Recall is a number between 0 and 1. It means the ratio between the retrieved correct items and the correct items. It is also not a good criterion for community detection. For example, if we only detect one community, which is the whole network, the precision will be 1. However, it is obviously not a good result. Generally, if we detect fewer communities, the recall tends to be larger.

4.2.3. F-Measure [21]

F-measure is defined as

\begin{array}{l} F - m e a s u r e (R, T) \\ = \frac{2}{1 / p r e c i s i o n (R, T) + 1 / r e c a l l (R, T)}, \end{array}

(11)

which is

\begin{array}{l} F - m e a s u r e (R, T) \\ = \frac{2 \cdot p r e c i s i o n (R, T) \cdot r e c a l l (R, T)}{p r e c i s i o n (R, T) + r e c a l l (R, T)} . \end{array}

(12)

F-measure is a number between 0 and 1. We can see that F-measure is the harmonic average of precision and recall. If F-measure is large, precision and recall need to have a relatively large value. According to what we discuss about precision and recall, if we want a large value of F-measure, the number of communities should not be too large or too small, and the detected communities must be very close to the real communities.

4.2.4. NMI [22] (Normalized Mutual Information)

NMI is defined as

\begin{array}{l} N M I (R, T) \\ = - \frac{\sum_{i = 1}^{p} \sum_{j = 1}^{q} |R_{i} \cap T_{j}| ln ((n \cdot |R_{i} \cap T_{j}|) / (|R_{i}| \cdot |T_{j}|))}{\sqrt{\sum_{i = 1}^{p} |R_{i}| ln (|R_{i}| / n) \cdot \sum_{i = 1}^{q} |T_{i}| ln (|T_{i}| / n)}} . \end{array}

(13)

NMI is a number between 0 and 1. The larger NMI is, the better result is. NMI is a frequently used criterion, but the calculation is relatively complex.

According to our discussion about the four criteria, we will take F-measure and NMI to evaluate the result of community detection.

4.3. Experimental Results on Extensive Modularity

For the threshold η, we take at regular interval ten values from 0.1 to 1, and we obtain ten different partitions, respectively, on the datasets Zachary, Dolphin, and Football. For each partition, we use NOC (number of communities), modularity Q, extensive modularity $e Q$ , F-measure, and NMI to evaluate the result.

Table 2 is the ten experimental results under different threshold η on the dataset Zachary. Seen from the modularity, the best result should be the case when the threshold is 0.6. We can see that the maximal value of modularity is 0.373. In this case, we find that the detected communities differ much from the real communities. The detected NOC is 3, while real NOC should be 2. Besides, the value of F-measure and NMI is not very high. Seen from extensive modularity, the best result is the case when the threshold is 0.2. The maximal value of extensive modularity is 0.744. In this case, the values of F-measure and NMI are both 1, which signifies the detected communities are exactly the same with the real communities.

Table 2

Experimental results under different threshold η on the dataset Zachary.

η	NOC	Q	$e Q$	F-measure	NMI
0.1	1	0.000	0.000	0.692	0.000
0.2	2	0.372	0.744	1.000	1.000
0.3	2	0.352	0.704	0.919	0.650
0.4	2	0.256	0.512	0.805	0.373
0.5	2	0.256	0.512	0.805	0.373
0.6	3	0.373	0.560	0.884	0.831
0.7	4	0.308	0.410	0.945	0.724
0.8	5	0.272	0.340	0.811	0.623
0.9	5	0.272	0.340	0.812	0.621
1.0	5	0.272	0.340	0.812	0.621

Table 3 is the ten experimental results under different threshold η on the dataset Dolphin. Seen from the modularity, the best result is the case when the threshold is 0.3. We can see that the maximal value of modularity is 0.477. In this case, we find that the detected communities also differ much from the real communities. The detected NOC is 4, while real NOC should be 2. Besides, the value of F-measure and NMI is very low. Seen from extensive modularity, the best result is the case when the threshold is 0.2. The maximal value of extensive modularity is 0.770. In this case, the values of F-measure and NMI are very close to 1, which signifies the detected communities are very close to the real communities. Though maximizing extensive modularity does not find the best result, it is very close to the best result and is far better than the case maximizing modularity.

Table 3

Experimental results under different threshold η on the dataset Dolphin.

η	NOC	Q	$e Q$	F-measure	NMI
0.1	2	0.373	0.746	1.000	1.000
0.2	2	0.385	0.770	0.965	0.814
0.3	4	0.477	0.636	0.887	0.675
0.4	6	0.429	0.515	0.790	0.486
0.5	8	0.418	0.478	0.725	0.450
0.6	15	0.317	0.340	0.546	0.367
0.7	17	0.276	0.293	0.570	0.351
0.8	19	0.259	0.273	0.437	0.337
0.9	19	0.259	0.273	0.437	0.337
1.0	20	0.227	0.239	0.422	0.335

Table 4 is the ten experimental results under different threshold η on the dataset Dolphin. Seen from modularity or extensive modularity, the best result is the case when the threshold is 0.5. In this case, the detected NOC is 11, which is very close to the real NOC 12. We see that maximizing modularity or extensive modularity does not find the best result, but it has only a slight gap with the best result.

Table 4

Experimental results under different threshold η on the dataset Football.

η	NOC	Q	$e Q$	F-measure	NMI
0.1	4	0.378	0.504	0.609	0.483
0.2	6	0.515	0.618	0.768	0.719
0.3	11	0.580	0.638	0.866	0.884
0.4	11	0.600	0.660	0.898	0.909
0.5	11	0.602	0.662	0.897	0.903
0.6	12	0.548	0.598	0.861	0.871
0.7	13	0.295	0.320	0.860	0.469
0.8	32	0.008	0.008	0.836	0.343
0.9	32	0.014	0.014	0.893	0.260
1.0	32	0.014	0.014	0.893	0.260

From the discussion on the experimental results on the three datasets, the result maximizing extensive modularity usually is the best result or very close to the best result, while the result maximizing modularity is usually not very good. We can thus consider that extensive modularity is a better criterion compared to modularity.

4.4. Experimental Results on Comparison of Algorithms

As we have discussed, the extensive modularity is a better criterion than modularity, so we choose the case maximizing extensive modularity as the best result and the final result. We have chosen four algorithms for comparison, which are GN Algorithm [6], CNM Algorithm [13], EACH Algorithm [16], and LCV Algorithm [15].

Table 5 is the experimental results of the five algorithms on the dataset Zachary. GN and CNM get large value of modularity, but they detect more communities than the real situation, and their value of NMI is very low. EACH, LCV, and CDHC have detected the real communities. So EACH, LCV, and CDHC are far better than GN and CNM on Zachary.

Table 5

Experimental results of algorithms on the dataset Zachary.

Algorithm	NOC	Q	$e Q$	F-measure	NMI
GN	5	0.401	0.501	×	0.580
CNM	3	0.381	0.572	×	0.693
EACH	2	0.372	0.744	×	1
LCV	2	0.372	0.744	1	×
CDHC	2	0.372	0.744	1	1

Table 6 is the experimental results of the five algorithms on the dataset Dolphin. GN, CNM, and EACH get large value of modularity, but they detect more communities than the real situation, and their value of NMI is very low. LCV and CDHC get large value of extensive modularity, and the detected communities are very close to the real communities. LCV and CDHC are far better than GN, CNM, and EACH on Dolphin. Furthermore, CDHC is slightly better than LCV.

Table 6

Experimental results of algorithms on the dataset Dolphin.

Algorithm	NOC	Q	$e Q$	F-measure	NMI
GN	5	0.519	0.649	×	0.554
CNM	4	0.515	0.687	×	0.575
EACH	4	0.485	0.647	×	0.443
LCV	2	0.385	0.770	0.959	×
CDHC	2	0.385	0.770	0.965	0.814

Table 7 is the experimental results of the five algorithms on the dataset Football. The modularity of the five algorithms differs a little. The NOC detected by GN and CNM is far less than the real NOC 12. GN gets the largest value of extensive modularity, but its NMI is relatively low. CDHC gets the second largest value of extensive modularity, and its NMI is very close to the best result. On the whole, EACH and CDHC are better on Football.

Table 7

Experimental results of algorithms on the dataset Football.

Algorithm	NOC	Q	$e Q$	F-measure	NMI
GN	8	0.595	0.680	×	0.831
CNM	6	0.558	0.670	×	0.770
EACH	11	0.591	0.650	×	0.911
LCV	11	0.575	0.633	0.847	×
CDHC	11	0.602	0.662	0.897	0.903

From the experimental results on the three datasets, we find that GN and CNM usually get large modularity. However, the detected communities usually differ much from the real communities. Large NOC makes it easy to get a large modularity. CDHC usually gets large extensive modularity, and the detected communities are very close to the real communities. On the whole, CDHC is far better than GN and CNM and is slightly better than EACH and LCV.

Figures 3, 4, and 5 show the communities that CDHC has detected on Zachary, Dolphin, and Football. We can see that the community structure is evident.

Figure 3

Communities detected of Zachary.

Figure 4

Communities detected of Dolphin.

Figure 5

Communities detected of Football.

5. Conclusion

This paper proposes the conception of extensive modularity on the basis of modularity. The upper boundary of extensive modularity is independent on the number of communities that we detect, so it overcomes the weakness of modularity that maximizing modularity tends to detect more communities than the real situation. We verify through experiments that extensive modularity is a better criterion than modularity when the real communities are unknown.

This paper proposes CDHC Algorithm. CDHC Algorithm includes four processes, which are initializing communities, expanding communities, merging little communities, and choosing the best result. We choose four algorithms to compare with CDHC on three datasets, which are Zachary, Dolphin, and Football. We find that CDHC is far better than GN and CNM and is slightly better than LCV and EACH. According to F-measure and NMI, the communities detected by CDHC are very close to the real communities.

Footnotes

Notation of Main Symbols

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

The work is supported by National Natural Science Foundation of China (no. 61402028) and Science Foundation of Shenzhen City in China (JCYJ20140509150917445).

References

Kleinberg

J. M.

Kumar

Raghavan

Rajagopalan

Tomkins

A. S.

The web as a graph: measurements, models, and methods

Computing and Combinatorics 1999 1627

Berlin, Germany

Springer

1 17 Lecture Notes in Computer Science

10.1007/3-540-48686-0_1

Redner

Citation statistics from 110 years of physical review

Physics Today 2005 58 6 49 54

Castellano

Fortunato

Loreto

Statistical physics of social dynamics

Reviews of Modern Physics 2009 81 2 591 646

10.1103/RevModPhys.81.591

2-s2.0-65549108449

Lancichinetti

Saramäki

Kivelä

Fortunato

Characterizing the community structure of complex networks

PLoS ONE 2010 5 8

e11976

10.1371/journal.pone.0011976

2-s2.0-77957873035

Radicchi

Castellano

Cecconi

Loreto

Paris

Defining and identifying communities in networks

Proceedings of the National Academy of Sciences of the United States of America 2004 101 9 2658 2663

10.1073/pnas.0400054101

2-s2.0-1542357701

Newman

M. E. J.

Girvan

Finding and evaluating community structure in networks

Physical Review E—Statistical, Nonlinear, and Soft Matter Physics 2004 69 2

026113

10.1103/physreve.69.026113

2-s2.0-37649028224

Arab

Afsharchi

A modularity maximization algorithm for community detection in social networks with low time complexity

Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT '12)

December 2012

IEEE

480 487

10.1109/wi-iat.2012.97

2-s2.0-84878443031

Dinh

T. N.

Thai

M. T.

Community detection in scale-free networks: approximation algorithms for maximizing modularity

IEEE Journal on Selected Areas in Communications 2013 31 6 997 1006

10.1109/jsac.2013.130602

2-s2.0-84878129377

Blondel

V. D.

Guillaume

J.-L.

Lambiotte

Lefebvre

Fast unfolding of communities in large networks

Journal of Statistical Mechanics: Theory and Experiment 2008 2008 10

P10008

10.1088/1742-5468/2008/10/p10008

2-s2.0-56349094785

10.

Fortunato

Barthélemy

Resolution limit in community detection

Proceedings of the National Academy of Sciences of the United States of America 2007 104 1 36 41

10.1073/pnas.0605965104

2-s2.0-33846126275

11.

Kernighan

B. W.

Lin

An efficient heuristic procedure for partitioning graphs

Bell System Technical Journal 1970 49 2 291 307

10.1002/j.1538-7305.1970.tb01770.x

12.

Pothen

Simon

H. D.

Liou

K.-P.

Partitioning sparse matrices with eigenvectors of graphs

SIAM Journal on Matrix Analysis and Applications 1990 11 3 430 452

10.1137/0611030

MR1054210

13.

Clauset

Newman

M. E. J.

Moore

Finding community structure in very large networks

Physical Review E 2004 70 6

066111

10.1103/physreve.70.066111

2-s2.0-41349117788

14.

Newman

M. E. J.

Fast algorithm for detecting community structure in networks

Physical Review E—Statistical, Nonlinear, and Soft Matter Physics 2004 69 6

066133

10.1103/physreve.69.066133

2-s2.0-42749100809

15.

Chen

T.-T.

A method for local community detection by finding maximal-degree nodes

Proceedings of the International Conference on Machine Learning and Cybernetics (ICMLC '10)

July 2010

Qingdao, China

IEEE

8 13

16.

Jia

Gao

Wang

Anti-triangle centrality-based community detection in complex networks

IET Systems Biology 2014 8 3 116 125

10.1049/iet-syb.2013.0039

17.

Zachary

W. W.

An information flow model for conflict and fission in small groups

Journal of Anthropological Research 1977 33 4 452 473

18.

Lusseau

Schneider

Boisseau

O. J.

Haase

Slooten

Dawson

S. M.

The bottlenose dolphin community of doubtful sound features a large proportion of long-lasting associations: can geographic isolation explain this unique trait?

Behavioral Ecology and Sociobiology 2003 54 4 396 405

10.1007/s00265-003-0651-y

2-s2.0-0042311400

19.

Girvan

Newman

M. E.

Community structure in social and biological networks

Proceedings of the National Academy of Sciences of the United States of America 2002 99 12 7821 7826

10.1073/pnas.122653799

MR1908073

2-s2.0-0037062448

20.

Chen

Fang

Community detection based on local central vertices of complex networks

Proceedings of the International Conference on Machine Learning and Cybernetics (ICMLC '11)

July 2011

Guilin, China

IEEE

920 925

10.1109/icmlc.2011.6016775

2-s2.0-80155196469

21.

M. K.

MultiComm: finding community structurein multi-dimensional networks

IEEE Transactions on Knowledge and Data Engineering 2014 26 4 929 941

10.1109/tkde.2013.48

2-s2.0-84897451806

22.

Quiles

M. G.

Zorzal

E. R.

Macau

E. E. N.

A dynamical model for community detection in complex networks

Proceedings of the International Joint Conference on Neural Networks (IJCNN '13)

August 2013

IEEE

1 8

10.1109/ijcnn.2013.6706944

2-s2.0-84893632304

A Method for Community Detection of Complex Networks Based on Hierarchical Clustering

Abstract

1. Introduction

2. Definitions for Community Detection

2.1. Central Node

2.2. Nodes’ Similarity

2.3. Link Strength between a Node and a Community

2.4. Modularity Q

2.5. Extensive Modularity e Q

3. Algorithm Description

3.1. Initialize Communities

Step 1.

Step 2.

Step 3.

Algorithm 1: Initialize communities.

3.2. Expand Communities

Step 1.

Step 2.

Algorithm 2: Expand communities.

3.3. Merge Small Communities

Step 1.

Step 2.

Algorithm 3: Merge little communities.

3.4. Choose the Best Result

4. Experiments

4.1. Datasets

4.1.1. Zachary Karate Club

4.1.2. Dolphin Network

4.1.3. American College Football

4.2. Assessment Criteria

4.2.1. Precision [20]

4.2.2. Recall [20]

4.2.3. F-Measure [21]

4.2.4. NMI [22] (Normalized Mutual Information)

4.3. Experimental Results on Extensive Modularity

4.4. Experimental Results on Comparison of Algorithms

5. Conclusion

Footnotes

Notation of Main Symbols

Conflict of Interests

Acknowledgments

References

2.5. Extensive Modularity $e Q$