Uncovering Research Topics of Academic Communities of Scientific Collaboration Network

Abstract

In order to improve the quality of applications, such as recommendation or retrieval in knowledge-based service system, it is very helpful to uncover research topics of academic communities in scientific collaboration network (SCN). Previous research mainly focuses on network characteristics measurement and community evolution, but it remains largely understudied on how to uncover research topics of each community. This paper proposes a nonjoint approach, consisting of three simple steps: (1) to detect overlapping academic communities in SCN with the clique percolation method, (2) to discover underlying topics and research interests of each researcher with author-topic (AT) model, and (3) to label research topics of each community with top N most frequent collaborative topics between members belonging to the community. Extensive experimental results on NIPS (neural information processing systems) dataset show that our simple procedure is feasible and efficient.

1. Introduction

Social network (SN) analysis is regarded as a powerful tool to find out social links and network structure of actors [1–9]. Scientific collaboration network (SCN) is a kind of complex SNs of researchers, in which a link between two researchers is established if they coauthored one or more scientific papers [10, 11]. Therefore, it is also called as coauthorship network [10, 12]. Previous studies on SCN [6, 9, 11–16] can be roughly separated into two stages: (1) the first stage mainly focused on how to construct network and how to measure network characteristics with some metrics [10, 14, 17], such as degree distribution, clustering coefficient, and average path length; (2) the second stage paid more attention to network structure analysis, community evolution, and so on [1, 2, 8, 13, 18, 19].

As we all know, most real-world networks contain groups in which nodes are more highly connected to each other than those to the rest of the network [13]. The sets of such nodes are usually called communities, clusters, cohesive groups, or modules [13, 20]. Similar to the real-world networks, SNs also include many communities based on common location, interests, and occupation. As one kind of SNs, it should be no exception for SCN [2, 7, 13, 19, 20]. According to whether or not a node is allowed to be a member of more than one community, the communities can be further divided into two types: overlapping and nonoverlapping. Most real-world networks are characterized by well-defined statistics of overlapping and nested communities [13].

Although real motivations for a link in the SCN are still not well understood at present, there usually exists one or more than one relationship (such as co-colleagues, advisor-advisee, classmates, coproject, friends, or many others) in the real world if two researchers coauthored some papers. Moreover, in order to follow the frontier research or borrow main ideas from other fields [18], an active researcher might involve multiple fields. Intuitively, it is unreasonable to limit one researcher to belong to only one community. Therefore, it convinces us of the fact that overlapping communities also exist in the SCN.

It is increasingly important to detect communities in SN in modern applications, ranging from bioinformatics, enterprise organization management, to bibliometrics [21]. Many approaches have been proposed to detect communities in the SN [22, 23], such as traditional methods based on clustering like k-means and other applications, division algorithms based on hierarchical clustering, modularity-based algorithms, spectral algorithms, dynamic algorithms, statistical inference-based methods, multiresolution methods, and lastly methods to find overlapping communities and other miscellaneous methods [23]. But most existing methods for finding communities just discover the separated sets in networks and ignore the overlapping phenomenon [13, 23].

In fact, it is the first step to identify network structure if one wants to provide a valuable insight into how network function and topology affect each other. In a knowledge-based service system, users may be interested in not only the link structure in a network but also in the reason why they form a community. However, most present methods merely focus on detecting the structures or monitoring evolution of communities. There are few literatures on uncovering research topics of academic communities and providing particular information for searching a group of researchers with similar interests. To the best of our knowledge, only Ichise et al. [18] put forward a method to detect academic communities with topic identification in literature. On closer examination, one can see that the word assignment technique was utilized for obtaining the communities. Unfortunately, the method has the limitation of trust in the keywords.

To overcome these problems, the paper proposes a nonjoint approach, which integrates community detection method and author-topic (AT) model. Specifically, it consists of three simple steps: (1) to detect overlapping academic communities in SCN with the clique percolation method, (2) to discover underlying topics and research interests of each researcher with author-topic (AT) model, and (3) to label research topics of each community with top N most frequent collaborative topics between members belonging to the community, where common topics between researchers are seen as collaborative topics.

The remainder of the paper is organized as follows. Section 2 provides related works on community detection models and topic models. Section 3 illustrates the analysis framework in the study and then introduces each unit of the framework. Section 4 describes and discusses experimental results. Finally, the conclusion is made.

2. Related Works

2.1. Community Detection Methods

Community detection is the organization of nodes in a network into subsets of nodes such that nodes within a subset are more densely connected internally than those within the other subsets. Another way to say this from a graph theoretic perspective is that, given a graph $G (A, E)$ with a set A of nodes and a set E of edges, community detection is to classify the node set A into multiple subsets $C = {c_{i}}_{i = 1}^{| C |}$ with $c_{i} \subseteq A$ , such that nodes belonging to a subset $c_{i}$ are all closely related [24]. Here $| \cdot |$ denotes the number of the elements in a set.

Because the number of communities underlying a network is typically unknown in advance and the sizes or densities of communities are often uneven, it is not trivial to find automatically community structure. Several community detection approaches have been developed and employed with varying levels of success [25], including hierarchical clustering algorithm, Girvan-Newman algorithm [26], modularity maximization algorithm [27], and clique-based methods [13]. It is worth noting that only the last one can deal with the overlapping phenomenon.

Hierarchical clustering is a simple algorithm which employs some type of similarity metrics between node pairs to group similar nodes into communities. Girvan-Newman algorithm identifies edges that lie between communities and then removes them, just leaving behind the communities themselves. Though Girvan-Newman is popular in a number of standard software packages, its time complexity is $𝒪 ({| E |}^{2} \times | A |)$ , making it impractical for networks with more than a few thousand nodes [26]. Modularity maximization algorithm defines a benefit function to measure the quality of a particular division of a network into communities [27], and can be used for large-scale network problem. However, since approximate optimization is utilized in modularity maximization algorithm, it often fails to detect clusters smaller than some scale depending on the size of network.

Clique-based methods build up the communities from the cliques in a network [28]. By clique, we mean the complete subgraphs in a network that are not parts of larger complete subgraphs. Specifically, the general procedure of the methods is to find cliques firstly and then to unite the cliques bigger than a minimum number of nodes to define a subgraph of original network, and finally components (disconnected parts) of the defined subgraph are used to define communities [22, 23]. The alternative of the method is to use k-cliques, which are complete subgraphs with k nodes, to construct line graph known as clique graph [20, 28]. In fact, clique graph is a hypergraph of original graph, the nodes of which are k-cliques, and the edges of which record the overlap of the cliques in the original graph. The difference between k-cliques and cliques is that k-cliques can become subsets of larger complete subgraphs. A typical approach based on k-cliques is the clique percolation method (CPM) [13, 29], which defines communities as percolation clusters of k-cliques. CPM algorithm runs in the time $𝒪 (α n^{β \ln | A |})$ , where α and β are constant values [30].

2.2. Topic Models

Topic models are a family of statistical models for discovering a mixture of “components” in a collection of documents [31]. In these models, each topic is modeled as a probability distribution over words in the vocabulary of corpus and each document in corpus is modeled as a mixture of topics given by a multinomial distribution over the topics [32]. An early topic model called probabilistic latent semantic indexing (pLSI) was proposed by Hofmann [33]. While Hofmann's work is a useful step toward probabilistic modeling of text, it is incomplete in the fact that it provides no probabilistic model at the level of documents. In order to overcome this problem, Blei and his coworkers developed latent Dirichlet allocation (LDA) model [34]. LDA is similar to pLSI, except that in LDA model the topic distribution is assumed to have a Dirichlet prior. In practice, the assumption usually brings about more reasonable mixtures of topics in a document. Subsequent topic models, such as author-topic (AT) model [35], topic over time (ToT) model [36], author-topic over time (AToT) model [31, 32, 37], and conference-author-relation topic (CART) model [38], are generally extensions on LDA.

As a famous topic model, LDA [34] is a generative probabilistic model for collections of discrete data such as text corpora [39]. LDA model is based upon the idea that the probability distribution over words in a document can be expressed as a mixture of topics. It means that each document may be viewed as a mixture of various topics. LDA model can be viewed as a generative process. A document can be generated in following three steps: (1) to sample a mixture proportion from a Dirichlet distribution, (2) to sample a topic index according to the mixture proportion for each word in the document, and (3) to sample a word token from a multinomial distribution over words specific to the sampled topic.

AT is also a generative model that extends LDA model to include authorship information [40, 41]. The model provides a relatively simple probabilistic model for exploring the relationships between authors, documents, topics, and words. In the model, each author is represented by a multinomial distribution over topics and each topic is represented by a multinomial distribution over words. The words in a document coauthored by multiauthors are assumed to be the result of a mixture of topic mixture of each author. Then, the topic-word and author-topic distributions are learned from text corpus. Compared with LDA, AT can give the increase of salient topics and more reasonable researchers interest patterns [40]. AT model has been proved to be an essential way to uncover the research interests of each researcher [40, 41].

3. Method

The analysis framework of proposed approach is illustrated in Figure 1. The framework is composed of four parts: to preprocess data, to detect communities in scientific collaboration network, to discover collaborative topics between authors, and to uncover topics of academic communities. We describe each part in detail in the following subsections.

Figure 1

Framework of the method.

3.1. To Preprocess Data

In this part, words and authors are extracted from papers collected. First word terms are extracted to build vocabulary and all stop-words are eliminated. Then word frequency and inverse document frequency of each word in vocabulary are computed for following author-topic model. Next, author names are extracted and the name disambiguation algorithm is used to process ambiguous names, such as an author with multiple names or multiple authors with the same name, and then all author names are normalized to a standard name and assigned a unique ID number.

3.2. To Detect Communities in SCN

From a “topological” point of view, network can be divided into four categories: undirected binary network, directed binary network, weighted directed network, and weighted undirected network [10, 14]. In the part, a SCN is first created by following the principle of undirected binary network, in which each node represents an author and each edge represents the coauthorship between two linked authors. Specifically, if two authors coauthored one paper at least, an edge with unit weight will be created. In other words, no matter how many papers two authors coauthored, there is only one edge between them. For example, if $a_{1}$ , $a_{2}$ , and $a_{3}$ coauthored one paper and $a_{1}$ and $a_{2}$ coauthored another paper, three edges will be created, that is, $e_{12}$ , $e_{13}$ , and $e_{23}$ , in the constructed network.

Then, cliques are extracted from constructed SCN and communities are detected with k -clique-community detection algorithm [13]. The community definition in the algorithm is based on the observation that a typical node in a community is linked to many other nodes, yet not necessarily to all other nodes. A k -clique-community is the union of all k-cliques that can be reached from each other through a series of adjacent k-cliques, where adjacency means sharing $k - 1$ nodes. When $k = 2$ , the k-clique communities are equivalent to the connected subgraphs, which are also called components in complex network analysis.

3.3. To Discover Collaborative Topics between Authors

Here, AT model is used to uncover the research interest of each author. The graphical model representation for AT model is shown in Figure 2. The following notations are used in this study. Let P and W be the set of papers and unique words in the corpus, respectively. For each $m \in {1,2, \dots, | P |}$ , $W_{m}$ or $a_{m}$ are denoted by all the word tokens or the set of authors in the paper m and $| W_{m} |$ means the length of the paper m. $ϑ_{a}$ or $φ_{l}$ are multinomial distribution of topics or words specific to the author a or the paper m. For more elaborate and detailed descriptions on AT model we refer the readers to [35, 40, 41].

Figure 2

The graphical model representation of the author-topic model.

In this work, collapsed Gibbs sampling algorithm, which runs over the three periods, initialization, burn-in, and sampling with L iterations in total, is used for inference on ${ϑ_{a}}_{a = 1}^{| A |}$ or ${φ_{l}}_{l = 1}^{| W |}$ , since it provides a simple method for obtaining parameter estimates under Dirichlet priors. The time complexity of the AT model is as follows:

\begin{matrix} 𝒪 (L \times \sum_{m = 1}^{| P |} (| W_{m} | \times | a_{m} |)) . \end{matrix}

(1)

In our paper, each topic is represented with the top 10 words most likely to be generated conditioned on the topic, and research interest of each author is represented with the top 10 most likely topics. From the results of AT model, we can build the relation matrix between authors and topics. Each element of the matrix is the association probability between an author and a topic. With the matrix, it is easy to get collaborative topics between any two authors with coauthorship. Formally, for each $i \in {1,2, \dots, | A |}$ , let $T_{i} = {t_{i 1}, t_{i 2}, \dots, t_{i 10}}$ be the research topics of author i. Then the collaborative topics $T_{i j}$ between the authors i and j are defined as the intersection of $T_{i}$ and $T_{j}$ ; that is, $T_{i j} = T_{i} \cap T_{j}$ . For example, if $T_{i} = {t_{1}, t_{3}, t_{4}, t_{5}, t_{10}, t_{12}, t_{14}, t_{15}, t_{16}, t_{17}}$ and $T_{j} = {t_{2}, t_{4}, t_{5}, t_{6}, t_{7}, t_{8}, t_{9}, t_{11}, t_{19}, t_{20}}$ , then the collaborative topics between the two authors are ${t_{4}, t_{5}}$ .

3.4. To Uncover Topics of Academic Communities

In Section 3.2, we have detected communities in SCN. In Section 3.3, we have got topics of authors and we have also obtained the collaborative topics between any two collaborated authors in detected communities. Here, we will integrate both results to uncover the topics of academic communities by ranking topics and selecting the most frequently collaborated ones.

For a SCN, $G = {A, E}$ , we denote the set of nodes as $A = {a_{1}, a_{2}, \dots, a_{| A |}}$ , the set of edges as $E = {e_{i j} | a_{i} \in A, a_{j} \in A, co-authored (a_{i}, a_{j})}$ , the set of topics found as $T = {t_{1}, t_{2}, \dots, t_{| T |}}$ , and the set of k-clique communities detected as $C = {c_{1}, c_{2}, \dots, c_{| C |}}$ . For each community $c_{m}$ , we can define a subgraph of G, $G_{m} = {A_{m}, E_{m}}$ , where $A_{m} = {a_{i} | a_{i} \in c_{m}}$ and $E_{m} = {e_{i j} | e_{i j} \in E, a_{i} \in c_{m}, a_{j} \in c_{m}}$ .

In Section 3.3, we have obtained collaborative topics for each edge in $E_{m}$ ; therefore we can compute the collaborative frequency of all topics in the community $c_{m}$ through counting topics by edges according to

\begin{matrix} f (t_{l}) = \sum_{e_{i j} \in E_{m}} δ (t_{l} \in T_{i j}), l = 1,2, \dots, | T |, \end{matrix}

(2)

where the indicator function

δ (x) = 1

if x is true and 0 otherwise.

Once we have got collaborative frequencies of all topics in community $c_{m}$ , we rank them by sorting frequency descendingly and then select the top N topics as the research topics of the community.

To illustrate the process clearly, let us take a simple example in Figure 3 with a community consisting of four nodes. Given author topics as follows: $T_{1} = {t_{1}, t_{2}, t_{10}, t_{11} t_{12}, t_{13} t_{14}, t_{15} t_{16}, t_{17}}$ , $T_{2} = {t_{1}, t_{2}, t_{3}, t_{4} t_{18}, t_{19} t_{20}, t_{21} t_{22}, t_{23}}$ , $T_{3} = {t_{1}, t_{4}, t_{5}, t_{24}, t_{25}, t_{26}, t_{27}, t_{28}, t_{29}, t_{30}}$ , and $T_{4} = {t_{3}, t_{4}, t_{5}, t_{31}, t_{32}, t_{33}, t_{34}, t_{35}, t_{36}, t_{37}}$ , then, the collaborative topics between authors are $T_{12} = {t_{1}, t_{2}}$ , $T_{13} = {t_{1}}$ , $T_{23} = {t_{1}, t_{4}}$ , $T_{24} = {t_{3}, t_{4}}$ , and $T_{34} = {t_{4}, t_{5}}$ .

Figure 3

An example of community.

Using (2), we can easily get the frequencies of all topics:

\begin{array}{l} f (t_{1}) = 3, f (t_{2}) = 1, f (t_{3}) = 1, \\ f (t_{4}) = 3, f (t_{5}) = 1 . \end{array}

(3)

Finally, if we rank the frequencies and select top 2 topics as research topics of the community, the result is ${t_{1}, t_{4}}$ .

4. Experimental Results and Discussion

4.1. Data

NIPS proceeding dataset is utilized to evaluate the performance of proposed framework, which consists of the full text of the 13 years of proceedings from 1987 to 1999 Neural Information Processing Systems (NIPS) Conferences (http://www.cs.toronto.edu/~roweis/data.html). The dataset contains 1,740 research papers and 2,037 unique authors. Because all the author names have been processed and normalized, we need not run name disambiguation algorithm in the step of preprocessing data. Based on coauthorship, we count the collaborative numbers between coauthored researchers. The distribution of the numbers of author pairs over collaborative numbers is shown in Table 1. It shows that the maximum collaborative number between authors is 9 corresponding to author pair (Smola_A ID: 1475 and Scholkopf_B ID: 1504).

Table 1

Distribution of the number of author pairs over collaborative numbers in NIPS dataset.

Collaborative numbers	Number of author pairs
1	2707
2	299
3	93
4	18
5	3
6	9
9	1
Total	3130

In addition to downcasing and removing stop-words and numbers, we also remove the words appearing less than five times in the corpus. After the preprocessing, the dataset contains 13,649 unique words and 2,301,375 word tokens in total. In our experiments of AT model, the number of topics is fixed at 100, the symmetric Dirichlet priors α and β are set at 0.5 and 0.1, and Gibbs sampling is run for $L = 2000$ iterations.

4.2. Scientific Collaboration Network

Based on coauthorship, we construct the scientific collaboration network containing 1897 nodes and 3130 edges. That is to say, there are 140 authors who did not collaborate with any other authors. The constructed network graph is shown in Figure 4. From Figure 4, we find that the NIPS network is composed of a larger subgraph (in the center of the picture) and many smaller subgraphs.

Figure 4

NIPS scientific collaboration network.

4.3. Component Analysis

Using component analysis approach [42] on the network, we found 235 components totally. That is to say, the network contains 235 separated subgraphs. The number of author nodes in the top 10 components is 1061, 37, 27, 22, 19, 15, 11, 10, 10, and 9, respectively. We select the largest one (1061 authors) as analysis object in the following experiments. Figure 5 shows the graph of the largest component.

Figure 5

The largest subgraph of NIPS network.

4.4. Cliques

The k -clique-community detection algorithm in NetworkX tools [42] is used to discover all cliques in the network. The size and number of cliques in the network and the largest component are shown in Table 2. The size of the largest clique in the network is 10, which means that it contains 10 authors, while the size of largest clique in the largest component is 9.

Table 2

Sizes and numbers of cliques.

	Size of cliques
	2	3	4	5	6	7	8	9	10
Numbers in network	508	348	137	42	26	6	2	3	1
Numbers in the largest component	316	244	94	32	21	4	1	2	0

4.5. Community Analysis

The detected communities depend on the value of parameter k, where k refers to the size of cliques. Typically, the value of k is between 3 and 6 [13]. Increasing k makes the communities smaller and more disintegrated, but, at the same time, also more cohesive [13]. Different value of k will give rather different results and thus give us flexibility when providing research community service for users in knowledge-based system.

The cliques found in Section 4.4 are used to detect communities. Communities detected under different k value are presented as follows.

(a) $k = 2$ . If $k = 2$ , communities detected will be the largest component of the network shown in Figure 5 (note that we just use the largest component to do the community analysis).

(b) $k = 3$ . Setting $k = 3$ , we obtain 47 communities and 22 overlapping nodes. Due to space constraints, we just list the author IDs of all overlapping nodes: 18, 42, 53, 63, 77, 117, 156, 205, 206, 370, 383, 390, 459, 578, 673, 697, 733, 811, 943, 1039, 1212, and 1276. The corresponding results for all the 47 communities are available from the authors upon request.

(c) $k = 4$ . Setting $k = 4$ , we obtain 18 communities shown in Tables 3 and 6 overlapping nodes: 77, 156, 383, 726, 1475, and 1504, which are underlined in Table 3. Their names are provided in Table 6. Comparing with the results of overlapping nodes on $k = 3$ , authors 77, 156, and 383 are still the overlapping nodes; however, other 20 nodes when $k = 3$ are not overlapping nodes when $k = 4$ ; therefore the three authors maybe played more important role in interaction between communities or they have more diverse research interests.

Table 3

Detected communities ( $k = 4$ ). Overlapping nodes are bold.

CNo	Members	Size
1	37, 42, 94, 95, 96, 194, 235, 236, 237, 238, 239, 240, 404, 432, 611, 726, 728, 780, 921, 922, 1017, 1077, 1078	23
2	197, 726, 729, 1473, 1474, 1475, 1504, 1697, 1826, 1827, 1828, 1830, 1952	13
3	63, 156, 376, 430, 555, 719, 961, 1554	8
4	276, 370, 510, 577, 681, 682, 683, 912, 916	9
5	77, 79, 1094, 1219, 1220, 1221, 1552	7
6	156, 418, 1298, 1299, 1556, 1717, 1718, 1719, 1776, 1777, 1778	11
7	300, 880, 989, 990, 1119, 1253	6
8	383, 748, 749, 750, 861	5
9	383, 390, 691, 692, 913, 914	6
10	117, 630, 798, 811, 1475, 1504	6
11	156, 381, 986, 1351, 1395, 1396, 1397, 2009	8
12	116, 784, 785, 786, 933	5
13	77, 78, 313, 460, 461, 462, 536, 942, 944	9
14	40, 205, 399, 507, 508, 686, 687, 688, 689, 690	10
15	41, 44, 970, 971, 972, 973, 1048	7
16	53, 148, 149, 150, 401, 704, 1068	7
17	805, 918, 919, 920, 1239	5
18	2, 3, 179, 374, 642, 643, 945	7

CNo is the abbreviation of community number.

(d) $k = 5$ . Setting $k = 5$ , we obtain 6 communities shown in Table 4, and again the IDs of overlapping nodes are underlined. There is only one overlapping node left which is a member of three communities, that is, communities 4, 5, and 6, and the author ID of the node is 156. In our experiment, the author (name of whom is Sejnowski_T) is the only overlapping nodes when $k = 3$ , 4, and 5. He maybe has more diverse research interests so that he can play “bridge” role in multiple communities.

Table 4

Detected communities ( $k = 5$ ). Overlapping nodes are bold.

CNo	Members	Size
1	37, 42, 94, 96, 194, 235, 236, 237, 238, 239, 240, 404, 432, 611, 726, 780, 1017	17
2	276, 370, 510, 577, 681, 682, 683, 912, 916	9
3	1475, 1504, 1697, 1826, 1827, 1828, 1830, 1952	8
4	156, 376, 430, 555, 719, 961, 1554	7
5	156, 418, 1298, 1299, 1556, 1717, 1718, 1719	8
6	156, 381, 986, 1351, 1395, 1396, 1397, 2009	8

CNo is the abbreviation of community number.

The subgraphs of the 6 communities $(k = 5)$ in NIPS network are shown in Figure 6. In Figure 6, nodes in the same community are painted with the same color and shape except the overlapping node. The overlapping node is painted with red color and circle shape. From Figures 6(a)–6(c), one can see that communities 1 and 3 are connected but they do not have overlapping nodes, community 2 is separated from other communities, and communities 4, 5, and 6 are connected by overlapping node 156.

Figure 6

Subgraph of discovered communities $(k = 5)$ .

4.6. Topics and Collaborative Topics between Authors

By running AT model on NIPS dataset, we obtain 100 topics and assign each topic an ID number range from 0 to 99. In Table 5, we list some typical topics and top 10 hot words for each topic with corresponding probabilities. Among them, there are some representative domains in NIPS in the listed topics, including support vector machine (SVM) and kernel methods, neural network, speech recognition, image and vision, EM and mixture model, and independent component analysis (ICA).

Table 5

Some typical research topics in NIPS.

Topic ID and name	Hot words and probabilities
15 Image and vision	Image	0.09447	System	0.01731
	Images	0.05248	Figure	0.01392
	Feature	0.02905	Pixels	0.01356
	Features	0.02246	Vision	0.01305
	Pixel	0.01761	Scale	0.01215

24 Visual position	Sejnowski	0.02215	Information	0.01027
	Visual	0.02123	Local	0.00924
	Basis	0.01916	Representations	0.00884
	Figure	0.01194	Representation	0.00884
	Position	0.01050	Song	0.00878

26 Speech recognition	Speech	0.06291	Speaker	0.01774
	Recognition	0.04250	Training	0.01395
	System	0.02914	Word	0.01321
	HMM	0.02458	Continuous	0.01158
	Context	0.02144	Acoustic	0.01099

31 Neural network and Boltzmann machine	Hinton	0.01972	Visible	0.00739
	Hidden	0.00874	Weights	0.00696
	Features	0.00797	Distribution	0.00692
	Dayan	0.00785	Single	0.00685
	Recognition	0.00781	Energy	0.00681

33 EEG	Time	0.02030	Brain	0.01217
	EEG	0.02013	Data	0.01083
	Sound	0.01650	Location	0.01053
	Localization	0.01346	Activity	0.00983
	Auditory	0.01287	Components	0.00960

43 Pattern recognition and distance transformation	Distance	0.02642	Transformations	0.00994
	Tangent	0.02514	Set	0.00994
	Pattern	0.01317	Recognition	0.00981
	Patterns	0.01246	Cun	0.00959
	Rate	0.01012	Vectors	0.00932

52 Outlier and noise characteristics	Case	0.02556	Values	0.01432
	Number	0.01751	Results	0.01370
	Large	0.01641	Simple	0.01341
	Random	0.01606	Small	0.01335
	Order	0.01455	General	0.01100

74 Neural network	Network	0.24201	Figure	0.01040
	Neural	0.17782	Artificial	0.00852
	Networks	0.15251	Work	0.00761
	Systems	0.01321	Shown	0.00760
	Paper	0.01106	Information	0.00669

77 SVM and kernel method	Kernel	0.03324	Margin	0.01433
	Support	0.02276	Data	0.01253
	Vector	0.02172	Space	0.01178
	SVM	0.01474	Solution	0.01048
	Set	0.01433	Regression	0.01036

80 Network input, output, and architecture	Input	0.10359	Weights	0.03965
	Output	0.08380	Training	0.03359
	Layer	0.05942	Net	0.03012
	Hidden	0.05890	Architecture	0.02245
	Network	0.04064	Inputs	0.02208

89 Independent component analysis	Information	0.02653	Matrix	0.01698
	Independent	0.02629	Blind	0.01519
	Source	0.01889	Component	0.01384
	Separation	0.01830	Natural	0.01336
	Sources	0.01718	ICA	0.01333

92 EM and mixture model	Data	0.05575	Distribution	0.02129
	Probability	0.03424	Log	0.02112
	Likelihood	0.02770	EM	0.02030
	Mixture	0.02502	Parameters	0.01948
	Density	0.02359	Gaussian	0.01715

97 Performance evaluation	Training	0.06981	Results	0.02668
	Set	0.06098	Number	0.02396
	Data	0.05007	Error	0.01849
	Performance	0.04155	Table	0.01663
	Test	0.03330	Problem	0.01387

Table 6

Author topics of overlapping nodes ( $k = 4$ ).

AID	Author	Topic IDs and probabilities
77	Koch_C	19 (0.18351), 53 (0.09956), 66 (0.08539), 50 (0.06150), 67 (0.06092), 84 (0.05648), 52 (0.04997), 29 (0.02542), 27 (0.02443), 13 (0.02352)
156	Sejnowski_T	24 (0.15232), 27 (0.06548), 66 (0.05343), 12 (0.04476), 33 (0.04087), 89 (0.03909), 52 (0.03130), 92 (0.03060), 15 (0.02934), 67 (0.02900)
383	Kawato_M	58 (0.47440), 96 (0.07180), 12 (0.05519), 74 (0.05178), 91 (0.02970), 26 (0.02424), 56 (0.02287), 80 (0.02082), 52 (0.01764), 42 (0.01240)
726	Vapnik_V	77 (0.26370), 97 (0.09446), 44 (0.06308), 98 (0.05889), 43 (0.05638), 42 (0.04341), 52 (0.03902), 70 (0.03525), 21 (0.02877), 92 (0.02667)
1475	Smola_A	77 (0.33208), 98 (0.10307), 44 (0.08248), 97 (0.05777), 52 (0.05617), 43 (0.05022), 59 (0.03558), 90 (0.02437), 42 (0.02414), 12 (0.01682)
1504	Scholkopf_B	77 (0.45924), 98 (0.05840), 52 (0.05196), 97 (0.05168), 44 (0.04020), 43 (0.03431), 90 (0.02675), 59 (0.02423), 42 (0.01919), 12 (0.01863)

AID is the abbreviation of author ID.

After running AT model on NIPS dataset, we also obtain research topics of each author. We select the 10 most likely topics as research topics for each author, and each topic has a probability value indicating the possibility that the author is related to the topic. Table 6 shows research topics of the authors corresponding to the overlapping nodes in communities under the condition $k = 4$ . In the column “Topics ID2 and probabilities” of Table 6, the decimal value in parentheses after each topic is the probability specific to the author. For example, “19(0.18351)” denotes the probability author Koch_C specific to topic 19 is 0.18351.

From the results of authors’ research topics, we can easily obtain the collaborative topics between two collaborated authors by finding their common topics. In the results of AT model, we got the probabilities between authors and topics. If two authors collaborated a topic, we use the smaller value of probabilities to express the possibility of the topic they are related to. For example, Vapnik_V (726) and Smola_A collaborated in topic 77, and the probability values they are related to the topic are 0.26370 and 0.33208, respectively, so the probability they collaborated in the topic is 0.26370. In Table 3, we have listed all authors in community 2 $(k = 4)$ . In the community, there are three overlapping nodes, 726, 1475, and 1504. In Table 6, we have listed the hot topics of the three authors. In this way, we can obtain a symmetric matrix of collaborative topics between them. In each cell except the diagonal cells, the collaborative topics are the intersection of topics of corresponding two authors. Collaborative topics between them are reported in Table 7.

Table 7

Collaborative topics between authors 726, 1475, and 1504.

Author ID	726	1475
1475	77 (0.26370), 44 (0.06308), 98 (0.05889), 97 (0.05777), 43 (0.05022), 52 (0.03902), 42 (0.02414)

1504	77 (0.26370), 98 (0.05840), 97 (0.05168), 44 (0.04020), 52 (0.03902), 43 (0.03431), 42 (0.01919)	77 (0.33208), 98 (0.05840), 52 (0.05196), 97 (0.05168), 44 (0.04020), 43 (0.03431), 90 (0.02437), 59 (0.02423), 42 (0.01919), 12 (0.01682)

4.7. Community Topics

In this subsection, we will integrate the results from Sections 4.5 and 4.6 to uncover the research topics of communities. Our main idea is to find out the most frequent and possible collaborative topics in each community.

To be specific, for each community, we count all collaborated topics in all edges of the subgraph using (1) and then rank them by sorting both collaborative frequencies and probabilities descendingly. We use the minimal collaborative probabilities as probabilities for each topic in the counting process. We show the result of community 1 ( $k = 5$ ; see Table 4) in Table 8. If two topics have the same frequencies, the topic with larger probability is ranked before another one.

Table 8

Collaborative topics of community 1 in Table 4.

TID	Collaborative topic name	Freq.	Probability
43	Pattern recognition and distance transformation	76	0.08673
97	Performance evaluation	72	0.04684
52	Outlier and noise characteristics	68	0.05536
80	Network input, output, and architecture	65	0.02443
86	Character recognition	57	0.04598
74	Neural network	47	0.02801
81	Neuronic network	41	0.04362
70	Supervised learning	24	0.02891
15	Image and vision	18	0.02528
42	Optimization algorithms	14	0.03183
77	SVM and kernel method	11	0.03183
19	Analog circuit	9	0.03957
6	Parallel processing	9	0.03528
12	Rule learning	9	0.02367
75	Network unit and connection weight	4	0.02446
44	Linear and nonlinear programming	3	0.02641
64	Routing	2	0.05357
92	EM and mixture model	1	0.02667

TID is the abbreviation of topic ID. Freq. is the abbreviation of frequency.

With ranked collaborative topics of communities, we can select the most outstanding topics or select top N topics as research topics for each community. In this paper, we use top N topics to represent the research interests for all detected communities. Here, we set $N = 3$ and list the topics of all communities under the condition $k = 5$ (shown in Table 9). In column 2 of Table 9, topics are split by comma, and each topic is listed as the sequence of topic ID, topic name, and collaborative frequency.

Table 9

Community topics ( $k = 5$ and $N = 3$ ).

CNo	Top 1		Top 2		Top 3
CNo	TID	Freq.	TID	Freq.	TID	Freq.
1	43	76	97	72	52	68
2	26	28	97	28	80	20
3	77	25	97	25	52	25
4	92	14	31	12	24	9
5	24	25	89	25	33	25
6	24	24	15	24	89	14

CNo is the abbreviation of community number. TID is the abbreviation of topic ID. Freq. is the abbreviation of frequency.

According to the results, we can speculate the research interests for each community. It seems that the main interest of community 1 is related to “pattern recognition” and “outlier detection,” community 2 “speech recognition,” community 3 “SVM & kernel method,” community 4 “EM & mixture model” and “neural network and Boltzmann machine,” community 5 “independent component analysis” and EEC, and community 6 “image and vision” and “independent component analysis.”

In order to investigate the effectiveness of proposed method, we check all the papers collaborated by authors in each detected community in Table 4. We found that the uncovered topics for most communities are closely related to topics of the papers written by the community authors. For example, there are 16 papers collaborated by authors in community 3 $(k = 5)$ , and the topic of the community seems to be “support vector machine and kernel method” according to the results of proposed method. We have listed the titles of these papers in Appendix A. We found that there are 8 papers, the titles of which contain “support vector,” and 4 papers, the titles of which contain “kernel.” Although there exist other topics judging from the titles of rest papers, it still assures us that the research interests of the community are mainly related to “SVM and kernel method.”

Finally, we examine the function of the overlapping nodes. In our experiments, when $k = 5$ , we have an overlapping node, 156 (author name: Sejnowski_T), between communities 4, 5, and 6. It is interesting that all the uncovered topics for the three communities contain topic 24 (visual position), and topics of both communities 5 and 6 contain topic 89 (independent component analysis). In order to find whether or not overlapping nodes play a bridge role between the three communities, we check all the 43 papers of author 156. The titles of his papers are provided in Appendix B. We found that the research interests of Sejnowski_T covered almost all found topics of the three communities. From the titles of the papers, it is not difficult to see that there are more than 12 papers covering topic 24, about 6 papers covering topic 89, and about 3 papers covering topic 92. Thus, we have reason to believe that author 156 has multiple research interests and so plays a bridge role between the three communities.

5. Conclusion

In this work, a method of uncovering research topics of communities in scientific collaboration network is proposed. The method integrates community detection model using k -clique-community algorithm and the author-topic model. The approach of k-clique-community algorithm is to detect overlapping communities in scientific collaboration network, while the approach of AT model is to discover topics and authors’ topics. We use common topics of coauthored researchers as their collaborative topics. Finally, we count all collaborative topics and select the most frequent collaborated topics among authors as research topics of communities. Experimental results on NIPS dataset show that our method is feasible and efficient.

In a knowledge-based system, it will be useful after obtaining the information of communities and their research topics. This information will help users position an interesting academic community quickly, and then they can be led to find interesting topics, researchers, and papers by using the topics and coauthorship of authors in the community. This information will also help us improve the application effect and user experience in academic recommendation system and provide researchers in a community with information they really need. Therefore, the more interesting problems related to the method are how to use it and its results in a knowledge-based system and how the parameter k in k -clique-community algorithm affects user selection in the practical application.

There are some challenge problems for future studies. One is to develop an algorithm to obtain collaborative topics between authors directly by extending AT model. Another one is to analyze topic evolution of communities and the functions of overlapping nodes in the evolution process.

Footnotes

Appendices

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

This work was supported by the Key Technologies Research on Data Mining from the Multiple Electric Vehicle Information Sources which is sponsored by Key Technologies R&D Program of Chinese 12th Five-Year Plan (2011–2015) under Grant no. 2013BAG06B01, and the Scientific Collaboration Network Analysis Based on Content and Linkage Data which is sponsored by ISTIC Preresearch Foundation under Grant no. YY201221, respectively.

References

Zhang

Zeng

User community discovery from multi-relational networks

Decision Support Systems 2013 54 2 870 879

10.1016/j.dss.2012.09.012

Ding

Tang

Mining diversity subgraph in multidisciplinary scientific collaboration networks: a meso perspective

Journal of Informetrics 2013 7 1 117 128

10.1016/j.joi.2012.09.005

Brunson

J. C.

Fassino

McInnes

Evolutionary events in a mathematical sciences research collaboration network

Scientometrics 2013

Rubí-Barceló

Core/periphery scientific collaboration networks among very similar researchers

Theory and Decision 2012 72 4 463 483

10.1007/s11238-011-9252-9

Kronegger

Mali

Ferligoj

Doreian

Collaboration structures in Slovenian scientific communities

Scientometrics 2012 90 2 631 647

2-s2.0-84855525391

10.1007/s11192-011-0493-8

Abbasi

Hossain

Owen

Exploring the relationship between research impact and collaborations for information science

Proceedings of the 45th Hawaii International Conference on System Sciences (HICSS ′12)

January 2012

Hawaii, Hawaii, USA

774 780

2-s2.0-84857956303

10.1109/HICSS.2012.664

Evans

T. S.

Lambiotte

Panzarasa

Community structure and patterns of scientific collaboration in business and management

Scientometrics 2011 89 1 381 396

2-s2.0-80052636586

10.1007/s11192-011-0439-1

Abbasi

Hossain

Uddin

Rasmussen

K. J. R.

Evolutionary dynamics of scientific collaboration networks: multi-levels and cross-time analysis

Scientometrics 2011 89 2 687 710

2-s2.0-80053953432

10.1007/s11192-011-0463-1

Pepe

Rodriguez

M. A.

Collaboration in sensor network research: an in-depth longitudinal analysis of assortative mixing patterns

Scientometrics 2010 84 3 687 701

2-s2.0-77954956366

10.1007/s11192-009-0147-2

10.

Newman

M. E. J.

Scientific collaboration networks. I. Network construction and fundamental results

Physical Review E 2001 64 1 016131-1 016131-8

2-s2.0-0035395881

11.

Arenas

Danon

Díaz-Guilera

Gleiser

P. M.

Guimerà

Community analysis in social networks

The European Physical Journal B: Condensed Matter and Complex Systems 2004 38 2 373 380

2-s2.0-2942670620

10.1140/epjb/e2004-00130-1

12.

Krichel

Bakkalbasi

A social network analysis of research collaboration in the economics community

Journal of Information Management and Scientometrics 2006 3 1 12

13.

Palla

Derényi

Farkas

Vicsek

Uncovering the overlapping community structure of complex networks in nature and society

Nature 2005 435 7043 814 818

2-s2.0-20444504323

10.1038/nature03607

14.

Liu

Bollen

Nelson

M. L.

van de Sompel

Co-authorship networks in the digital library research community

Information Processing & Management 2005 41 6 1462 1480

2-s2.0-22944485980

10.1016/j.ipm.2005.03.012

15.

Wang

Feng

Scientific collaboration networks in China's system engineering

International JoUrnal of u- and e-Service, Science and Technology 2013 6 6 31 40

16.

Abbasi

Hossain

Analyzing academic communities’ collaboration and performance

Proceedings of the International Conference on Information & Knowledge Engineering

2011

75 82

17.

Newman

M. E. J.

Scientific collaboration networks. II. Shortest paths, weighted networks, and centrality

Physical Review E 2001 64 1 016132-1 016132-7

2-s2.0-0042635247

18.

Ichise

Takeda

Muraki

Research community mining with topic identification

Proceedings of the 10th International Conference on Information Visualization

July 2006

London, UK

276 281

2-s2.0-35348924930

10.1109/IV.2006.92

19.

Nguyen

M. V.

Kirley

García-Flores

Community evolution in a scientific collaboration network

Proceedings of the IEEE World Congress on Computational Intelligence (WCCI ′12)

2012

Brisbane, Australia

1 8

20.

Evans

T. S.

Clique graphs and overlapping communities

Journal of Statistical Mechanics: Theory and Experiment 2010 12 12037

10.1088/1742-5468/2010/12/P12037

21.

García-Bañuelos

Portilla

Chávez-Aragón

Reyes-Galaviz

O. F.

Ayanegui-Santiago

Finding and analyzing social collaboration networks in the Mexican computer science community

10th Mexican International Conference on Computer Science (ENC ′09)

September 2009

Mexico, Mexico

167 175

2-s2.0-77952739659

10.1109/ENC.2009.17

22.

Everett

M. G.

Analyzing clique overlap

Connections 1998 21 1 49 61

23.

Plantié

Crampes

Survey on social community detection

Social Media Retrieval 2013 65 85

24.

Hemant

Algorithms for discovering communities in complex networks [Ph.D. thesis] 2006

University of Central Florida

25.

Porter

M. A.

Onnela

J.-P.

Mucha

P. J.

Communities in networks

Notices of the American Mathematical Society 2009 56 9 1082 1097

2-s2.0-70349607513

26.

Girvan

Newman

M. E. J.

Community structure in social and biological networks

Proceedings of the National Academy of Sciences of the United States of America 2002 99 12 7821 7826

2-s2.0-0037062448

10.1073/pnas.122653799

27.

Newman

M. E. J.

Fast algorithm for detecting community structure in networks

Physical Review E 2004 69 6 066133

2-s2.0-42749100809

10.1103/PhysRevE.69.066133

28.

Gregori

Lenzini

Mainardi

Parallel k-clique community detection on large-scale networks

IEEE Transactions on Parallel and Distributed Systems 2013 24 8 1651 1660

10.1109/TPDS.2012.229

29.

Tóth

Vicsek

Palla

Overlapping modularity at the critical point of k-clique percolation

Journal of Statistical Physics 2013 151 3-4 689 706

10.1007/s10955-012-0640-5

30.

Deng

Wang

Modularity modeling and evaluation in community detecting of complex network based on information entropy

Journal of Computer Research and Development 2012 49 4 725 734

31.

Qingwei

Xiaodong

Shuo

Author-topic evolution model and its application in analysis of research interests evolution

Journal of the China Society for Scientific and Technical Information 2013 32 9 912 919

32.

Shi

Qiao

A dynamic users’ interest discovery model with distributed inference algorithm

International Journal of Distributed Sensor Networks 2014

33.

Hofmann

Probabilistic latent semantic indexing

Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval

1999

34.

Blei

D. M.

A. Y.

Jordan

M. I.

Latent dirichlet allocation

Journal of Machine Learning Research 2003 3 4-5 993 1022

2-s2.0-0141607824

35.

Rosen-Zvi

Griffiths

Steyvers

The author-topic model for authors and documents

Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence (UAI ′04)

2004

Arlington, Va, USA

487 494

36.

Wang

McCallum

Topics over time: a non-Markov continuous-time model of topical trends

Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

August 2006

424 433

2-s2.0-33749565782

37.

Shi

Qiao

Author-Topic over Time (AToT): a dynamic users’ interest model

Mobile, Ubiquitous, and Intelligent Computing 2014

Berlin, Germany

Springer

239 245

38.

Nguyen

P. V.

CART: conference-author-relation topic model for relationship mining and role discovery in citation network

University of Illinois at Urbana-Champaign, 2013

39.

Blei

D. M.

Probabilistic topic models

Communications of the ACM 2012 55 4 77 84

10.1145/2133806.2133826

40.

Rosen-Zvi

Griffiths

Steyvers

The Author-Topic Model for Authors and Documents 2004

Arlington, Tex, USA

AUAI Press

41.

Steyvers

Smyth

Rosen-Zvi

Probabilistic Author-Topic Models for information Discovery 2004

New York, NY, USA

ACM Press

42.

Hagberg

A. A.

Schult

D. A.

Swart

P. J.

Exploring network structure, dynamics, and function using NetworkX

Proceedings of the 7th Python in Science Conference (SciPy ′08)

Pasadena, Calif, USA

11 15