Abstract
The name ambiguity problem affects the accuracy of the web search, document retrieval, and information fusion. A lot of work has been done to solve the name disambiguation problem for publication or paper, but we propose a model to solve this problem for the National Natural Science Foundation of China fund. In this paper, we propose a probabilistic Markov random fields framework to solve the problem of the National Natural Science Foundation of China fund name disambiguation. We define an objective function and use parameters learning algorithm to get the suitable parameters. Experimental results indicate that our approach significantly outperforms other different traditional clustering methods.
Keywords
Introduction
Different people may share the same identical name in the real world. It is a critical problem in many applications, such as web search, information integration, and paper retrieval.
To understand this problem, we illustrate this problem with an example which has been drawn from National Natural Science Foundation of China (NSFC) fund projects information; there are 39 projects held by 18 different persons with name Ming Chen in NSFC fund from 1955 to 2015. These funds may have same subject, same keywords, or same participants. In this paper, we try to extract the principal investigators and the information of the funds.
The problem of the name disambiguation has been investigated in many institutions and researchers. For example, Giles et al. 1 proposed an unsupervised learning approach using K-way spectral clustering for author name disambiguation. Wang et al. 2 proposed using a bias classification method for finding the atomic clustering for name disambiguation problem. Tian et al. 3 tried to cluster the graph by calculating the node similarity. However, these methods are based only on the content similarity or the relationships similarity between the nodes; it may have a good performance for some domain, while in our problem, there are multiple relationships and the different relationships may have different importance for the problem.
Thus, two questions arising here include: (1) how to combine the content feature and relationship feature? (2) How to evaluate the contribution of different types of relationships?
In this paper, we formalize the name disambiguation problem using Markov random fields,1,4 where the input data consist of both attributes information and relationship information. And we propose a two-step algorithm for the parameter learning: assignment of funds and updating the parameters. The approach achieves a better performance than other traditional clustering algorithm, because the approach takes advantage of the interrelationship between funds.
We evaluated the proposed method to the datasets, collected from NSFC which contains 580 funds of 20 different person names. Our results show that our approach can significantly improve the performance of name disambiguation by comparing the K-means, hierarchical agglomerative clustering (HAC), and spectral clustering. Moreover, we also examine the contribution of different features for the final results.
Our contributions in this paper include: (1) proposed a probabilistic Markov random fields framework to solve the challenge of name disambiguation, (2) an approach to solve the parameter estimation, and (3) an empirical verification of the proposed approach.
The rest of the paper is organized as follows: The next section reviews some related work. The subsequent section presents the overview of our approach. “Definitions” section describes the features definition; “Conditional random field (CRF)” and “Parameter learning” sections proposed our method of using Markov random fields for name disambiguation and parameter learning algorithm. “Experiments” section presents experiments and results. Finally, we conclude the paper in the final section.
Related work
The name disambiguation methods can be divided into three categories: Supervised-based approaches, unsupervised-based approaches, and constraint-based approaches. 4 These approaches for name disambiguation have different domains, for example disambiguation on coauthor,5, citations,6–9 email information, 10 web pages,11,12 search engine driven,13,14 and so on.
Han et al. 15 proposed two supervised learning approaches (a generative model and a discriminative model) for name disambiguation in author citations. One approach uses the naive Bayes probability model. They assume each author’s citation data are generated by the naive Bayes model and use the past citations data to estimate the parameters of the model. Then use the model to calculate the probability of each name entry. The other uses the support vector machines and the vector space representation of citations. This approach classifies citation to the closest author class by training a classifier for each author class. However, the one drawback of supervised approaches is its scalability and it is impractical to train a model for each author in a large digital library.
For the unsupervised-based approaches, topic models or clustering algorithms are employed to find different partitions and then assigned to different persons. Wang et al. 2 proposed using a bias classification method to find atomic cluster for publishing name disambiguation problem. They integrated the atomic cluster into two traditional clustering methods by two steps. In the first step, they try to find publications which have strong similarities by utilizing a bias AdaBoost classifier. In the second step, they take the atomic clusters as the input for traditional cluster algorithm and get the final output result. Song et al. 16 presented a two-stage efficient approach to address the name disambiguation problem; in the first stage, he proposed two hierarchical Bayesian text models, namely probabilistic latent semantic analysis and latent Dirichlet allocation. After learning an initial model, the topic distributions are treated as feature sets and names are disambiguated by leveraging a HAC method.
Some name disambiguation approaches primarily utilize specific features such as citation, coauthor, title, and content. Schulz et al. 8 presented an algorithm for large-scale author name disambiguation based on common coauthor, self-citations, shared references, and citations. Then used a two-step agglomerative clustering for name disambiguation.
Some other approaches have been proposed based on graph topological structure. And the disambiguation module clusters people by calculating person similarities as edge weights between person nodes. For example, Giles et al. 1 and Tian et al. 3 tried to cluster the graph by calculating the node similarity. On and Lee 13 attempted to partition a large-scale graph into K cluster based on both graph topological structural and attribute similarities by constructing an attribute augmented graph. In the augmented graph, they proposed a neighborhood random walk model to estimate the pairwise vertex closeness in the graph through <attribute, value> pairs. The paper demonstrates that attribute similarity increases the closeness of pairwise vertices in their distance measure. However, it is still a tricky problem how to optimally balance the contributions of the different attributes.
In the existing approaches, the data usually only contain analogous relationships and nodes. However, there are multiple different relationships and features between the nodes in our problem setting. And the different type relationships may have different weights for name disambiguation problem. How to automatically calculate the contributions of different relationships by the model is still a challenging problem.
Overview of our approach
In this section, we first give an overview of our framework and then introduce the details of name disambiguation approach.
After investigating and analysis, we found one difficultly for name disambiguation is that data are very sparse in the feature space. Some funds data are located far away from the target centroid and it is difficult to get good performance by using some tradition clustering approaches. After analyzing our data, we found that there are multiple different relationships and features between the nodes. For example, two funds with same PI name have the same CoPartner, but the similarity of the two funds is not high and actually they belong to the same PI.
Figure 1 shows a simplified graph model example. The above each node denotes a fund, each edge denotes a relationship between two funds, and the label representing the different types of the relationships. The bottom nodes denote the observation variable. The similarity between the two nodes can be calculated by some content-based similarity measurement, such as cosine similarity, Pearson correlation coefficient, and Jaccard coefficient. However, only based on the content similarity we would not be achieving a good performance. For the different types of relationships, we may have different contributions for the name disambiguation. Based on this observation, we propose a method based on Markov random fields to address the above challenges.

Name disambiguation model.
We found that the funds with similar content may belong to the same person and funds with strong relationships may also have the same label (the same PI). Based on this assumption, we integrate both content information and structure information into the Markov random fields. To solve the Markov random fields include both estimating the weights of feature function and assigning funds to different persons.
Our work is mainly aimed at NSFC fund principal investigators disambiguation, so that the same name of the principal investigators of the project can be correctly divided into the real principal investigator. We collected a dataset which contains 580 funds of 20 different person names in total. Our experimental results show that our method can significantly improve the performance of name disambiguation than three traditional clustering algorithms.
Definitions
Features
In the dataset, each fund is associated with five attributes: year, subject, CoPartner, keywords, and title, as shown in Table 1. All of the features data can be accessed from the NSFC website. In order to describe the relationships between the funds, we define four types of features between the funding data: CoPartner, CoKeyword, CoSubject, and SimilarTitle. We can use these features to describe similarities between the funds.
Attributes of each fund
Let
For each fund
Relationships between projects.
For the name disambiguation, funds and relationships are transformed into an undirected graph, where each node represents a fund and the edge is relationship. Attributes of the funds are attached to the corresponding nodes as a feature vector. For the title feature vector, we use occurrences of the words (except the stop words) as vector values. And we can define the funds relation undirected graph as follows:
Conditional random field (CRF)
CRFs are a class of statistical modeling method often applied in pattern recognition and machine learning and used for structured prediction. CRFs attempt to get conditional probability model after knowing the Observation variables. We give the Observation sequence variable X={
For our problem of the NSFC fund principal investigators name disambiguation, we can define Y as a cluster label, each
Within the graph, the hidden nodes y are linked by edges following a predefined graph structure. A CRF distribution over the cliques can be written as
Disambiguation objective function
Suppose there are K persons{
By substituting equation (1) into equation (3), we obtain
The transition feature function
The status feature function
Here
By inserting equations (5) and (6) into equation (4), then we can obtain
Here we combine the weights of transition feature function
Parameter learning
Parameter learning is mainly to obtain the appropriate parameter values
The algorithm for parameter learning primarily consists of two parts: assignment projects and parameters update. More specifically, we first randomly select a parameter setting Input: training date set Output:parameter
1.1 1.2 1.3 1.4
2.1 assign each project to the closest cluster centroid; 3.1 update every cluster centroid; 3.2 update the weight of the feature function;
For initialization, we first randomly assign the value of parameter α and parameter
However, directly maximizing the log-likelihood objective function is extremely time consuming. The third partition in equation (7) ln Z needs to be computed in every iteration for Z. In order to make the computation tractable for the parameter learning, we maximize the pseudo-likelihood of the training data.
17
This approximates the conditional likelihood by a product of conditional distributes over given immediate neighbors (Markov blanket) of
The partition function
Then we calculate the parameter
Finally, we can update the parameters
Here,
Experiments
Datasets
To evaluate our method, we access a dataset from NSFC website. For this dataset, it includes 20 real person names with their 580 funds. For these person names, some have a few real persons, for example, “Yuliang Li” only represents three different real persons. However, some have many real persons, for “Wei liu,” there are 55 different real persons. Tables 3 and 4 show the detail of dataset.
Datasets.
Details of two principal names.
For the dataset, each fund is labeled with the number to indicate the actual person. We determined the label number by the principal homepages and other public data. From Tables 3 and 4, we can see the number of real person is unbalanced. For example, there are 16 funds held by “Zhou Qi,” but 12 funds hosted by Prof. Zhou Qi from Institute of Zoology, Chinese Academy of Sciences. The other four funds are hosted by the other four different people, respectively.
Experimental design
In the experiment, we will use the pairwise measurement to evaluate our method and compare with the K-means algorithm, hierarchical clustering, and spectral clustering. The measures are adapted for evaluating name disambiguation by calculating the number of pairs of funds assigned with the same label. We define the measurements as follows
We considered several traditional clustering methods as baseline methods, including K-means, HAC, and spectral clustering. In these methods we combine the title, subject, keywords, and CoPartner features. Specifically, for title, we partition it into a bag of words without stop words and then generate a feature for each word. For subject, we transform it into one hot vector feature. For CoPartner and keywords, we split the list into person name and each keyword, and then define a one hot vector feature for each keyword and name. For the K-means, we just partition the n funds vector into k (≤n) sets S = {C1, C2, …, Ck} so as to minimize the within-cluster sum of squares. For HAC, each fund starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy, and we chose Euclidean distance as metric to calculate the two clusters distance. The spectral clustering partitions the nodes in a graph into K clusters by the relationship matrix. The difference is that spectral clustering relation is binary value with no weight.
Results
We conducted disambiguation experiments for funds which are related to each of the principal names in the dataset. Table 5 shows the final results. We can see that our method (bold values in Table 5) outperforms the other methods for name disambiguation (+12.2% over K-means, +8% over HAC, and +8.9% over special clustering by average F1 score). Figure 2 shows the performance comparison between three baseline methods and our method. It can be seen that the improvements by our approach are significant.
Results of name disambiguation (percent).
HAC: hierarchical agglomerative clustering.

Performance of using three baseline methods and our method. HAC: hierarchical agglomerative clustering.
The K-means and HAC methods have two disadvantages: They cannot make use of relationships between funds and they just rely on a fixed distance measure. Special clustering considers the relationship between the nodes; it uses the similarity matrix to indicate the relationship information. However, relation value is binary with no weight.
Our method directly models the relevance as the dependencies between the assignment results and uses an unsupervised algorithm to learn the similarity function between the papers.
Feature contribution analysis
We also studied the contribution of the different features for name disambiguation. We tested each feature for the final result. Figure 3 shows the performance of using different features. We can see that the CoKeyword feature results in the best performance in terms of Precision, Recall, and F1 measure. And the average performance of CoPartner feature is better than both SimTitle feature and CoSubject feature. The recall value of the SimTile is similar to CoSubject, but the precision and F1 measure performance is better than CoSubject feature.

Contribution of different features.
After that, we rank each feature by its performance and then we add the features one by one into our method. In particular, we first add CoKeyword, followed by adding CoPartner, SimTitle, and CoSubject. In each step, we evaluate the change of Precision, Recall, and F1 score. Figure 4 shows the average Precision, Recall, and F1 score of our method with different feature combinations.

Performances of different features.
Conclusion
In this paper, we have solved the problem of the NSFC fund name disambiguation by proposing a probabilistic CRF model. We defined an objective function and used parameters learning algorithm to get the suitable parameters. We applied our approach to a real dataset from NSFC. Experimental results indicate that our approach significantly outperforms the K-means, HAC, and SA cluster.20 Experiments also show that CoKeyword features are the most important features for our name disambiguation problem.
In the next step, we would be interested to study how the topic models like the LDA model can improve our name disambiguation, and we will use more attributes information to improve the final results. Moreover, we are also interested to study a method to estimate the number of K, because existing methods cannot accurately estimate the number of clusters.
Footnotes
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work is supported by the project of International Engineering Science and Technology Big Data Platform Construction and Strategy Research (15692110200) of Shanghai Computer Software Technology Development Center.
