A conditional random field model for name disambiguation in National Natural Science Foundation of China fund

Abstract

The name ambiguity problem affects the accuracy of the web search, document retrieval, and information fusion. A lot of work has been done to solve the name disambiguation problem for publication or paper, but we propose a model to solve this problem for the National Natural Science Foundation of China fund. In this paper, we propose a probabilistic Markov random fields framework to solve the problem of the National Natural Science Foundation of China fund name disambiguation. We define an objective function and use parameters learning algorithm to get the suitable parameters. Experimental results indicate that our approach significantly outperforms other different traditional clustering methods.

Keywords

Name disambiguation conditional random field National Natural Science Foundation of China fund clustering

Introduction

Different people may share the same identical name in the real world. It is a critical problem in many applications, such as web search, information integration, and paper retrieval.

To understand this problem, we illustrate this problem with an example which has been drawn from National Natural Science Foundation of China (NSFC) fund projects information; there are 39 projects held by 18 different persons with name Ming Chen in NSFC fund from 1955 to 2015. These funds may have same subject, same keywords, or same participants. In this paper, we try to extract the principal investigators and the information of the funds.

The problem of the name disambiguation has been investigated in many institutions and researchers. For example, Giles et al.¹ proposed an unsupervised learning approach using K-way spectral clustering for author name disambiguation. Wang et al.² proposed using a bias classification method for finding the atomic clustering for name disambiguation problem. Tian et al.³ tried to cluster the graph by calculating the node similarity. However, these methods are based only on the content similarity or the relationships similarity between the nodes; it may have a good performance for some domain, while in our problem, there are multiple relationships and the different relationships may have different importance for the problem.

Thus, two questions arising here include: (1) how to combine the content feature and relationship feature? (2) How to evaluate the contribution of different types of relationships?

In this paper, we formalize the name disambiguation problem using Markov random fields,^1,4 where the input data consist of both attributes information and relationship information. And we propose a two-step algorithm for the parameter learning: assignment of funds and updating the parameters. The approach achieves a better performance than other traditional clustering algorithm, because the approach takes advantage of the interrelationship between funds.

We evaluated the proposed method to the datasets, collected from NSFC which contains 580 funds of 20 different person names. Our results show that our approach can significantly improve the performance of name disambiguation by comparing the K-means, hierarchical agglomerative clustering (HAC), and spectral clustering. Moreover, we also examine the contribution of different features for the final results.

Our contributions in this paper include: (1) proposed a probabilistic Markov random fields framework to solve the challenge of name disambiguation, (2) an approach to solve the parameter estimation, and (3) an empirical verification of the proposed approach.

The rest of the paper is organized as follows: The next section reviews some related work. The subsequent section presents the overview of our approach. “Definitions” section describes the features definition; “Conditional random field (CRF)” and “Parameter learning” sections proposed our method of using Markov random fields for name disambiguation and parameter learning algorithm. “Experiments” section presents experiments and results. Finally, we conclude the paper in the final section.

Related work

The name disambiguation methods can be divided into three categories: Supervised-based approaches, unsupervised-based approaches, and constraint-based approaches.⁴ These approaches for name disambiguation have different domains, for example disambiguation on coauthor,^5, citations,^6–9 email information,¹⁰ web pages,^11,12 search engine driven,^13,14 and so on.

Han et al.¹⁵ proposed two supervised learning approaches (a generative model and a discriminative model) for name disambiguation in author citations. One approach uses the naive Bayes probability model. They assume each author’s citation data are generated by the naive Bayes model and use the past citations data to estimate the parameters of the model. Then use the model to calculate the probability of each name entry. The other uses the support vector machines and the vector space representation of citations. This approach classifies citation to the closest author class by training a classifier for each author class. However, the one drawback of supervised approaches is its scalability and it is impractical to train a model for each author in a large digital library.

For the unsupervised-based approaches, topic models or clustering algorithms are employed to find different partitions and then assigned to different persons. Wang et al.² proposed using a bias classification method to find atomic cluster for publishing name disambiguation problem. They integrated the atomic cluster into two traditional clustering methods by two steps. In the first step, they try to find publications which have strong similarities by utilizing a bias AdaBoost classifier. In the second step, they take the atomic clusters as the input for traditional cluster algorithm and get the final output result. Song et al.¹⁶ presented a two-stage efficient approach to address the name disambiguation problem; in the first stage, he proposed two hierarchical Bayesian text models, namely probabilistic latent semantic analysis and latent Dirichlet allocation. After learning an initial model, the topic distributions are treated as feature sets and names are disambiguated by leveraging a HAC method.

Some name disambiguation approaches primarily utilize specific features such as citation, coauthor, title, and content. Schulz et al.⁸ presented an algorithm for large-scale author name disambiguation based on common coauthor, self-citations, shared references, and citations. Then used a two-step agglomerative clustering for name disambiguation.

Some other approaches have been proposed based on graph topological structure. And the disambiguation module clusters people by calculating person similarities as edge weights between person nodes. For example, Giles et al.¹ and Tian et al.³ tried to cluster the graph by calculating the node similarity. On and Lee¹³ attempted to partition a large-scale graph into K cluster based on both graph topological structural and attribute similarities by constructing an attribute augmented graph. In the augmented graph, they proposed a neighborhood random walk model to estimate the pairwise vertex closeness in the graph through <attribute, value> pairs. The paper demonstrates that attribute similarity increases the closeness of pairwise vertices in their distance measure. However, it is still a tricky problem how to optimally balance the contributions of the different attributes.

In the existing approaches, the data usually only contain analogous relationships and nodes. However, there are multiple different relationships and features between the nodes in our problem setting. And the different type relationships may have different weights for name disambiguation problem. How to automatically calculate the contributions of different relationships by the model is still a challenging problem.

Overview of our approach

In this section, we first give an overview of our framework and then introduce the details of name disambiguation approach.

After investigating and analysis, we found one difficultly for name disambiguation is that data are very sparse in the feature space. Some funds data are located far away from the target centroid and it is difficult to get good performance by using some tradition clustering approaches. After analyzing our data, we found that there are multiple different relationships and features between the nodes. For example, two funds with same PI name have the same CoPartner, but the similarity of the two funds is not high and actually they belong to the same PI.

Figure 1 shows a simplified graph model example. The above each node denotes a fund, each edge denotes a relationship between two funds, and the label representing the different types of the relationships. The bottom nodes denote the observation variable. The similarity between the two nodes can be calculated by some content-based similarity measurement, such as cosine similarity, Pearson correlation coefficient, and Jaccard coefficient. However, only based on the content similarity we would not be achieving a good performance. For the different types of relationships, we may have different contributions for the name disambiguation. Based on this observation, we propose a method based on Markov random fields to address the above challenges.

Figure 1.

Name disambiguation model.

We found that the funds with similar content may belong to the same person and funds with strong relationships may also have the same label (the same PI). Based on this assumption, we integrate both content information and structure information into the Markov random fields. To solve the Markov random fields include both estimating the weights of feature function and assigning funds to different persons.

Our work is mainly aimed at NSFC fund principal investigators disambiguation, so that the same name of the principal investigators of the project can be correctly divided into the real principal investigator. We collected a dataset which contains 580 funds of 20 different person names in total. Our experimental results show that our method can significantly improve the performance of name disambiguation than three traditional clustering algorithms.

Definitions

Features

In the dataset, each fund is associated with five attributes: year, subject, CoPartner, keywords, and title, as shown in Table 1. All of the features data can be accessed from the NSFC website. In order to describe the relationships between the funds, we define four types of features between the funding data: CoPartner, CoKeyword, CoSubject, and SimilarTitle. We can use these features to describe similarities between the funds.

Table 1.

Attributes of each fund $p_{i}$ .

Attributes	Description
$p_{i}$ .year	Year of fund $p_{i}$ application
$p_{i}$ .subject	The fund $p_{i}$ subject
$p_{i}$ .keyword	Keyword set{ $k_{i}^{o}$ , $k_{i}^{2}$ ,…, $k_{i}^{l}$ } in $p_{i}$
$p_{i}$ .CoPartner	A set of collaborators nameof $p_{i}$ ,{ $a_{i}^{o}$ , $a_{i}^{2}$ ,…, $a_{i}^{l}$ }
$p_{i}$ .title	Title is composed of a setof words { $w_{i}^{o}$ , $w_{i}^{2}$ ,…, $w_{i}^{l}$ }

Let $p_{i}$ denotes a fund, c denotes cluster of funds, and a denotes principal. { $k_{i}^{o}$ , $k_{i}^{2}$ , … , $k_{i}^{l}$ } denotes a set of keywords of fund $p_{i}$ . { $a_{i}^{o}$ , $a_{i}^{2}$ ,…, $a_{i}^{l}$ } denotes a set of co-persons of a fund $p_{i}$ . A set of words { $w_{i}^{o}$ , $w_{i}^{2}$ ,…, $w_{i}^{l}$ } denotes words in title.

For each fund $p_{i}$ , it has five attributes, as shown in Table 1. Such data information can be found in online website. We define four types of undirected relationships between projects as shown in Table 2.

Table 2.

Relationships between projects.

R	Relation	Description
$r_{1}$	CoPartner	$\exists r$ ,s, $a_{i}^{r}$ = $a_{j}^{s}$
$r_{2}$	CoKeyword	$\exists r$ ,s, $k_{i}^{r}$ = $k_{j}^{s}$
$r_{3}$	CoSubject	$p_{i}$ .subject = $p_{j}$ .subject
$r_{4}$	SimTitle	$p_{i}$ .phone = $p_{j}$ .phone

CoPartner( $r_{1}$ ) The same person’s funds may have the same collaborators, which are significant for name disambiguation. If two funds $p_{i}$ and $p_{j}$ own a same collaborator (except principal a), then the value of the CoPartner feature between the two funds is defined as 1, otherwise 0.

CoKeyword( $r_{2}$ ) CoKeyword relationship feature is very import information for the name disambiguation. The same person is more likely to apply the similar funds which may have the same keywords. CoKeyword feature processing is similar to the Co-Person process.

CoSubject( $r_{3}$ ) Obviously, the same person may work in the same research field. If two funds $p_{i}$ and $p_{j}$ have the same subject, we define the CoSubject feature value is 1, otherwise 0.

SimTitle( $r_{4}$ ) We partition the title into keywords without stop words and then we evaluate the title feature between two funds by the keywords list. This auxiliary feature of the keywords can improve the precision, because some funds are not included in keywords list.

For the name disambiguation, funds and relationships are transformed into an undirected graph, where each node represents a fund and the edge is relationship. Attributes of the funds are attached to the corresponding nodes as a feature vector. For the title feature vector, we use occurrences of the words (except the stop words) as vector values. And we can define the funds relation undirected graph as follows:

Definition 1 (undirected relationship graph). Given a set of funds data P = { $p_{1}$ , $p_{2}$ ,…, $p_{n}$ }, let $r_{m}$ ( $p_{i}$ , $p_{j}$ ) be a relationship $r_{m}$ between $p_{i}$ and $p_{j}$ . We can use G = {P, R, $V_{P}$ , $W_{R}$ }, $V_{p_{i}} \in V_{P}$ is the feature vector of the fund $p_{i}$ and $w_{m}$ $\in$ $W_{R}$ is the weight of the relationship $r_{m}$ .

Conditional random field (CRF)

CRFs are a class of statistical modeling method often applied in pattern recognition and machine learning and used for structured prediction. CRFs attempt to get conditional probability model after knowing the Observation variables. We give the Observation sequence variable X={ $x_{1}, x_{2}, x_{3}, \dots, x_{n}$ } corresponding to the sequence hidden variable Y={ $y_{1}, y_{2}, y_{3}, \dots, y_{n}$ }, then our goal is to construct the conditional probability model p(y | x).

For our problem of the NSFC fund principal investigators name disambiguation, we can define Y as a cluster label, each $y_{i}$ corresponds to different people, every hidden variable $y_{i}$ takes a value from the set {1,…,k}, which are the indexes of the clusters. The observation variables X correspond to projects.

Within the graph, the hidden nodes y are linked by edges following a predefined graph structure. A CRF distribution over the cliques can be written as

p (Y|X) = \frac{1}{Z} \exp (\sum_{(y_{i}, y_{j}) \in E, k} α_{k} f_{k} (y_{i}, y_{j}) + \sum_{(y_{i}, y_{j}) \in E, l} β_{l} f_{l} (y_{i}, x_{i}))

(1)

where

f_{k} (y_{i}, y_{j})

is a transition feature function between node

y_{i} with y_{j}

, which represents the attribute information is relevant with each project and information between projects, respectively.

f_{l} (y_{i}, x_{i})

is status feature function defined on node

x_{i}

. E is set of all edges in the graph;

α_{k}

and

β_{l}

are weights of the edge feature and status feature function, respectively. Z is normalization factor, which can be written as

z = \sum_{y} exp (\sum_{(y_{i}, y_{j}) \in E, k} α_{k} f_{k} (y_{i}, y_{j}) + \sum_{(y_{i}, y_{j}) \in E, l} β_{l} f_{l} (y_{i}, x_{i}))

(2)

Disambiguation objective function

Suppose there are K persons{ $y_{1}$ , $y_{2}$ ,…, $y_{k}$ } with name a, our task is to disambiguate the n funds to their real PI $y_{i}$ , i $\in [1, K]$ . We define the objective function as the maximization condition random field P (Y | X), we can define the objective function as

L_{\max} = ln (P (Y|X))

(3)

By substituting equation (1) into equation (3), we obtain

L_{m a x} = ln (\frac{1}{2} exp (\sum_{(y_{i}, y_{j}) \in E, k} α_{k} f_{k} (y_{i}, y_{j}) + \sum_{x \in X, l} β_{l} f_{l} (y_{i}, x_{i})))

(4)

The transition feature function $f_{k} (y_{i}, y_{j})$ represents the relation of the nodes in the graph. Intuitively, if the relation of two projects is close, it is more likely to divide two projects into the same cluster. For example, two projects own the same email, same organization, and same CFDA number, so we can determine that these two projects have strong relationship. Thus, we define the transition feature function as

f_{k} (y_{i}, y_{j}) = K (x_{i}, x_{i}) \sum_{r_{m ϵ R_{i j}}} [λ_{m} r_{m} (x_{i}, x_{j})]

(5)

where K(

x_{i}, x_{i}

) is a similarity function between project

x_{i}

and

x_{j}

;

R_{i j}

denotes the all relationships between

x_{i}

and

x_{j}

;

λ_{m}

is the weight of the relationship

r_{m}

. For simplicity, we can define the relation function

r_{m} (x_{i}, x_{j})

as a binary function, if the relation

r_{m}

exists between

x_{i}

and

x_{j}

, the value of the relation function

r_{m} (x_{i}, x_{j})

is 1, else the value is 0.

The status feature function $f_{l} (y_{i}, x_{i})$ mainly preserves the attribute information associated with project $x_{i}$ . The idea is that if the project is similar to all other projects in a cluster, then it is very likely to put the project into this cluster. For the status feature function $f_{l} (y_{i}, x_{i})$ , we can define as

f_{l} (y_{i}, x_{i}) = K (y_{i}, x_{i}) = K (μ_{(i)}, x_{i})

(6)

Here $μ_{(i)}$ is the centroid of the cluster which project $x_{i}$ assigns in. K( $μ_{(i)}$ , $x_{i}$ ) denotes the similarity between project $x_{i}$ and its assigned cluster center $μ_{(i)}$ .

By inserting equations (5) and (6) into equation (4), then we can obtain

L_{m a x} = \sum_{r_{m ϵ R_{i j}}} α_{k} K (x_{i}, x_{i}) r_{m} (x_{i}, x_{j}) + \sum_{x_{i} \in X, l} β_{l} K (μ_{(i)}, x_{i}) - \log Z

(7)

Here we combine the weights of transition feature function $α_{k}$ and the weight of status feature function $β_{l}$ .

Parameter learning

Parameter learning is mainly to obtain the appropriate parameter values $θ$ = { $α_{1}$ , $α_{2}$ , … , $β_{1}$ , $β_{2}$ ,…} to determine assignments of all projects. It involves an optimization process to maximize the log-likelihood objective function.

The algorithm for parameter learning primarily consists of two parts: assignment projects and parameters update. More specifically, we first randomly select a parameter setting $θ$ and select a centroid for each cluster. Then, we assign each fund to its cluster and calculate the centroid of each project cluster. After that we update the parameters by maximizing the objective function until all the funds do not change its cluster.

Algorithm 1. Parameter learning.

Input: training date set $p= {p_{1}, p_{2}, \dots, p_{n}}$

Output：parameter $θ$ of the model and y={ $y_{1}$ , $y_{2}$ ,…, $y_{n}$ }, $y_{i} \in [1, k]$

1. Initialization

1.1 randomly initialize parameter $θ$ ;

1.2 for each project $x_{i}$ choose an initial value $y_{i}$ ;

1.3 compute the cluster centroid $μ_{i}$ ;

1.4 for each project $x_{i}$ calculate the $f_{l} (y_{i}, x_{i})$ and $f_{k} (y_{i}, y_{j})$ ;

2. assignment

2.1 assign each project to the closest cluster centroid;

3. update

3.1 update every cluster centroid;

3.2 update the weight of the feature function;

For initialization, we first randomly assign the value of parameter α and parameter $β$ . For the cluster centroid, we use the nearest neighbor clustering method to identify the cluster centroid. First we choose two funds and calculate the similarity between two funds; if the similarity between two funds is less than a threshold, these two funds will assign to disjoint cluster. Otherwise, assign these two funds to one cluster. Then calculate the cluster centroid. For the other funds, sort the similarity between funds and the cluster centroid. If the smallest similarity is less than threshold, assign this fund to the nearest cluster centroid. Otherwise, generate a new cluster and update the changed cluster centroid. In this way, we will get $γ$ clusters. If $γ$ is less than K (a predefined actual number of clusters), we randomly choose (K− $γ$ ) funds as new cluster centroid. If $γ$ is more than K, we group the nearest cluster until there are only K cluster left. If $γ$ is equal the number K, we use this $γ$ cluster as our initial assignment directly. Next we will introduce some detail of the two steps in parameter learning.

Assignments. Each project $x_{i}$ is assigned to $μ_{(i)}$ to maximize log(P( $y_{i}$ | $x_{i}$ ))

ln (P (y_{i} | x_{i})) \propto L_{x_{i}} (μ_{h}, x_{i}) = \sum_{(x_{i}, x_{j}) \in E_{i}, R_{i}, m} α_{k} K (x_{i}, x_{i}) r_{m} (x_{i}, x_{j}) + \sum_{x_{i} \in X, l} β_{l} K (μ_{(i)}, x_{i}) - ln Z

(8)

where

E_{i}

denotes all the relationship related to

x_{i}

, Z is normalization factor on

x_{i}

, and

K (x_{i}, x_{i})

is the similarity function, here we use a cosine similarity measurement

K (x_{i}, x_{i}) = \frac{x_{i}^{T} x_{i}}{‖ x_{i} ‖ ‖ x_{j} ‖}, where ‖ x_{i} ‖ = \sqrt{x_{i}^{T} x_{i}}

(9)

However, directly maximizing the log-likelihood objective function is extremely time consuming. The third partition in equation (7) ln Z needs to be computed in every iteration for Z. In order to make the computation tractable for the parameter learning, we maximize the pseudo-likelihood of the training data.¹⁷ This approximates the conditional likelihood by a product of conditional distributes over given immediate neighbors (Markov blanket) of $y_{i}$ .¹⁸ Let n be the number of nodes in our model, the pseudo-likelihood can be computed as

P (y) = \prod_{i}^{n} p (y_{i} | M B (y_{i}))

(10)

L_{R (θ|D)} = L_{(θ|D)} - \frac{{(w - w)}^{T} (w - w)}{2 δ^{2}}

(11)

where D is the training set expressed as D = {

y_{i}, x_{i}

|i = 1,…,n}, and

L_{R}

(

θ

|D) is the pseudo-log likelihood written as

L (θ|D) = \sum_{i}^{n} {\sum_{(y_{i}, y_{j}) \in E, k} α_{k} f_{k} (y_{i}, y_{j}) + \sum_{x \in X, l} β_{l} f_{l} (y_{i}, x_{i}) - ln (Z_{m} (x_{i}, MB (y_{i}), α_{k}, β_{l}))}

(12)

The partition function $Z_{m}$ is a local “committee” (comparing with a global network), where Markov blanket greatly reduces the computational cost for the objective function. $Z_{m}$ sums M states from the local Markov blanket of $y_{i}$ .

Then we calculate the parameter $α_{k}$ and $β_{l}$ for $L_{x_{i}} (μ_{h}, x_{i})$ , respectively. Differentiating the objective function with respect to each parameter $α_{k}$ , we have

\frac{\partial L}{\partial α_{k}} = \frac{\partial L (θ | D)}{\partial α_{k}} = \sum_{{(x_{i}, x_{j})}_{ϵ E}} K (x_{i}, x_{i}) r_{m} (x_{i}, x_{j}) - \frac{\partial ln (Z_{m} (x_{i}, MB (y_{i}), α_{k}, β_{l}))}{\partial α_{k}}

(13)

Finally, we can update the parameters

α_{k}^{n e w} = α_{k}^{o l d} + Δ \frac{\partial L}{\partial α_{k}}

(14)

Here, $Δ$ is the learning rate. We can do the same for $β_{l}$ .

Experiments

Datasets

To evaluate our method, we access a dataset from NSFC website. For this dataset, it includes 20 real person names with their 580 funds. For these person names, some have a few real persons, for example, “Yuliang Li” only represents three different real persons. However, some have many real persons, for “Wei liu,” there are 55 different real persons. Tables 3 and 4 show the detail of dataset.

Table 3.

Datasets.

Name	#Fund	#Per	Name	#Fund	#Per
Ming Chen	39	18	Xu Zhang	35	12
Lin Huang	10	3	Guoliang Xu	19	5
Wei Liu	120	55	Xin Sun	16	5
Jing Li	14	5	Jun Luo	29	10
Xu Chen	26	7	Xi Zhang	28	4
Fu Gao	21	3	Qi Zhou	16	5
Tao Xu	31	6	Lijun Wang	13	6
Yongfeng Shang	13	2	Yuliang Li	22	3
Tao Zhang	38	10	Ming Jiang	18	6
Lin Li	37	8	Can Li	35	5

Table 4.

Details of two principal names.

Name	Organization	#Funds
Zhou Qi(16/5)	Institute of Zoology,Chinese Academyof Sciences	12
	Tongji University	1
	Hefei University of Technology	1
	China University of Geosciences	1
	Bohai University	1
Xi Zhang（28/4）	Tsinghua University	18
	Jilin University	5
	Zhejiang University	4
	Shanghai Jiao Tong University	1

For the dataset, each fund is labeled with the number to indicate the actual person. We determined the label number by the principal homepages and other public data. From Tables 3 and 4, we can see the number of real person is unbalanced. For example, there are 16 funds held by “Zhou Qi,” but 12 funds hosted by Prof. Zhou Qi from Institute of Zoology, Chinese Academy of Sciences. The other four funds are hosted by the other four different people, respectively.

Experimental design

In the experiment, we will use the pairwise measurement to evaluate our method and compare with the K-means algorithm, hierarchical clustering, and spectral clustering. The measures are adapted for evaluating name disambiguation by calculating the number of pairs of funds assigned with the same label. We define the measurements as follows

P a i r w i s e_P r e c i s i o n = \frac{# Pairs correctly predicted number}{# Total pairs predicted number}

P a i r w i s e_R e c a l l = \frac{# P a i r s c o r r e c t l y p r e d i c t e d n u m b e r}{# T o t a l p a i r s n u m b e r}

F 1 - M e a s u r e = \frac{2 * P a i r w i s e_P r e c i s i o n * P a i r w i s e_R e c a l l}{P a i r w i s e_P r e c i s i o n + P a i r w i s e_R e c a l l}

We considered several traditional clustering methods as baseline methods, including K-means, HAC, and spectral clustering. In these methods we combine the title, subject, keywords, and CoPartner features. Specifically, for title, we partition it into a bag of words without stop words and then generate a feature for each word. For subject, we transform it into one hot vector feature. For CoPartner and keywords, we split the list into person name and each keyword, and then define a one hot vector feature for each keyword and name. For the K-means, we just partition the n funds vector into k (≤n) sets S = {C1, C2, …, Ck} so as to minimize the within-cluster sum of squares. For HAC, each fund starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy, and we chose Euclidean distance as metric to calculate the two clusters distance. The spectral clustering partitions the nodes in a graph into K clusters by the relationship matrix. The difference is that spectral clustering relation is binary value with no weight.

Results

We conducted disambiguation experiments for funds which are related to each of the principal names in the dataset. Table 5 shows the final results. We can see that our method (bold values in Table 5) outperforms the other methods for name disambiguation (+12.2% over K-means, +8% over HAC, and +8.9% over special clustering by average F1 score). Figure 2 shows the performance comparison between three baseline methods and our method. It can be seen that the improvements by our approach are significant.

Table 5.

Results of name disambiguation (percent).

Name	K-means			HAC			Spectral cluster			Our approach
Name	Prec	Rec	F1	Prec	Rec	F1	Prec	Rec	F1	Prec	Rec	F1
Ming Chen	83.64	83.64	79.12	78.62	76.33	77.46	78.24	83.12	80.61	86.67	92.12	89.31
Lin Huang	100	100	100	92.86	92.8	92.83	80.67	83.67	82.14	90.24	81.78	85.80
Wei Liu	80.32	70.56	75.12	62.5	100	76.92	67.89	75.78	71.62	90.32	88.45	89.38
Jing Li	66.67	73.33	69.84	62.5	87.5	72.92	88.65	89.56	89.10	78.67	87.65	82.92
Xu Chen	85.71	85.71	85.71	100	100	100	78.67	80.35	79.50	87.67	89.78	88.71
Fu Gao	62.86	65.76	64.28	76.67	66.6	71.28	78.89	70.67	74.55	80.46	89.76	84.86
Tao Xu	60.82	80.56	69.31	70.67	83.78	76.67	77.56	85.56	81.36	80.56	84.23	82.85
Yongfeng Shang	77.54	78.65	78.09	79.25	84.35	81.72	86.78	89.56	88.15	77.67	80.24	78.93
Tao Zhang	82.94	87.47	85.14	80.23	90.23	84.94	67.34	78.32	72.42	90.67	93.34	91.99
Lin Li	56.23	47.89	51.73	60.29	66.35	63.18	77.65	62.56	69.29	78.52	80.81	79.65
Xu Zhang	84.67	76.89	80.59	77.29	84.89	80.91	63.45	79.05	70.40	84.23	90.45	87.23
Guoliang Xu	77.27	85.36	81.11	78.63	80.25	79.43	79.58	86.27	82.79	92.13	87.56	89.79
Xin Sun	82.59	73.13	77.57	83.23	78.27	80.67	92.86	75.73	83.42	100	100	100
Jun Luo	78.33	72.35	75.22	80.37	77.73	79.03	82.67	75.89	79.14	85.34	80.72	82.97
Xi Zhang	75.27	67.24	71.03	66.24	61.23	63.64	65.32	70.11	67.63	75.12	75.86	75.49
Qi Zhou	69.23	75.82	72.38	65.49	70.34	67.83	65.67	60.39	62.92	92.89	83.40	87.89
Lijun Wang	80.56	70.23	75.04	82.34	89.23	85.65	74.32	78.56	76.38	83.78	78.23	80.91
Yuliang Li	67.28	62.89	65.01	90.23	83.51	86.74	75.98	86.15	80.75	100.0	100.0	100.0
Ming Jiang	75.12	89.56	79.56	88.36	90.42	89.38	100.0	100.0	100.0	100.0	100.0	100.0
Can Li	69.23	58.63	63.49	67.18	69.32	68.23	70.60	75.00	72.73	90.12	87.05	88.56
Avg.	75.81	75.03	75.21	77.14	81.65	79.34	77.64	79.32	78.47	87.25	87.57	87.41

HAC: hierarchical agglomerative clustering.

Figure 2.

Performance of using three baseline methods and our method. HAC: hierarchical agglomerative clustering.

The K-means and HAC methods have two disadvantages: They cannot make use of relationships between funds and they just rely on a fixed distance measure. Special clustering considers the relationship between the nodes; it uses the similarity matrix to indicate the relationship information. However, relation value is binary with no weight.

Our method directly models the relevance as the dependencies between the assignment results and uses an unsupervised algorithm to learn the similarity function between the papers.

Feature contribution analysis

We also studied the contribution of the different features for name disambiguation. We tested each feature for the final result. Figure 3 shows the performance of using different features. We can see that the CoKeyword feature results in the best performance in terms of Precision, Recall, and F1 measure. And the average performance of CoPartner feature is better than both SimTitle feature and CoSubject feature. The recall value of the SimTile is similar to CoSubject, but the precision and F1 measure performance is better than CoSubject feature.

Figure 3.

Contribution of different features.

After that, we rank each feature by its performance and then we add the features one by one into our method. In particular, we first add CoKeyword, followed by adding CoPartner, SimTitle, and CoSubject. In each step, we evaluate the change of Precision, Recall, and F1 score. Figure 4 shows the average Precision, Recall, and F1 score of our method with different feature combinations.

Figure 4.

Performances of different features.

Conclusion

In this paper, we have solved the problem of the NSFC fund name disambiguation by proposing a probabilistic CRF model. We defined an objective function and used parameters learning algorithm to get the suitable parameters. We applied our approach to a real dataset from NSFC. Experimental results indicate that our approach significantly outperforms the K-means, HAC, and SA cluster.²⁰ Experiments also show that CoKeyword features are the most important features for our name disambiguation problem.

In the next step, we would be interested to study how the topic models like the LDA model can improve our name disambiguation, and we will use more attributes information to improve the final results. Moreover, we are also interested to study a method to estimate the number of K, because existing methods cannot accurately estimate the number of clusters.

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work is supported by the project of International Engineering Science and Technology Big Data Platform Construction and Strategy Research (15692110200) of Shanghai Computer Software Technology Development Center.

References

Giles

Zha

Han

Name disambiguation in author citations using a k-way spectral clustering method. In: Digital Libraries, 2005. JCDL'05. Proceedings of the 5th ACM/IEEE-CS Joint Conference on, Denver, CO, USA, 7–11 June 2005, pp. 334–343. IEEE.

Wang

Tang

et al . Name disambiguation using atomic clusters. In: Web-Age Information Management, 2008. WAIM'08. The Ninth International Conference on, Zhangjiajie, 20–22 July 2008, pp. 357–364. IEEE.

Tian

Hankins

Patel

JM.

Efficient aggregation for graph summarization. In: Proceedings of the 2008 ACM SIGMOD international conference on Management of data, Vancouver, Canada, 9–12 June 2008, pp. 567–580. ACM.

Tang J, Fong ACM, Wang B, et al. A unified probabilistic framework for name disambiguation in digital library. IEEE Trans Knowl Data Eng 2012; 24: 975–987.

Kang

Lee

et al . On co-authorship for author disambiguation. Inf Process Manage 2009; 45: 84–97.

Cong

Miao

Author name disambiguation using a new categorical distribution similarity. Mach Learn Knowl Discov Databases 2012; 7523: 569–584.

Cota

Ferreira

Nascimento

et al . An unsupervised heuristic-based hierarchical method for name disambiguation in bibliographic citations. J Am Soc Inf Sci 2010; 61: 1853–1870.

Yin

Han

Philip

SY.

Object distinction: distinguishing objects with identical names. In: Data Engineering, 2007. ICDE 2007. IEEE 23rd International Conference on, Istanbul, Turkey, 15–20 April 2007, pp. 1242–1246. IEEE.

Schulz

Mazloumian

Petersen

et al . Exploiting citation networks for large-scale author name disambiguation. EPJ Data Sci 2014; 3: 11.

10.

Strotmann

Zhao

Author name disambiguation: what difference does it make in author‐based citation analysis?

J Am Soc Inf Sci Technol 2012; 63: 1820–1833.

11.

Minkov

Cohen

AY.

Contextual search and name disambiguation in email using graphs. In: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, Seattle, Washington, USA, 6–11 August 2006, pp. 27–34. ACM.

12.

Ferreira

Gonçalves

Laender

AHF.

A brief survey of automatic methods for author name disambiguation. SIGMOD Rec 2012; 41: 15–26.

13.

Kalashnikov

Mehrotra

Chen

et al . Disambiguation algorithm for people search on the web. In: Data Engineering, 2007. ICDE 2007. IEEE 23rd International Conference on, Istanbul, Turkey, 15–20 April 2007, pp. 1258–1260. IEEE.

14.

Lee

Scalable name disambiguation using multi-level graph partition. In: Proceedings of the 2007 SIAM International Conference on Data Mining. Society for Industrial and Applied Mathematics, 2007, pp. 575–580.

15.

Bollegala

Matsuo

Ishizuka

A web search engine-based approach to measure semantic similarity between words. IEEE Trans Knowl Data Eng 2011; 23: 977–990.

16.

Han

Giles

Zha

et al . Two supervised learning approaches for name disambiguation in author citations. In: Digital Libraries, 2004. Proceedings of the 2004 joint ACM/IEEE conference on, Tucson, AZ, USA, USA, 11–11 June 2004, pp. 296–305. IEEE.

17.

Song

Huang

Councill

et al . Efficient topic-based unsupervised name disambiguation. In: Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries, Vancouver, BC, Canada, 18–23 June 2007, pp. 342–351. ACM.

18.

Kindermann, R., Snell, J. L., & American Mathematical Society. (1980). Markov random fields and their applications. Providence, Rhode Island, US: American Mathematical Society.

19.

Besag

Statistical analysis of non-lattice data. Statistician 1975; 24: 179–195.¹⁹¹

20.

IS, Na SH, Lee S, et al. On co-authorship for author disambiguation. Inf Process Manage 2009; 45: 84–97.