Person number estimation in large corpora

Abstract

In this paper we present various methods of estimating the K-number, the number of distinct entities carrying the same name in a corpus and an analysis of their characteristics and their impact on person cross document coreference task (PCDC). There are two important classes of such methods, corpus based and external resource based. The experiments reported here show that the estimation of K-number plays an important role for PCDC, from understanding the complexity of the task to improving the overall accuracy of coreference.

Keywords

Person Cross-Document Coreference Cluster Number Estimation Domain Knowledge Corpus-based Methods

Get full access to this article

View all access options for this article.