Remixing entity linking evaluation datasets for focused benchmarking

Abstract

In recent years, named entity linking (NEL) tools were primarily developed in terms of a general approach, whereas today numerous tools are focusing on specific domains such as e.g. the mapping of persons and organizations only, or the annotation of locations or events in microposts. However, the available benchmark datasets necessary for the evaluation of NEL tools do not reflect this focalizing trend. We have analyzed the evaluation process applied in the NEL benchmarking framework GERBIL [in: Proceedings of the 24th International Conference on World Wide Web (WWW’15), International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, Switzerland, 2015, pp. 1133–1143, Semantic Web9(5) (2018), 605–625] and all its benchmark datasets. Based on these insights we have extended the GERBIL framework to enable a more fine grained evaluation and in depth analysis of the available benchmark datasets with respect to different emphases. This paper presents the implementation of an adaptive filter for arbitrary entities and customized benchmark creation as well as the automated determination of typical NEL benchmark dataset properties, such as the extent of content-related ambiguity and diversity. These properties are integrated on different levels, which also enables to tailor customized new datasets out of the existing ones by remixing documents based on desired emphases. Besides a new system library to enrich provided NIF [in: International Semantic Web Conference (ISWC’13), Lecture Notes in Computer Science, Vol. 8219, Springer, Berlin, Heidelberg, 2013, pp. 98–113] datasets with statistical information, best practices for dataset remixing are presented, and an in depth analysis of the performance of entity linking systems on special focus datasets is presented.

Keywords

Entity Linking GERBIL evaluation benchmark

1. Introduction

Named entity linking (NEL) is the task of interconnecting natural language text fragments with entities in formal knowledge-bases with the purpose to e.g. help subsequent processing tools to cope with ambiguities of natural language. NEL has evolved to a fundamental requirement for a range of applications, such as (web-)search engines, e.g. by mapping the content of search queries to a knowledge-graph [32] or to improve search rankings [39]. By linking textual content to formal knowledge-bases, exploratory search systems as well as content-based recommender systems greatly benefit from the underlying graph structures by leveraging semantic similarity and relatedness measures [35]. Likewise, social media and web monitoring systems benefit from NEL, e.g. by the identification of persons or companies in social media content as subject of observation or tracking. A general survey on current NEL systems has been provided in [16,31].

While the number of application scenarios for NEL is on the increase, likewise the number of different NEL approaches is evolving ranging from simple string matching techniques to complex optimization based on machine learning [26]. Most NEL approaches make use of a general solution strategy, however there is an uprising trend for specialized solutions. In [43] the authors demonstrate an approach focused on medical literature while [8] examine heritage texts with NEL. Other approaches are focused on specific entity types, such as e.g. [7], which is applied to the domain of art. Another interesting solution is [1], which can be utilized to build domain specific NEL tools. The approach of [41] extracts semantic information from mixed media types like scientific videos. This ongoing fragmentation of types of tasks aggravates the application of generic benchmarking frameworks for NEL optimization and comparison such as GERBIL [30,37] or NERD [27,28].

With GERBIL, a NEL tool optimized for the detection of person names only might be rather difficult to compare to other NEL tools of a more general focus or specialized for another topic. However, the benchmark datasets provided with GERBIL are annotated with all types of entities including organizations, events, etc. Therefore, by using these general typed benchmarks the overall achieved results with GERBIL might only be hard to compare since the assumed person-only NEL system would wrongly be punished with false negatives caused by non-person annotations contained in the benchmarks. The only valid way to achieve an objective evaluation would be to manually filter a dataset to only contain persons and upload it to GERBIL for the desired experiment. However, these experiments are not reproducible, because it is neither clear or standardized, how the applied filtering was carried out, nor is the newly created filtered dataset always publicly available for further experiments. Moreover, it is not desirable to manage a plethora of different versions of filtered datasets. As of now, GERBIL deploys 19 annotation systems and more than 20 datasets, whereas these numbers are subject to constant change. For a detailed overview on the systems and datasets provided by GERBIL we refer to the official version.1

¹
http://aksw.org/Projects/GERBIL.html

Besides the already described problem, there are also more challenges faced by the GERBIL framework considering the recent development of new NEL approaches. For instance, it is highly desirable to be able to quantify the ‘difficulty’ of NEL problems presented in the different evaluation datasets, as e.g. the average degree of ambiguity, the completeness of annotations, etc.

A first attempt to cope with this problem was made in [12] by manually compiling the Kore502

https://datahub.io/de/dataset/kore-50-nif-ner-corpus

corpus with the goal to capture hard to disambiguate mentions of entities. Another problem arises with the quality of annotations as described in [15] and [38] including e.g. annotation redundancy, inter-annotation agreement, topicality according to the evolving knowledge bases, mention boundaries, as well as nested annotations. Especially, completeness and coverage of annotations are essential measures to assess those annotation tasks (A2KB cf. [37]) where also the entity mention detection contributes to the overall results.

Since no ‘all-in-one’ perfect dataset has emerged in the past, which covers all the aspects sufficiently well, it would be beneficial to measure and provide dataset characteristics on the document level to subsequently allow a recompilation of documents across different datasets according to predefined criteria into a customized corpus. For example, for the already mentioned person-only annotation system these measures would help to specifically select only those documents, which exhibit a significant number of person annotations providing a predefined level of ‘difficulty’. Remixing evaluation datasets on the document level leads to a better and more application specific focus of NEL tool evaluation while simultaneously ensuring reproducibility.

We have already introduced an extension of the GERBIL framework enabling a more fine grained evaluation and in depth analysis of the deployed benchmark datasets according to different emphases [40]. To achieve this, an adaptive filter for arbitrary entities has been introduced together with a system to automatically measure benchmark dataset properties. The implementation including a result visualization are integrated in the publicly available GERBIL framework.

In this paper, we present the following contributions: the work presented in [40] is brought up-to-date, consolidated, and furthermore extended with

new additional dataset measures,

a stand-alone library to enable customized remixing of datasets,

a vocabulary to enrich NIF-based datasets with additional statistical information,

a subset of available datasets has been reorganized to enable benchmarking according to the different dataset properties, and

an in depth analysis of the performance of different systems on the reorganized datasets is presented.

The paper is structured as follows: after this introductory section, measures to characterize NEL datasets are introduced in Section 2. Section 3 explains the GERBIL integration as well as the stand-alone library in detail, while Section 4 elaborates on the most interesting properties on datasets we have determined so far and presents more insights on the systems performances on the reorganized and focused datasets. Finally, Section 5 concludes the paper with a summary of the presented work and an outlook on ongoing and future research.

2. Measuring NEL dataset characteristics

NEL datasets have already been analyzed to great extent. We consider these analyses to identify their potential shortcomings to be able to introduce characteristics and measures to establish more differentiated analyses. In [15] the basic characteristics of 9 NEL datasets were introduced including the number of documents, number of mentions, entity types, and number of NIL annotations. In [34] a more detailed view on the distribution of entity types was given including mapping coverage, entity candidate count, maximum recall, as well as entity popularity. The overlap among datasets was investigated in [38], they also introduced the new measures confusability, prominence and dominance as indicators for ambiguity, popularity, and difficulty.

In this paper, amongst others also a subset of the proposed characteristics has been integrated into the GERBIL benchmarking system. Compared to previous work, where either a theoretical only or an experimental only treatment of the problem was presented, this paper contributes a ready to use implementation by means of extending the GERBIL source code3

³
https://github.com/santifa/gerbil/

and also provides a publicly available on-line service.4

⁴

http://gerbil.s16a.org/

Besides the implementation of filtering the benchmark datasets according to the desired characteristics, the tool instantly updates and visualizes the per annotation system results including statistical summaries. The integration into GERBIL enables a standardized, consistent, extensible as well as reproducible way to analyze and measure dataset characteristics for NEL.

Building on that we also provide a stand-alone library5

⁵

https://github.com/santifa/hfts

that computes the proposed metrics directly on NIF datasets. Without limiting the generality of the forgoing, the following explanations refer to the annotation (A2KB) as well as disambiguation tasks (D2KB) of the GERBIL framework. D2KB is the task of disambiguation of a given entity mention against the knowledge base. With A2KB, first entity mentions have to be localized in the given input text before the subsequent disambiguation task is performed. Hence, for most implementations D2KB can be seen as a sub task of A2KB.

Before introducing the dataset characteristics one by one the terminology is presented.

A dataset D is a set of documents $d \in D$ . We define a document as the tuple $d = (d_{t}, d_{a})$ where $d_{t}$ is the document text and $| d_{t} |$ is the number of words within the text of the document d. $d_{a}$ is a set of annotations belonging to the document d and $| d_{a} |$ is the number of annotations for the document d.

An annotation $a \in d_{a}$ is defined as the tuple $a = (s, e, i, l)$ . s is the surface form of a which can be located in the document text $d_{t}$ with its character index i, indicating the begin of the annotation, and the text length l, indicating the number of characters the annotation encloses to the right of index i. The corresponding linked entity is denoted with e.

Furthermore, we define E as the infinite set of entities and S as the infinite set of surface forms such that they are supersets of all other sets of the form $E^{x}$ and $S^{x}$ . Moreover, we define $E^{D}$ as the set of entities within the dataset D and $S^{D}$ as the set of surface forms within D.

In the Appendix of this paper a complete listing of the mathematical notation is given for overview purposes.

The hereafter defined measures might refer to different levels: dataset level, document level, and annotation (or entity) level. Table 1 contains an overview on which measure is considered at a specific level.

Table 1

Overview of the introduced measures and the according levels of reference, where (ds stands for dataset level, doc for document level an for annotation level)

Measure	Level
Not annotated	ds
Density	ds, doc
Prominence	ds, doc, an
Maximum recall	ds
Likelihood of confusion	ds, doc, an
Dominance	ds
Types	ds, doc, an

Some of the introduced measures are distinguished between micro and macro measurements [4]. Macro measurement aggregates the average results of each single document. Regarding document length, all documents have the same influence on the aggregated result. In contrast, the micro measurement takes the results of each document into account as if they would belong to one single document, which consequently increases the influence of larger documents.

The formal definition is provided for both measurements for density, likelihood of confusion, dominance, and maximum recall. All other definitions are provided as macro measurement if not stated otherwise.

2.1. Number of annotations

In general, the number of annotations is a measure to estimate the size of the disambiguation context. The average number of annotations for a dataset $na : D \to R$ is defined as: $\begin{matrix} (1) & na (D) = \frac{\sum_{d \in D} | d_{a} |}{| D |} . \end{matrix}$

2.2. Not annotated documents

Some of the available benchmark datasets even contain documents without any annotations at all. Documents without annotations might lead to an increase of false positives in the evaluation results and thereby might cause a loss of precision. The fraction of not annotated documents for a dataset $nad : D \to [0, 1]$ is defined as: $\begin{matrix} (2) & nad (D) = \frac{| {d : | d_{a} | = 0} |}{| D |} . \end{matrix}$

Empty documents might be a problem for the annotation task (A2KB), but not for the disambiguation only task (D2KB), where empty document annotations are simply omitted in the processing.

2.3. Missing annotations (density)

Similar to not annotated documents, missing annotations in an otherwise annotated document might lead to a problem with the A2KB task. Annotation systems potentially identify these missing annotations, which are not confirmed in the available ground truth and thus are counted as false positives. It is not possible to determine the specific number of missing annotations without conducting an objective manual assessment of the entire ground truth data, which requires major effort. However, we propose to estimate this number by measuring an annotation density value which is the fraction of the number of annotations and the document text length. The $density : D \to [0, 1]$ is defined as: $\begin{matrix} (3) & \begin{matrix} {density}_{micro} (D) = \frac{\sum_{d \in D} \frac{| d_{a} |}{| d_{t} |}}{| D |}, \\ {density}_{macro} (D) = \frac{\sum_{d \in D} | d_{a} |}{\sum_{d \in D} | d_{t} |} . \end{matrix} \end{matrix}$

If an annotation is spanning more than one word, it is only counted as one annotation.

2.4. Prominence (popularity)

The assumption of [38] is, that an evaluation against a corpus with a tendency to focus strongly on prominent or popular entities may cause bias. Hence, NEL systems preferring popular entities potentially exhibit an increase in performance. To verify this, we have implemented two different measures on the annotation level. Similarly to [38], the prominence is estimated as PageRank [22] of entities, based on their underlying link graph in the knowledge base. Additionally, we also take into account Hub and Authorities (HITS) values as a complementary popularity related score. PageRank as well as HITS values were obtained from [25].

To classify annotations, documents, and datasets according to different levels of prominence of entities, the set of entities was partitioned as follows. PageRank (respectively HITS) underlies a power-law distribution (cf. Section 4.2.1), meaning that only a few entities exhibit a high PageRank and the majority of entities a lower PageRank (long-tail), cf. Fig. 1. Highly prominent entities are then defined as the upper 10% of the top PageRank values. The subsequent 45% (i.e. 10%–55%) define medium prominence and the lower 45% (i.e. 55%–100%) low prominence.

It is important to mention that for a dataset with a stronger bias towards head entities, the entities of the middle or lower segment would then be in the higher segment for a dataset with a more even distribution. Thus, when working with multiple datasets, a global partitioning including all values of all entities is preferred.

Fig. 1.

Example partitioning for the PageRank.

For an arbitrary scoring algorithm P we can define the set of entities within a specific interval $a, b \in [0, 1]$ with $E_{a, b}^{D} : (P) \to E$ as: $\begin{matrix} (4) & E_{a, b}^{D} (P) = {e \in E^{D} : a ⩽ P (e) ⩽ b} . \end{matrix}$

The resulting set contains all entities of a dataset that satisfies the given interval limits. A disadvantage of this approach is that entities, which do not have a score assigned, are not part of one of the resulting sets. Similarly the prominence can be determined using the HITS values or any other ranking score.

2.5. Likelihood of confusion (level of ambiguity)

Since a surface form might denote multiple meanings as well as entities might be represented by different textual representatives the likelihood of confusion is a measure for the level of ambiguity for one surface form or entity. It was first proposed in [38] for surface forms. The authors pointed out that the true likelihood of confusion is always unknown due to a missing exhaustive collection of all named entities.

Fig. 2.

The likelihood of confusion for a surface form is determined by the total number of possible entities known to some annotating system and a dataset $e^{D} \cup W_{E}$ .

The likelihood of confusion needs some considerations beforehand. It can be determined for both sides of an annotation $a = (s, e, i, l)$ . For a surface form s and the possible links to some entities E and for an entity e and the possible corresponding surface forms S.

We define a dictionary of an annotating system by $W_{E}$ which is a mapping $W_{E} : S \to E$ .

As shown in Fig. 2 the text …Bruce…(lower box) has an annotation with ‘Bruce’ as surface form s. This surface form can be linked against different entities, i.e. they are homonyms, thus exhibiting the same writing but different meanings. As shown in the figure, an entity can belong to the dataset or is unknown to the dataset but known to the annotating system. Also, the entity can be unknown to both sets.

For the other side we define a dictionary of the annotating systems $W_{S}$ which is a mapping $W_{S} : E \to S$ .

Figure 3 shows the other side where the text annotation has dbr:Bruce_Willis as an entity. This entity can be linked against multiple possible surface forms which are synonyms. Again the surface form can be known to the dataset and the annotating system or unknown to one of them or both.

Fig. 3.

The likelihood of confusion for an entity mention is the number of possible related surface forms shown in light blue.

As already mentioned, a surface form s or an entity e can be placed within four possible locations:

Unknown to dictionary and dataset: $\begin{matrix} e \notin E^{D} \cup W_{E} or s \notin S^{D} \cup W_{S} . \end{matrix}$

Only known to the dataset: $\begin{matrix} e \in E^{D} ∖ W_{E} or s \in S^{D} ∖ W_{S} . \end{matrix}$

Only known to the dictionary: $\begin{matrix} e \in W_{E} ∖ E^{D} or s \in W_{S} ∖ S^{D} . \end{matrix}$

Known to dictionary and dataset: $\begin{matrix} e \in E^{D} \cap W_{E} or s \in S^{D} \cap W_{S} . \end{matrix}$

The example annotation system dictionaries $W_{E}$ and $W_{S}$ used for the experiments has been compiled from DBpedia entities’ labels, redirect labels, disambiguation labels, and foaf:names, if available.

For a dataset and a dictionary, the average likelihood of confusion is determined for surface forms ${lc}^{sf} : (D, W) \to R^{+}$ with: $\begin{array}{rcl} (5) & \begin{matrix} {lc}_{micro}^{sf} (D, W) = \frac{\sum_{d \in D} \frac{\sum_{a \in d_{a}} | W_{E} (s) \cup E^{D} (s) |}{| d_{a} |}}{| D |}, \\ {lc}_{macro}^{sf} (D, W) = \frac{\sum_{s \in S^{D}} | W_{E} (s) \cup E^{D} (s) |}{| S^{D} |} . \end{matrix} \end{array}$

The intuition is, the more entities exist per surface form, the larger is the likelihood of confusion ${lc}^{sf}$ .

The average likelihood of confusion for entities ${lc}^{e} : (D, W) \to R^{+}$ is: $\begin{matrix} (6) & \begin{matrix} {lc}_{micro}^{e} (D, W) = \frac{\sum_{d \in D} \frac{\sum_{a \in d_{a}} | W_{S} (e) \cup S^{D} (e) |}{| d_{a} |}}{| D |}, \\ {lc}_{macro}^{e} (D, W) = \frac{\sum_{e \in E^{D}} | W_{S} (e) \cup S^{D} (e) |}{| E^{D} |} . \end{matrix} \end{matrix}$ Here the intuition is, the more surface forms exist per entity, the larger is the likelihood of confusion ${lc}^{e}$ .

Again, an annotation within a dataset contains a surface form and an entity. For each side (surface form or entity) the likelihood of confusion is determined by counting the elements belonging to this particular side.

The measures should roughly indicate the difficulty distribution of a dataset.

2.5.1. Dominance (level of diversity)

In [38] the dominance was introduced as a measure of how commonly a specific surface form is really meant for an entity with respect to other possible surface forms. A low dominance in a dataset leads to a low variance for an automated disambiguation system and to possible over-fitting. Similar to the likelihood of confusion, the true dominance remains unknown. Again, in addition to the work presented in [38] we estimate dominance for both sides of an annotation $a = (s, e, i, l)$ : for the entities as well as surface forms. For an entire dataset and a dictionary, the average dominance is also determined in both directions.

For example the entity dbr:Angelina_Jolie, let there exist 4 different surface forms in the dataset, while the dictionary provides overall 10 surface forms, which results in a 40% dominance of the entity dbr:Angelina_Jolie in the considered dataset. The dominance of an entity determines how many different surface forms of this entity are used in the dataset (synonyms).

As example for the other side, for the given surface form ‘Anna’ the dictionary provides 10 different entities, while the dataset only uses 2 entities for different mentions of the surface form ‘Anna’, which results in a 20% dominance of ‘Anna’ for the dataset under consideration. The dominance of a surface form determines how many different entities are used with this surface form in the dataset (homonyms). It indicates the variance or flexibility of the used vocabulary and expresses the dependency on context. Dominance indicates the expressiveness of the used dataset. An extensive one exhibits more diversity. The dominance of a dataset is closely related to the likelihood of confusion since it describes the coverage among the dataset and dictionary.

The average dominance for a dataset D is determined for all entities $E^{D}$ with ${dom}^{e} : (W, D) \to R^{+}$ and for surface forms $S^{D}$ with ${dom}^{sf} : (W, D) \to R^{+}$ . $\begin{array}{rcl} (7) & \begin{matrix} {dom}_{micro}^{sf} (D, W) = \frac{\sum_{d \in D} \frac{\sum_{a \in d_{a}} \frac{E^{d} (s)}{W_{E} (s)}}{| d_{a} |}}{| D |}, \\ {dom}_{macro}^{sf} (D, W) = \frac{\sum_{s \in S^{D}} \frac{| E^{D} (s) |}{| W_{E} (s) |}}{| S^{D} |}, \end{matrix} \\ (8) & \begin{matrix} {dom}_{micro}^{e} (D, W) = \frac{\sum_{d \in D} \frac{\sum_{a \in d_{a}} \frac{S^{d} (e)}{W_{S} (e)}}{| d_{a} |}}{| D |}, \\ {dom}_{macro}^{e} (D, W) = \frac{\sum_{e \in E^{D}} \frac{| S^{D} (e) |}{| W_{S} (e) |}}{| E^{D} |} . \end{matrix} \end{array}$

Since the actual dominance is unknown and the completeness of the applied dictionaries cannot be guaranteed, computed values above the nominal threshold of 1.0 are possible. These results refer to an incomplete dictionary, i.e. there are more patterns used in the dataset than the applied dictionary does contains. The subsequently described maximum recall takes care of this aspect.

2.5.2. Maximum recall

Most of the NEL approaches apply dictionaries to look up possible entity candidates matching a given surface form. If the dictionary doesn’t contain an appropriate mapping for the surface form the annotation system is unable to identify a possible entity candidate at all.

As Fig. 3 shows and as already mentioned before some parts of the dataset might not be contained within the dictionary. Surface forms not in the intersection are unlikely to be found by entity linking since the annotation systems are using dictionaries to look up potential relations. Therefore, an incomplete dictionary limits the performance of an NEL system since an unknown surface form will lead to a loss in precision. So the maximum recall can be seen as an artificial limit of a dataset.

To estimate the coverage of a mapping dictionary, the maximum recall measurement was introduced by [34].

For a dictionary $W_{S}$ and a dataset the maximum recall is the defined as $mr : (D, W) \to [0, 1]$ : $\begin{matrix} (9) & \begin{matrix} {mr}_{micro} (D, W) = \frac{\sum_{d \in D} (1 - \frac{| S^{d} ∖ W_{S} |}{| S^{d} |})}{| D |}, \\ {mr}_{macro} (D, W) = 1 - \frac{| S^{D} ∖ W_{S} |}{| S^{D} |} . \end{matrix} \end{matrix}$

2.5.3. Types

Since some NEL approaches might be focused on a specific domain or handle some entity categories in a different way, a filter has been implemented to distinguish dataset entities by their type. Besides the focus of NEL approaches in [38] it is also stated that types of entities may be differently difficult to disambiguate such as person names (esp. first names) might be more ambiguous and country names more or less unique. A type filter for some type T and $E^{T}$ denoting the set of all entities for T is defined as $E^{D} : (T) \to E$ : $\begin{matrix} (10) & E^{D} (T) = {e \in E^{D} : e \in E^{T}} . \end{matrix}$

Following these theoretical considerations, the extensions of the GERBIL framework and how the determined characteristics are exploited will be described in the subsequent sections.

3. Implementation

This section describes the implementation of the GERBIL extension and the standalone library. Furthermore, the vocabulary to integrate the calculated statistics in the NIF annotation model are explained in detail.

3.1. Extending GERBIL

Two new components have been implemented to extend the GERBIL framework: one component to filter and isolate subsets of the available datasets, and a second component to calculate aggregated statistics about the data (sub-)sets according to the newly introduced measures. It is important to mention that these filters and calculations can also be applied to newly uploaded datasets. Thus, the system can also be used to gain insights about any arbitrary ‘non-official’ datasets not yet part of the GERBIL framework. The implemented filter-cascade is of a generic type and can be adjusted via customized SPARQL queries. For example, to filter a dataset to only contain entities of type foaf:Person the following filter configuration has to be applied:

name=Filter Persons service=http://dbpedia.org/sparql query=select distinct ?v where { values ?v {##} . ?v rdf:type foaf:Person . } chunk=50

Fig. 4.

Overview of the filter-cascade.

The name designates the filter in the GUI, service denotes an arbitrary SPARQL-endpoint, but also a local file encoded in RDF/Turtle can be specified to serve as the base RDF query dataset. The query is a SPARQL query that returns a list of entities to be kept in the filtered dataset. The ## placeholder will be replaced with the specific entities of the dataset. To avoid the size limits for SPARQL queries, the chunk parameter can be specified to split the query automatically in several parts for the execution. Any number of filters can be specified to be included in the analysis. With the flexibility of configuring SPARQL-queries, filters of any complexity or depth can be specified.

To partition the datasets according to entity prominence (popularity) we have additionally implemented a filter to segment the datasets in three subsets containing the top 10%, 10% to 55%, and 55% to 100% of the entities. This segmentation is applied to PageRank as well as HITS values separately.

Figure 4 shows a general overview of the filter cascade. The annotations produced by GERBIL are subsequently cleaned from invalid IRI’s. If they are already cached the result is returned. Otherwise the set is chunked and passed to the defined filter.

Buttons have been added as new control elements to the A2KB, C2KB, and D2KB overview pages in GERBIL (cf. Fig. 5). The user now is able to choose between the classic view ‘no-filter’, the persons, places, organisations filter views, the PageRank/HITS top 10%, 10–55%, and 55–100% filter views, a comparison view, or a statistical overview. All implemented measures are visualized in GERBIL using HighCharts.6

⁶

http://www.highcharts.com/

The existing charts are also replaced by the new chart API, since GERBIL was limited to only one single chart type. The comparison view enables the user to view two filters at the same time as well as the average for all annotation systems on a specific filter. The overview shows several statistics for all datasets, such as e.g., total number of types per filter, density, likelihood of confusion in average and total. A subset of these statistics is shown and discussed in Section 4. The extended source code is publicly available at Github.7

⁷

https://github.com/santifa/gerbil/

In addition, an online version of the system is available.8

⁸

http://gerbil.s16a.org/

Before discussing the dataset statistics as a result of the new GERBIL extension, the following section introduces the stand-alone-library for statistics calculation as well as the new vocabulary.

Fig. 5.

New dataset filters for A2KB experiments in the GERBIL user interface.

3.2. Library and vocabulary for dataset statistics

Following the considerations mentioned in the previous sections, the proposed measurements can also be calculated independently of GERBIL with a separate stand-alone library. The library consumes a NIF encoded input file, calculates the proposed statistics, and extends the NIF file with the newly determined information. A comprehensive documentation as well as the library source code is provided at Github.9

⁹
https://github.com/santifa/hfts

To serialize the calculated statistics generated by the GERBIL extension as well as by the library, a vocabulary has been defined with three layers to be integrated into the NIF model.

Table 2

Overview of the introduced properties and the corresponding measurements (ds stands for dataset level, doc for document level an for annotation level)

Measure	Property	Level
Not annotated	notAnnotated	ds
Density	microDensity	ds
	macroDensity	ds
	density	doc
Prominence	hits	an
	pagerank	an
Maximum recall	microMaxRecall	ds
	macroMaxRecall	ds
	maxRecall	doc
Likelihood of confusion	microAmbiguityEntities	ds
	macroAmbiguityEntities	ds
	ambiguityEntities	doc
	ambiguityEntity	an
	microAmbiguitySurfaceForms	ds
	macroAmbiguitySurfaceForms	ds
	ambiguitySurfaceForms	doc
	ambiguitySurfaceForm	an
Dominance	diversityEntities	ds
	diversitySurfaceForms	ds

The first layer refers to an entity mention, respectively annotation, (e.g. NIF phrase) with its corresponding text fragment. The second layer addresses to the document (e.g. NIF context) that provides the text where the entity mentions are embedded. A third layer groups documents together to form a dataset. We introduce the hfts:Dataset class, which holds the documents with the hfts:referenceDocuments property. on the dataset level 13 properties have been introduced, which hold the measurements missing-annotation, density, maximum recall, dominance and likelihood of confusion on the dataset level. Some of them come with a micro as well as macro flavour while others are only computed once.

On the document level 6 new properties have been introduced to cover density, likelihood of confusion, and maximum recall. The likelihood of confusion, prominence, and the types are also assigned on the entity mention level.

In Table 2 an overview over the introduced properties and their corresponding level is presented. Figure 6 shows an excerpt of the extended Kore50 dataset for the new dataset class. One can see the new dataset statistics introduced by the RDF properties introduced by the hfts: prefix. In Fig. 7 an example for the document level is presented (nif:Context). Additionally to the existing NIF data the statistics have been serialized with the newly introduced hfts: properties. The entire definition and further documentation of the vocabulary is available at Github.10

¹⁰

hfts:< https://raw.githubusercontent.com/santifa/hfts/master/ont/hfts.ttl# >

Fig. 6.

An example of the new statistics properties on dataset level extending the KORE50 dataset.

Next, the possibility of remixing customized benchmark datasets will be explained including several examples.

Fig. 7.

An example of the new statistics properties on document level extending the KORE50 dataset.

3.3. Remixing customized datasets

The basic idea of remixing NEL benchmark datasets is to tailor new customized datasets from the existing ones by selecting documents based on desired emphases. This enables the compilation of focused benchmark datasets for NEL. For remixing it is proposed to store all analysed datasets in a single RDF triple store. This enables to quickly access the dataset documents via the SPARQL query language. In particular, SPARQL CONSTRUCT queries can be applied to select exactly those triples from the document annotations that meet a particular criteria, as e.g., popular persons, high possible maximum recall, places difficult to disambiguate, or any other arbitrary criteria, which can be expressed via SPARQL filter rules.

Fig. 8.

Basic query that selects only documents with a maximum recall ⩾1.0.

For this purpose, we introduce the basic query shown in Fig. 8. A CONSTRUCT statement creates RDF triples from document annotations meeting the filter requirement maximumRecall ⩾ 1.0. This basic query utilizes the entire RDF induced graph and it might be useful to limit the number of documents that should be returned by the query. For this task, a subquery can be applied as shown in the second example in Fig. 9.

Another example is presented in Fig. 10. The SPARQL subselect chooses only documents that contain persons and aggregates their number. Subsequently, the CONSTRUCT statement selects documents that contain more than 4 persons with a maximum recall of at least 0.8.

Fig. 9.

This query in addition limits the number of selected documents.

To underline that any kind of filter can be applied, Fig. 11 shows a more specific example using a federated query to select only documents from the RDF graph with persons born before 1970. To achieve this, the official DBpedia SPARQL endpoint is queried for additional information that is not present within the given benchmark datasets. More SPARQL examples can be found at Github.11

¹¹

https://github.com/santifa/hfts/blob/master/Remix.md

Fig. 10.

Extract documents with a maximum recall of 0.8 and at least 4 person.

Fig. 11.

A SPARQL query that selects documents containing persons born before 1970 via additional data queried from the DBpedia SPARQL endpoint.

For authoring arbitrary queries two aspects should be considered. First, many values of the proposed measurements are given as absolute values and are not always equally distributed across the datasets, documents, and annotations. Hence, it is necessary to investigate on the boundary values and value distribution before specifying a specific threshold. It is a subject of future work to normalize and harmonize the statistics adequately. Second, the proposed query examples are based on the document level. Therefore, if an annotation meets a requirement, the entire document together with all its annotations (which might not meet the requirement) is added to the result. Of course, queries can also be structured to only return the filtered annotations, but this might lead to a missing annotation scenario that again might result in a drop of recall for the A2KB task.

Finally, the thereby newly created dataset can be uploaded to the GERBIL platform for a precisely tailored evaluation experiment.

4. Statistics and results

This section presents the results of the execution of the proposed measures on the GERBIL datasets. Furthermore, an in depth overview on how to use the new library to partition the benchmarking datasets according to different criteria and to analyze the systems performances in much greater detail is presented.

4.1. GERBIL datasets

The following datasets have been analyzed according to the characteristics introduced in Section 2: WES2015 [39], OKE2015 [21], DBpedia Spotlight [17], KORE50 [12], MSNBC [5], IITB [14], RSS500 [29], Micropost2014 [2], Reuters128 [29], ACE2004 [19], AQUAINT [18], and NEWS-100 [29]. In this section, only the most significant results are presented. A complete listing of the achieved results is available online.12

¹²
http://gerbil.s16a.org/

Figure 12 shows the percentage of documents in the GERBIL datasets which were not annotated. Overall, there are 5 datasets that contain empty documents while 3 of them show a significant (i.e. >30%) number of empty documents. For A2KB tasks, these datasets might lead to an increased false positive rate and thus might lower the potentially achievable precision of an annotation system. Therefore, empty documents might be excluded from evaluation datasets to enable a sound evaluation. However, it should be noted that it is possible that these un-annotated documents are not actually mistakes but rather don’t contain any entities.

Fig. 12.

Percentage of documents without annotations in the GERBIL datasets.

Fig. 13.

Annotation density as relative number of annotations respective document length in words.

Figure 13 shows the annotation density of the GERBIL datasets as relative number of annotations with respect to document lengths in words. This serves as an estimation for potentially missing annotations, e.g. in the IITB dataset 27.8% of all terms are annotated. If a dataset is annotated rather sparsely (low values), it is likely that the A2KB task will result in loss of precision, because the sparser the annotations the higher is the likelihood of potentially missing annotations (as it is shown in Section 4.2.7). Especially for NEL tools based on machine learning it should be considered, whether a sparsely annotated dataset is appropriate for the training task. Of course, this strongly depends on the according application. Nevertheless, it is arguable, if sparseness is problematic for A2KB, because all annotation systems are facing the same problem and the achieved results nevertheless might still be comparable.

Table 3

Percentage of entities by entity type and entity popularity per dataset

Table 3 shows the distribution of entity types and entity prominence per dataset. A green (bold) label indicates the highest value and a red (italic) the lowest value in each category. Since not all entities can be linked with a type or affiliated with the ranking, the values for each partition do not necessarily sum up to 100%. For each dataset the percentage of entities per category is denoted, as e.g., of all the entities in the KORE50 dataset 47.1% are persons and 6.9% are places. In [34] it was demonstrated, there is a significant number of untyped entities in the DBpedia Spotlight and the KORE50 datasets. Therefore, an extra row for unspecified entities has been added to the table. The News-100 dataset exhibits the most unspecified entities because it is a German dataset and mostly contains annotations referring to the German DBpedia, but the analysis was based on the English DBpedia. The first partition (row 1–4) can be considered as an indicator of how specialized a dataset is. Thus, e.g., for the evaluation of an annotation system with focus on persons, the KORE50 dataset with 45.1% of person annotations might be better suited than the IITB dataset with only 2.4% of person annotations. The second and third partition (PageRank and HITS) show the entities categorized according to their popularity. It can be observed that many datasets are slightly unbalanced towards popular entities. A well balanced dataset should exhibit a relation of 10%, 45%, 45% among the three subset categories.

Fig. 14.

Average number of surface forms (SF) per entity (blue, left) and average number of entities per surface form (red/hatched, right) indicating the likelihood of confusion for each dataset.

Fig. 15.

Average dominance for surface forms (blue) and entities (red/hatched) per dataset.

Figure 14 shows the average likelihood of confusion to correctly disambiguate an entity or a surface form for several datasets. The blue bar (left) indicates the average number of surface forms that can be assigned to an entity, i.e. it refers to surface forms per entity, respectively synonyms. The red/hatched bar (right) shows the average number of entities that can be assigned to a surface form, i.e. it refers to entities per surface form, respectively homonyms. The figure shows clearly that KORE50 uses surface forms with a high number of potential entity candidates, i.e. it contains a large number of homonyms. Since this dataset is focused on persons it is not surprising that surface forms representing first names, such as e.g. ‘Chris’ or ‘Steve’, can be associated with a large number of corresponding entity candidates. KORE50 was compiled with the aim to capture hard to disambiguate mentions of entities, which is confirmed by these observations. ACE2004 exposes the highest average number of surface forms for possible entities (35), i.e. it contains many synonyms.

In Section 4.2.2 a correlation analysis between the likelihoods of confusion for entities and surface forms with precision and recall is presented.

Figure 15 shows the average dominance of entities and surface forms in percent. The red/hatched bars show the average dominance of entities. The dominance of an entity expresses the relation between an entity’s surface forms used in the dataset with respect to all its existing surface forms in the dictionary. Referring to Fig. 15, the KORE50 dataset uses only 9% of the surface forms that are provided in the dictionary. This indicates also how well the dataset’s surface forms are covered by the dictionary’s surface forms.

On the other hand, the blue bars show the average dominance of surface forms. The dominance of a surface form expresses the relation of how many entities are using this surface form in the considered dataset and the overall number of entities in the dictionary using this surface form.

Referring to Fig. 15, the KORE50 dataset in which many persons are annotated uses only 7% of the possible entities for the contained surface forms. In average, entities are represented in the WES2015 dataset with 21% of their surface forms.

Since the datasets with a high likelihood of confusion have a low dominance, it is arguable that these two measures express somehow the contrary. For example, the KORE50 dataset has a high likelihood of confusion for surface forms with 446 entities for one surface form on the average. This means that for a high dominance each surface form is represented by more than 400 entities within this dataset. Such a high dominance means also that a high coverage of surface forms (dominance of entities) or entities (dominance of surface forms) is present. For example, in the WES2015 dataset, which is focused on blog posts on rather specific topics, many rare entities (i.e. entities with a low popularity) with many different notations are used resulting in a likelihood of confusion of 15 surface forms for an entity on the average. The average dominance of entities is quite high with 21%, since the likelihood of confusion is low and topic specific blog posts often vary the surface forms for an entity to enrich the spiritedness of the text. This is commonly known from articles or essays, where the author usually tries to minimize frequent repetitions of surface form by varying the surface form for the entity under consideration to avoid monotony and to make the article more interesting to read. It might be concluded that a high dominance covers the diversity of natural language more precisely and therefore could be considered a means to prevent overfitting.

The News-100 dataset shows an anomaly in the dominance of entities, which is larger than 100 %. The reason for that is that the dataset contains a large number of entities from the German DBpedia. For these entities a surface form cannot be found in the dictionary (which was generated from the English DBpedia). That means, there are more surface forms present in the dataset than in the dictionary, which results in a dominance value larger than 100 %.

This section has introduced and discussed the results of the statistical dataset analysis. Based on the information embedded in the NIF dataset files, a customized reorganisation of datasets can be accomplished as explained in the following section.

4.2. Insights from remixing datasets

To gain more insights on the interplay of annotation systems performance and the introduced dataset characteristics, this section describes how the datasets are reorganized to determine each system’s performance with focus on a given measure.

The approach is to first combine the datasets into one large dataset and then divide it into partitions. Each partition contains only those annotations or documents that lie in a specified interval of values of one of the proposed measures. For this purpose and to insert the statistical data into the NIF document the proposed library has been applied. Subsequently, the entire dataset was stored in an RDF triple store. With the SPARQL queries proposed in the previous sections, each partition was constructed and stored in a separate NIF document, which was submitted to the official GERBIL service to acquire the results.

For the conducted experiments the following public and GERBIL ‘shipped’ datasets have been used: DBpedia Spotlight, KORE50, Reuters128, RSS500, ACE2004, IITB, MSNBC, News100, AQUAINT. Other available datasets were either not publicly available or not in the NIF format.

Since the official GERBIL service was used to conduct the experiments, the therewith provided systems are included in the experiments. Unfortunately, not all systems returned consistent results due to too many errors or insufficient availability. However, if sufficient results could be provided, the system was included in the analysis.

The following annotation systems provided by GERBIL have been used: AGDISTIS [36], AIDA [13], Babelfy [20], DBpedia Spotlight [17], Dexter [3], Entityclassifier.eu [6], FOX [33], Kea [42], WAT [23] and PBOH [9].

The measures used in the subsequent experiments are the measures currently supported by the library (i.e. likelihood of confusion, HITS, PageRank, density, and numbers of annotations). In general, both the A2KB as well as D2KB types of experiments, might be applied. For likelihood of confusion, HITS and PageRank only D2KB is provided because these are characteristics of the annotations. Number of annotations as estimation for the size of the disambiguation context is used with A2KB and D2KB types of tasks, density as characteristic of documents is used with A2KB only. All data as well as the achieved results can be found online.13

¹³
https://github.com/santifa/hfts/blob/master/Results.md

Fig. 16.

Distribution of values (linear scale).

4.2.1. Value distribution and partitioning

Figure 16 presents the distribution of the data values over all datasets. In total, the dataset contains 16,821 annotation in 1043 documents. The figure shows a distribution chart for each measure. On the charts, the x-axis shows the number of annotations (for confusions, HITS, PageRank) or documents (for density and number of annotations). The y-axis shows the absolute values of the measures. Each of the charts approximate a power-law distribution, i.e. only a few items exhibit large values and many items smaller values. For HITS and PageRank only 14,372 items are available, because for 2,449 entities no HITS or PageRank value could be determined.

Fig. 17.

Distribution of values (log scale).

We have decided to apply a decile partitioning. It seems a reasonably well choice to indicate low, medium, large as well as the boundary values. When partitioning on the item values an uneven distribution of values over the partitions occurs because of the power-law, i.e. the first partition would contain a very large disproportionate number of items and the last partition only a very few items. To achieve a more even distribution a logarithmic scaling on the values is applied as shown in Fig. 17. The red horizontal dashed lines indicate the partition boundaries. Table 4 shows for each measure the threshold values (thr) for the partition boundaries as well as the number of items per partition (qty). For HITS and PageRank an additional partition was introduced to also include the items without a value (unspec.). Each threshold is meant as the upper boundary of the partition, thus the lower boundary is the threshold of the previous partition. The color coding in the background of the cells will be explained later.

Table 4

Partitioning thresholds (log-based) and annotation/document quantities (this table is best viewed in color)

4.2.2. Likelihood of confusion of surface forms

Figure 18 shows the experimental results of each system for the likelihood of confusion of surface forms. Each graph shows the partitions (x-axis), as well as the determined $F_{1}$ -measure ( $f_{1}$ ), precision (p), and recall (r) for each partition. In the background the relative sizes of the partitions are indicated with boxes (see Table 4 for specific values).

Fig. 18.

Likelihood of confusion for surface forms (D2KB).

The likelihood of confusion for surface forms describes the number of entities mapping to one particular surface form. For an annotation in the dataset, a confusion of 30 signifies that 30 possible entities for that surface form exist (homonymy).

The leftmost partition (0) contains lower values, thus annotations contain surface forms with fewer numbers of entities mapping to them and therefore a lower likelihood of confusion. Typical are for example surface forms mentioning full names, as e.g., ‘Britney Spears’, ‘Northwest Airlines’, or ‘JavaScript’. The rightmost partition (9) shows larger values. It is expected that the annotations in the right partitions are more difficult to disambiguate since they exhibit a larger likelihood of confusion. The first partition contains almost half of all values, indicating that for almost half of the annotations only one entity maps to the surface form. For the second to sixth partition a reasonable even distribution is given. Considering Table 4, only 40 items are in the rightmost partition. These include the names Allen, Bill, Bob, Carlos, David, Davis, Eric, Jan, John, Johnson, Jones, Karl, Kim, Lee, Martin, Mary, Miller, Paul, Robert, Ryan, Steve, Taylor, and Thomas.

This experiment was applied as a disambiguation task (D2KB).14

¹⁴

http://gerbil.aksw.org/gerbil/experiment?id=201712060006

However, the Entityclassifier.eu system did not provide results for partitions 7, 8, and 9 (set to zero). WAT and PBOH created too many errors and have been excluded in this experiment.

To interprete the figures in general, the presented graphs show a trend from the upper left to the lower right, meaning that the systems performance decreases with growing likelihood of confusion. Many systems, except AIDA and Babelfy, fail with surface forms having more than ca. 1,700 entities mapping to (8th partition and above). Entityclassifier.eu, Dexter, and FOX show a very strong focus on precision, at the expense of recall, as we can also see in the further experiments.

It can be concluded that the fewer entities are mapping to a particular surface form, the easier seems the disambiguation task. For surface forms with more than 1,700 potential entity candidates the reliability of the disambiguation might drop dramatically.

Fig. 19.

Likelihood of confusion for entities (D2KB).

4.2.3. Likelihood of confusion of entities

Figure 19 shows the experimental results of each system for the likelihood of confusion of entities. The graphs are presented in the same way as for the previous measure. The likelihood of confusion for entities describes to how many surface forms the entity of an annotation is mapping to. For an annotation, a confusion of 30 means that 29 surface forms besides the one within the annotation share the same entity.

The leftmost partition (0) contains lower values, thus annotations with entities mapping to only one surface form. The rightmost partition (9) contain annotations with entities mapping to more than 361 surface forms e.g. dbp:United_States. The number of items across the partitions is more evenly distributed than for the previous measure.

This experiment was applied as disambiguation task (D2KB).15

¹⁵
http://gerbil.aksw.org/gerbil/experiment?id=201712050002

All participating systems except WAT and PBOH returned valid results, Entityclassifier.eu returned several faulty results.

In general, there is an upward trend, i.e., the more surface forms are available for an entity, the better it is. However, almost all systems have in common, that the performance drops rather abruptly on the first partition (0) compared to the second partition (1). A closer look on the partition data revealed that a large share of the entities in partition 0 are resources originating from Wikipedia redirect and disambiguation pages (e.g. dbp:Diesel , dbp:Thermoelectricity). Typically, these resources only map to a single surface form, which is why they occur in partition 0. Assumably, the systems are not annotating redirect and disambiguation resources, since they prefer to use the main resource and not resources directing to it. Some datasets show a drop at partition 7, but the partition data does not show obvious anomalies. Since we only can access the performance values provided by the GERBIL experiments, and therefore cannot access the actual annotations systems results, it is impossible to further investigate on that now.

Overall, it can be concluded that the more surface forms an entity is mapping to, the better the systems performances are. Furthermore, the datasets containing a larger number of redirect and disambiguation resources can bias the systems performances. Future work will repeat this analysis without bias to gain insights about, how well the systems really perform on the first partition.

Fig. 20.

Results for PageRank (D2KB).

4.2.4. PageRank

Figure 20 shows the systems performances on the popularity estimation via PageRank values. Now, an additional partition is included in the graphs, which is located left (partition 0) showing the results on the 2,449 annotations, where no PageRank was given. For all other partitions, the PageRank values increase from left to right. Thus, popular entities can be found on the right hand. The distribution of values across the partitions is reasonable even.

The experiments were conducted as D2KB task.16

¹⁶
http://gerbil.aksw.org/gerbil/experiment?id=201712060001

With the exception of Entityclassifier.eu and FOX, all systems returned error free results. For the time of the execution of these experiments, also the WAT system was available. PBOH was not available.

In the graph a general uprising trend can be observed, i.e. popular entities are better disambiguated than unpopular entities, but with the exception of AIDA and Babelfy, all systems struggle with extremely popular entities (partition 10). A view in the data revealed that the 146 annotations only refer to the 4 entities dbp:Germany, dbp:United_States, dbp:Americas and dbp:Animal. It might be that some of the effect comes from the confusion of dbp:United_States and dbp:Americas. Therefore, partition 10 might not be sufficiently representative. The entities with the largest PageRanks (e.g. from partition 8) mostly refer to countries and popular locations as well as to the entity dbp:Insect.

In conclusion, a positive correlation (>0.7) between the PageRank values and the systems performances can be observed. It seems likely that popular entities are used much more frequently, while being described via many varying surface forms.

Fig. 21.

Results for HITS (D2KB).

4.2.5. HITS

Similarly to PageRank, HITS values were not provided for all entities, thus partition 0 contains the annotations with unspecified values (see Fig. 21). For the other partitions the HITS values are increasing from left to right. According to Table 4, partition 2 contains only very few annotations (19). The other partitions contain a more representative number of items.

Again, the experiments were conducted as D2KB tasks.17

¹⁷
http://gerbil.aksw.org/gerbil/experiment?id=201712060011

However, the Entityclassifier.eu, WAT, and PBOH produced too many faulty results and had to be excluded from the evaluation.

The HITS analysis reveals that for very low values (partition 1) and higher values (partition 6 and upwards) the systems provide better results than for the medium values (partitions 2–5). There is a weak correlation among HITS and the systems performances (>0.4). This could be interpreted as with increasing partition number there are more entities with higher popularity, which might cause better disambiguation results.

Fig. 22.

Results for number of annotations (D2KB).

Fig. 23.

Results for number of annotations (A2KB).

4.2.6. Number of annotations

Figure 22 and 23 show the results for the number of annotations measure. This measure is not to be interpreted as a quality of the annotations but of the documents. Table 4 shows that more than half (595) of the 1,043 documents contain exactly 3 annotations, indicated by partition 1. Only 20 documents contain fewer annotations (partition 0). The number of annotations also corresponds to the size of the ‘disambiguation context’.

For this measure both experiment types D2KB18

¹⁸
http://gerbil.aksw.org/gerbil/experiment?id=201711280011

(Fig. 22) and A2KB19

¹⁹

http://gerbil.aksw.org/gerbil/experiment?id=201711280030

(Fig. 23) were conducted. For the A2KB task, the AGDISTIS system was not available, because it is only capable of D2KB tasks. For the period of D2KB experiments also the PBOH system was available. Entityclassifier.eu produced several errors, but overall, the results seem to be valid. WAT was not available.

In Fig. 22 (D2KB) it can be observed that some systems are not robust against growing context size, as e.g., AGDISTIS, AIDA, Entityclassifier.eu, and FOX. The other systems exhibit a more or less constant behaviour. The annotation tasks (A2KB) presented in Fig. 22 confirm this observation. Almost every system increases precision with growing context sizes, but on the expense of recall. This drifting apart occurs between the 4th and 6th partition (16 to 50 annotations per document). KEA seems to strongly benefit from increasing context sizes, while FOX benefits from smaller context sizes.

Fig. 24.

Results for density (A2KB).

4.2.7. Density

The results for the density measure are presented in Fig. 24. Density also is a quality of the documents and not of their annotations. Low density (left hand partitions) signifies that a longer document has only a few annotations. High density (right hand partitions) on the other hand signifies that a document contains many annotations relative to its length.

For density the experiments were conducted as A2KB tasks.20

²⁰
http://gerbil.aksw.org/gerbil/experiment?id=201712050010

All participating systems except PBOH provided valid results.

From the presented graphs it can be observed that the systems perform on low dense documents with high recall, but comparably low precision. On the other hand, dense documents are annotated with higher precision, but lower recall. While Babelfy performs more or less evenly distributed, KEA seems to also maintain recall with denser documents. The break even point between precision and recall is located between the 4th and 6th partition (density between 0.055 and 0.133).

The density only estimates the number of missing annotations and the correlation between this metric and precision and recall supports this to some extent. However, it is important to also take into account the reasons for sparsity. Sparsity in the annotations can also stem from a specific combination of a knowledge base and documents. Very domain specific documents with little coverage in the knowledge base will often be sparsely annotated, even if the annotation is complete with respect to the knowledge base. This limits this metric’s utility. It would be interesting to asses whether the density can be put in relation to the dominance of entities and surfaces forms in order to reduce domain and knowledge base dependencies.

Table 5

Micro-f₁ results of D2KB systems for different remixed datasets

4.2.8. General results

Table 5 shows the achieved micro-f₁ results of the systems for the D2KB task. The top row indicates the original GERBIL results21

²¹
http://gerbil.aksw.org/gerbil/experiment?id=201711230013

(No Filter). Top results are indicated in green (bold) and the lowest results in red (italic). Each row shows the results for the dataset filtered according to a specific criteria. The second column shows the number of remaining annotations in the dataset after filtering. The penultimate column shows the average of the systems, the last column the Pearson correlation of the current row to the first row. Unfortunately the WAT system did not produce usable results and had to be excluded.

For persons,22

²²

http://gerbil.aksw.org/gerbil/experiment?id=201711280013

organizations23

²³

http://gerbil.aksw.org/gerbil/experiment?id=2017112800143

and places24

²⁴

http://gerbil.aksw.org/gerbil/experiment?id=201711280015

the results achieved by the systems are rather similar, but do not perfectly correlate to the baseline (first row). For persons and organizations PBOH seems to be the best system. KEA produces the best results for places and for the entities not falling into these categories (others). The others category strongly correlates with the baseline.

The next 2 rows separate annotations into a dataset containing entities with itsrdf:taClassRef statement (with Classes25

²⁵

http://gerbil.aksw.org/gerbil/experiment?id=201711280028

) and without (without Classes26

²⁶

http://gerbil.aksw.org/gerbil/experiment?id=201711280020

). The first dataset correlates very strongly to the baseline. For the annotations without class assignment the correlation is not so clear, furthermore the annotation performance was comparably low.

Another filtering was performed by filtering entities according to class membership of typical classes of the three different domains: Music,27

²⁷

http://gerbil.aksw.org/gerbil/experiment?id=201712060008 http://gerbil.aksw.org/gerbil/experiment?id=201712110000

Science,28

²⁸

http://gerbil.aksw.org/gerbil/experiment?id=201712060009 http://gerbil.aksw.org/gerbil/experiment?id=201712110001

and Movie/TV.29

²⁹

http://gerbil.aksw.org/gerbil/experiment?id=201712060007

In every domain a different system performed best. The Pearson value for Music indicates a lower correlation.

The last four rows show datasets filtered according to thresholds of the proposed measures. For the first, we removed the first and last decile partition to avoid bias caused by disambiguation and redirect resources, too popular and unpopular entities, entities without information about PageRank and HITS, extremely short and large contexts, extreme homonyms and synonyms (likelihood of confusion). Furthermore, the density was restricted to a moderate level around the break even points between precision and recall to avoid major bias caused by extreme strong and low density. The filtered dataset is denoted as the ‘low skew’ dataset.30

³⁰

http://gerbil.aksw.org/gerbil/experiment?id=201712100002

The dataset contains 765 annotations in 118 documents. Considering Table 4, a grey cell background indicates that this partition was not included in the ‘low skew’ dataset.

From all these restrictions, all annotations have been filtered, which fall into the intersection of the opposite filters, denoted as the ‘high skew’ dataset31

³¹

http://gerbil.aksw.org/gerbil/experiment?id=201712100003

(grey cells of Table 4). This results in only 66 annotations in 22 documents.

Table 5 shows that the results for the ‘low skew’ dataset are overall better than for the ‘high skew’ dataset. But surprisingly, 3 systems (KEA, AGDISTIS, Dexter) perform with larger f-measure than on the ‘low skew’ dataset. With a larger value of 0.898 the Pearson value suggests a slightly better correlation with the baseline for the ‘low skew’ dataset than for the ‘high skew’ dataset with 0.866.

For the ‘high skew’ dataset, 66 annotations might not be very representative, but applying all the restrictions resulted in this rather small dataset. To increase the size we attempted to relax the restrictions slightly and created the ‘medium skew’ dataset.32

³²

http://gerbil.aksw.org/gerbil/experiment?id=201804090002

For this dataset the filters have not been applied all at once. According to a ‘leave-one-out’ principle, for each of the 6 filters (see header of Table 4) a datasets was created. The dataset for a particular filter was restricted only to the other filters. Finally, a join has been applied to the sets resulting in 595 annotations. The results presented in Table 5 show that the systems performed in a similar manner compared to the ‘high skew’ results. Four systems performed slightly better, the results of three diminished. Unfortunately we were not able to produce results for Dexter and Entityclassifier.eu. Thus, the correlation quotient is not informative. In general, the results of both ‘high skew’ datasets should be treated with caution since they are created on purpose from outliers and very likely contain bias. The consequence is, that the results are not trustful, and a system performing well on the ‘high skew’ datasets, e.g. KEA, must not necessarily perform well overall. However, we can see, that PBOH performs best on the ‘low skew’ dataset, and this seems to be an objective and reliable result.

The last two remixed datasets are derived from the ‘low skew’ dataset. The first one was compiled with the intent to include only annotations, which are comparably ‘easy’ to disambiguate.33

³³

http://gerbil.aksw.org/gerbil/experiment?id=201712120003 http://gerbil.aksw.org/gerbil/experiment?id=201804020001

The other one includes annotations which are considered more ‘difficult’ to resolve.34

³⁴

http://gerbil.aksw.org/gerbil/experiment?id=201712120004

Considering Table 4 the green, orange, and white partitions belong to the easy dataset, the red, orange and white partitions belong to the difficult datasets. Thus, the easy dataset preferably contains annotations with more popular entities and lower likelihood of confusion of entities and surface forms. For the difficult dataset annotations with unpopular entities and higher likelihood of confusion of entities and surface forms are considered. We did not further restrict the number of annotations and density values compared to the ‘low skew’ dataset, because the resulting datasets would have been too small.

KEA performed well on the dataset that was considered easier, but not on the difficult dataset where PBOH is ahead of all other systems. The average numbers of the easy and difficult datasets suggest that expectations have been fulfilled. The dataset considered more difficult to solve in fact is more difficult to solve and the easy dataset easier to solve than others. The results for the difficult dataset only slightly correlate with the overall results.

Fig. 25.

Annotation density as relative number of annotations respective document length in words.

4.2.9. Measures of remixed datasets

For a further detailed view on the data, the characteristics of the remixed datasets have been calculated and are presented in Fig. 25, 26, and 27.

Fig. 26.

Average number of surface forms (SF) per entity (blue, left) and average number of entities per surface form (red/hatched, right) indicating the likelihood of confusion for each dataset.

Figure 25 shows the density values of the remixed datasets. Since the datasets are filtered on annotation level and therewith some annotations were not included, it is to be expected that the density values are overall smaller compared to the unfiltered datasets (see Fig. 13). For the experiments conducted as D2KB tasks, the density does not influence the results. For A2KB tasks it might be more useful to remix on document level instead of annotation level.

In Fig. 26 the likelihood of confusions are presented. As expected, the difficult dataset contains a larger average number of entities per surface form, indicating more homonyms compared to the easy dataset. Furthermore, the number of surface forms per entity is smaller for the difficult compared to the easy dataset, indicating a smaller number of synonyms. We might conclude, that items of the Science category are more difficult to disambiguate than items of the Place category. The ‘high skew’ category almost only contains one item per surface form, respectively entity. Revising the data revealed that with the filtering of this category (cf. Table 4 grey background cells) partition 9 for the likelihoods of confusion has been completely cleared out by the other restrictions (PageRank, HITS, etc.). Thus, it seems that there exist some dependencies between the measures.

In Fig. 27 the dominance of entities and surface forms is presented. In general, for the remixed datasets the dominance of entities is larger than for the original datasets. This is to be expected, because by filtering out annotations in the dataset (reducing $S^{D}$ and $E^{D}$ ), the remaining entities are gaining more dominance.

Fig. 27.

Average dominance for surface forms (blue) and entities (red/hatched) per dataset.

4.2.10. Dataset coverage

Table 6
Coverage of origin datasets and remixed datasets

Complete Low skew High skew Med skew Easy Difficult

16821 765 66 595 235 98

DBp. Spotl. 330 14 0 3 4 0

1.96% 1.83% 0.00% 0.50% 1.70% 0.00%

4.24% 0.00% 0.91% 1.21% 0.00%

KORE50 144 3 2 7 1 0

0.86% 0.39% 3.03% 1.18% 0.43% 0.00%

2.08% 1.39% 4.86% 0.69% 0.00%

MSNBC 650 137 0 5 16 34

3.86% 17.91% 0.00% 0.84% 6.81% 34.69%

21.08% 0.00% 0.77% 2.46% 5.23%

IITB 11182 357 63 558 100 42

66.48% 46.67% 95.45% 93.78% 42.55% 42.86%

3.19% 0.56% 4.99% 0.89% 0.38%

N3-RSS500 1000 0 1 12 0 0

5.94% 0.00% 1.52% 2.02% 0.00% 0.00%

0.00% 0.10% 1.20% 0.00% 0.00%

N3-Reuters-128 880 89 0 5 26 11

5.23% 11.63% 0.00% 0.84% 11.06% 11.22%

10.11% 0.00% 0.57% 2.95% 1.25%

ACE2004 253 3 0 4 1 0

1.50% 0.39% 0.00% 0.67% 0.43% 0.00%

1.19% 0.00% 1.58% 0.40% 0.00%

News-100 1655 1 0 1 0 1

9.84% 0.13% 0.00% 0.17% 0.00% 1.02%

0.06% 0.00% 0.06% 0.00% 0.06%

AQUAINT 727 161 0 0 87 10

4.32% 21.05% 0.00% 0.00% 37.02% 10.20%

22.15% 0.00% 0.00% 11.97% 1.38%

	Complete	Low skew	High skew	Med skew	Easy	Difficult
DBp. Spotl.	330	14	0	3	4	0
1.96%	1.83%	0.00%	0.50%	1.70%	0.00%
	4.24%	0.00%	0.91%	1.21%	0.00%
KORE50	144	3	2	7	1	0
0.86%	0.39%	3.03%	1.18%	0.43%	0.00%
	2.08%	1.39%	4.86%	0.69%	0.00%
MSNBC	650	137	0	5	16	34
3.86%	17.91%	0.00%	0.84%	6.81%	34.69%
	21.08%	0.00%	0.77%	2.46%	5.23%
IITB	11182	357	63	558	100	42
66.48%	46.67%	95.45%	93.78%	42.55%	42.86%
	3.19%	0.56%	4.99%	0.89%	0.38%
N3-RSS500	1000	0	1	12	0	0
5.94%	0.00%	1.52%	2.02%	0.00%	0.00%
	0.00%	0.10%	1.20%	0.00%	0.00%
N3-Reuters-128	880	89	0	5	26	11
5.23%	11.63%	0.00%	0.84%	11.06%	11.22%
	10.11%	0.00%	0.57%	2.95%	1.25%
ACE2004	253	3	0	4	1	0
1.50%	0.39%	0.00%	0.67%	0.43%	0.00%
	1.19%	0.00%	1.58%	0.40%	0.00%
News-100	1655	1	0	1	0	1
9.84%	0.13%	0.00%	0.17%	0.00%	1.02%
	0.06%	0.00%	0.06%	0.00%	0.06%
AQUAINT	727	161	0	0	87	10
4.32%	21.05%	0.00%	0.00%	37.02%	10.20%
	22.15%	0.00%	0.00%	11.97%	1.38%

To also observe the distribution of the origin datasets over the remixed datasets the following analysis is performed. Table 6 shows the coverage of the origin datasets (rows) and the remixed datasets (columns). The first data row shows the number of annotations in the remixed datasets. Column ‘Complete’ corresponds to the join of all origin datasets. The origin datasets are described row by row. Each origin dataset item contains three lines of numbers. The first line shows the number of annotations covered in the columnwise dataset, e.g. the KORE50 dataset contains 144 annotations, whereas 3 of them also belong to the ‘low skew’ dataset. The second line just shows the relative numbers, e.g. the KORE50 dataset contributes 0.86% to the ‘Complete’ set of annotations and 0.39% to the ‘low skew’ dataset. The third line relates the number of annotations of the column to the size of the origin dataset, meaning that e.g. 2.08% of the KORE50 dataset also belong to the ‘low skew’ dataset. Special aspects are highlighted through bold font.

It is observable, that the IITB dataset contributes almost two thirds to the entire experiments, which also leads to a large coverage over the remixed datasets. 4.99% of its annotations fall into the ‘med skew’ category. IITB and KORE50 seemingly are the most ‘high skew’ datasets. But, the number of ‘high skew’ annotations overall is considerably low, so that it can be said, that there is no origin dataset which might suffer too much skewness.

On the other side, we see MSNBC and AQUAINT as considerably low skewed datasets. Over 20% of their annotations fall in that category.

With 11.97% of annotations AQUAINT has the largest fraction of easy annotations. The dataset with the largest relative number of difficult annotations is MSNBC with 5.23%. Surprisingly, the KORE50 dataset does not contribute to the difficult dataset at all, which contradicts KORE50’s creation intention.

In summary, the share of ‘high skew’ elements overall is rather small. There is no dataset that should be excluded in further evaluation experiments because it is completely ‘out of order’.

5. Conclusion

In this paper an extension of the GERBIL framework has been introduced to enable a more fine grained evaluation of NEL systems.

According to the predefined entity types, the KORE50 benchmark dataset contains the most persons, N3-Reuters-500 the most organizations, and ACE2004 the most places. The IITB dataset on the other hand contains almost no persons, organizations, or places. According to the PageRank algorithm the DBpedia Spotlight dataset contains the most prominent entities, while the Micropost 2014 Test dataset contains the most entities with medium and low prominence. N3-RSS contains the fewest popular and OKE 2015 gold standard the fewest medium and low prominence entities. The HITS value showed a more diverse picture with Micropost 2014 Train containing the most popular entities, MSNBC with the most medium prominence entities, and WES2015 with the most low prominence entities. On the other hand, IITB contains the fewest high prominence entities and OKE 2015 gold standard follows with the fewest medium prominence entities. N3-RSS-500 contains the fewest low prominence entities.

A stand-alone library has been introduced to enrich documents encoded in the NIF format with additional meta information. This enables researchers to remix existing NIF-based datasets according to their needs in a reproducible manner.

An exhaustive example was presented, on how to use the library to reorganize datasets according to the measures introduced earlier. Therefore, datasets were combined and partitioned to determine and visualize for each system correlations between a dataset property and the system’s performance. It was ascertained that systems fail with homonyms with a likelihood of confusion beyond ca. 1,700 entities mapping to the surface form. From the analysis on entities’ likelihood of confusions, it was confirmed that redirect and disambiguation resources strongly bias the overall results. However, the overall performance increases the more surface forms an entity is mapping to. It was also shown that the PageRank of entities correlates with the systems performance, but only up to a certain threshold. Interestingly, for the HITS measure the systems produced poor results on low to medium, but very good results on very low and larger values. It was further shown that not all systems are robust against a rising number of annotations in a text to disambiguate. Many systems tend to suffer loss of recall with larger numbers of items to disambiguate. While FOX greatly performs on smaller contexts, KEA benefits from larger numbers of annotations in a context. Finally, the density measure shows that text with rather few annotations can promote recall and demote precision very unevenly.

Furthermore, an overall comparison of different filtered datasets was given including a focus on specific domains, as e.g., persons, organizations, places, music, science, movies/tv. Although KEA and PBOH perform well in the majority of cases, they are not necessarily the best performing systems. Babelfy greatly performs on the science domain, thus, there are domain and dataset structure specific preferences across the systems. Therefore, it is of major importance always to take into account the characteristics of datasets for entity linking benchmarks.

It is impossible to define how a perfect ‘one for all’ dataset should look like. However, we attempted to compile at least one dataset that is almost free of the apparent biasing factors ascertained from the proposed measures. To determine the ‘difficulty’ of a dataset, the confusion and popularity measures seem to be appropriate measures, but only in combination with moderate size of context and balanced density. Extreme outliers should be avoided as possible. Also redirect and disambiguation resources distort the result very much.

From the remixing we have learned, that there are in fact domain differences in the performance of the annotating systems. The systems have their peculiarities according to the introduced measures and there are differences in the quality of datasets. But, we cannot find evidence, that the datasets under consideration contain a harmful number of inappropriate annotations.

Further biasing factors identified in the datasets are NIL (notInWiki) annotations and the mixture of language versions of DBpedia, as for example caused by including the News-100 dataset. Both should be taken into account in further versions of this work. Unfortunately, the applied online annotation systems were not always available. Moreover, it is not clear what the current development state of the systems is or how many systems exist that are not connected to GERBIL, which might also worthwhile to be included in further analysis.

Ongoing research is focused on the implementation of additional measures, such as e.g. those introduced by [10,24] and the annotation systems’ performance breakdown should also include the dominance and maximum recall measures. More datasets such as WES2015 and the Microposts series should be included in future versions.

Also, we would like to introduce difficulty levels for datasets along with new properties for annotation, which might be useful for further remixing, as e.g. a distinction of the NEL annotation for common and proper nouns, or the dependency on temporal context. The inter-systems agreement might also be a valuable measure to be included into an evaluation.

The results of this work as well as the provided source code and the public online service enable to improve further benchmarks, to optimize systems for a unprecedented level of detail, and the results enable to find the right tool or method for the desired annotation task.

In summary, the evaluation at a finer granular level allows a better understanding of the NEL process and also promotes the development of improved NEL systems.

Footnotes

Mathematical notation

$a = (s, e, i, l)$	An annotation with surface form s, entity e, text index i, and length l.
E	A set of entities.
$E^{D}$	Set of entities in dataset D.
$E^{d}$	Set of entities in a document d.
$E (s)$	Set of entities for the surface form s.
$W_{E}$	A mapping (dictionary) from surface forms to entities $W_{E} : S \to E$ of an annotation system.
$W_{E} (s)$	Set of entities in dictionary $W_{E}$ for surface form s.
S	Set of surface forms.
$S^{D}$	Set of surface forms in dataset D.
$S^{d}$	Set of surface forms in document d.
$S (e)$	Set of surface forms for the entity e.
$W_{S}$	A mapping (dictionary) from entities to surface forms $W_{S} : E \to S$ of an annotation system.
$W_{S} (e)$	Set of surface forms in the dictionary $W_{S}$ for entity e.
P	Arbitrary scoring algorithm (e.g. PageRank, HITS) to estimate popularity.

Formula overview

Average number of annotations: $\begin{matrix} (11) & n a (D) = \frac{\sum_{d \in D} | d_{a} |}{| D |} . \end{matrix}$ Average number of not annotated documents: $\begin{matrix} (12) & nad (D) = \frac{| {d : | d_{a} | = 0} |}{| D |} . \end{matrix}$ Density of a document d: $\begin{matrix} (13) & density (d) = \frac{| d_{a} |}{| d_{t} |} . \end{matrix}$ Density of dataset D: $\begin{matrix} (14) & \begin{matrix} {density}_{micro} (D) = \frac{\sum_{d \in D} density (d)}{| D |}, \\ {density}_{macro} (D) = \frac{\sum_{d \in D} | d_{a} |}{\sum_{d \in D} | d_{t} |} . \end{matrix} \end{matrix}$ Set of entities with prominence in interval $[a, b]$ for a scoring algorithm P: $\begin{matrix} (15) & E_{a, b}^{D} (P) = {e \in E^{D} : a ⩽ P (e) ⩽ b} . \end{matrix}$

Average likelihood of confusion for all surface forms of dataset D $\begin{array}{rcl} (16) & \begin{matrix} {lc}_{micro}^{sf} (D, W) = \frac{\sum_{d \in D} \frac{\sum_{a \in d_{a}} | W_{E} (s) \cup E^{D} (s) |}{| d_{a} |}}{| D |}, \\ {lc}_{macro}^{sf} (D, W) = \frac{\sum_{s \in S^{D}} | W_{E} (s) \cup E^{D} (s) |}{| S^{D} |} . \end{matrix} \end{array}$

Average likelihood of confusion for all entities of dataset D: $\begin{array}{rcl} (17) & \begin{matrix} {lc}_{micro}^{e} (D, W) = \frac{\sum_{d \in D} \frac{\sum_{a \in d_{a}} | W_{S} (e) \cup S^{D} (e) |}{| d_{a} |}}{| D |}, \\ {lc}_{macro}^{e} (D, W) = \frac{\sum_{e \in E^{D}} | W_{S} (e) \cup S^{D} (e) |}{| E^{D} |} . \end{matrix} \end{array}$

Dominance of surface forms: $\begin{matrix} (18) & \begin{matrix} {dom}_{micro}^{sf} (D, W) = \frac{\sum_{d \in D} \frac{\sum_{a \in d_{a}} \frac{E^{d} (s)}{W_{E} (s)}}{| d_{a} |}}{| D |}, \\ {dom}_{macro}^{sf} (D, W) = \frac{\sum_{s \in S^{D}} \frac{| E^{D} (s) |}{| W_{E} (s) |}}{| S^{D} |} . \end{matrix} \end{matrix}$

Dominance of entities: $\begin{matrix} (19) & \begin{matrix} {dom}_{micro}^{e} (D, W) = \frac{\sum_{d \in D} \frac{\sum_{a \in d_{a}} \frac{S^{d} (e)}{W_{S} (e)}}{| d_{a} |}}{| D |}, \\ {dom}_{macro}^{e} (D, W) = \frac{\sum_{e \in E^{D}} \frac{| S^{D} (e) |}{| W_{S} (e) |}}{| E^{D} |} . \end{matrix} \end{matrix}$

Maximum recall: $\begin{matrix} (20) & \begin{matrix} {mr}_{micro} (S, W) = \frac{\sum_{d \in D} (1 - \frac{| S^{d} ∖ W_{S} |}{| S^{d} |})}{| D |}, \\ {mr}_{macro} (S, W) = 1 - \frac{| S^{D} ∖ W_{S} |}{| S^{D} |} . \end{matrix} \end{matrix}$

Set of entities in dataset D with type T where $E^{T}$ is the set of all entities with type T: $\begin{matrix} (21) & E^{D} (T) = {e \in E^{D} : e \in E^{T}} . \end{matrix}$

References

Bhatia and

Jain, Context sensitive entity linking of search queries in enterprise knowledge graphs, in: Proceedings of the International Semantic Web Conference 2016 (ESWC’16), Springer, Cham, 2016, pp. 50–54. doi:10.1007/978-3-319-47602-5_11.

A.E.

Cano,

Rizzo,

Varga,

Rowe,

Stankovic and

A.-S.

Dadzie, Making sense of microposts: (#Microposts2014) named entity extraction & linking challenge, in: Proceedings of the the 4th Workshop on Making Sense of Microposts Co-Located with the 23rd International World Wide Web Conference (WWW’14), CEUR Workshop Proceedings, Vol. 1141, 2014, pp. 54–60.

Ceccarelli,

Lucchese,

Orlando,

Perego and

Trani, Dexter: An open source framework for entity linking, in: Proceedings of the 6th International Workshop on Exploiting Semantic Annotations in Information Retrieval, ACM, 2013, pp. 17–20. doi:10.1145/2513204.2513212.

Cornolti,

Ferragina and

Ciaramita, A framework for benchmarking entity-annotation systems, in: Proceedings of the 22nd International Conference on World Wide Web (WWW’13), ACM, 2013, pp. 249–260. doi:10.1145/2488388.2488411.

Cucerzan, Large-Scale Named Entity Disambiguation Based on Wikipedia Data, in: Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), ACL, 2007, pp. 708–716, available at http://www.aclweb.org/anthology/D/D07/D07-1074.

Dojchinovski and

Kliegr, Entityclassifier.eu: Real-time classification of entities in text with Wikipedia, in: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Lecture Notes in Computer Science, Vol. 8190, Springer, Berlin, Heidelberg, 2013, pp. 654–658. doi:10.1007/978-3-642-40994-3_48.

Dragoni,

Cabrio,

Tonelli and

Villata, Enriching a small artwork collection through semantic linking, in: Proceedings of the International Semantic Web Conference (ESWC’16), Lecture Notes in Computer Science, Vol. 9678, Springer, Cham, 2016, pp. 724–740. doi:10.1007/978-3-319-34129-3_44.

Frontini,

Brando and

J.-G.

Ganascia, Semantic web based named entity linking for digital humanities and heritage texts, in: 1st International Workshop Semantic Web for Scientific Heritage at the 12th ESWC 2015 Conference, CEUR-WS, Vol. 1364, Portorož, Slovenia, 2015, pp. 77–88.

O.-E.

Ganea,

Lucchi,

Eickhoff and

Hofmann, Probabilistic bag-of-hyperlinks model for entity linking, in: Proceedings of the 25th International Conference on World Wide Web (WWW’16), ACM, 2016, pp. 927–938. doi:10.1145/2872427.2882988.

10.

Hachey,

Nothman and

Radford, Cheap and easy entity evaluation, in: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Short Papers), ACL, 2014, pp. 464–469. doi:10.3115/v1/P14-2076.

11.

Hellmann,

Lehmann,

Auer and

Brümmer, Integrating NLP using linked data, in: International Semantic Web Conference (ISWC’13), Lecture Notes in Computer Science, Vol. 8219, Springer, Berlin, Heidelberg, 2013, pp. 98–113. doi:10.1007/978-3-642-41338-4_7.

12.

Hoffart,

Seufert,

D.B.

Nguyen,

Theobald and

Weikum, KORE: Keyphrase Overlap Relatedness for Entity Disambiguation, in: 21st ACM International Conference on Information and Knowledge Management, ACM, New York, NY, USA, 2012, pp. 545–554. doi:10.1145/2396761.2396832.

13.

Hoffart,

M.A.

Yosef,

Bordino,

Fürstenau,

Pinkal,

Spaniol,

Taneva,

Thater and

Weikum, Robust disambiguation of named entities in text, in: Proceedings of the Conference on Empirical Methods in Natural Language Processing, ACL, 2011, pp. 782–792.

14.

Kulkarni,

Singh,

Ramakrishnan and

Chakrabarti, Collective Annotation of Wikipedia Entities in Web Text, in: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, New York, NY, USA, 2009, pp. 457–466. doi:10.1145/1557019.1557073.

15.

Ling,

Singh and

D.S.

Weld, Design challenges for entity linking, Transactions of the Association for Computational Linguistics3 (2015), 315–328.

16.

J.L.

Martinez-Rodriguez,

Hogan and

Lopez-Arevalo, Information extraction meets the semantic web: A survey, Semantic Web (accepted 2018; to appear), available at http://www.semantic-web-journal.net/system/files/swj1909.pdf. doi:10.3233/SW-180333.

17.

P.N.

Mendes,

Jakob,

García-Silva and

Bizer, Dbpedia spotlight: Shedding Light on the Web of Documents, in: Proceedings of the 7th International Conference on Semantic Systems (I-Semantic’11), ACM, New York, NY, USA, 2011, pp. 1–8. doi:10.1145/2063518.2063519.

18.

Milne and

I.H.

Witten, Learning to Link with Wikipedia, in: Proceedings of the 17th ACM International Conference on Information and Knowledge Management, ACM, New York, NY, USA, 2008, pp. 509–518. doi:10.1145/1458082.1458150.

19.

Mitchell,

Strassel,

Huang and

Zakhary, ACE 2004 multilingual training corpus LDC2005T09, web download, Linguistic Data Consortium, Philadelphia, 2005, available at https://catalog.ldc.upenn.edu/ldc2005t09.

20.

Moro,

Raganato and

Navigli, Entity Linking meets Word Sense Disambiguation: A Unified Approach, Transactions of the Association for Computational Linguistics2 (2014), 231–244.

21.

A.G.

Nuzzolese,

A.L.

Gentile,

Presutti,

Gangemi,

Garigliotti and

Navigli, Open Knowledge Extraction Challenge, in: Semantic Web Evaluation Challenge, CCIS, Vol. 548, Springer, Cham, 2015, pp. 3–15. doi:10.1007/978-3-319-25518-7_1.

22.

Page,

Brin,

Motwani and

Winograd, The PageRank citation ranking: Bringing order to the Web, Technical report, Stanford InfoLab, 1999, available at http://ilpubs.stanford.edu:8090/422/.

23.

Piccinno and

Ferragina, From TagME to WAT: A new entity ennotator, in: Proceedings of the 1st International Workshop on Entity Recognition & Disambiguation (ERD’14), ACM, New York, NY, USA, 2014, pp. 55–62. doi:10.1145/2633211.2634350.

24.

Pradhan,

Luo,

Recasens,

E.H.

Hovy,

Ng and

Strube, Scoring coreference partitions of predicted mentions: A reference implementation, in: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, ACL, 2014, pp. 30–35. doi:10.3115/v1/P14-2006.

25.

Reddy,

Knuth and

Sack, DBpedia Graph Measures, Hasso Plattner Institute, Potsdam, Germany, 2014, available at http://s16a.org/node/6.

26.

Rizzo,

A.E.C.

Basave,

Pereira and

Varga, Making sense of Microposts (#microposts2015) Named Entity rEcognition and Linking (NEEL) Challenge, in: 5th Workshop on Making Sense of Microposts at 24th Int. World Wide Web Conference, CEUR-WS, Vol. 1395, 2015, pp. 44–53.

27.

Rizzo and

Troncy, NERD: A framework for unifying named entity recognition and disambiguation extraction tools, in: Proceedings of the Demonstrations at the 13th Conference of the European Chapter of the Association for Computational Linguistics, ACL, Stroudsburg, PA, USA, 2012, pp. 73–76.

28.

Rizzo,

van Erp and

Troncy, Benchmarking the Extraction and Disambiguation of Named Entities on the Semantic Web, in: 9th Int. Conf. on Language Resources and Evaluation (LREC’14), European Language Resources Association (ELRA), 2014.

29.

Röder,

Usbeck,

Hellmann,

Gerber and

Both, N³ – A collection of datasets for named entity recognition and disambiguation in the NLP interchange format, in: Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC’14), European Language Resources Association (ELRA), 2014.

30.

Röder,

Usbeck and

A.N.

Ngomo, GERBIL – Benchmarking named entity recognition and linking consistently, Semantic Web9(5) (2018), 605–625. doi:10.3233/SW-170286.

31.

Shen,

Wang and

Han, Entity Linking with a Knowledge Base: Issues, Techniques, and Solutions, IEEE Transactions on Knowledge and Data Engineering27(2) (2015), 443–460. doi:10.1109/TKDE.2014.2327028.

32.

Singhal, Introducing the knowledge graph: Things, not strings, Official Google Blog, May 2012.

33.

Speck and

A.-C.N.

Ngomo, Named entity recognition using FOX, in: Proceedings of the ISWC 2014 Posters & Demonstrations Track Within the 13th International Semantic Web Conference (ISWC 2014), CEUR-WS, Vol. 1272, 2014, pp. 85–88.

34.

Steinmetz,

Knuth and

Sack, Statistical Analyses of Named Entity Disambiguation Benchmarks, in: Proceedings of NLP & DBpedia 2013 Workshop at 12th International Semantic Web Conference, CEUR-WS, Vol. 1064, 2013.

35.

Tietz,

Waitelonis,

Jäger and

Sack, Smart media navigator: Visualizing recommendations based on linked data, in: Proceedings of the Industry Track at the International Semantic Web Conference 2014 Co-Located with the 13th International Semantic Web Conference (ISWC’14), CEUR-WS, Vol. 1383, 2014.

36.

Usbeck,

A.-C.N.

Ngomo,

Röder,

Gerber,

S.A.

Coelho,

Auer and

Both, AGDISTIS – Graph-based disambiguation of named entities using linked data, in: Proceedings of the 2014 International Semantic Web Conference (ISWC’14), Lecture Notes in Computer Science, Vol. 8796, Springer, Cham, 2014, pp. 457–471. doi:10.1007/978-3-319-11964-9_29.

37.

Usbeck,

Röder,

A.-C.

Ngonga Ngomo,

Baron,

Both,

Brümmer,

Ceccarelli,

Cornolti,

Cherix,

Eickmann,

Ferragina,

Lemke,

Moro,

Navigli,

Piccinno,

Rizzo,

Sack,

Speck,

Troncy,

Waitelonis and

Wesemann, GERBIL – General entity annotation benchmark framework, in: Proceedings of the 24th International Conference on World Wide Web (WWW’15), International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, Switzerland, 2015, pp. 1133–1143. doi:10.1145/2736277.2741626.

38.

van Erp,

Mendes,

Paulheim,

Ilievski,

Plu,

Rizzo and

Waitelonis, Evaluating entity linking: An analysis of current benchmark datasets and a roadmap for doing a better job, in: Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC’16), European Language Resources Association, ELRA-ELDA, Paris, France, 2016, pp. 4373–4379. ISBN 978-2-9517408-9-1.

39.

Waitelonis,

Exeler and

Sack, Linked data enabled generalized vector space model to improve document retrieval, in: Proceedings of the Third NLP&DBpedia Workshop (NLP & DBpedia 2015) Co-Located with the 14th International Semantic Web Conference 2015 (ISWC’15), CEUR-WS, Vol. 1581, 2015, pp. 33–44.

40.

Waitelonis,

Jürges and

Sack, Don’t Compare Apples to Oranges: Extending GERBIL for a Fine Grained NEL Evaluation, in: Proceedings of the 12th International Conference on Semantic Systems (SEMANTiCS’16), ACM, New York, NY, USA, 2016, pp. 65–72. doi:10.1145/2993318.2993334.

41.

Waitelonis,

Plank and

Sack, TIB|AV-portal: Integrating automatically generated video annotations into the web of data, in: International Conference on Theory and Practice of Digital Libraries (TPDL’16), Lecture Notes in Computer Science, Vol. 9819, Springer, Cham, 2016, pp. 429–433. doi:10.1007/978-3-319-43997-6_37.

42.

Waitelonis and

Sack, Named entity linking in #tweets with KEA, in: Proceedings of the 6th Workshop on ‘Making Sense of Microposts’ Co-Located with the 25th International World Wide Web Conference (WWW’16), CEUR-WS, Vol. 1691, 2016, pp. 61–63.

43.

J.G.

Zheng,

Howsmon,

Zhang,

Hahn,

McGuinness,

Hendler and

Ji, Entity linking for biomedical literature, BMC Medical Informatics and Decision Making15(1) (2015), S4. doi:10.1186/1472-6947-15-S1-S4.

Remixing entity linking evaluation datasets for focused benchmarking

Abstract

Keywords

1. Introduction

1 http://aksw.org/Projects/GERBIL.html

3 https://github.com/santifa/gerbil/

2.2. Not annotated documents

2.3. Missing annotations (density)

2.4. Prominence (popularity)

2.5.2. Maximum recall

2.5.3. Types

3. Implementation

3.1. Extending GERBIL

9 https://github.com/santifa/hfts

4.1. GERBIL datasets

12 http://gerbil.s16a.org/

13 https://github.com/santifa/hfts/blob/master/Results.md

15 http://gerbil.aksw.org/gerbil/experiment?id=201712050002

16 http://gerbil.aksw.org/gerbil/experiment?id=201712060001

17 http://gerbil.aksw.org/gerbil/experiment?id=201712060011

18 http://gerbil.aksw.org/gerbil/experiment?id=201711280011

20 http://gerbil.aksw.org/gerbil/experiment?id=201712050010

21 http://gerbil.aksw.org/gerbil/experiment?id=201711230013

Footnotes

Mathematical notation

Formula overview

References

¹
http://aksw.org/Projects/GERBIL.html

³
https://github.com/santifa/gerbil/

⁹
https://github.com/santifa/hfts

¹²
http://gerbil.s16a.org/

¹³
https://github.com/santifa/hfts/blob/master/Results.md

¹⁵
http://gerbil.aksw.org/gerbil/experiment?id=201712050002

¹⁶
http://gerbil.aksw.org/gerbil/experiment?id=201712060001

¹⁷
http://gerbil.aksw.org/gerbil/experiment?id=201712060011

¹⁸
http://gerbil.aksw.org/gerbil/experiment?id=201711280011

²⁰
http://gerbil.aksw.org/gerbil/experiment?id=201712050010

²¹
http://gerbil.aksw.org/gerbil/experiment?id=201711230013