Sage Journals: Discover world-class research

Abstract

The ability to compare systems from the same domain is of central importance for their introduction into complex applications. In the domains of named entity recognition and entity linking, the large number of systems and their orthogonal evaluation w.r.t. measures and datasets has led to an unclear landscape regarding the abilities and weaknesses of the different approaches. We present Gerbil – an improved platform for repeatable, storable and citable semantic annotation experiments – and its extension since being release. Gerbil has narrowed this evaluation gap by generating concise, archivable, human- and machine-readable experiments, analytics and diagnostics. The rationale behind our framework is to provide developers, end users and researchers with easy-to-use interfaces that allow for the agile, fine-grained and uniform evaluation of annotation tools on multiple datasets. By these means, we aim to ensure that both tool developers and end users can derive meaningful insights into the extension, integration and use of annotation applications. In particular, Gerbil provides comparable results to tool developers, simplifying the discovery of strengths and weaknesses of their implementations with respect to the state-of-the-art. With the permanent experiment URIs provided by our framework, we ensure the reproducibility and archiving of evaluation results. Moreover, the framework generates data in a machine-processable format, allowing for the efficient querying and post-processing of evaluation results. Additionally, the tool diagnostics provided by Gerbil provide insights into the areas where tools need further refinement, thus allowing developers to create an informed agenda for extensions and end users to detect the right tools for their purposes. Finally, we implemented additional types of experiments including entity typing. Gerbil aims to become a focal point for the state-of-the-art, driving the research agenda of the community by presenting comparable objective evaluation results. Furthermore, we tackle the central problem of the evaluation of entity linking, i.e., we answer the question of how an evaluation algorithm can compare two URIs to each other without being bound to a specific knowledge base. Our approach to this problem opens a way to address the deprecation of URIs of existing gold standards for named entity recognition and entity linking, a feature which is currently not supported by the state-of-the-art. We derived the importance of this feature from usage and dataset requirements collected from the Gerbil user community, which has already carried out more than 24,000 single evaluations using our framework. Through the resulting updates, Gerbil now supports 8 tasks, 46 datasets and 20 systems.

Keywords

Semantic entity annotation system reusability archivability benchmarking framework Named Entity Recognition,Linking,Disambiguation

1. Introduction

Named Entity Recognition (NER) and Named Entity Linking/Disambiguation (NEL/D) as well as other natural language processing (NLP) tasks play a key role in annotating RDF knowledge from unstructured data. While manifold annotation tools have been developed over recent years to address (some of) the subtasks related to the extraction of structured data from unstructured data [20,28,38,40,42,48,52,59,62], the provision of comparable results for these tools remains a tedious problem. The issue of comparability of results is not to be regarded as being intrinsic to the annotation task. Indeed, it is now well established that scientists spend between 60 and 80% of their time preparing data for experiments [23,30,47]. Data preparation being such a tedious problem in the annotation domain is mostly due to the different formats of gold standards as well as the different data representations across reference datasets. These restrictions have led to authors evaluating their approaches on datasets (1) that are available to them and (2) for which writing a parser and an evaluation tool can be carried out with reasonable effort. In addition, many different quality measures have been developed and used actively across the annotation research community to evaluate the same task, creating difficulties when comparing results across publications on the same topics. For example, while some authors publish macro-F-measures and simply call them F-measures, others publish micro-F-measures for the same purpose, leading to significant discrepancies across the scores. The same holds for the evaluation of how well entities match. Indeed, partial matches and complete matches have been used in previous evaluations of annotation tools [11,57]. This heterogeneous landscape of tools, datasets and measures leads to a poor repeatability of experiments, which makes the evaluation of the real performance of novel approaches against the state-of-the-art rather difficult.

Thus, we present Gerbil – a general framework for benchmarking semantic entity annotation systems which introduces a platform and a software for comparable, archivable and efficient semantic annotation experiments fostering a more efficient and effective community.1

¹
This paper is a significant extension of [64] including the progress of the Gerbil project since its initial release in 2015.

In the rest of this paper, we explain the core principles which we followed to create GERBIL and detail our new contributions. Thereafter, we present the state-of-the-art in benchmarking Named Entity Recognition, Typing and Linking. In Section 4, we present the Gerbil framework. We focus in particular on the provided features such as annotators, datasets, metrices and the evaluation processes including our new approach to match URIs. We then present an evaluation of the framework by indirectly qualifying the interaction of the community with our platform since its release. We conclude with a discussion of the current state of Gerbil and a presentation of future work. More information can be found at our project webpage http://gerbil.aksw.org and at the code repository page https://github.com/AKSW/gerbil. The online version of Gerbil can be accessed at http://gerbil.aksw.org/gerbil.
2. Principles

Insights into the difficulties of current evaluation setups have led to a movement towards the creation of frameworks to ease the evaluation of solutions that address the same annotation problem, see Section 3. Gerbil is a community-driven effort to enable the continuous evaluation of annotation tools. Our approach is an open-source, extensible framework that allows an evaluation of tools against (currently) 20 different annotators on 12 different datasets in 6 different experiment types. By integrating such a large number of datasets, experiment types and frameworks, Gerbil allows users to evaluate their tools against other semantic entity annotation systems (short: entity annotation systems) by using the same setting, leading to fair comparisons based on the same measures. Our approach goes beyond the state-of-the-art in several respects:

Repeatable settings: Gerbil provides persistent URLs for experimental settings. Hence, by using Gerbil for experiments, tool developers can ensure that the settings for their experiments (measures, datasets, versions of the reference frameworks, etc.) can be reconstructed in a unique manner in future works.

Archivable experiments: Through experiment URLs, Gerbil also addresses the problem of archiving experimental results and allows end users to gather all pieces of information required to choose annotation frameworks for practical applications.

Open software and service: Gerbil aims to be a central repository for annotation results without being a central point of failure: While we make experiment URLs available, we also provide users directly with their results to ensure that they use them locally without having to rely on Gerbil.

Leveraging RDF for storage: The results of Gerbil are published in a machine-readable format. In particular, our use of DataID [3] and DataCube [14] to denote tools and datasets ensures that results can be easily combined and queried (for example to study the evolution of the performance of frameworks) while the exact configuration of the experiments remains uniquely reconstructable. By these means, we also tackle the problem of reproducibility.

Fast configuration: Through the provision of results on different datasets of different types and the provision of results on a simple user interface, Gerbil also provides efficient means to gain an overview of the current performance of annotation tools. This provides (1) developers with insights into the type of data on which their accuracy needs improvement and (2) end users with insights allowing them to choose the right tool for the tasks at hand.

Any knowledge base: With Gerbil we introduce the notion of knowledge-base-agnostic benchmarking of entity annotation systems through generalized experiment types. By these means, we allow the benchmarking of tools against reference datasets from any domain grounded in any reference knowledge base.

Table 1
FAIR principles and how Gerbil addresses each of them

To be Findable

F1. (meta)data are assigned a globally unique and persistent identifier Unique W3ID URIs per experiment

F2. data are described with rich metadata (defined by R1 below) Experimental configuration as RDF

F3. metadata clearly and explicitly include the identifier of the data it describes Relates via RDF

F4. (meta)data are registered or indexed in a searchable resource batch-updated SPARQL endpoint: http://gerbil.aksw.org/sparql

To be Accessible

A1. (meta)data are retrievable by their identifier using a standardized communications protocol HTTP (with JSON-LD as data format)

A1.1 the protocol is open, free, and universally implementable HTTP is an open standard

A1.2 the protocol allows for an authentication and authorization procedure, where necessary Not necessary, see Gerbil disclaimer

A2. metadata are accessible, even when the data are no longer available Each experiment is archived

To be Interoperable

I1. (meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation RDF, DataID, DataCube

I2. (meta)data use vocabularies that follow FAIR principles Community-based, open vocabularies

I3. (meta)data include qualified references to other (meta)data Datasets are described using DataID

To be Reusable

R1. meta(data) are richly described with a plurality of accurate and relevant attributes Experiment measures have been chosen in a community process

R1.1. (meta)data are released with a clear and accessible data usage license Gerbil is implemented by LGPL-3.0

R1.2. (meta)data are associated with detailed provenance Provenance is added to each machine-readable experiment data

R1.3. (meta)data meet domain-relevant community standards Gerbil covers a superset of domain-relevant data

To ensure that the Gerbil framework is useful to both end users and tool developers, its architecture and interface were designed with the following requirements in mind:

Easy integration of annotators: We provide a wrapping interface that allows annotators to be evaluated via their HTTP interface. In particular, we integrated 15 additional annotators not evaluated against each other in previous works (e.g., [11]).

Easy integration of datasets: We also provide means to gather datasets for evaluation directly from data services such as DataHub.2 ²

http://datahub.io.

In particular, we added 37 new datasets to Gerbil.

Easy addition of new measures: The evaluation measures used by Gerbil are implemented as interfaces. Thus, the framework can be easily extended with novel measures devised by the annotation community.

Extensibility: Gerbil is provided as an open-source platform3 ³

Available at http://gerbil.aksw.org.

that can be extended by members of the community both to new tasks and different purposes.

Diagnostics: The interface of the tool was designed to provide developers with means to easily detect aspects in which their tool(s) need(s) to be improved.

Portability of results: We generate human- and machine-readable results to ensure maximum usefulness and portability of the results generated by our framework.

After the release of Gerbil and several hundred experiments, a list of drawbacks of current datasets stated by Gerbil’s community and developers led to requirements for further development of the platform. In particular, the requirements pertained to:

Entity Matching. The comparison of two strings representing entity URIs is not sufficient to determine whether an annotator has linked an entity correctly. For example, the two URIs http://dbpedia.org/resource/Berlin and http://en.wikipedia.org/wiki/Berlin stand for the same real-world object. Hence, the result of an annotation system should be marked as true positive if it generates any of these two URIs to signify the corresponding real-world object. The need to address this drawback of current datasets (which only provide one of these URIs) is amplified by the diversity of annotators and the corresponding diversity of knowledge bases (KB) on which they rely.

Deprecated entities in datasets. Most of the gold standards in the NER and NED research area have not been updated after their first creation. Thus, the URIs they rely on have remained static over the years while the underlying KBs may have been refined or changed. This leads to some URIs in a gold standard being deprecated. As in the first requirement, there is hence a need to provide means to assess a result as true positive when the URI generated by a framework is a novel URI which corresponds to the deprecated URI.

New tasks and Adapters.Gerbil was requested for use in the two OKE challenges in 2015 and 2016.4 ⁴

https://github.com/anuzzolese/oke-challenge.

^,5 ⁵

https://github.com/anuzzolese/oke-challenge-2016.

Thus, we implemented corresponding tasks and supported the execution of each respective campaign. Additionally, we added several state-of-the-art annotators and datasets upon community request.

Finally, Gerbil was designed primarily for benchmarking entity annotation tools with the aim of ensuring repeatable and archiveable experiments in compliance with the FAIR principles [67]. Table 1 depicts the details of Gerbil’s implementation of the FAIR principles.

3. Related work

Named Entity Recognition and Entity Linking have gained significant momentum with the growth of Linked Data and structured knowledge bases. Over the past few years, the problem of result comparability has thus led to the development of a handful of frameworks.

The BAT-framework [11] is designed to facilitate the benchmarking of NER, NEL/D and concept tagging approaches. BAT compares seven existing entity annotation approaches using Wikipedia as reference. Moreover, it defines six different task types, five different matchings and six evaluation measures providing five datasets. Rizzo et al. [52] present a state-of-the-art study of NER and NEL systems for annotating newswire and micropost documents using well-known benchmark datasets, namely CoNLL2003 and Microposts 2013 for NER as well as AIDA/CoNLL and Microposts2014 [4] for NED. The authors propose a common schema, named the NERD ontology,6

⁶
http://nerd.eurecom.fr/ontology.
to align the different taxonomies used by various extractors. To tackle the disambiguation ambiguity, they propose a method to identify the closest DBpedia resource by (exact-)matching the entity mention. Recently, Chen et al. [8] published EUEF, the easy-to-use evaluation framework which addresses three more challenges as opposed to the standard Gerbil algorithm. First, EUEF introduces a new matching metric based on fuzzy matching to account for annotator mistakes. Second, the framework introduces a new methodology for handling NIL annotations. Third, Chen et al.’s framework analyzes sub-components of NER/NED systems. However, EUEF only includes three systems and seven datasets and is not yet open source or online available. Ling et al. [34] recently presented a good overview and comparison of NEL systems together with their VINCULUM system. VINCULUM is not part of GERBIL as it does not provide a public endpoint.

Over the course of the last 25 years several challenges, workshops and conferences dedicated themselves to the comparable evaluation of information extraction (IE) systems. Starting in 1993, the Message Understanding Conference (MUC) introduced a first systematic comparison of information extraction approaches [60]. Ten years later, the Conference on Computational Natural Language Learning (CoNLL) offered the beginnings of a shared task on named entity recognition and published the CoNLL corpus [61]. In addition, the Automatic Content Extraction (ACE) challenge [17], organized by NIST, evaluated several approaches but was discontinued in 2008. Since 2009, the text analytics conference has hosted the workshop on knowledge base population (TAC-KBP) [37] where mainly linguistic-based approaches are published. The Senseval challenge, originally concerned with classical NLP disciplines, widened its focus in 2007 and changed its name to SemEval to account for the recently recognized impact of semantic technologies [31]. The Making Sense of Microposts workshop series (#Microposts) established in 2013 an entity recognition and in 2014 an entity linking challenge focusing on tweets and microposts [55]. In 2014, Carmel et al. [6] introduced one of the first Web-based evaluation systems for NER and NED and the centerpiece of the entity recognition and disambiguation (ERD) challenge. Here, all frameworks are evaluated against the same unseen dataset and provided with corresponding results.

Gerbil goes beyond the state-of-the-art by extending the BAT-framework as well as [52] enhancing reproducibility, diagnostics and publishability of entity annotation systems. In particular, we provide 37 additional datasets and 15 additional annotators. The framework addresses the lack of treatment of NIL values within the BAT-framework and provides more wrapping approaches for annotators and datasets. Moreover, Gerbil provides persistent URLs for experiment results, unique URIs for frameworks and datasets, a machine-readable output and automatic dataset updates from data portals. Thus, it allows for a holistic comparison of existing annotators while simplifying the archiving of experimental results. Moreover, our framework offers opportunities for the fast and simple evaluation of entity annotation system prototypes via novel NIF-based [25] interfaces, which are designed to simplify the exchange of data and binding of services.
4. The Gerbil framework

4.1. Architecture overview

Gerbil abides by a service-oriented architecture driven by the model-view-controller pattern (see Fig. 1). Entity annotation systems, datasets and configurations, such as, experiment type, matching or measure are implemented as controller interfaces easily pluggable to the core controller. The output of experiments as well as descriptions of the various components are stored in a serverless database for fast deployment. Finally, the view component displays configuration options and respectively renders experiment results delivered by the main controller communication with the diverse interfaces and database.

Fig. 1.

Overview of Gerbil’s abstract architecture. Interfaces to users and providers of datasets and annotators are marked in blue.

4.2. Features

Experiments run in our framework can be configured in several manners. In the following, we present some of the most important parameters of experiments available in Gerbil.

4.2.1. Experiment types

An experiment type defines the problem that has to be solved by the benchmarked system. Cornolti et al.’s [11] BAT-framework offers six different experiment types, namely (scored) annotation (S/A2KB), disambiguation (D2KB) – also known as linking – and (scored respectively ranked) concept annotation (S/R/C2KB) of texts. In [52], the authors propose two types of experiments, highlighting the strengths and weaknesses of the analyzed systems. Thereby, performing (i) entity recognition, i.e., the detection of the exact match of the pair entity mention and type (e.g., detecting the mention Barack Obama and typing it as a Person), and (ii) entity linking, where an exact match of the mention is given and the associated DBpedia URI has to be linked (e.g., locating a resource in DBpedia which describes the mention Barack Obama). This work differs from the previous one in its entity recognition experimentation, and annotation of entities to a RDF knowledge base.

Gerbil merges the six experiments provided by the BAT-framework to three experiment types by a general handling of scored annotations. These experiment types are further extended by the idea to not only link to Wikipedia but to any knowledge base K. One major formal update of the measures in Gerbil is that in addition to implementing experiment types from previous frameworks, it also measures the influence of emerging entities (EEs or NIL annotations), i.e., the linking of entities that are recognized as such but cannot be linked to any resource from the reference knowledge base K. For example, the string “Ricardo Usbeck” can be recognized as a person name by several tools but cannot be linked to Wikipedia/DBpedia, as it does not have a URI in these reference datasets. Our framework extends the experiment types of [11] as follows: Let $m = (s, l, d, c) \in M$ denote an entity mention in document $d \in D$ with start position s, length l and confidence score $c \in [0, 1]$ . Note that some frameworks might not return (1) a position s or a length l for a mention, in which case we set $s = 0$ and $l = 0$ ; (2) a score c, in which case we set $c = 1$ .

We implement 8 types of experiments:

Entity Recognition: In this task the entity mentions need to be extracted from a document set D. To this end, an extraction function $ex : D \to 2^{M}$ must be computed.

D2KB: The goal of this experiment type is to map a set of given entities mentions (i.e., a subset $μ \subseteq M$ ) to entities from a given knowledge base or to NIL. Formally, this is equivalent to finding a mapping $a : μ \to K \cup {NIL}$ . In the classical setting for this task, the start position, the length and the score of the mentions $m_{i}$ are not taken into consideration.

Entity Typing: The typing task is similar to the D2KB task. Its goal is to map a set of given entities mentions μ to the type hierarchy of K. This task uses the hierarchical F-measure to evaluate the types returned by the annotation system using the expected types of the gold standard and the type hierarchy of K.

C2KB: The concept tagging task C2KB aims to detect entities when given a document. Formally, the tagging function $tag$ simply returns a subset of K for each input document d.

A2KB: This task is the classical NER/D task, that is, a combination of the Entity Recognition and D2KB tasks. Thus, an A2KB annotation system receives the document set D, has to identify entities mentions μ and link them to K.

RT2KB: This task is the combination of the Entity Recognition and Typing tasks, i.e., the goal is to identify entities in a given document set D and map them to the types of K.

OKE 2015 Task1: The first task of the OKE Challenge 2015 [46] comprises Entity Recognition, Entity Typing and D2KB.

OKE 2015 Task2: The goal of the second task of the OKE Challenge 2015 [46] is to extract the part of the text that contains the type of a given entity mention and link it to the type hierarchy of K.

With this extension, our framework can now deal with gold standard datasets and annotators that link to any knowledge base, e.g., DBpedia, BabelNet [45] etc., as long as the necessary identifiers are URIs. We were thus able to implement 37 new gold standard datasets, cf. Section 4.4, and 15 new annotators linking entities to any knowledge base instead of solely to Wikipedia, as in previous works, cf. Section 4.3.1. With this extensible interface, Gerbil can be extended to deal with supplementary experiment types, e.g., entity salience [11], word sense disambiguation (WSD) [42] and relation extraction [57]. These categories of experiment types will be added to Gerbil in following versions.

4.2.2. Matching

A matching defines which conditions the result of an annotator has to fulfill to be a correct result, i.e., to match an annotation of the gold standard. An annotation has either a position, a meaning (i.e., a linked entity or a type) or both. Therefore, we can define an annotation $a = (s, l, d, u)$ with a start position s and a length l as defined above. d is the document the annotation belongs to and u is a URI linking to an entity or the type of an entity (depending on the experiment type).

The first matching type $M_{e}$ used for the C2KB experiments is the strong entity matching. This matching does not rely on positions and takes only the URIs u into account. Following this matching, a single annotation $a = (s, l, d, u)$ returned by the annotator is correct if it matches exactly with one of the annotations $a^{'} = (s^{'}, l^{'}, d, u^{'})$ in the gold standard $a^{'} \in G (d)$ of d [11]. Formally, $\begin{array}{l} (1) & M_{e} (a, a^{'}) = \{\begin{matrix} 1 & iff u = u^{'}, \\ 0 & else. \end{matrix} \end{array}$

For the D2KB experiments, matching is expanded to strong annotation matching $M_{a}$ and includes the correct position of the entity mention inside the document: $\begin{array}{l} (2) & M_{a} (m, G) = \{\begin{matrix} 1 & iff u = u^{'} \land s = s^{'} \\ \land l = l^{'}, \\ 0 & else. \end{matrix} \end{array}$

The strong annotation matching can also be used for A2KB and Sa2KB experiments. However, in practice this exact matching can be misleading. A document can contain a gold standard named entity, such as, “President Barack Obama” while the result of an annotator only marks “Barack Obama” as named entity. Using an exact matching leads to weighting this result as wrong while a human might rate it as correct. Therefore, the weak annotation matching $M_{w}$ relaxes the conditions of the strong annotation matching. Thus, a correct annotation has to be linked to the same entity and must overlap the annotation of the gold standard: $\begin{array}{l} (3) & M_{w} (m, G) = \{\begin{matrix} 1 & iff u = u^{'} \\ \land ((s ⩽ s^{'} \land e ⩽ e^{'} \land s^{'} < e) \\ \lor (s ⩾ s^{'} \land e ⩾ e^{'} \land s < e^{'}) \\ \lor (s ⩽ s^{'} \land e ⩾ e^{'}) \\ \lor (s ⩾ s^{'} \land e ⩽ e^{'})), \\ 0 & else, \end{matrix} \end{array}$ where $e = s + l$ and $e^{'} = s^{'} + l^{'}$ .

However, the evaluation of whether two given meanings are matching each other is more challenging than the expression $u = u^{'}$ reveals. The comparison of two strings representing entity URIs might look like a solution for this problem. However, in practice, this simple approach has limitations. These limitations are mainly caused by the various ways in which the annotators express their annotation. Some systems use DBpedia [33] URIs or IRIs while other systems annotate documents with Wikipedia IDs or article titles. Additionally, in most cases the versions of the KBs used to create datasets are diverging from the versions an annotator relies on.

The key insight behind the solution to this problem in Gerbil is simply to use URIs to represent meanings. We provide an enhanced entity matching which comprises the four steps (1) URI set retrieval, (2) URI checking, (3) URI set classification, and (4) URI set matching, see Fig. 2.

Fig. 2.

Schema of the four components of the entity matching process.

URI set retrieval Since an entity can be described in several KBs using different URIs and IRIs, Gerbil assigns a set of URIs to a single annotation representing the semantic meaning of this annotation. Initially, this set contains the single URI that has been loaded from the dataset or read from an annotators response. The set is expanded by crawling the Semantic Web graph using owl:sameAs links as well as redirects. These links are retrieved using different modules that are chosen based on the domain of the URI. The general approach we implemented de-references the given URI and tries to parse the returned triples. Although this approach works with every KB, we offer a module for the DBpedia URIs that can transform them into Wikipedia URIs and vice versa. Additionally, we implemented a Wikipedia API client module that can retrieve redirects for Wikipedia URIs. Moreover, one module can handle common errors like wrong domain names, e.g., the usage of DBpedia.org instead of dbpedia.org, and the transformation from an IRI into a URI and vice versa. The expansion of the set stops if all URIs in the set have been crawled and no new URI could be added.

URI checking While the development of annotators moves on, many datasets were created years ago using versions of KBs that are redundant. This is an important issue that cannot be solved automatically unless the datasets refer to their old versions, which is in practice rarely the case. We try to minimize the influence of outdated URIs by checking every URI for its existence. If a URI cannot be dereferenced, it is marked as outdated. However, this is only possible for URIs of KBs that abide by the Linked Data principles and provide de-referencable URIs, e.g., DBpedia.

URI set classification All entities can be separated into two classes [26]. The class $C_{KB}$ comprises all entities that are present inside at least one KB. In contrast, emerging entities are not present in any KB and form the second class $C_{EE}$ . A URI set S is classified as $S \in C_{KB}$ if it contains at least one URI of a predefined KB’s namespace. Otherwise it is classified as $S \in C_{EE}$ .

URI set matching The final step of checking whether two entities are matching each other is to check whether their two URI sets are matching. There are two cases in which two URI sets $S_{1}$ and $S_{2}$ are matching. $\begin{array}{l} (4) & (S_{1} \in C_{KB}) \land (S_{2} \in C_{KB}) \land (S 1 \cap S 2 \neq \emptyset), \\ (5) & (S_{1} \in C_{EE}) \land (S_{2} \in C_{EE}) . \end{array}$ In the first case, both sets are assigned to the $C_{KB}$ class and the sets are overlapping while in the second case, both sets are assigned to the $C_{EE}$ class. Note that in the case of emerging entities, it does not make sense to check whether both sets are overlapping, since most of the URIs of these entities are synthetically generated.

Limitations This entity matching has two known drawbacks. First, wrong links between KBs can lead to a wrong URI set. The following example shows that because of a wrong linkage between DBpedia and data.nytimes.com, Japan and Armenia are the same:7 ⁷

dbr is the prefix for http://dbpedia.org/resource/, owl is the prefix for http://www.w3.org/2002/07/owl# and nyt is the prefix for http://data.nytimes.com/.

dbr:Japan owl:sameAs nyt:66220885916538669281 .

nyt:66220885916538669281 owl:sameAs dbr:Armenia .

Second, the URI set retrieval as well as the URI checking cause a huge communication effort. Since our implementation of this communication is considerate of the KB endpoints by inserting delays between the single requests, these steps slow down the evaluation. However, our future developments will attempt to reduce this drawback.

4.2.3. Measures

Gerbil comes with six measures subdivided into two groups derived from the BAT-framework, namely the micro- and the macro-group of precision, recall and F-measure. For a more detailed analysis of the annotator performance, we implemented the possibility to add new metrics to the evaluation, e.g., runtime measurements. Moreover, we added different performance measures that focus on specific parts of the tasks. Beside the general micro and macro precision, recall and F1-measure, Gerbil offers three other measures that take the classification of the entities into account. Table 2 shows the different cases that can occur when sets of URIs are compared.

While all cases are taken into account for the normal measures, the InKB measures focus on those cases in which either the URI set of the dataset or the URI set of the annotator are classified as $S \in C_{KB}$ . The same holds for the EE measures and the $C_{EE}$ class. Both measures can be used to check the performance of one of these two classes. The GSInKB measures are only calculated for NED experiments (D2KB). These measures can be used to assess the performance of an annotator as if there were no emerging entities inside the dataset, e.g., if the annotation system is not capable of handling these entities.

Table 2
The different classification cases that can occur during the evaluation. A dash means that there is no URI set that could be used for the matching. A tick shows that this case is taken into account while calculating the measure

Dataset Annotator Normal InKB EE GSInKB

$S_{1} \in KB$ $S_{2} \in KB$ ✓ ✓ ✓

$S_{1} \in KB$ $S_{2} \notin KB$ ✓ ✓ ✓ ✓

$S_{1} \in KB$ – ✓ ✓ ✓

$S_{1} \notin KB$ $S_{2} \in KB$ ✓ ✓ ✓

$S_{1} \notin KB$ $S_{2} \notin KB$ ✓ ✓

$S_{1} \notin KB$ – ✓ ✓

– $S_{2} \in KB$ ✓ ✓

– $S_{2} \notin KB$ ✓ ✓

Dataset	Annotator	Normal	InKB	EE	GSInKB
$S_{1} \in KB$	$S_{2} \in KB$	✓	✓		✓
$S_{1} \in KB$	$S_{2} \notin KB$	✓	✓	✓	✓
$S_{1} \in KB$	–	✓	✓		✓
$S_{1} \notin KB$	$S_{2} \in KB$	✓	✓	✓
$S_{1} \notin KB$	$S_{2} \notin KB$	✓		✓
$S_{1} \notin KB$	–	✓		✓
–	$S_{2} \in KB$	✓	✓
–	$S_{2} \notin KB$	✓		✓

4.3. Improved diagnostics

To support the development of new approaches, we implemented additional diagnostic capabilities such as the calculation of correlations of dataset features and annotator performance [63]. In particular, we calculate the Spearman correlation between document attributes (e.g., number of persons) and performance (i.e., F-measure) to quantify how the first variable affects the second. Figure 3 shows the correlation between the performance of systems and selected features of the datasets. This can help determine strengths and weaknesses of the different approaches.

Fig. 3.

Absolute correlation values of the annotators’ Micro F1-scores and the dataset features for the A2KB experiment and weak annotation match (http://gerbil.aksw.org/gerbil/overview, date: 23.06.2017).

4.3.1. Annotators

Gerbil aims to reduce the amount of work required to compare existing as well as novel annotators in a comprehensive and reproducible way. To this end, we provide two main approaches to evaluating entity annotation systems with Gerbil.

BAT-framework Adapter: In BAT, annotators can be implemented by wrapping using a Java-based interface. Gerbil also offers an adapter so the wrappers of the BAT-framework can be reused easily. Due to the community effort behind Gerbil, we could raise the number of published annotators from 5 to 20. We investigated the effort to implement a BAT-framework adapter in contrast to evaluation efforts done without a structured evaluation framework in Section 5.

NIF-based Services: Gerbil implements means to understand NIF-based [25] communication over web-service in two ways. First, if the server-side implementation of annotators understands NIF-documents as input and output format, Gerbil and the framework can simply exchange NIF-documents.8

⁸
We describe the exact requirements for the structure of the NIF document on our project website’s wiki, as NIF offers several ways to build a NIF-based document or corpus.
Thus, novel NIF-based annotators can be deployed efficiently into Gerbil and use a more robust communication format compared to the amount of work necessary for deploying and writing a BAT-framework adapter. Second, if developers do not want to publish their APIs or write source code, Gerbil offers the possibility for NIF-based webservices to be tested online by providing their URI and name only.9 ⁹
http://gerbil.aksw.org/gerbil/config.
Gerbil does not store these connections in terms of API keys or URLs but still offers the opportunity of persistent experiment results.

Currently, Gerbil offers 20 entity annotation systems with a variety of features, capabilities and experiments. In the following, we present current state-of-the-art approaches both available or unavailable in Gerbil.

Cucerzan: As early as in 2007, Cucerzan presented a NED approach based on Wikipedia [13]. The approach tries to maximize the agreement between contextual information of input text and a Wikipedia page as well as category tags on the Wikipedia pages. The test data is still available10 ¹⁰
http://research.microsoft.com/en-us/um/people/silviu/WebAssistant/TestData/.
but since we can safely assume that the Wikipedia page content has changed substantially since 2006, we do not use it in our framework, nor are we aware of any publication reusing this data. Furthermore, we were not able to find a running webservice or source code for this approach.

Wikipedia Miner: This approach was introduced in [40] in 2008 and is based on different facts like prior probabilities, context relatedness and quality, which are then combined and tuned using a classifier. The authors evaluated their approach based on a subset of the AQUAINT dataset.11 ¹¹
http://www.nist.gov/tac/data/data_desc.html#AQUAINT.
They provide the source code for their approach as well as a webservice12 ¹²
http://wikipedia-miner.cms.waikato.ac.nz/.
which is available in Gerbil.

Illinois Wikifier: In 2011, [9,50] presented an NED approach for entities from Wikipedia. In this article, the authors compare local approaches, e.g., using string similarity, with global approaches, which use context information and lead finally to better results. The authors provide their datasets13 ¹³
http://cogcomp.cs.illinois.edu/page/resource_view/4.
as well as their software “Illinois Wikifier”14 ¹⁴
http://cogcomp.cs.illinois.edu/page/software_view/33.
online. Since “Illinois Wikifier” is currently only available as local binary and Gerbil is solely based on webservices we excluded it from Gerbil for the sake of comparability and server load.

DBpedia Spotlight: One of the first semantic approaches [15] was published in 2011, combining NER and NED approaches based on DBpedia.15 ¹⁵
https://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki/Known-uses.
Based on a vector-space representation of entities and using the cosine similarity, this approach has a public (NIF-based) webservice16 ¹⁶
https://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki/Web-service.
as well as its online available evaluation dataset.17 ¹⁷
http://wiki.dbpedia.org/spotlight/isemantics2011/evaluation.

AIDA: The AIDA approach [28] relies on coherence graph building and dense subgraph algorithms and is based on the YAGO218 ¹⁸
http://www.mpi-inf.mpg.de/yago-naga/yago/.
knowledge base. Although the authors provide their source code, a webservice and their dataset, which is a manually annotated subset of the 2003 CoNLL share task [61], Gerbil will not use the webservice since it is not stable enough for regular replication purposes at the time of this publication.19 ¹⁹
https://www.mpi-inf.mpg.de/departments/databases-and-information-systems/research/yago-naga/aida/.
That is, the AIDA team discourages its use because they constantly switch the underlying entity repository, and tune parameters.

TagMe 2: TagMe 2 [20] was published in 2012 and is based on a directory of links, pages and an inlink graph from Wikipedia. The approach recognizes named entities by matching terms with Wikipedia link texts and disambiguates the match using the in-link graph and page dataset. Afterwards, TagMe 2 prunes the identified named entities considered non-coherent to the rest of the named entities in the input text. The authors publish a key-protected webservice20 ²⁰
http://tagme.di.unipi.it/.
as well as their datasets21 ²¹
http://acube.di.unipi.it/tagme-dataset/.
online. The source code, licensed under Apache 2 licence can be obtained directly from the authors. However, the datasets comprise only fragments of 30 words and even less of full documents and will not be part of the current version of Gerbil.

NERD-ML: In 2013, [65] proposed an approach for entity recognition tailored for extracting entities from tweets. The approach relies on a machine learning classification of the entity type which uses (a) a given feature vector composed of a set of linguistic features, (b) the output of a properly trained Conditional Random Fields classifier and (c) the output of a set of off-the-shelf NER extractors supported by the NERD Framework. The follow-up, NERD-ML [52], improved the classification task by re-designing the selection of features. The authors assessed the NERD-ML’s performance on both microposts and newswire domains. NERD-ML has a public webservice which is part of Gerbil.22 ²²
http://nerd.eurecom.fr.

KEA NER/NED: This approach is the successor of the approach introduced in [59], which is based on a fine-granular context model taking into account heterogeneous text sources as well as text created by automated multimedia analysis. The source texts can have different levels of accuracy, completeness, granularity and reliability, all of which influence the determination of the current context. Ambiguity is solved by selecting entity candidates with the highest level of probability according to the predetermined context. The new implementation begins with the detection of groups of consecutive words (n-gram analysis) and a lookup of all potential DBpedia candidate entities for each n-gram. The disambiguation of candidate entities is based on a scoring cascade. KEA is available as a NIF-based webservice.23 ²³
http://s16a.org/kea.

WAT: WAT is the successor of TagME [20].24 ²⁴
http://github.com/nopper/wat.
The new annotator includes a re-design of all TagME components, namely, the spotter, the disambiguator, and the pruner. Two disambiguation families were newly introduced: graph-based algorithms for collective entity linking based and vote-based algorithms for local entity disambiguation (based on the work of Ferragina et al. [20]). The spotter and the pruner can be tuned using SVM linear models. Additionally, the library can be used as a D2KB-only system by feeding appropriate mention spans to the system.

Dexter: This approach [7] is an open-source implementation of an entity disambiguation framework. The system was adopted to simplify the implementation of an entity linking approach and allows the replacement of single parts of the process. The authors used several state-of-the-art disambiguation methods. Results in this paper are obtained using an implementation of the original TagMe disambiguation function. Moreover, Ceccarelli et al. provide the source code25 ²⁵
http://dexter.isti.cnr.it.
as well as a webservice.

AGDISTIS: This approach [62] is a pure entity disambiguation approach (D2KB) based on string similarity measures, an expansion heuristic for labels to cope with co-referencing and the graph-based HITS algorithm. The authors published datasets26 ²⁶
https://github.com/AKSW/n3-collection.
along with their source code and an API.27 ²⁷
https://github.com/AKSW/AGDISTIS.
AGDISTIS can only be used for the D2KB task.

Babelfy: The core of this approach draws on the use of random walks and a densest subgraph algorithm to tackle the word sense disambiguation and entity linking tasks jointly in a multilingual setting [42] thanks to the BabelNet28 ²⁸
http://babelnet.org.
semantic network [45]. Babelfy has been evaluated using six datasets: three from earlier SemEval tasks [43,44,49], one from a Senseval task [56] and two already used for evaluating AIDA [27,28]. All of them are available online and distributed throughout the Web. Additionally, the authors offer a webservice that is limited to 100 requests per day, extensible for research purposes [41].29 ²⁹
http://babelfy.org.

FOX: FOX [57] is an ensemble-learning based framework for RDF extraction from text. It makes use of the diversity of NLP algorithms to extract entities with a high precision and a high recall. Moreover, it provides functionality for keyword and relation extraction.

FRED: In 2015, Gangemi et al. [10] present FRED(*), a novel machine reader based on TagMe. FRED extends TagMe with entity typing capabilities.

FREME: Also in 2015, the EU project FREME publish their e-entity service which is based on Conditional Random Fields for NER while NED relies on a most-frequent-sense method. That is, candidate entities are chosen based on a sense of commonness within a KB.30 ³⁰
https://api-dev.freme-project.eu/doc/api-doc/full.html#/e-Entity.

entityclassifier.eu: Dojchinovski and Kliegr [18] present their approach based on hypernyms and a Wikipedia-based entity classification system which identifies salient words. The input is transformed to a lower dimensional representation keeping the same quality of output for all sizes of input text.

CETUS: This approach [54] has been implemented as a baseline for the second task of the OKE challenge 2015. It uses handcrafted patterns to search for entity type information. In a second step, the type hierarchy of the YAGO ontology [36] is used to find a type matching the extracted type information. A second approach called CETUS_FOX uses FOX to retrieve the type of the entity.
Table 3
Overview of implemented annotator systems. Brackets indicate the existence of the implementation of the adapter but also the inability to use it in the live system

BAT-Framework Gerbil 1.0.0 Gerbil 1.2.5 Experiment

[40] Wikipedia Miner ✓ ✓ (✓) A2KB

[9 ,50] Illinois Wikifier ✓ (✓) ✓ A2KB

[15] Spotlight ✓ ✓ ✓ OKE Task 1

[28] AIDA ✓ ✓ ✓ A2KB

[20] TagMe 2 ✓ ✓ ✓ A2KB

[52] NERD-ML ✓ ✓ A2KB

[59] KEA ✓ ✓ A2KB

[48] WAT ✓ ✓ A2KB

[7] Dexter ✓ ✓ A2KB

[62] AGDISTIS (✓) ✓ ✓ D2KB

[42] Babelfy ✓ ✓ A2KB

[57] FOX ✓ OKE Task 1

[10] FRED ✓ OKE Task 1

FREME ✓ OKE Task 1

[18] entityclassifier.eu ✓ A2KB

[54] CETUS ✓ OKE Task 2

[68] xLisa ✓ A2KB

[69] DoSer ✓ D2KB

[22] PBOH ✓ D2KB

[24] NERFGUN ✓ D2KB

NIF-based Annotator ✓ ✓ Any

xLisa: Zhang and Rettinger present the x-Lisa annotator [68] which is a three-step pipeline based on cross-lingual Linked Data lexica to harness the multilingual Wikipedia. Based on these lexica, they calculate the mention-candidate-similarity using n-grams. In the third step, x-Lisa constructs an entity-mention graph using the Normalized Google Distance as weights for page rank.

DoSer: In 2016, Zwickelbauer et al. present DoSer [69], a pure named entity linking approach which is – similiar to AGDISTIS – knowledge-base-agnostic. First, DoSer computes semantic embeddings of entities over one or multiple knowledge bases. Second, given a set of mentions, DoSer calculates possible candidate URIs using existing knowledge base surface forms or additional indexes. Finally, DoSer calculates a personalized page rank using the semantic embeddings of a disambiguation graph constructed from links between possible candidates.

PBOH: This approach [22] is a pure entity disambiguation approach based on light statistics from the English Wikipedia corpus. The authors developed a probabilistic graphical model using pairwise Markov Random Fields to address the entity linking problem. They show that pairwise co-occurrence statistics of words and entities are enough to obtain a comparable or better performance than heavy feature engineered systems. They employ loopy belief propagation to perform inference at test time.

NERFGUN: The most recent NED system is NERFGUN [24]. This approach reuses ideas from many existing systems for collective named entity linking. First, it uses several indexes based on DBpedia to retrieve a set of candidate entities, such as anchor link texts rdfs:labels and page titles. NERFGUN proposes an undirected probabilistic graphical model based on factor graphs where each factor measures the suitability of the resolution of some mention to a given candidate. The set of features is linearly combined by weights and the inference step is based on Markov Chain Monte Carlo models.

Table 3 compares the implemented annotation systems of Gerbil and the BAT-Framework. AGDISTIS has been in the source code of the BAT-Framework provided by a third-party since publication of Cornolti et al.’s initial work [11] in 2014. Nevertheless, Gerbil’s community effort has led to the implementation of overall 15 new annotators as well as the aforementioned generic NIF-based annotator. The AIDA annotator as well as the “Illinois Wikifier” will not be available in Gerbil since we restrict ourselves to webservices. However, these algorithms can be integrated at any time as soon as their webservices are available.
4.4. Datasets

		BAT-Framework	Gerbil 1.0.0	Gerbil 1.2.5	Experiment
[40]	Wikipedia Miner	✓	✓	(✓)	A2KB
[9 ,50]	Illinois Wikifier	✓	(✓)	✓	A2KB
[15]	Spotlight	✓	✓	✓	OKE Task 1
[28]	AIDA	✓	✓	✓	A2KB
[20]	TagMe 2	✓	✓	✓	A2KB
[52]	NERD-ML		✓	✓	A2KB
[59]	KEA		✓	✓	A2KB
[48]	WAT		✓	✓	A2KB
[7]	Dexter		✓	✓	A2KB
[62]	AGDISTIS	(✓)	✓	✓	D2KB
[42]	Babelfy		✓	✓	A2KB
[57]	FOX			✓	OKE Task 1
[10]	FRED			✓	OKE Task 1
	FREME			✓	OKE Task 1
[18]	entityclassifier.eu			✓	A2KB
[54]	CETUS			✓	OKE Task 2
[68]	xLisa			✓	A2KB
[69]	DoSer			✓	D2KB
[22]	PBOH			✓	D2KB
[24]	NERFGUN			✓	D2KB
	NIF-based Annotator		✓	✓	Any

Table 4
Datasets, their formats and features. Groups of datasets, e.g., for a single challenge, have been grouped together. A ⋆ indicates various inline or keyfile annotation formats. The experiments follow their definition in Section 4.2

Corpus Format Experiment Topic $| Documents |$ Avg. Entity/Doc. Avg. Words/Doc.

ACE2004 MSNBC A2KB News 57 5.37 373.9

AIDA/CoNLL CoNLL A2KB News 1393 25.07 189.7

AQUAINT ⋆ A2KB News 50 14.54 220.5

Derczynski TSV A2KB Tweets 182 1.57 20.8

ERD2014 ⋆ A2KB Queries 91 0.65 3.5

GERDAQ XML A2KB Queries 992 1.72 3.6

IITB XML A2KB Mixed 103 109.22 639.7

KORE 50 NIF/RDF A2KB Mixed 50 2.88 12.8

Microposts2013 Microposts2013 RT2KB Tweets 4265 1.11 18.8

Microposts2014 Microposts2014 A2KB Tweets 3395 1.50 18.1

Microposts2015 Microposts2015 A2KB Tweets 6025 1.36 16.5

Microposts2016 Microposts2015 A2KB Tweets 9289 1.03 15.7

MSNBC MSNBC A2KB News 20 37.35 543.9

N³ Reuters-128 NIF/RDF A2KB News 128 4.85 123.8

N³ RSS-500 NIF/RDF A2KB RSS-feeds 500 1.00 31.0

OKE 2015 Task 1 NIF/RDF OKE Task 1 Mixed 199 5.11 25.5

OKE 2015 Task 2 NIF/RDF OKE Task 2 Mixed 200 3.06 28.7

OKE 2016 Task 1 NIF/RDF OKE Task 1 Mixed 254 5.52 26.6

OKE 2016 Task 2 NIF/RDF OKE Task 2 Mixed 250 2.83 27.5

Ritter Ritter RT2KB News 2394 0.62 19.4

Senseval 2 XML ERec Mixed 242 9.86 21.3

Senseval 3 XML ERec Mixed 352 5.70 14.7

Spotlight Corpus NIF/RDF A2KB News 58 5.69 28.6

UMBC UMBC RT2KB Tweets 12,973 0.97 17.2

WSDM2012/Meij TREC C2KB Tweets 502 1.87 14.4

Corpus	Format	Experiment	Topic	$\| Documents \|$	Avg. Entity/Doc.	Avg. Words/Doc.
ACE2004	MSNBC	A2KB	News	57	5.37	373.9
AIDA/CoNLL	CoNLL	A2KB	News	1393	25.07	189.7
AQUAINT	⋆	A2KB	News	50	14.54	220.5
Derczynski	TSV	A2KB	Tweets	182	1.57	20.8
ERD2014	⋆	A2KB	Queries	91	0.65	3.5
GERDAQ	XML	A2KB	Queries	992	1.72	3.6
IITB	XML	A2KB	Mixed	103	109.22	639.7
KORE 50	NIF/RDF	A2KB	Mixed	50	2.88	12.8
Microposts2013	Microposts2013	RT2KB	Tweets	4265	1.11	18.8
Microposts2014	Microposts2014	A2KB	Tweets	3395	1.50	18.1
Microposts2015	Microposts2015	A2KB	Tweets	6025	1.36	16.5
Microposts2016	Microposts2015	A2KB	Tweets	9289	1.03	15.7
MSNBC	MSNBC	A2KB	News	20	37.35	543.9
N³ Reuters-128	NIF/RDF	A2KB	News	128	4.85	123.8
N³ RSS-500	NIF/RDF	A2KB	RSS-feeds	500	1.00	31.0
OKE 2015 Task 1	NIF/RDF	OKE Task 1	Mixed	199	5.11	25.5
OKE 2015 Task 2	NIF/RDF	OKE Task 2	Mixed	200	3.06	28.7
OKE 2016 Task 1	NIF/RDF	OKE Task 1	Mixed	254	5.52	26.6
OKE 2016 Task 2	NIF/RDF	OKE Task 2	Mixed	250	2.83	27.5
Ritter	Ritter	RT2KB	News	2394	0.62	19.4
Senseval 2	XML	ERec	Mixed	242	9.86	21.3
Senseval 3	XML	ERec	Mixed	352	5.70	14.7
Spotlight Corpus	NIF/RDF	A2KB	News	58	5.69	28.6
UMBC	UMBC	RT2KB	Tweets	12,973	0.97	17.2
WSDM2012/Meij	TREC	C2KB	Tweets	502	1.87	14.4

BAT enables the evaluation of different approaches using the AQUAINT, MSNBC, IITB and the four AIDA/CoNLL datasets (Train A, Train B, Test and Complete). With Gerbil, we include several more datasets that are depicted in Table 4 together with their features. The table shows the huge number of different formats preventing fast and easy benchmarking of a new annotation system without an intermediate evaluation platform like Gerbil.

Gerbil includes the ACE2004 dataset from Ratinov et al. [50] as well as the dataset of Derczynski [16]. We provide an adapter for the Microposts challenge [52] corpora from 2013 to 2016 each consisting of a test and train dataset as well as an additional third dataset in the years 2015 and 2016. Furthermore, we added the ERD challenge dataset [6] consisting of queries and the four GERDAQ datasets [12] for entity recognition and linking in natural language questions. Also, Gerbil includes the Senseval 2 and 3 datasets which took place 2001 and 2004 respectively and activates them for the Entity Recognition task only [19,39]. Moreover, we added the UMBC dataset from 2012 by Finin et al. [21] which was created using crowdsourcing of simple entity annotations over Twitter microposts, and the WSDM2012/Meij dataset [2] which describes tweets (although these were annotated using only two annotators in 2012). Finally, GERBIL includes the Ritter [51] dataset containing roughly 2400 tweets annotated with 10 NER classes from Freebase.

We capitalize upon the uptake of publicly available, NIF based corpora from the recent years [53,58].31 ³¹

http://datahub.io/dataset?license_id=cc-by&q=NIF.

To this end, Gerbil implements a Java-based NIF [25] reader and writer module which enables the loading of arbitrary NIF document collections, as well as communication to NIF-based webservices. Additionally, we integrated four NIF corpora, i.e., the N³ RSS-500 and N³ Reuters-128 dataset,32 ³²

https://github.com/AKSW/n3-collection.

as well as the Spotlight Corpus and the KORE 50 dataset.33 ³³

http://www.yovisto.com/labs/ner-benchmarks/.

Gerbil supported the Open Knowledge Extraction Challenge in 2015 and 2016 which led to integration of the 6 datasets for OKE Task 1 and 6 datasets for OKE Task 2.

The extensibility of datasets in Gerbil is furthermore ensured by allowing users to upload or use already available NIF datasets from DataHub. However, Gerbil is currently only importing already available datasets. Gerbil will regularly check whether new corpora are available and publish them for benchmarking after a manual quality assurance cycle which ensures their usability for the implemented configuration options. Additionally, users can upload their NIF-corpora directly to Gerbil avoiding their publication in publicly available sources. This option allows rapid testing of entity annotation systems with closed source or licensed datasets.

Gerbil currently offers 12 state-of-the-art datasets, ranging from newswire and twitter to encyclopedic corpora of various numbers of texts and entities. Due to license issues we are only able to provide downloads for 31 of them directly but we provide instructions to obtain the others on our project wiki.34 ³⁴

The licenses and instructions can be found at https://github.com/AKSW/gerbil/wiki/Licences-for-datasets.

4.5. Output

Gerbil’s main aim is to provide comprehensive, reproducible and publishable experiment results. Hence, Gerbil’s experimental output is represented as a table containing the results, as well as embedded JSON-LD35

³⁵
http://www.w3.org/TR/json-ld/.
RDF data using the RDF DataCube vocabulary [14]. We ensure a detailed description of each component of an experiment as well as machine-readable, interlinkable results following the 5-star Linked Data principles. Moreover, we provide a persistent and time-stamped URL for each experiment that can be used for publications, as it has been done in [22,24].

RDF DataCube is a vocabulary standard and can be used to represent fine-grained multidimensional, statistical data which is compatible with the Linked SDMX [5] standard. Every Gerbil experiment is modelled as qb:Dataset containing the individual runs of the annotators on specific corpora as qb:Observations. Each observation features the qb:Dimensions experiment type, matching type, annotator, corpus, and time. The evaluation measures offered by Gerbil as well as the error count are expressed as qb:Measures. To include further metadata, annotator and corpus dimension properties link DataID [3] descriptions of the individual components.

Gerbil uses the recently proposed DataID [3] ontology that combines VoID [1] and DCAT [35] metadata with Prov-O [32] provenance information and ODRL [29] licenses to describe datasets. Besides metadata properties like titles, descriptions and authors, the source files of the open datasets themselves are linked as dcat:Distributions, allowing direct access to evaluation corpora. Furthermore, ODRL license specifications in RDF are linked via dc:license, potentially facilitating automatically adjusted processing of licensed data by NLP tools. Licenses are further specified via dc:rights, including citations of the relevant publications.

To describe annotators in a similar fashion, we extended DataID for services. The class Service, to be described with the same basic properties as a dataset, was introduced. To link an instance of a Service to its distribution the datid:distribution property was introduced as super property of dcat: distribution, i.e., the specific URI at which the service can be queried. Furthermore, Services can have a number of datid:Parameters and datid: Configurations. Datasets can be linked via the datid:input or datid:output properties.

Offering such detailed and structured experimental results opens new research avenues in terms of tool and dataset diagnostics to increase decision makers’ ability to choose the right settings for the right use case. Next to individual configurable experiments, Gerbil offers an overview of recent experiment results belonging to the same experiment and matching type in the form of a table as well as sophisticated visualizations,36 ³⁶
http://gerbil.aksw.org/gerbil/overview.
see Fig. 4. This allows for a quick comparison of tools and datasets on recently run experiments without additional computational effort.

Fig. 4.
Example spider diagram of recent A2KB experiments with weak annotation matching derived from our online interface (http://gerbil.aksw.org/gerbil/overview, date: 23.06.2017).
5. Evaluation

One of Gerbils main goals was to provide the community with an online benchmarking tool that provides archivable and comparable experiment URIs. Thus, the impact of the framework can be measured by analyzing the interactions on the platform itself. Since its first public release on the 17th October 2014 until the 20th October 2016, 3288 experiments were started on the platform containing more than 24,341 tasks for annotator-dataset pairs. According to our mail correspondence, we can deduce that more than 20 local installations of Gerbil exist for testing novel annotation systems both in industry and academia. It shows intensive usage of our Gerbil instance. One interesting aspect is the usage of the different systems, especially the heavy testing of NIF-based webservices, see Table 5. Thus, Gerbil is a powerful evaluation tool for developing new annotators that can be evaluated easily by using the NIF-based interface. Moreover, Gerbil has already been extended outside of the core developer team to foster a deeper analysis of annotation systems [66]. Finally, experiment URIs provided by Gerbil are referenced in over 19 papers and three of them are using the provided stable URIs as of January 21, 2017.

Table 5
Number of tasks executed per annotator. By caching results we did not need to execute 12,466 tasks but only 9906. Data taken from 15th February 2015

Annotator Number of tasks

NIF-based Annotators 2519

Babelfy 958

DBpedia Spotlight 922

TagMe 2 811

WAT 787

Kea 763

Wikipedia Miner 714

NERD-ML 639

Dexter 587

AGDISTIS 443

Entityclassifier.eu NER 410

FOX 352

Cetus 1

Annotator	Number of tasks
NIF-based Annotators	2519
Babelfy	958
DBpedia Spotlight	922
TagMe 2	811
WAT	787
Kea	763
Wikipedia Miner	714
NERD-ML	639
Dexter	587
AGDISTIS	443
Entityclassifier.eu NER	410
FOX	352
Cetus	1

Furthermore, we evaluated the amount of time that 5 experienced developers needed to write an evaluation script for their framework and how long they needed to evaluate their framework using GERBIL. The comparative times of writing an evaluation script for a system and a single dataset compared to the time needed to write an adapter for Gerbil were not distinguishable. Since Gerbil offers 32 datasets, it can speed up the evaluation process 32-fold. That is the number of datasets configurable after implementing their own framework. See also Fig. 5 and Usbeck et al. [64] for more details.

Fig. 5.

Comparison of effort needed to implement an adapter for an annotation system with and without Gerbil [64].

6. Conclusion and future work

In this paper, we presented and evaluated Gerbil, a platform for the evaluation of annotation frameworks. Gerbil aims to push annotation system developers to better quality and wider use of their frameworks. Some of the main contributions of Gerbil include the provision of persistent URLs for reproducibility and archiving. Furthermore, we implemented a generic adaptor for external datasets as well as a generic interface to integrate remote annotator systems. The datasets available for evaluation in the previous benchmarking platforms for annotation was extended by 37. Moreover, 15 novel annotators were added to the platform. The presented, web-based frontend allows for several use cases enabling both laymen and expert users to perform informed comparisons of semantic annotation tools. The persistent URIs enhance the long term quotation in the field of information extraction. Gerbil is not just a new framework wrapping existing technology. In comparison to earlier frameworks, it transcends state-of-the-art benchmarks in its capability to consider the influence of NIL attributes and the ability to deal with data sets and annotators that link to different knowledge bases. More information about Gerbil and its source code can be found at the project’s website.

In the future, Gerbil will be further enhanced inside the HOBBIT project.37 ³⁷

http://project-hobbit.eu.

Gerbil will be incorporated into a larger benchmarking platform that allows a fair comparison not only of the quality but also of the effectiveness of annotation systems. A first step in that direction is the Open Knowledge Extraction Challenge 2017. This contains tasks that use Gerbil as a benchmark, measuring the quality of annotation systems executed directly inside the HOBBIT platform to make sure that all participating systems are executed on the same hardware.

Another development is the further support of developers with direct feedback, i.e., showing the annotations that have been marked incorrect in the documents. This feature has not been implemented because of licensing issues. However, we think that it would be possible to implement it without license violations for datasets that are publicly available.

Footnotes

Acknowledgements

This work was supported by the Eurostars projects DIESEL (E!9367) and QAMEL (E!9725) as well as the European Union’s H2020 research and innovation action HOBBIT under the Grant Agreement number 688227.

References

Alexander,

Cyganiak,

Hausenblas and

Zhao, Describing linked datasets with the VoID vocabulary, W3C Interest Group Note, 03 March 2011, http://www.w3.org/TR/void/.

Blanco,

Ottaviano and

Meij, Fast and space-efficient entity linking for queries, in: Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, WSDM 2015, Shanghai, China, February 2–6, 2015,

Cheng,

Li,

Gabrilovich and

Tang, eds, ACM, 2015, pp. 179–188. doi:10.1145/2684822.2685317.

Brümmer,

Baron,

Ermilov,

Freudenberg,

Kontokostas and

Hellmann, DataID: Towards semantically rich metadata for complex datasets, in: Proceedings of the 10th International Conference on Semantic Systems, SEMANTICS 2014, Leipzig, Germany, September 4–5, 2014,

Sack,

Filipowska,

Lehmann and

Hellmann, eds, ACM, 2014, pp. 84–91. doi:10.1145/2660517.2660538.

A.E.

Cano Basave,

Rizzo,

Varga,

Rowe,

Stankovic and

Dadzie, Making sense of microposts (#microposts2014) named entity extraction & linking challenge, in: Proceedings of the 4th Workshop on Making Sense of Microposts Co-Located with the 23rd International World Wide Web Conference (WWW 2014), Seoul, Korea, April 7, 2014,

Rowe,

Stankovic and

Dadzie, eds, CEUR Workshop Proceedings, Vol. 1141, CEUR-WS.org, 2014, pp. 54–60, http://ceur-ws.org/Vol-1141/microposts2014_neel-challenge-report.pdf .

Capadisli,

Auer and

A.-C.

Ngonga Ngomo, Linked SDMX data: Path to high fidelity statistical linked data, Semantic Web6(2) (2015), 105–112. doi:10.3233/SW-130123.

Carmel,

Chang,

Gabrilovich,

B.P.

Hsu and

Wang, ERD’14: Entity recognition and disambiguation challenge, SIGIR Forum48(2) (2014), 63–77. doi:10.1145/2701583.2701591.

Ceccarelli,

Lucchese,

Orlando,

Perego and

Trani, Dexter: An open source framework for entity linking, in: ESAIR’13, Proceedings of the Sixth International Workshop on Exploiting Semantic Annotations in Information Retrieval, Co-Located with CIKM 2013, San Francisco, CA, USA, October 28, 2013,

P.N.

Bennett,

Gabrilovich,

Kamps and

Karlgren, eds, ACM, 2013, pp. 17–20. doi:10.1145/2513204.2513212.

Chen,

Wei,

Li,

Liu and

Zhu, An easy-to-use evaluation framework for benchmarking entity recognition and disambiguation systems, Frontiers of Information Technology & Electronic Engineering18(2) (2017), 195–205. doi:10.1631/FITEE.1500473.

Cheng and

Roth, Relational inference for wikification, in: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, EMNLP 2013, Grand Hyatt Seattle, Seattle, WA, USA, October 18–21, 2013, A Meeting of SIGDAT, a Special Interest Group of the ACL, ACL, 2013, pp. 1787–1796, http://aclweb.org/anthology/D/D13/D13-1184.pdf .

10.

Consoli and

D.R.

Recupero, Using FRED for named entity resolution, linking and typing for knowledge base population, in: Semantic Web Evaluation Challenges – Second SemWebEval Challenge at ESWC 2015, Revised Selected Papers, Portorož, Slovenia, May 31–June 4, 2015,

Gandon,

Cabrio,

Stankovic and

Zimmermann, eds, Communications in Computer and Information Science, Vol. 548, Springer, 2015, pp. 40–50. doi:10.1007/978-3-319-25518-7_4.

11.

Cornolti,

Ferragina and

Ciaramita, A framework for benchmarking entity-annotation systems, in: 22nd International World Wide Web Conference, WWW ’13, Rio de Janeiro, Brazil, May 13–17, 2013,

Schwabe,

V.A.F.

Almeida,

Glaser,

R.A.

Baeza-Yates and

S.B.

Moon, eds, International World Wide Web Conferences Steering Committee/ACM, 2013, pp. 249–260, http://dl.acm.org/citation.cfm?id=2488411 .

12.

Cornolti,

Ferragina,

Ciaramita,

Rüd and

Schütze, A piggyback system for joint entity mention detection and linking in web queries, in: Proceedings of the 25th International Conference on World Wide Web, WWW 2016, Montreal, Canada, April 11–15, 2016,

Bourdeau,

Hendler,

Nkambou,

Horrocks and

B.Y.

Zhao, eds, ACM, 2016, pp. 567–578. doi:10.1145/2872427.2883061.

13.

Cucerzan, Large-scale named entity disambiguation based on Wikipedia data, in: EMNLP-CoNLL 2007, Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, Prague, Czech Republic, June 28–30, 2007,

Eisner, ed., ACL, 2007, pp. 708–716, http://www.aclweb.org/anthology/D07-1074 .

14.

Cyganiak,

Reynolds and

Tennison, The RDF Data Cube vocabulary, W3C Recommendation, 16 January 2014, http://www.w3.org/TR/vocab-data-cube/.

15.

Daiber,

Jakob,

Hokamp and

P.N.

Mendes, Improving efficiency and accuracy in multilingual entity extraction, in: I-SEMANTICS 2013 – 9th International Conference on Semantic Systems, ISEM ’13, Graz, Austria, September 4–6, 2013,

Sabou,

Blomqvist,

Di Noia,

Sack and

Pellegrini, eds, ACM, 2013, pp. 121–124. doi:10.1145/2506182.2506198.

16.

Derczynski,

Maynard,

Rizzo,

van Erp,

Gorrell,

Troncy,

Petrak and

Bontcheva, Analysis of named entity recognition and linking for tweets, Information Processing and Management51(2) (2015), 32–49. doi:10.1016/j.ipm.2014.10.006.

17.

G.R.

Doddington,

Mitchell,

M.A.

Przybocki,

L.A.

Ramshaw,

Strassel and

R.M.

Weischedel, The automatic content extraction (ACE) program – Tasks, data, and evaluation, in: Proceedings of the Fourth International Conference on Language Resources and Evaluation, LREC 2004, Lisbon, Portugal, May 26–28, 2004, European Language Resources Association (ELRA), 2004, http://www.lrec-conf.org/proceedings/lrec2004/pdf/5.pdf .

18.

Dojchinovski and

Kliegr, Entityclassifier.eu: Real-time classification of entities in text with Wikipedia, in: Machine Learning and Knowledge Discovery in Databases – European Conference, ECML PKDD 2013, Proceedings, Part III, Prague, Czech Republic, September 23–27, 2013,

Blockeel,

Kersting,

Nijssen and

Zelezný, eds, Lecture Notes in Computer Science, Vol. 8190, Springer, 2013, pp. 654–658. doi:10.1007/978-3-642-40994-3_48.

19.

Edmonds and

Cotton, SENSEVAL-2: Overview, in: The Proceedings of the Second International Workshop on Evaluating Word Sense Disambiguation Systems (SENSEVAL ’01), Toulouse, France, July 5–6, 2001, Association for Computational Linguistics, 2001, pp. 1–5, http://dl.acm.org/citation.cfm?id=2387364.2387365 .

20.

Ferragina and

Scaiella, Fast and accurate annotation of short texts with Wikipedia pages, IEEE Software29(1) (2012), 70–75. doi:10.1109/MS.2011.122.

21.

Finin,

Murnane,

Karandikar,

Keller,

Martineau and

Dredze, Annotating named entities in Twitter data with crowdsourcing, in: Proceedings of the 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk, Los Angeles, CA, USA, June 6, 2010,

Callison-Burch and

Dredze, eds, Association for Computational Linguistics, 2010, pp. 80–88, http://aclanthology.info/papers/W10-0713/annotating-named-entities-in-twitter-data-with-crowdsourcing .

22.

Ganea,

Lucchi,

Eickhoff and

Hofmann, Probabilistic bag-of-hyperlinks model for entity linking, in: Proceedings of the 25th International Conference on World Wide Web, WWW 2016, Montreal, Canada, April 11–15, 2016,

Bourdeau,

Hendler,

Nkambou,

Horrocks and

B.Y.

Zhao, eds, ACM, 2016, pp. 927–938. doi:10.1145/2872427.2882988.

23.

Gil, Semantic challenges in getting work done, 2014, Invited talk at ISWC.

24.

Hakimov,

ter Horst,

Jebbara,

Hartung and

Cimiano, Combining textual and graph-based features for named entity disambiguation using undirected probabilistic graphical models, in: Knowledge Engineering and Knowledge Management – 20th International Conference, EKAW 2016, Proceedings, Bologna, Italy, November 19–23, 2016,

Blomqvist,

Ciancarini,

Poggi and

Vitali, eds, Lecture Notes in Computer Science, Vol. 10024, 2016, pp. 288–302. doi:10.1007/978-3-319-49004-5_19.

25.

Hellmann,

Lehmann,

Auer and

Brümmer, Integrating NLP using linked data, in: The Semantic Web – ISWC 2013 – 12th International Semantic Web Conference, Proceedings, Part II, Sydney, NSW, Australia, October 21–25, 2013,

Alani,

Kagal,

Fokoue,

P.T.

Groth,

Biemann,

J.X.

Parreira,

Aroyo,

N.F.

Noy,

Welty and

Janowicz, eds, Lecture Notes in Computer Science, Vol. 8219, Springer, 2013, pp. 98–113. doi:10.1007/978-3-642-41338-4_7.

26.

Hoffart,

Altun and

Weikum, Discovering emerging entities with ambiguous names, in: 23rd International World Wide Web Conference, WWW ’14, Seoul, Republic of Korea, April 7–11, 2014,

Chung,

A.Z.

Broder,

Shim and

Suel, eds, ACM, 2014, pp. 385–396. doi:10.1145/2566486.2568003.

27.

Hoffart,

Seufert,

D.B.

Nguyen,

Theobald and

Weikum, KORE: keyphrase overlap relatedness for entity disambiguation, in: 21st ACM International Conference on Information and Knowledge Management, CIKM’12, Maui, HI, USA, October 29–November 2, 2012,

Chen,

Lebanon,

Wang and

M.J.

Zaki, eds, ACM, 2012, pp. 545–554. doi:10.1145/2396761.2396832.

28.

Hoffart,

M.A.

Yosef,

Bordino,

Fürstenau,

Pinkal,

Spaniol,

Taneva,

Thater and

Weikum, Robust disambiguation of named entities in text, in: Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, EMNLP 2011, John McIntyre Conference Centre, Edinburgh, UK, July 27–31, 2011, A Meeting of SIGDAT, a Special Interest Group of the ACL, ACL, 2011, pp. 782–792, http://www.aclweb.org/anthology/D11-1072 .

29.

Iannella,

Steidl,

Myles and

Rodríguez-Doncel, ODRL vocabulary & expression, W3C Candidate Recommendation, 26 September 2017, https://www.w3.org/TR/odrl-vocab/.

30.

Jermyn,

Dixon and

B.J.

Read, Preparing clean views of data for data mining, in: 12th ERCIM Workshop on Database Research, Amsterdam, November 2–3, 1999,

B.J.

Read and

Siebes, eds, ERCIM Workshop Proceedings, Vol. 00/W002, 1999, https://www.ercim.eu/publication/ws-proceedings/12th-EDRG/EDRG12_JeDiRe.pdf .

31.

Kilgarriff, Senseval: An exercise in evaluating word sense disambiguation programs, in: First International Conference on Language Resources and Evaluation, Granada, Spain, May 28–30, 1998, European Language Resources Association (ELRA), 1998.

32.

Lebo,

Sahoo,

McGuinness,

Belhajjame,

Cheney,

Corsar,

Garijo,

Soiland-Reyes,

Zednik and

Zhao, PROV-O: The PROV ontology, W3C Recommendation, 30 April 2013, http://www.w3.org/TR/prov-o/.

33.

Lehmann,

Isele,

Jakob,

Jentzsch,

Kontokostas,

P.N.

Mendes,

Hellmann,

Morsey,

van Kleef,

Auer and

Bizer, DBpedia – A large-scale, multilingual knowledge base extracted from Wikipedia, Semantic Web6(2) (2015), 167–195. doi:10.3233/SW-140134.

34.

Ling,

Singh and

D.S.

Weld, Design challenges for entity linking, Transactions of the Association for Computational Linguistics3 (2015), 315–328, https://tacl2013.cs.columbia.edu/ojs/index.php/tacl/article/view/528 .

35.

Maali,

Erickson and

Archer, Data catalog vocabulary (DCAT), W3C Recommendation, 16 January 2014, http://www.w3.org/TR/vocab-dcat/.

36.

Mahdisoltani,

Biega and

F.M.

Suchanek, YAGO3: A knowledge base from multilingual Wikipedias, in: CIDR 2015, Seventh Biennial Conference on Innovative Data Systems Research, Online Proceedings, Asilomar, CA, USA, January 4–7, 2015, www.cidrdb.org, 2015, http://cidrdb.org/cidr2015/Papers/CIDR15_Paper1.pdf .

37.

McNamee, TAC 2009 knowledge base population track, 2009, http://pmcnamee.net/kbp.html.

38.

P.N.

Mendes,

Jakob,

García-Silva and

Bizer, DBpedia spotlight: Shedding light on the web of documents, in: Proceedings the 7th International Conference on Semantic Systems, I-SEMANTICS 2011, Graz, Austria, September 7–9, 2011,

Ghidini,

A.-C.

Ngonga Ngomo,

S.N.

Lindstaedt and

Pellegrini, eds, ACM International Conference Proceeding Series, ACM, 2011, pp. 1–8. doi:10.1145/2063518.2063519.

39.

Mihalcea,

Chklovski and

Kilgarriff, The Senseval-3 English lexical sample task, in: Proceedings of Senseval-3: Third International Workshop on the Evaluation of Systems for the Semantic Analysis of Text, Barcelona, Spain, July 2004,

Mihalcea and

Edmonds, eds, Association for Computational Linguistics, 2004, pp. 25–28, http://web.eecs.umich.edu/~mihalcea/senseval/senseval3/proceedings/pdf/mihalcea2.pdf .

40.

D.N.

Milne and

I.H.

Witten, Learning to link with Wikipedia, in: Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM 2008, Napa Valley, CA, USA, October 26–30, 2008,

J.G.

Shanahan,

Amer-Yahia,

Manolescu,

Zhang,

D.A.

Evans,

Kolcz,

Choi and

Chowdhury, eds, ACM, 2008, pp. 509–518. doi:10.1145/1458082.1458150.

41.

Moro,

Cecconi and

Navigli, Multilingual word sense disambiguation and entity linking for everybody, in: Proceedings of the ISWC 2014 Posters & Demonstrations Track a Track Within the 13th International Semantic Web Conference, ISWC 2014, Riva del Garda, Italy, October 21, 2014,

Horridge,

Rospocher and

van Ossenbruggen, eds, CEUR Workshop Proceedings, Vol. 1272, CEUR-WS.org, 2014, pp. 25–28, http://ceur-ws.org/Vol-1272/paper_30.pdf .

42.

Moro,

Raganato and

Navigli, Entity linking meets word sense disambiguation: A unified approach, Transactions of the Association for Computational Linguistics2 (2014), 231–244, https://tacl2013.cs.columbia.edu/ojs/index.php/tacl/article/view/291 .

43.

Navigli,

Jurgens and

Vannella, SemEval-2013 task 12: Multilingual word sense disambiguation, in: Proceedings of the 7th International Workshop on Semantic Evaluation, SemEval@NAACL-HLT 2013, Atlanta, GA, USA, June 14–15, 2013,

M.T.

Diab,

Baldwin and

Baroni, eds, Association for Computational Linguistics, 2013, pp. 222–231, http://aclweb.org/anthology/S/S13/S13-2040.pdf .

44.

Navigli,

K.C.

Litkowski and

Hargraves, SemEval-2007 task 07: Coarse-grained English all-words task, in: Proceedings of the 4th International Workshop on Semantic Evaluations, SemEval@ACL 2007, Prague, Czech Republic, June 23–24, 2007,

Agirre,

Màrquez i Villodre and

Wicentowski, eds, Association for Computational Linguistics, 2007, pp. 30–35, http://aclweb.org/anthology/S/S07/S07-1006.pdf .

45.

Navigli and

S.P.

Ponzetto, BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network, Artificial Intelligence193 (2012), 217–250. doi:10.1016/j.artint.2012.07.001.

46.

A.G.

Nuzzolese,

A.L.

Gentile,

Presutti,

Gangemi,

Garigliotti and

Navigli, Open knowledge extraction challenge, in: Semantic Web Evaluation Challenges – Second SemWebEval Challenge at ESWC 2015, Revised Selected Papers, Portorož, Slovenia, May 31–June 4, 2015,

Gandon,

Cabrio,

Stankovic and

Zimmermann, eds, Communications in Computer and Information Science, Vol. 548, Springer, 2015, pp. 3–15. doi:10.1007/978-3-319-25518-7_1.

47.

R.D.

Peng, Reproducible research in computational science, Science334(6060) (2011), 1226–1227. doi:10.1126/science.1213847.

48.

Piccinno and

Ferragina, From TagME to WAT: A new entity annotator, in: ERD’14, Proceedings of the First ACM International Workshop on Entity Recognition & Disambiguation, Gold Coast, QLD, Australia, July 11, 2014,

Carmel,

Chang,

Gabrilovich,

B.P.

Hsu and

Wang, eds, ACM, 2014, pp. 55–62. doi:10.1145/2633211.2634350.

49.

Pradhan,

Loper,

Dligach and

Palmer, SemEval-2007 task-17: English lexical sample, SRL and all words, in: Proceedings of the 4th International Workshop on Semantic Evaluations, SemEval@ACL 2007, Prague, Czech Republic, June 23–24, 2007,

Agirre,

Màrquez i Villodre and

Wicentowski, eds, Association for Computational Linguistics, 2007, pp. 87–92, http://aclweb.org/anthology/S/S07/S07-1016.pdf .

50.

Ratinov,

Roth,

Downey and

Anderson, Local and global algorithms for disambiguation to Wikipedia, in: The 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference, Portland, OR, USA, June 19–24, 2011,

Lin,

Matsumoto and

Mihalcea, eds, Association for Computational Linguistics, 2011, pp. 1375–1384, http://www.aclweb.org/anthology/P11-1138 .

51.

Ritter,

Clark, Mausam and

Etzioni, Named entity recognition in tweets: An experimental study, in: Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, EMNLP 2011, John McIntyre Conference Centre, Edinburgh, UK, July 27–31, 2011, A Meeting of SIGDAT, a Special Interest Group of the ACL, ACL, 2011, pp. 1524–1534, http://www.aclweb.org/anthology/D11-1141 .

52.

Rizzo,

van Erp and

Troncy, Benchmarking the extraction and disambiguation of named entities on the semantic web, in: Proceedings of the Ninth International Conference on Language Resources and Evaluation, LREC 2014, Reykjavik, Iceland, May 26–31, 2014,

Calzolari,

Choukri,

Declerck,

Loftsson,

Maegaard,

Mariani,

Moreno,

Odijk and

Piperidis, eds, European Language Resources Association (ELRA), 2014, pp. 4593–4600, http://www.lrec-conf.org/proceedings/lrec2014/summaries/176.html .

53.

Röder,

Usbeck,

Hellmann,

Gerber and

Both, N³ – A collection of datasets for named entity recognition and disambiguation in the NLP interchange format, in: Proceedings of the Ninth International Conference on Language Resources and Evaluation, LREC 2014, Reykjavik, Iceland, May 26–31, 2014,

Calzolari,

Choukri,

Declerck,

Loftsson,

Maegaard,

Mariani,

Moreno,

Odijk and

Piperidis, eds, European Language Resources Association (ELRA), 2014, pp. 3529–3533, http://www.lrec-conf.org/proceedings/lrec2014/summaries/856.html .

54.

Röder,

Usbeck,

Speck and

A.-C.

Ngonga Ngomo, CETUS – A baseline approach to type extraction, in: Semantic Web Evaluation Challenges – Second SemWebEval Challenge at ESWC 2015, Revised Selected Papers, Portorož, Slovenia, May 31–June 4, 2015,

Gandon,

Cabrio,

Stankovic and

Zimmermann, eds, Communications in Computer and Information Science, Vol. 548, Springer, 2015, pp. 16–27. doi:10.1007/978-3-319-25518-7_2.

55.

Rowe,

Stankovic and

Dadzie (eds), Proceedings of the 4th Workshop on Making Sense of Microposts Co-Located with the 23rd International World Wide Web Conference (WWW 2014), Seoul, Korea, April 7, 2014, CEUR Workshop Proceedings, Vol. 1141, CEUR-WS.org, 2014, http://ceur-ws.org/Vol-1141 .

56.

Snyder and

Palmer, The English all-words task, in: Proceedings of Senseval-3: Third International Workshop on the Evaluation of Systems for the Semantic Analysis of Text, Barcelona, Spain, July 2004,

Mihalcea and

Edmonds, eds, Association for Computational Linguistics, 2004, pp. 41–43, http://web.eecs.umich.edu/~mihalcea/senseval/senseval3/proceedings/pdf/snyder.pdf .

57.

Speck and

A.-C.

Ngonga Ngomo, Ensemble learning for named entity recognition, in: The Semantic Web – ISWC 2014 – 13th International Semantic Web Conference, Proceedings, Part I, Riva del Garda, Italy, October 19–23, 2014,

Mika,

Tudorache,

Bernstein,

Welty,

C.A.

Knoblock,

Vrandecic,

P.T.

Groth,

N.F.

Noy,

Janowicz and

C.A.

Goble, eds, Lecture Notes in Computer Science, Vol. 8796, Springer, 2014, pp. 519–534. doi:10.1007/978-3-319-11964-9_33.

58.

Steinmetz,

Knuth and

Sack, Statistical analyses of named entity disambiguation benchmarks, in: Proceedings of the NLP & DBpedia Workshop Co-Located with the 12th International Semantic Web Conference (ISWC 2013), Sydney, NSW, Australia, October 22, 2013,

Hellmann,

Filipowska,

Barrière,

P.N.

Mendes and

Kontokostas, eds, CEUR Workshop Proceedings, Vol. 1064, EUR-WS.org, 2013, http://ceur-ws.org/Vol-1064/Steinmetz_Statistical.pdf .

59.

Steinmetz and

Sack, Semantic multimedia information retrieval based on contextual descriptions, in: The Semantic Web: Semantics and Big Data, 10th International Conference, ESWC 2013, Proceedings, Montpellier, France, May 26–30, 2013,

Cimiano,

Ó.

Corcho,

Presutti,

Hollink and

Rudolph, eds, Lecture Notes in Computer Science, Vol. 7882, Springer, 2013, pp. 382–396. doi:10.1007/978-3-642-38288-8_26.

60.

Sundheim, Tipster/MUC-5: Information extraction system evaluation, in: Proceedings of the 5th Conference on Message Understanding, MUC 1993, Baltimore, MD, USA, August 25–27, 1993, 1993, pp. 27–44. doi:10.1145/1072017.1072023.

61.

E.F.

Tjong Kim Sang and

De Meulder, Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition, in: Proceedings of the Seventh Conference on Natural Language Learning, CoNLL 2003, Held in Cooperation with HLT-NAACL 2003, Edmonton, Canada, May 31–June 1, 2003,

Daelemans and

Osborne, eds, ACL, 2003, pp. 142–147, http://aclweb.org/anthology/W/W03/W03-0419.pdf .

62.

Usbeck,

A.-C.

Ngonga Ngomo,

Röder,

Gerber,

S.A.

Coelho,

Auer and

Both, AGDISTIS – Graph-based disambiguation of named entities using linked data, in: The Semantic Web – ISWC 2014 – 13th International Semantic Web Conference, Proceedings, Part I, Riva del Garda, Italy, October 19–23, 2014,

Mika,

Tudorache,

Bernstein,

Welty,

C.A.

Knoblock,

Vrandecic,

P.T.

Groth,

N.F.

Noy,

Janowicz and

C.A.

Goble, eds, Lecture Notes in Computer Science, Vol. 8796, Springer, 2014, pp. 457–471. doi:10.1007/978-3-319-11964-9_29.

63.

Usbeck,

Röder and

A.-C.

Ngonga Ngomo, Evaluating entity annotators using GERBIL, in: The Semantic Web: ESWC 2015 Satellite Events – ESWC 2015 Satellite Events, Revised Selected Papers, Portorož, Slovenia, May 31–June 4, 2015,

Gandon,

Guéret,

Villata,

J.G.

Breslin,

Faron-Zucker and

Zimmermann, eds, Lecture Notes in Computer Science, Vol. 9341, Springer, 2015, pp. 159–164. doi:10.1007/978-3-319-25639-9_31.

64.

Usbeck,

Röder,

A.-C.

Ngonga Ngomo,

Baron,

Both,

Brümmer,

Ceccarelli,

Cornolti,

Cherix,

Eickmann,

Ferragina,

Lemke,

Moro,

Navigli,

Piccinno,

Rizzo,

Sack,

Speck,

Troncy,

Waitelonis and

Wesemann, GERBIL: General entity annotator benchmarking framework, in: Proceedings of the 24th International Conference on World Wide Web, WWW 2015, Florence, Italy, May 18–22, 2015,

Gangemi,

Leonardi and

Panconesi, eds, ACM, 2015, pp. 1133–1143. doi:10.1145/2736277.2741626.

65.

van Erp,

Rizzo and

Troncy, Learning with the web: Spotting named entities on the intersection of NERD and machine learning, in: Proceedings of the Concept Extraction Challenge at the Workshop on ‘Making Sense of Microposts’, Rio de Janeiro, Brazil, May 13, 2013,

A.E.

Cano,

Rowe,

Stankovic and

Dadzie, eds, CEUR Workshop Proceedings, Vol. 1019, CEUR-WS.org, 2013, pp. 27–30, http://ceur-ws.org/Vol-1019/paper_15.pdf .

66.

Waitelonis,

Jürges and

Sack, Don’t compare apples to oranges: Extending GERBIL for a fine grained NEL evaluation, in: Proceedings of the 12th International Conference on Semantic Systems, SEMANTICS 2016, Leipzig, Germany, September 12–15, 2016,

Fensel,

Zaveri,

Hellmann and

Pellegrini, eds, ACM, 2016, pp. 65–72. doi:10.1145/2993318.2993334.

67.

M.D.

Wilkinson,

Dumontier,

I.J.

Aalbersberg,

Appleton,

Axton,

Baak,

Blomberg,

J.-W.

Boiten,

L.B.

da Silva Santos,

P.E.

Bourne,

Bouwman,

A.J.

Brookes,

Clark,

Crosas,

Dillo,

Dumon,

Edmunds,

C.T.

Evelo,

Finkers,

Gonzalez-Beltran,

A.J.G.

Gray,

Groth,

Goble,

J.S.

Grethe,

Heringa,

P.A.C.

Hoen,

Hooft,

Kuhn,

Kok,

S.J.

Lusher,

M.E.

Martone,

Mons,

A.L.

Packer,

Persson,

Rocca-Serra,

Roos,

van Schaik,

S.-A.

Sansone,

Schultes,

Sengstag,

Slater,

Strawn,

M.A.

Swertz,

Thompson,

van der Lei,

van Mulligen,

Velterop,

Waagmeester,

Wittenburg,

Wolstencroft,

Zhao and

Mons, The FAIR guiding principles for scientific data management and stewardship, Scientific Data3 (2016), 160018. doi:10.1038/sdata.2016.18.

68.

Zhang and

Rettinger, X-LiSA: Cross-lingual semantic annotation, Proceedings of the VLDB Endowment7(13) (2014), 1693–1696, http://www.vldb.org/pvldb/vol7/p1693-zhang.pdf . doi:10.14778/2733004.2733063.

69.

Zwicklbauer,

Seifert and

Granitzer, DoSeR – A knowledge-base-agnostic framework for entity disambiguation using semantic embeddings, in: The Semantic Web. Latest Advances and New Domains – 13th International Conference, ESWC 2016, Proceedings, Heraklion, Crete, Greece, May 29–June 2, 2016,

Sack,

Blomqvist,

d’Aquin,

Ghidini,

S.P.

Ponzetto and

Lange, eds, Lecture Notes in Computer Science, Vol. 9678, Springer, 2016, pp. 182–198. doi:10.1007/978-3-319-34129-3_12.

To be Findable
F1. (meta)data are assigned a globally unique and persistent identifier	Unique W3ID URIs per experiment
F2. data are described with rich metadata (defined by R1 below)	Experimental configuration as RDF
F3. metadata clearly and explicitly include the identifier of the data it describes	Relates via RDF
F4. (meta)data are registered or indexed in a searchable resource	batch-updated SPARQL endpoint: http://gerbil.aksw.org/sparql
To be Accessible
A1. (meta)data are retrievable by their identifier using a standardized communications protocol	HTTP (with JSON-LD as data format)
A1.1 the protocol is open, free, and universally implementable	HTTP is an open standard
A1.2 the protocol allows for an authentication and authorization procedure, where necessary	Not necessary, see Gerbil disclaimer
A2. metadata are accessible, even when the data are no longer available	Each experiment is archived
To be Interoperable
I1. (meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation	RDF, DataID, DataCube
I2. (meta)data use vocabularies that follow FAIR principles	Community-based, open vocabularies
I3. (meta)data include qualified references to other (meta)data	Datasets are described using DataID
To be Reusable
R1. meta(data) are richly described with a plurality of accurate and relevant attributes	Experiment measures have been chosen in a community process
R1.1. (meta)data are released with a clear and accessible data usage license	Gerbil is implemented by LGPL-3.0
R1.2. (meta)data are associated with detailed provenance	Provenance is added to each machine-readable experiment data
R1.3. (meta)data meet domain-relevant community standards	Gerbil covers a superset of domain-relevant data

GERBIL – Benchmarking Named Entity Recognition and Linking consistently

Abstract

Keywords

1. Introduction

4.1. Architecture overview

4.2.1. Experiment types

4.2.2. Matching

Footnotes

Acknowledgements

References