Lessons learnt from the Named Entity rEcognition and Linking (NEEL) challenge series

Abstract

The large number of tweets generated daily is providing decision makers with means to obtain insights into recent events around the globe in near real-time. The main barrier for extracting such insights is the impossibility of manual inspection of a diverse and dynamic amount of information. This problem has attracted the attention of industry and research communities, resulting in algorithms for the automatic extraction of semantics in tweets and linking them to machine readable resources. While a tweet is shallowly comparable to any other textual content, it hides a complex and challenging structure that requires domain-specific computational approaches for mining semantics from it. The NEEL challenge series, established in 2013, has contributed to the collection of emerging trends in the field and definition of standardised benchmark corpora for entity recognition and linking in tweets, ensuring high quality labelled data that facilitates comparisons between different approaches. This article reports the findings and lessons learnt through an analysis of specific characteristics of the created corpora, limitations, lessons learnt from the different participants and pointers for furthering the field of entity recognition and linking in tweets.

Keywords

Microposts named entity recognition named entity linking disambiguation knowledge base evaluation challenge

1. Introduction

Tweets have been proven to be useful in different applications and contexts such as music recommendation, spam detection, emergency response, market analysis, and decision making. The limited number of tokens in a tweet however implies a lack of sufficient contextual information necessary for understanding its content. A commonly used approach is to extract named entities, which are information units such as the names of a Person or an Organisation, a Location, a Brand, a Product, a numerical expression including Time, Date, Money and Percent found in a sentence [40], and enrich the content of the tweet with such information. In the context of the NEEL challenge series, we extended this definition of named entity as being a phrase representing the name, excluding the preceding definite article (i.e. “the”) and any other pre-posed (e.g. “Dr”,“Mr”) or post-posed modifiers, that belongs to a class of the NEEL Taxonomy (Appendix A) and is linked to a DBpedia resource. Semantically enriched tweets have been shown to help addressing complex information seeking behaviour in social media, such as semantic search [1], deriving user interests [2], and disaster detection [78].

The automated identification, classification and linking of named entities has proven to be challenging due to, among other things, the inherent characteristics tweets: i) the restricted length and ii) the noisy lexical nature, i.e. terminology differs between users when referring to the same thing, and non-standard abbreviations are common. Numerous initiatives have contributed to the progress in the field broadly covering different types of textual content (and thus going beyond the boundaries of tweets). For example TAC-KBP [54] has established a yearly challenge in the field covering newswire, websites, discussion forum posts, ERD [17] with search queries content, and SemEval [57] with technical manuals and reports.

The NEEL challenge series, established first in 2013 and since then running yearly, has captured a community need for making sense from tweets through a wealth of high quality annotated corpora and to monitor the emerging trends in the field. The first edition of the challenge named Concept Extraction (CE) Challenge [16] focused on entity identification and classification. A step further into this task is to ground entities in tweets by linking them to knowledge base referents. This prompted the Named Entity Extraction and Linking (NEEL) Challenge the following year [15]. These two research avenues, which add to the intrinsic complexity of the tasks proposed in 2013 and 2014, prompted the Named Entity rEcognition and Linking (NEEL) Challenge in 2015 [65]. In 2015, the role of the named entity type in the grounding process was investigated, as well as the identification of named entities that cannot be grounded because they do not have a knowledge base referent (defined as NIL). The English DBpedia 2014 dataset was the designated referent knowledge base for the 2015 NEEL challenge, and the evaluation was performed through live querying the Web APIs participants prepared, in an automatic fashion to measure the computing time. The 2016 edition [68] consolidated the 2015 edition, using the English DBpedia 2015-04 version as referent knowledge base. This edition proposed an offline evaluation where the computing time was not taken into account in the final evaluation.

The four challenges have published four incremental manually labeled benchmark corpora. The creation of the corpora followed rigid designations and protocols, resulting in high quality labeled data that can be used as seeds for reasoning and supervised approaches. Despite these protocols, the corpora have strengths and weaknesses that we have discovered over the years and they are discussed in this article.

The purpose of each challenge was to set up an open and competitive environment that would encourage participants to deliver novel approaches or improve on existing ones for recognising and linking entities from tweets to either a referent knowledge base entry or NIL where such an entry does not exist. From the first (in 2013) to the 2016 NEEL challenge, thirty research teams have submitted at least one entry to the competitions proposing state-of-the-art approaches. More than three hundred teams have explicitly acquired the corpora in the four years, underlining the importance of the challenges in the field.1

¹
This number does not account for the teams who experimented with the corpora out of the challenges’ timeline.

The NEEL challenges have also attracted from strong interest from the industry as both participants and funding agencies. For example, in 2013 and 2015 the best performing systems were proposed by industrial participants. The prizes were sponsored by industry (ebay2

http://www.ebay.com

in 2013 and SpazioDati3

http://www.spaziodati.eu

in 2015) and research projects (LinkedTV4

⁴

http://www.linkedtv.eu

in 2014, and FREME5

⁵

http://freme-project.eu

in 2016). The NEEL challenge also triggered the interest of localised challenges such as NEEL-IT, the NEEL challenge for tweets written in Italian [7] that brings the multilinguality aspect in the NEEL contest.

This paper reports on the findings and lessons learnt from the last four years of NEEL challenges, analysing the corpora in detail, highlighting their limitations, and providing guidance to implement top performing approaches in the field from the different participants. The resulting body of work has implications for researchers, application designers and social media engineers who wish to harvest information from tweets for their own objectives. The remainder of this paper is structured as follows: in Section 2 we introduce a comparison with recent shared tasks in entity recognition and linking and underline the reason that has prompted the need to establish the NEEL challenge series. Next, in Section 3, the decisions regarding different versions of the NEEL challenge are introduced and the initiative is compared against the other shared tasks. We then detail the steps followed in generating the four different corpora in Section 4, followed by a quantitative and qualitative analysis of the corpora in Section 5. We then list the different approaches presented and narrow down the emerging trends in Section 6, grounding the trends according to the evaluation strategies presented in Section 7. Section 8 reports the participants’ results and provides an error analysis. We conclude and list our future activities in Section 9. We then provide in appendix the NEEL Taxonomy (Appendix A) and the NEEL Challenge Annotation Guidelines (Appendix B).

2. Entity Linking background

The first research challenge to identify the importance of the recognition of entities in textual documents was held in 1997 during the 7th Message Understanding Conference (MUC-7) [20]. In this challenge, the term named entity was used for the first time to represent terms in text that refer to instances of classes such as Person, Location, and Organisation. Since then, named entities have become a key aspect in different research domains, such as Information Extraction, Natural Language Processing, Machine Learning, and the Semantic Web.

Recognising an entity in a textual document was the first big challenge, but after overcoming this obstacle, the research community moved into a second and challenging task: disambiguating entities. This problem appears when a mention in text may refer to more than one entity. For instance, the mention Paul appearing in text may refer to the singer Paul McCartney, to the actor Paul Walker, or to any of the millions of people called Paul around the world. In the same manner, Copacabana can be a mention of the beach in Rio de Janeiro, Brazil, or the beach in Dubrovnik, Croatia. The problem of ambiguity is translated into the question “which is the exact entity that the mention in text refers to?”. To solve this problem, recognising the mention to an entity in text is only the first step for the semantic processing of textual documents, the next one is to ground the mention to an unambiguous representation of the same entity in a knowledge base. This task became known in the research community as Entity Disambiguation.

The Entity Disambiguation task popularised after Bunescu and Pasca [13] in 2006 explored the use of an encyclopaedia as a source for entities. In particular, after [23] demonstrated the benefit of using Wikipedia,6

⁶
http://en.wikipedia.org

a free crowd-sourced encyclopaedia, for such purpose. The reason why encyclopedic knowledge is important is that an encyclopaedia contains representation of entities in a variety of domains, and, moreover, contains a single representation for each entity along with a symbolic or textual description. Therefore from 2006 until 2009, there were two main areas of research: Entity Recognition, as a legacy of the work started during the MUC-7 challenge, and Entity Disambiguation, exploring encyclopedic knowledge bases as catalogs of entities.

In 2009, the TAC-KBP challenge [54] introduced a new problem to both the Entity Recognition and Entity Disambiguation communities. In Entity Recognition, the mention is recognised in text without information about the exact entity that is referred to by the mention. On the other hand, Entity Disambiguation focuses only on the resolution of entities that have a referent in a provided knowledge base. The TAC-KBP challenge illustrated the problem that a mention identified in text may not have a referent entity in the knowledge base. In this case, the suggestion was to link such a mention to a NIL entity in order to indicate that it is not present in the knowledge base. This problem was referred as Named Entity Linking and it is still a hard and current research problem. Nowadays, however, the terms Entity Disambiguation and Entity Linking have become interchangeable.

Since the TAC-KBP challenge, there has been an explosion in the number of algorithms proposed for Entity Linking using a variety of textual documents, Knowledge Bases, and even using different definitions of entities. This variety, whilst beneficial, also extends to how approaches are evaluated, regarding metrics and gold standard datasets used. Such diversity makes it difficult to perform comparisons between various Entity Linking algorithms and creates the need for benchmark initiatives.

In this section, we first introduce the main components of the Entity Linking task and their possible variations, followed by a typical workflow used to approach to the task, the expected output of each step and three strategies for evaluation of Entity Linking systems. We conclude with an overview of benchmark initiatives and their decisions regarding the use of Entity Linking components and evaluation strategies.

2.1. Entity Linking components

Entity Linking is defined as the task of grounding entity mentions (i.e. phrases) in textual documents with knowledge base entries, in which both mention and knowledge base entry are recognised as references to the same entity. If there is no knowledge base entry to ground a specific entity mention then the mention should be linked to a NIL reference instead.

In this definition, Entity Linking contains three main components: text, knowledge base, and entity. The features of each component may vary, and consequently, have an impact on the results of algorithms used to perform the task. For instance, a state-of-the-art solution based on long textual documents may have a poor performance when evaluated over short documents with little contextual information within the text. In a similar manner, a solution developed to link entities of types Person, Location, and Organisation may not be able to link entities of type Movie. Therefore, the choice of each component defines which type of solutions are being evaluated by each specific benchmark initiative.

2.1.1. Textual document

In Entity Linking, textual documents are usually divided in two main categories: long text, and short text. Long textual documents usually contain more than 400 words, such as news articles and web sites. Short documents (such as microposts7

⁷
Microposts is the term used in the social media field to refer to tweets and social media posts in general.

or tweets) may have as few as 200 characters, or even contain a single word, as in search queries. Different types of texts have their own characteristics that may influence Entity Linking algorithms.

Long textual documents provide a series of document-level features that can be explored for Entity Linking such as: the presence of multiple entity mentions in a single document; well-written text (expressed by the lack or relative absence of misspellings); and the availability of contextual information that supports the grounding of each mention. Contextual information entails the supporting facts that help in deciding the best knowledge base entry to be linked with a given mention. For instance, let us assume the knowledge base has two candidate entries to be linked with the mention Michael Jordan. One of these entries refer to the professor at University of California, Berkeley and the other to the basketball player. In order to decide which is the correct entry to be linked with the text, some context needs to be provided such as: “played a game yesterday”, or “won the championship”, and so on. More context makes the task easier, and little or no context makes the task more challenging.

Short text documents are considered more challenging than long ones because they have the exact opposite features such as: the presence of few entity mentions in a single document (due to the limited size of the text); the presence of misspellings or phonetic spelling (e.g. “I call u 2morrow” rather than “I call you tomorrow”); and the low availability of contextual information within the text. It is important to note though that even within the short text category there are still important distinctions between microposts and search queries that may impact the performance of Entity Linking algorithms.

In performing a search, it is expected that the search query will be composed by a mention to the entity of interest being searched and, sometimes, by additional contextual information. Therefore, despite the challenge of being a short text document, search queries are assumed to contain at least one mention to an entity and likely to contain additional contextual information. However, for microposts this assumption does not hold.

Microposts do not necessarily have an entity as target. For instance, a document with the content “So happy today!!!” does not explicitly cite any entity mention. Also, microposts may be used to talk about entities without providing any context within the text, as in “Adele, you rock!!”. In this aspect, Entity Linking for microposts is more challenging than for search queries because it is unclear if a micropost will contain an entity and context for the linking. Furthermore, microposts are also more likely to contain misspellings and phonetic writing than search queries. If a search engine user misspells a term then it is very likely that she will not find the desired information. In this case, it is safe to assume that search engine users will try to avoid misspellings and phonetic writing as much as possible. On the other hand, in micropost communities, misspellings and phonetic writing are used as strategies to shorten words, thus enabling the communication of more information within a single micropost. Therefore, misspelling and phonetic writing are common features of microposts and need to be taken into consideration when performing Entity Linking.

2.1.2. Knowledge base

The second component of Entity Linking we consider is the knowledge base used. Knowledge bases differ from each other regarding the domains of knowledge they cover (e.g. domain-specific or encyclopedic knowledge), the features used to describe entries (e.g. long textual descriptions, attribute-value pairs, or relationship links between entities), their ratio of updates, and the amount of entities they cover.

As with textual documents, different characteristics will impact in the Entity Linking task. The domain covered by the knowledge base will influence which entity mentions will possibly have a link. If there is a mismatch between the domain of the text (e.g. biomedical text) and the domain of the knowledge base (e.g. politics) then all, or most, entity mentions found in text will not have a reference in the knowledge base. In the extreme case of complete mismatch, the Entity Linking process will be reduced to Entity Recognition. Therefore, in order to perform linking, the knowledge base should at least be partially related to the domain of the text being linked.

Furthermore, the features used to describe entities in the knowledge base influence which algorithms can make use of it. For instance, if entities are represented only through textual descriptions, a text-based algorithm needs to be used to find the best mention-entry link. If, however, knowledge base entries are only described through relationship links with other entities then a graph-based algorithm may be more suitable.

The third characteristic of a knowledge base which impacts Entity Linking is its ratio of updates. Static knowledge bases (i.e. knowledge bases that are not or infrequently updated) represent only the status of a given domain at the moment it was generated. Any entity which becomes relevant to that domain, after that point in time will not be represented in the knowledge base. Therefore, in a textual document, only mentions to entities prior to the creation of the knowledge base will have a link, all others would be linked to NIL. The faster entities change in a given domain the more likely it is for the knowledge base to become outdated. In the likelihood that there is a complete disjoint between text and knowledge base, all links from text would invariably be linked to NIL. Depending on the textual document to be linked, the ratio of updates may or may not be an important feature. Social and news media are more likely to have a faster change on their entities of interest than manufacturing reports, for instance.

Another characteristic of a knowledge base which may impact Entity Linking is the amount of entities it covers. Two knowledge bases with the same characteristics may still vary on the amount of entities they cover. When applied to Entity Linking, the more entities a knowledge base covers the more likely there will be a matching between text and knowledge base. Of course in this case we should assume that both knowledge bases are focused on representing key entities in their domain rather than long tail ones.

2.1.3. Entity

The third component of interest for Entity Linking is the definition of entity. Despite its importance for the Entity Linking task, entities are not formally defined. Instead, entities are defined either through example or through the data available. Named entities are the most common case of definition by example. Named entities were introduced in 1997 as part of the Message Understanding Conference as instances of Person, Organisation, and Geo-political types. An extension of named entities is usually performed through the inclusion of additional types such as Locations, Facilities, Movies. In these cases there is no formal definition of entities, rather they are exemplars of a set of categories.

An alternative definition of entities assumes that entities are anything represented by the knowledge base. In other words, the definition of entity is given by the data available (in this case, data from the knowledge base). Whereas this definition makes the Entity Linking task easier by not requiring any refined “human-like” reasoning about types, it makes it impossible to identify NIL links. If entity is anything in the knowledge base, how could we ever possibly have, by definition, an entity which is not in the knowledge base?

The choice of entity will depend on the Entity Linking desired. If the goal is to consider links to NIL then the definition based on types is the most suitable, otherwise the definition based on the knowledge base may be used.

2.2. Typical Entity Linking workflow and evaluation strategies

Fig. 1.

Typical Entity Linking workflow with expected output of each step.

Regardless of the different Entity Linking components, most proposed systems for Entity Linking follow a workflow similar to the one presented in Fig. 1. This workflow is composed of the following steps: Mention Detection, Entity Typing, Candidate Detection, Candidate Selection, and NIL Clustering. Note that, although it is usually a sequential workflow, there are approaches that create a feedback loop between different steps, or merge two or more steps into a single one.

The Mention Recognition step receives textual documents as input and recognises all terms in text that refer to entities. The goal of this step is to perform typical Named Entity Recognition. Next, the Entity Typing step detects the type of each mention previously recognised. This task is usually framed as a categorisation problem. Candidate Detection next receives the detected mentions and produces a list with all entries in the knowledge base that are candidates to be linked with each mention. In the Candidate Selection step, these candidate lists are processed and, by making use of available contextual information, the correct link for each mention, either an entry from the knowledge base or a NIL reference, is provided. Last, the NIL Clustering step receives a series of mentions linked to NIL as input and generates clusters of mentions referring to the same entity, i.e. each cluster contains all NIL mentions representing one, and only one, entity, and there are no two clusters representing the same entity.

The evaluation of Entity Linking systems is based on this typical workflow and can be of three types: end-to-end, step-by-step, or partial end-to-end.

An end-to-end strategy evaluates a system based only on the aggregated result of all its steps. It means that if one step in the workflow does not perform well and its error propagates through all subsequent steps, this type of evaluation will judge the system based only on the aggregated error. In this case, a system that performs excellent Candidate Selection but poor Mention Detection can be considered as good as a system that performs a poor Candidate Selection but an excellent Mention Detection. The end-to-end strategy is very useful for application benchmark in which the goal is to maximise the results that will be consumed by another application based on Entity Linking (e.g. entity-based search). However, for research benchmark, it is important to know which algorithms are the best fit for each of the steps in the Entity Linking workflow.

The opposite to an end-to-end evaluation is a step-by-step strategy. The goal of this evaluation is to provide a robust benchmark of algorithms for each step of the Entity Linking workflow. Each step is provided with the gold standard input (i.e. the correct input data for that specific step) in order to eliminate propagation of errors from previous steps. The output of each step is then evaluated separately. Despite the robustness of this approach, this type of evaluation does not account for systems that do not follow the typical Entity Linking workflow, e.g. systems that merge two steps into a single one or that create feedback loops; and it is also a highly time and labour consuming task to set up.

Finally, the partial end-to-end evaluation aims at evaluating the output of each Entity Linking step but by analysing the final result of the whole system. Partial end-to-end evaluation uses different metrics that are influenced only by specific parts of the Entity Linking workflow. For instance, one metric evaluates only the link between mentions and entities in the knowledge base, another metric evaluates only links with NIL, yet another one evaluates only the correct mentions recognized, whereas another metric measures the performance of the NIL Clustering.

2.3. Entity Linking benchmark initiatives

The number of variations in Entity Linking makes it hard to benchmark Entity Linking systems. Different research communities focus on different types of text and knowledge base, and different algorithms will perform better or worse on any specific step. In this section, we present the Entity Linking benchmark initiatives to date, the Entity Linking specifications used, and the communities involved. The challenges are summarised in Table 1.

Table 1
Named Entity Recognition and Linking challenges since 2013

2.3.1. TAC-KBP

Entity Linking was first introduced in 2009 as a challenge for the Text Analysis Conference.8

⁸
http://www.nist.gov/tac

This conference was aimed at a community focused on the analysis of textual documents and the challenge itself was part of the Knowledge Base Population track (also called TAC-KBP) [54]. The goal of this track was to explore algorithms for automatic knowledge base population from textual sources. In this track, Entity Linking was perceived as a fundamental step, in which entities are extracted from text and evaluated if they already exist in a knowledge base to be populated (i.e. link to a knowledge base entry) or if they should be included in the knowledge base (i.e. link to NIL). The results of Entity Linking could be used either for direct population of knowledge bases or used in conjunction with other TAC-KBP tasks such as Slot Filling.

As of 2009, the TAC-KBP benchmark was not concerned about recognition of entities in text, in particular considering that their entities of interest were instances of types Organisation, Geo-political, and Person, and the recognition of these types of entities in text was already a well-established task in the community. The challenge was then mainly concerned with correct Entity Typing and Candidate Selection. In later years, Mention Detection and NIL Clustering were also included in the TAC-KBP pipeline [47]. Also, more entity types are now considered such as Location and Facility, as well as, multiple languages [48].

Characteristics that have been constant in TAC-KBP are the use of long textual documents, entities given by Type, and the use of encyclopedic knowledge bases. A reason for long textual documents would be that this type of text is more likely to contain contextual information to populate a knowledge base, in particular news articles and web sites. The use of entities given by Type is a direct consequence of the availability of named entity recognition algorithms based on types and the need for NIL detection. The use of an encyclopedic knowledge base was because Person, Organisation, and Geo-political entities are not domain-specific and due to the availability of Wikipedia as a free available knowledge base on the Web.

2.3.2. ERD

The Entity Recognition and Disambiguation (ERD) challenge [17] was a benchmark initiative organised in 2015 as part of the SIGIR conference9

⁹
http://sigir.org/sigir2014

with the focus of enabling entity-based information retrieval. For such a task, a system needs to be able to index documents based on entities, rather than words, and to identify which entities satisfy a given query. Therefore, the ERD challenge proposed two Entity Linking tracks, a long text track based on web sites (e.g. the documents to be indexed), and a short text track based on search queries. For both tracks, entities identified in text should be linked with a subset of Freebase,10

¹⁰

https://developers.google.com/freebase

a large collaborative encyclopedic knowledge base containing structured data.

The Information Retrieval community, and consequently the ERD challenge, focuses on the processing of large amounts of information. Therefore, the systems evaluated should provide not only the correct results but also fulfill basic standards for large scale web systems, i.e. they should be available through Web APIs for public use, they should accept a minimum number of requests without timeout, and they should ensure a minimum uptime availability. All these standards were translated into the evaluation method of the ERD challenge that required systems to have a given Web API available for querying during the time of the evaluation. Also, large scale web systems are evaluated regarding how useful their output is for the task at hand regardless of the internal algorithms used, so the evaluation used by ERD was an end-to-end evaluation using standard information retrieval evaluation metrics (i.e. precision, recall, and f-measure).

2.3.3. W-NUT

The community of natural language processing and computational linguistics within the ACL-IJCNLP11

¹¹
https://www.aclweb.org

conferences have always been interested in the study of long textual documents. One of the main characteristics of these documents is that they are usually written using standard English writing. However, with the advent of Twitter and other forms of microblogging, short documents started to receive increased attention from the academic community of computational linguists in special because of their non-standard writing.

In 2015, the Workshop on Noisy User-generated Text (W-NUT) [4] promoted the study of documents that are not written in standard English, with tweets as the focus of its two shared tasks. One of these tasks was targeted at the normalisation of text. In other words, expressions such as “r u coming 2” should be normalised into standard English on the form of “are you coming to”. The second task proposed named entity recognition within tweets in which systems were required to detect mentions to entities corresponding to a list of ten entity types. This proposed task corresponds to the first two steps of the Entity Linking workflow: Mention Detection and Entity Typing.

2.3.4. SemEval

Word Sense Disambiguation and Entity Linking are two tasks that perform disambiguation of textual documents through links with a knowledge base. Their main difference is that the former disambiguates the meaning of words with respect to a dictionary of word senses, whereas the latter disambiguates words with respect to a list of entity referents. These two tasks have been historically treated as different tasks since they require knowledge bases of a dissimilar nature. However, with the development of Babelnet, a knowledge base containing both entities and word senses, Word Sense Disambiguation and Entity Linking could be finally performed using a single knowledge base.

In 2015, a shared task for Multilingual All-Words Sense Disambiguation and Entity Linking [57] using Babelnet was proposed as part of the International Workshop on Semantic Evaluation (SemEval).12

¹²
http://alt.qcri.org/semeval2015

In this task, the goal was to create a system that could perform both Word Sense Disambiguation and Entity Linking. In word sense disambiguation, senses are anything that is available in the dictionary of senses. Therefore, in order to make the integration of the two tasks easier, it followed that an entity is anything that is available in the knowledge base of entities. Also, given the complexity involved in joining the two tasks, the SemEval shared task focused on technical manuals, reports, and formal discussions that tend to follow a more rigid written structure than tweets or other forms of informal natural language text. The use of such well-written texts makes the task easier at the mention recognition level (i.e. Mention Detection), and leaves the challenge at the disambiguation level (i.e. Candidate Selection).

3. The NEEL challenge series

Named Entity Recognition and Entity Linking have been active research topics since their introduction by MUC-7 in 1997 and TAC-KBP in 2009, respectively. The main focus of these initiatives had been on long textual documents, such as news articles, or web sites. Meanwhile, microposts emerged as a new type of communication on the Social Web and have been a widespread format to express opinions, sentiments, and facts about entities. The popularisation of microposts through the use of Twitter,13

¹³
http://www.twitter.com

an established platform for publication of microposts, reinforced a gap in the research of the Named Entity Recognition and Entity Linking communities. The NEEL series was proposed as a benchmark initiative to fill this gap.

The evolution of the NEEL challenge followed the evolution of Entity Linking. The challenge was first held in 2013 under the name of Concept Extraction (CE) and was concerned with the detection of mentions to entities in microposts and the specification of their types. In the next year, already under the acronym of NEEL, the challenge also included linking mentions to an encyclopedic Knowledge Base or to NIL. In 2015 and 2016, NEEL was expanded to also include clustering of NIL mentions.

To propose a fair benchmark for approaches to Entity Linking with microposts, the organisation of the NEEL challenge had to make certain decisions concerning different Entity Linking components and the available strategies for evaluation, always taking into consideration the trends and needs of the research community focused on the Web and microposts. In this section, we provide the motivation for these decisions. A discussion on their impact will be provided in later sections.

Text. The first decision that had to be taken regards the text used for the challenge. Twitter was chosen as the source of textual documents since it is a well-known platform for microposts on the Web, and it provides a public API that makes it easy to extract microposts both for generation of the benchmark corpora and for future use of the evaluated Entity Linking systems. More information on how Twitter was used to build the NEEL corpora is presented in Section 4.

Knowledge base. Despite the type of text used, it is important for an Entity Linking challenge to provide a balance among mentions linked to the Knowledge Base and mentions linked to NIL. A better balance enables a fairer evaluation, otherwise the challenge would be biased towards algorithms that perform one task better than the other. In the case of tweets, the frequency of knowledge base updates is an important factor for the balance among knowledge base links and NIL. Microposts are a dynamic form of communication usually dealing with recent events. If the collection of tweets is more recent than the entities in the knowledge base, the amount of NIL links is likely to be much higher than the links to entries in the Knowledge Base. Therefore, the rate in which the knowledge base is updated is an important factor for the NEEL challenge.

Taking this into account, we chose to use DBpedia [49], a structured knowledge base based on Wikipedia, mainly because it is frequently updated with entities appearing in events covered in social media. Another motivation for using DBpedia is that its format lends itself better to the task than Wikipedia itself. Each NEEL version used the latest available version of DBpedia.

Definition of entities. Due to the dynamic nature of microposts, the recognition of NILs was recognised as an important feature since the introduction of Entity Linking in the NEEL challenge in 2014. Due to that, but also to accommodate the participants from the Concept Extraction challenge, the definition of entities is given by type.

In 2013, the list of entity types was based on the taxonomy used in CoNLL 2003 [71]. From 2014 onwards, the NEEL Taxonomy (Appendix A) was created with the goal of providing a more fine-grained classification of entities. This would represent a vast amount of entities of interest in the context of the Web. The types of entities used and how the NEEL Taxonomy was built is described in Section 4.

Evaluation. The evaluation is the main component of a benchmark initiative because, after all, the goal of benchmarking is to compare different systems applied to the same data by using the same evaluation metrics. There are two main decisions regarding the evaluation process. The first decision is about the format in which the results of each system are gathered (i.e. via file transfer or call to a Web API). The second decision regards how the results will be evaluated and which evaluation metrics will be applied.

The NEEL challenge has used different evaluation settings in different versions of the challenge. Each change has its own motivation, but the main focus for each of them was to provide a fair and comprehensive evaluation of the submitted systems.

The first decision regards the submission of a file or the evaluation through Web APIs. Both approaches have their advantages and disadvantages. The use of a file lowers the bar to new participants in the challenge because they do not need to develop a Web API in addition to the usual Entity Linking steps, nor to have a Web server available during the whole evaluation process. This was the proposed model for 2013, 2014, and 2016. However, during NEEL 2014, some participants suggested that the challenge should apply a blind evaluation, i.e. the participants should know the input data just at the time of the query in order to avoid common mistakes of tuning the system based on evaluation data. Therefore, in 2015 the submission of evaluation results was changed to Web API calls. The impact of this change was that a few teams could not participate in the challenge, mainly because their Web server was not available during evaluation or their API did not have the correct signature. This format of evaluation also required extra effort on the part of the organisers that had to advise participant teams that their web servers were not available. Given the amount of problems generated and no real benefit experienced, the organisation opted for going back to the transfer of files with the results of the systems as in previous years.

The second decision concerns the evaluation strategy, which impacts the metrics used and on the overall benchmark ranking. In this step, we either have the option for an end-to-end, a partial end-to-end, or a step-by-step evaluation. Borrowing from the named entity recognition community, the first two versions of the challenge (i.e. 2013 and 2014) were based on an end-to-end evaluation. In this evaluation, standard evaluation metrics (i.e. precision, recall, and f-measure) were applied on top of the aggregated results of the system. A drawback of end-to-end evaluation is that in Entity Linking, if one step in the typical workflow does not perform well, its error will propagate until the last step. Therefore, an end-to-end evaluation will only evaluate based on the aggregated errors from all steps. This was not a problem when the systems were required to perform one or two simple steps, but when the challenge starts requiring a larger number of steps then a more fine-grained evaluation is required.

A partial end-to-end strategy evaluates the output of each Entity Linking step by analysing only the final result of the system. This evaluation uses different metrics for each part of the workflow and had been successfully performed by multiple TAC-KBP versions. Therefore, due to its benefits for the research community, the partial end-to-end evaluation has also been applied in the NEEL challenge in 2015 and 2016. Furthermore, the NEEL challenge applied this strategy using the same evaluation tool as TAC-KBP [44], which aimed to enable an easier interchange of participants between both communities.

The step-by-step evaluation has never being applied within the NEEL series. Despite its robustness by eliminating error propagation, it is very time consuming, in particular if participant systems do not implement the typical workflow. The evaluation process for each year as well as the specific metrics used will be discussed in Section 7.

Target conference. The NEEL challenge keeps in mind that microposts are of interest of a broader community, composed of researchers in Natural Language Processing, Information Retrieval, Computational Linguistics, and also from a community interested on the World Wide Web. Given this, the NEEL Challenges were proposed as part of the International Workshop on Making Sense of Microposts that was held in conjunction with consecutive World Wide Web conferences.

In the next sections we will explain in detail how the NEEL challenges were organised, how the benchmark corpora were generated semi-manually, details of participant systems in each year, and the impact of each change in the participation in subsequent years.

4. Corpus creation

The organisation of the NEEL challenges led to the yearly release of datasets of high value for the research community. Over the years, the datasets increased in size and coverage.

4.1. Collection procedure and statistics

The initial 2013 challenge dataset contains 4,265 tweets collected from the end of 2010 to the beginning of 2011 using the Twitter firehose with no explicit hashtag search. These tweets cover a variety of topics, including comments on news and politics. The dataset was split into 66% training and 33% test.

The second 2014 challenge dataset contains 3,505 event-annotated tweets, where each entity was linked to its corresponding DBpedia URI. This dataset was collected as part of the Redites project14

¹⁴
http://demeter.inf.ed.ac.uk/redites

from 15th July 2011 to 15th August 2011 (31 days) comprising a set of over 18 million tweets obtained from the Twitter firehose. The 2014 dataset includes both event and non-event related tweets. The collection of event related tweets did not rely on the use of hashtags but on applying the First Story Detection (FSD) algorithm of [60] and [61]. This algorithm relies on locality-sensitive hashing, which processes each tweet as it arrives in time. The hashing dynamically builds up tweet clusters representing events. Notice that hashing in this context refers to a compression methodology not to a Twitter hashtag. Within this collection, the FSD algorithm identified a series of events (stories) including the death of Amy Winehouse, the London Riots and the Oslo bombing. Since the challenge task was to automatically recognise and link named entities (to DBpedia referents), we built the challenge dataset considering both event and non-event tweets. While event tweets are more likely to contain entities, non-event tweets enabled us to evaluate the performance of the system in avoiding false positives in the entity extraction phase. This dataset was split into a training (70%) and testing (30%) sets. Given the task of identifying mentions and linking to the referent knowledge base entities in 2014, the class information was removed from the final release.

The 2015 challenge dataset extends the 2014 dataset. This dataset consists of tweets published over a longer period, between 2011 and 2013. In addition to this, we also collected tweets from the Twitter firehose from November 2014 covering both event (such as the UCI Cyclo-cross World Cup) and non-event tweets. The dataset was split into training (58%), consisting of the entire 2014 dataset, development (8%), which enabled participants to tune their systems, and test (34%) from the newly added 2015 tweets.

The 2016 challenge dataset builds on the 2014 and 2015 datasets, and consists of tweets extracted from the Twitter firehose from 2011 to 2013 and from 2014 to 2015 via a selection of popular hashtags. This dataset was split into training (65%) consisting of the entire 2015 dataset, development (1%), and test (34%) sets from the newly collected tweets for the 2016 challenge.

Table 2

General statistics of the training, dev, and test data sets. tweets refers to the number of tweets in the set; words to the unique number of words, thus without repetition; tokens refers to the total number of words; tokens/tweet represents the average number of tokens per tweet, entities refers to the unique number of named entities including NILs; NILs refers to the number of entities not yet available in the knowledge base; total entities corresponds to the number of entities with repetition in the set; entities/tweet refers to the average of entities per tweet; NILs/tweet corresponds to the average of NILs per tweet

Dataset	tweets	words	tokens	tokens/tweet	entities	NILs	total entities	entities/tweet	NILs/tweet
2013 training	2,815	10,439	51,969	18.46	2,107	–	3,195	1.88	–
2013 test	1,450	6,669	29,154	20.10	1,140	–	1,557	1.79	–
2014 training	2,340	12,758	41,037	17.54	1,862	–	3,819	3.26	–
2014 test	1,165	6,858	20,224	17.36	834	–	1,458	2.50	–
2015 training	3,498	13,752	67,393	19.27	2,058	451	4,016	1.99	0.22
2015 dev	500	3,281	7,845	15.69	564	362	790	2.04	0.94
2015 test	2,027	10,274	35,558	17.54	2,122	1,478	3,860	2.32	0.89
2016 training	6,025	26,247	100,071	16.61	3,833	2,291	8,665	1.43	0.38
2016 dev	100	841	1,406	14.06	174	85	338	3.38	0.85
2016 test	3,164	13,728	45,164	14.27	430 *	284 *	1,022 *	3.412 +	0.95 +

Only 300 tweets have been randomly selected to be manually annotated and included in the gold standard.

⁺

These figures refer to the 300 tweets of the gold standard.

Statistics describing the training, development and test sets are provided in Table 2. In all, but not in the 2015 challenge, the training datasets presented a higher rate of named entities linked to DBpedia than the development and test datasets. The percentage of tweets that mention at least one entity is 74.42% in the training, 72.96% in the test set for the 2013 dataset; 32% in the training, and 40% in the test set for the 2014 dataset; 57.83% in the training set, 77.4% in the development set, and 82.05% in the test set for the 2015 dataset; and 67.60% in the training set, 100% in the development set, and 9.35% in the test set for the 2016 dataset. The overlap of entities between the training and test data is 8.09% for the 2013 dataset, 13.27% for the 2014 dataset, 4.6% for the 2015 dataset, and 6.59% for the 2016 dataset. Following the Terms of Use of Twitter, for all the four challenge datasets, participants were only provided the tweet IDs and the annotations, the tweet text had to be downloaded from Twitter.

4.2. Annotation taxonomy and class distribution

The taxonomy for annotating the entities changed from a four-class taxonomy, based on the taxonomy used in CoNLL 2003 [71], in 2013 to an extended version seven-type taxonomy, namely the NEEL Taxonomy (Appendix A), which is derived from the NERD Ontology [67]. This new taxonomy was introduced to provide a more fine-grained classification of the entities, covering also names of characters, products and events. Furthermore, it is deemed to better answer the need to cope with the semantic diversity of named entities in textual documents as shown in [66]. Table 3 shows the mapping between the two classification schemes.

Table 3
Mapping between the taxonomy used in the first challenge of the NEEL challenge series (left column), and the taxonomy used since the 2014 on (right column)

CE 2013 NEEL taxonomy

MISC Thing

PER Person

LOC Location

ORG Organization

– Character

– Product

– Event

CE 2013	NEEL taxonomy
MISC	Thing
PER	Person
LOC	Location
ORG	Organization
–	Character
–	Product
–	Event

Table 4

Entity type statistics for the two data sets from 2013

Type	Training	Test
Person	1,722 (53.89%)	1,128 (72.44%)
Location	621 (19.44%)	100 (6.42%)
Organisation	618 (19.34%)	236 (15.16%)
Misceleneous	233 (7.29%)	95 (6.10%)

Table 5

Entity type statistics for the three data sets from 2015

Type	Training	Dev	Test
Character	43 (1.07%)	5 (0.63%)	15 (0.39%)
Event	182 (4.53%)	81 (10.25%)	219 (5.67%)
Location	786 (19.57%)	132 (16.71%)	957 (24.79%)
Organization	968 (24.10%)	125 (15.82%)	541 (14.02%)
Person	1102 (27.44%)	342 (43.29%)	1402 (36.32%)
Product	541 (13.47%)	80 (10.13%)	575 (14.9%)
Thing	394 (9.81%)	25 (3.16%)	151 (3.92%)

Summary statistics of the entity types are provided in Tables 4, 5, and 6 for the 2013, 2015, and 2016 corpora respectively.15

¹⁵

The statistics cover the observable data in the corpora. Thus, the distributions of implicit classes in the 2014 corpus are not reported. The choice of removing the class information from the release was made on purpose because of the final objective of the task of having end-to-end solutions.

The most frequent entity type across all datasets is Person. This is followed by Organisation and Location in the 2013 and 2015 datasets. In the 2016 dataset the second and third most frequent types are Product and Organisation. The distributional differences between the entity types in the three sets are quite apparent, making the NEEL task challenging, particularly when addressed with supervised learning approaches.

Table 6

Entity type statistics for the three data sets from 2016. The statistics of the Test set refer to the manually annotated set of tweets selected to generate the gold standard

Type	Training	Dev	Test
Character	63 (0.73%)	19 (5.62%)	57 (5.58%)
Event	482 (5.56%)	7 (2.07%)	24 (2.35%)
Location	1,868 (21.56%)	17 (5.03%)	43 (4.21%)
Organization	1,641 (18.94%)	33 (9.76%)	158 (15.46%)
Person	2,846 (32.84%)	120 (35.50%)	337 (32.97%)
Product	1,199 (13.84%)	128 (37.87%)	355 (34.74%)
Thing	570 (6.58%)	14 (4.14%)	49 (4.79%)

4.3. Annotation procedure

In the 2013 challenge, 4 annotators created the gold standard; in the 2014 challenge a total of 14 annotators were used who had different backgrounds, including computer scientists, social scientists, social web experts, semantic web experts and natural language processing experts; in the 2015 challenge, 3 annotators generated the annotations; in the 2016 challenge, 2 experts took on the manual annotation campaign.

The annotation process for the 2013 dataset started with the unannotated corpus and consists of the following steps:

Manual annotation: the corpus was split into four quarters, each was annotated by a different human annotator.

Consistency: for consistency checking, each annotator further checked the annotations that the other three performed to verify correctness.

Consensus: for the annotations without consensus, discussions among the four annotators was used to come to a final conclusion. This process resulted in resolving annotation inconsistencies.

Adjudication: a very small number of errors was also reported by the participants, which was taken into account in the final version of the dataset.

With the inclusion of entity links, the annotation process for the 2014, 2015 datasets was amended to consist of the following phases:

Unsupervised automated annotation: the corpus was initially annotated using the NERD framework [69], which extracted potential entity mentions, candidate links to DBpedia, and entity types. The NERD framework was used as an off-the-shelf annotation tool, i.e. it was used without training it properly.

Manual annotation: the labeled data set was divided into batches, with different annotators – three annotators in the 2014 challenge, and two annotators in the 2015 challenge – to each batch. In this phase manual annotations were performed using an annotation tool (e.g. CrowdFlower for the 2014 challenge dataset,16

¹⁶
For annotating the 2014 challenge dataset, we used Crowdflower with selected expert annotators rather than the crowd.

and GATE [25] for the 2015 challenge dataset17

¹⁷

For the 2015 challenge we chose GATE instead of Crowdflower, because GATE allows for the annotation of entities according to an ontology, and to compute inter-annotator agreement on the dataset.

). The annotators were asked to analyse the annotations generated in Phase 1 by adding or removing entity annotations as required. The annotators were also asked to mark any ambiguous cases encountered. Along with the batches, the annotators also received the NEEL Challenge Annotation Guidelines (Appendix B).

Consistency: the annotators – three experts in the 2014 challenge, and a forth annotator in the 2015 challenge – double-checked the annotations and generated the gold standard (for the training, development and test sets). Three main tasks were carried out here: (i) consistency checking of entity types; (ii) consistency checking of URIs; (iii) resolution of ambiguous cases raised by the annotators. The annotators looped through Phase 2 and 3 of the process until the problematic cases were resolved.

NIL Clustering: particular to the 2015 challenge, a seed cluster generation algorithm through merging of string- and type- identical named entity mentions was used to generate an initial NIL Clustering.

Consensus: also in the 2015 challenge, based on the results of the seed algorithm, the third annotator manually verified all NIL clusters in order to remove links asserted to the wrong cluster, and merge clusters referring to the same entity. Special attention was paid to name variations such as acronyms, misspellings, and similar names.

Adjudication: where the challenge participants reported incorrect or missing annotations. Each reported mention was evaluated by one of the challenge chairs to check compliance with the NEEL Challenge Annotation Guidelines, and additions and corrections made as required.

In the 2016 challenge, the training set was built on top of the 2014 and 2015 datasets in order to provide continuity with previous years and to build upon existing findings. The 2016 challenge used the NEEL Challenge Annotation Guidelines provided in 2015. Due to the intensity of the annotation task, 10% of the test set was annotated manually.18

¹⁸

The participants were asked to annotate the entire corpus of tweets.

A random selection was performed while preserving the original distributions of types in the corpus by the law of large numbers [80]. The annotation process for the 2016 test set consisted of the following steps:

Manual annotation: the data set was divided into 2 batches, one for each annotator. In this phase, annotations were performed using GATE. The annotators were asked to analyse the annotations generated in Phase 1 by adding or removing entity annotations as required. The annotators were also asked to mark any ambiguous cases encountered. Along with the batches, the annotators received the NEEL Challenge Annotation Guidelines.

Consistency: the annotators checked each other annotations and generated the gold standard (for the training, development and test sets). Three main tasks were carried out here: (i) consistency checking of entity types; (ii) consistency checking of URIs; (iii) resolution of ambiguous cases raised by the annotators. The annotators iterated between Phases 1 and 2 until the problematic cases were resolved.

NIL Clustering: an unsupervised NIL Clustering generation was performed, using a seed cluster generation algorithm based on exact string matching of mention strings and their types.

Consensus: one of the two expert annotators went through all NIL clusters in order to, where appropriate, include or exclude them from a given cluster.

Adjudication: where the challenge participants reported incorrect or missing annotations. Each reported mention was evaluated by one of the challenge chairs to check compliance with the Challenge Annotation Guidelines, and additions and corrections were made as required.

The inter-annotator agreement (IAA) for the challenge datasets (2014, 2015 and 2016) is presented in Table 7.19

¹⁹

The inter-annotator agreement for the 2013 dataset could not be computed, as the challenge settings and intermediate data were lost due to a lack of organisation of the challenge.

We computed these values using the annotation diff tool in GATE. As the annotators are not only classifying predefined mentions but can also define different mentions, traditional IAA measures such as Cohen’s Kappa are less suited to this task. Therefore, we measured the IAA in terms of precision (P), recall (R) and F-measure (

F_{1}

) using exact matches, where the annotations must be correct for both the entity mentions and the entity types over the full spans (entity mention and label) [26]. Annotations which were partially correct were considered as incorrect.

Table 7

Inter-annotator agreement on the challenge datasets

Dataset	P	R	$F_{1}$
NEEL 2014	49.49%	73.10%	59.02%
NEEL 2015	97.00%	98.5%	98.00%
NEEL 2016	90.31%	92.27%	91.28%

The lessons learnt from building high quality gold standards are that the annotation process must be guided with annotation guidelines, at least two annotators must be involved in the annotation process to ensure consistency, and the feedback from the participants is valuable in improving the quality of the datasets, providing complementary annotations to the cases found by experts. The annotation guidelines, written by experts, must describe the annotation task (for instance, entity types and NEEL taxonomy) through examples, and must be regularly updated during the manual annotation stage, describing special cases, issues encountered. In order to speed up the annotation process, it is a good practice to employ an annotation tool. We used GATE because the annotation process was guided by a taxonomy-centric view. The annotation task took less time if the annotators shared the same background (e.g. all annotators were semantic web and natural language processing experts with experience in information extraction).

5. Corpus analysis

While the main goals of the 2013–2016 challenges were the same, and the 2014–2016 corpora are largely built on top of each other, there are some differences among the datasets. In this section, we will analyse the different datasets according to the characteristics of the entities and events annotated in them. We hereby reuse measures and scripts from [76] and add a readability measure analysis of the corpora. Note that for the Entity Linking analyses, we can only compare the 2014–2016 NEEL corpora since the 2013 corpus (CE2013) does not contain entity links.

5.1. Entity overlap

Table 8 presents the entity overlap between the different datasets. Each row in the table represents the percentage of unique entities present in that dataset that are also represented in the other datasets.

Table 8
Entity overlap in the analysed datasets. Behind the dataset name in each row the number of unique entities present in that dataset is given. For each dataset pair the overlap is given as the number of entities and percentage (in parentheses)

NEEL 2014 NEEL 2015 NEEL 2016

NEEL 2014 (2,380) – 1,630 (68.49%) 1,633 (68.61%)

NEEL 2015 (2,800) 1,630 (58.21%) – 2,800 (100%)

NEEL 2016 (2,992) 1,633 (54.58%) 2,800 (93.58%) –

	NEEL 2014	NEEL 2015	NEEL 2016
NEEL 2014 (2,380)	–	1,630 (68.49%)	1,633 (68.61%)
NEEL 2015 (2,800)	1,630 (58.21%)	–	2,800 (100%)
NEEL 2016 (2,992)	1,633 (54.58%)	2,800 (93.58%)	–

5.2. Confusability

We define the true confusability of a surface form s as the number of meanings that this surface form can have.20

²⁰
As surface form we refer to the lexical value of the mention.

Because new organisations, people and places are named every day, there is no exhaustive collection of all named entities in the world. Therefore, the true confusability of a surface form is unknown, but we can estimate the confusability of a surface form through the function

A (s) : S \Rightarrow N

that maps a surface form to an estimate of the size of its candidate mapping, such that

A (s) = | C (s) |

The confusability of a location name offers only a rough a priori estimate of the difficulty in linking that surface form. Observing the annotated occurrences of this surface form in a text collection allows us to make more informed estimates. We show the average number of meanings denoted by a surface form, indicating the confusability, as well as complementary statistical measures on the datasets in Table 9. In this table, we observe that most datasets have a low number of average meanings per surface form, but there is a fair amount of variation, i.e. number of surface forms that can refer to a meaning.

Table 9

Confusability stats for analysed datasets. Average stands for average number of meanings per surface form, Min. and Max. stand for the minimum and maximum number of meanings per surface form found in the corpus respectively, and σ denotes the standard deviation

Corpus	Average	Min.	Max.	σ
NEEL 2014	1.02	1	3	0.16
NEEL 2015	1.05	1	4	0.25
NEEL 2016	1.04	1	3	0.22

5.3. Dominance

We define the true dominance of an entity resource $r_{i}$ 21

²¹
An entity resource is an entry in a knowledge base that describes that entity, for example http://dbpedia.org/resource/Hillary_Clinton is the DBpedia entry that describes the American politician Hillary Rodham Clinton.

for a given surface form

s_{i}

to be a measure of how commonly

r_{i}

is meant with regard to other possible meanings when

s_{i}

is used in a sentence. Let the dominance estimate

D (r_{i}, s_{i})

be the relative frequency with which the resource

r_{i}

appears in Wikipedia links where

s_{i}

appears as the anchor text. Formally:

\begin{matrix} D (r_{i}, s_{i}) = \frac{| WikiLinks (s_{i}, r_{i}) |}{\forall_{r \in R} | WikiLinks (s_{i}, r) |} \end{matrix}

The dominance statistics for the analysed datasets are presented in Table 10. The dominance scores for all corpora are quite high and the standard deviation is low, meaning that in the vast majority of cases, a single resource is associated with a certain surface form in the annotations, creating a low of variance for an automatic disambiguation system.

Table 10

Dominance stats for analysed datasets

Corpus	Dominance	Max	Min	σ
NEEL 2014	0.99	47	1	0.06
NEEL 2015	0.98	88	1	0.09
NEEL 2016	0.98	88	1	0.08

5.4. Summary

In this section, we have analysed the corpora in terms of their variance in named entities and readability.

As the datasets are built on top of each other, they show a fair amount of overlap in entities between each other. This need not to be a problem, if there is enough variation among the entities, but the confusability and dominance statistics show that there are very few entities in our datasets with many different referents (“John Smiths”) and if such an entity is present, often only one of its referents is meant. To remedy this, future entity linking corpora should take care to balance the entity distribution and include more variety.

We experimented with various readability measures to assess the reading difficulty of the various tweet corpora. These measures would indicate that tweets are generally not very difficult in terms of word and sentence length, but the abbreviations and slang present in tweets proves them to be more difficult to interpret for readers outside the target community. To the best of our knowledge, there is no readability metric that takes this into account. Therefore we chose not to include those experimental results in this article.

6. Emerging trends and systems overview

In the remainder of this analysis, we focus on two main tasks, namely Mention Detection and Candidate Selection. Thirty different approaches were applied in four editions of the challenge since 2013. Table 11 lists all ranked teams.

Table 11
Per year submissions and number of runs for each team

Approach Authors No. of runs

2013 Entries

1 Habib, M. et al. [42] 1

2 Dlugolinsky, S. et al. [30] 3

3 van Erp, M. et al. [77] 3

4 Cortis, K. [22] 1

5 Godin, F. et al. [38] 1

6 van Den Bosch, M. et al. [75] 3

7 Munoz-Garcia, O. et al. [59] 1

8 Genc, Y. et al. [35] 1

9 Hossein, A. [46] 1

10 Mendes, P. et al. [56] 3

11 Das, A. et al. [28] 3

12 Sachidanandan, S. et al. [70] 1

13 de Oliveira, D. et al. [29] 1

2014 Entries

14 Chang, M. et al. [19] 1

15 Habib, M. et al. [43] 2

16 Scaiella, U. et al. [72] 2

17 Amir, M. et al. [3] 3

18 Bansal, R. et al. [5] 1

19 Dahlmeier, D. et al. [27] 1

2015 Entries

20 Yamada, I. et al. [81] 10

21 Gârbacea, C. et al. [34] 10

22 Basile, P. et al. [9] 2

23 Guo, Z. et al. [41] 1

24 Barathi Ganesh, H.B. et al. [6] 1

25 Sinha, P. et al. [73] 3

2016 Entries

26 Waitelonis, J. et. al. [79] 1

27 Torres-Tramon, P. et al. [74] 1

28 Greenfield, K. et al. [39] 2

29 Ghosh, S. et al. [36] 3

30 Caliano, D. et al. [14] 2

Approach	Authors	No. of runs
2013 Entries
1	Habib, M. et al. [42]	1
2	Dlugolinsky, S. et al. [30]	3
3	van Erp, M. et al. [77]	3
4	Cortis, K. [22]	1
5	Godin, F. et al. [38]	1
6	van Den Bosch, M. et al. [75]	3
7	Munoz-Garcia, O. et al. [59]	1
8	Genc, Y. et al. [35]	1
9	Hossein, A. [46]	1
10	Mendes, P. et al. [56]	3
11	Das, A. et al. [28]	3
12	Sachidanandan, S. et al. [70]	1
13	de Oliveira, D. et al. [29]	1
2014 Entries
14	Chang, M. et al. [19]	1
15	Habib, M. et al. [43]	2
16	Scaiella, U. et al. [72]	2
17	Amir, M. et al. [3]	3
18	Bansal, R. et al. [5]	1
19	Dahlmeier, D. et al. [27]	1
2015 Entries
20	Yamada, I. et al. [81]	10
21	Gârbacea, C. et al. [34]	10
22	Basile, P. et al. [9]	2
23	Guo, Z. et al. [41]	1
24	Barathi Ganesh, H.B. et al. [6]	1
25	Sinha, P. et al. [73]	3
2016 Entries
26	Waitelonis, J. et. al. [79]	1
27	Torres-Tramon, P. et al. [74]	1
28	Greenfield, K. et al. [39]	2
29	Ghosh, S. et al. [36]	3
30	Caliano, D. et al. [14]	2

6.1. Emerging trends

Whilst there are substantial differences between the proposed approaches, a number of trends can be observed in the top-performing named entity recognition and linking approaches to tweets. Firstly, we observe the large adoption of data-driven approaches: while in the first and second year of the challenge there was an extensive use of off-the-shelf approaches, the top ranking systems from 2013–2016 show a high dependence on the training data. This is not surprising, since these approaches are supervised, but this clearly suggests that, to reach top performance, labeled data is necessary. Additionally, the extensive use of knowledge bases as dictionaries of typed entities and entity relation holder have dramatically affected the performance over the years. This strategy overcomes the lexical limitations of a tweet and performs well on the identification of entities available in the knowledge base used as referent. A common phase in all submitted approaches is normalisation, meaning smoothing the lexical variations of the tweets and translating them to language structures that can be better parsed by state-of-the-art approaches that expect more formal and well-formed text. Whilst the linguistic workflow favours the use of sequential solutions, Entity Recognition and Linking for tweets is proposed as joint step using large knowledge bases as referent entity directories. While knowledge bases support the linking of entities with mentions in text, they cannot support the identification of novel and emerging entities. Ad-hoc solutions for tweets for the generation of NILs have been proposed, ranging from edit distance-based solutions to the use of Brown clustering.

Between the first NEEL challenge on Concept Extraction (CE) and the 2016 edition we observe the following:

tweet normalisation as first step of any approach. This is generally defined as preprocessing and it increases the expressiveness of the tweets, e.g. via the expansion of Twitter accounts and hashtags with the actual names of entities they represent, or with conversion of non-ASCII characters, and, generally, noise filtering;

the contribution of knowledge bases in the mention detection and typing task. This leads to higher coverage, which, along with the linguistic analysis and type prediction, better fits this particular domain;

the use of high performing end-to-end approaches for the candidate selection. Such a methodology was further developed with the addition of fuzzy distance functions operating over n-grams and acronyms;

the inclusion of a pruning stage to filter out candidate entities. This was presented in various approaches ranging from Learning-to-Rank to recasting the problem as a classification task. We observed that the approach based on a classifier reached better performance (in particular, the classifier that performed best for this task was implemented using a SVM based on a radial basis function kernel), however it required an extensive feature engineering of the feature set used as training;

utilising hierarchical clustering of mentions to aggregate exact mentions of the same entity in the text and thus complementing the knowledge base entity directory in case of absence of an entity;

a considerable decrease in off-the-shelf systems. These were popular in the first editions of NEEL, but in later editions their performance grew increasingly limited as the task became more constrained.

Table 12 provides an overview of the methods and features used in these four years, grouped according to the step involved in the workflow. In addition to the list of the steps listed in Fig. 1.

Table 12
Map of the approaches per sub-task applied in the NEEL series of challenges from 2013 until 2016

Step Method Features Knowledge base Off-the-shelf systems

Preprocessing Cleaning, Expansion, Extraction stop words, spelling dictionary, acronyms, hashtags, Twitter accounts, tweet timestamps, punctuation, capitalisation, token positions – –

Mention detection Approximate String Matching, Exact String Matching, Fuzzy String Matching, Acronym Search, Perfect String Matching, Levenshtein Matching, Context Similarity Matching, Conditional Random Fields, Random Forest, Jaccard String Matching, Prior Probability Matching POS tags, tokens and adjacent tokens, contextual features, tweet timestamps, string similarity, n-grams, proper nouns, mention similarity score, Wikipedia titles, Wikipedia redirects, Wikipedia anchors, word embeddings Wikipedia, DBpedia Semanticizer *

Entity typing DBpedia Type, Logistic Regression, Random Forest, Conditional Random Fields tokens, linguistic features, word embeddings, entity mentions, NIL mentions, DBpedia and Freebase types DBpedia, Freebase AlchemyAPI, OpenCalais, * Zemanta ****

Candidate selection Distributional Semantic Model, Random Forest, RankSVM, Random Walk with Restart, Learning to Rank gloss, contextual features, graph distance Wikipedia, DBpedia DBpedia Spotlight [55], AlchemyAPI, Zemanta, Babelfy [58]

NIL clustering Conditional Random Fields, Random Forest, Brown Clustering, Lack of candidate, Score Threshold, Surface Form Aggregation, Type Aggregation POS tags, contextual words, n-grams length, predicted entity types, capitalization ratio, entity mention label, entity mention type

Step	Method	Features	Knowledge base	Off-the-shelf systems
Preprocessing	Cleaning, Expansion, Extraction	stop words, spelling dictionary, acronyms, hashtags, Twitter accounts, tweet timestamps, punctuation, capitalisation, token positions	–	–
Mention detection	Approximate String Matching, Exact String Matching, Fuzzy String Matching, Acronym Search, Perfect String Matching, Levenshtein Matching, Context Similarity Matching, Conditional Random Fields, Random Forest, Jaccard String Matching, Prior Probability Matching	POS tags, tokens and adjacent tokens, contextual features, tweet timestamps, string similarity, n-grams, proper nouns, mention similarity score, Wikipedia titles, Wikipedia redirects, Wikipedia anchors, word embeddings	Wikipedia, DBpedia	Semanticizer *
Entity typing	DBpedia Type, Logistic Regression, Random Forest, Conditional Random Fields	tokens, linguistic features, word embeddings, entity mentions, NIL mentions, DBpedia and Freebase types	DBpedia, Freebase	AlchemyAPI, OpenCalais, * Zemanta ****
Candidate selection	Distributional Semantic Model, Random Forest, RankSVM, Random Walk with Restart, Learning to Rank	gloss, contextual features, graph distance	Wikipedia, DBpedia	DBpedia Spotlight [55], AlchemyAPI, Zemanta, Babelfy [58]
NIL clustering	Conditional Random Fields, Random Forest, Brown Clustering, Lack of candidate, Score Threshold, Surface Form Aggregation, Type Aggregation	POS tags, contextual words, n-grams length, predicted entity types, capitalization ratio, entity mention label, entity mention type

https://github.com/semanticize/semanticizer

^**

http://www.alchemyapi.com

^***

http://www.opencalais.com

^****

http://www.zemanta.com

Table 13

Submissions and number of runs for each team for the Mention Detection phase

Team	External system	Main features	Mention detection strategy	Language resource
2013 Entries
1	AIDA	IsCap, AllCap, TwPOS2011	CRF and SVM (RBF)	YAGO, Microsoft n-grams, WordNet
2	ANNIE, OpenNLP, Illinois NET, Illinois Wikifier, LingPipe, OpenCalais, StanfordNER, WikiMiner	IsCap, AllCap, lower case, isNP, isVP, Token length	C4.5 decision tree	Google Gazetteer
3	StanfordNER, NERD, TwitterNLP	IsCap, AllCap, prefix, suffix, TwPOS2011, first word, last word	SVM SMO	–
4	ANNIE	IsCap, ANNIE POS	ANNIE	DBpedia and ANNIE Gazetteer
5	Alchemy, DBpedia Spotlight, OpenCalais, Zemanta	–	Random Forest	–
6	–	PosTreebank, lowercasing	IGTree memory-based taggers	Geonames.org Gazetteer, JRC names corpus
7	Freeling	n-gram, PosFreeling 2012, isNP, Token Length	Lexical Similarity	Wiki and DBpedia Gazetteers
8	NLTK [52]	n-grams, NLTKPos	Lexical Similarity	Wikipedia
9	Babelfy API [58]	–	Lexical Similarity	DBpedia and BabelNet
10	DBpedia Spotlight	n-grams, IsCap, AllCap, lower case	CRF	DBpedia, BALIE Gazetteers
11	–	Stem, IsCap, TwPos2011, Follows	CRF	Country names, City names Gazetteers, Samsad and NICTA dictionaries, IsOOV
12	–	IsCap, prefix, suffix	CRF	Wiki and Freebase Gazetteers
13	–	n-gram	PageRank, CRF	YAGO, Wikipedia, WordNet
2014 Entries
14	–	n-grams, stop words removal, punctuation as tokens	Lexical Similarity	Wikipedia and Freebase lexicons
15	TwiNER [50]	Regular Expression, Entity phrases, n-gram	TwiNER and CRF	DBpedia Gazetteer, Wikipedia
16	TAGME [31]	Wikipedia anchor texts, n-grams	Collective agreement and Wikipedia statistics	Wikipedia
17	StanfordNER	–	–	NER Dictionary
18	TwitterNLP	proper nouns sequence, n-grams	–	Wikipedia
19	DBpedia Spotlight, TwitterNLP	Unigram, POS, lower, title and upper case, stripped words, isNumber, word cluster, DBpedia	CRF	DBpedia Gazetteer, Brown Clustering [51]
2015 Entries
20	–	n-grams	Lexical Similarity joint with CRF, Random Forest	Wikipedia
21	Semanticizer	–	CRF	DBpedia
22	POS Tagger	n-grams	Maximun Entropy	DBpedia
23	TwitIE [10]	–	–	DBpedia
24	TwitIE	tokens	–	DBpedia
25	–	tokens	CRF joint with POS Tagger	–

Table 13

(Continued)

Team	External system	Main features	Mention detection strategy	Language resource
2016 Entries
26	–	unigrams	Lexical Similarity	DBpedia
27	GATE NLP	tokens	CRF	–
28	–	n-grams	Lexical Similarity	DBpedia
29	Stanford NER and ARK Twitter POS tagger [37]	tokens and POS	CRF	–
30	–	tokens	Lexical Similarity and Lexical Similarity	–

6.2. Systems overview

Table 13 presents a description of the approaches used for Mention Detection combined with Typing. Participants approached the task using lexical similarity matchers, machine learning algorithms, and hybrid methods that combine the two. For 2013, the strategies yielding the best results where hybrid, where models relied on the application of off-the-shelf systems (e.g., AIDA [45], ANNIE [24], OpenNLP,22

²²
https://opennlp.apache.org

Illinois NET [62], Illinois Wikifier [63], LingPipe,23

²³

http://alias-i.com/lingpipe

OpenCalais, Stanford NER [32], WikiMiner,24

²⁴

http://wikipedia-miner.cms.waikato.ac.nz

NERD [66], TwitterNLP [64], AlchemyAPI, DBpedia Spotlight, Zemanta) for both the identification of the boundaries of the entity (mention detection) and the assignment of a semantic type (entity typing). The top performing system proved to be System 1, which proposed an ensemble learning approach composed of a Conditional Random Fields (CRF) and a Support Vector Machine (SVM) with a radial basis function kernel specifically trained with the challenge dataset. The ensemble is performed via a union of the extraction results, while the typing is assigned via the class computed by the CRF.

The 2014 systems approached the Mention Detection task adding lexicons and features computed from DBpedia resources. System 14, the best performing system, used a matcher from n-grams computed from the text and the lexicon entries taken from DBpedia. From the 2014 challenge on, we observe more approaches favouring recall in the Mention Detection, while focusing less on using linguistic features for mention detection. System 15, proposed by the same authors of the best performing system in 2014, addressed the Mention Detection task with a large set of linguistic and lexicon-related features (such as the probability of the candidate obtained from the Microsoft Web N-Gram services, or its appearance in WordNet) and using a SVM classifier with a radial basis function kernel specifically trained with the challenge data. Such an approach resulted in high precision, but it slightly penalised recall.

The 2015 best performing approach for Mention Detection, System 20, was largely inspired by the 2014 winning approach: the use of n-grams used to look up resources in DBpedia and a set of lexical features such as POS tags and position in tweets. The type was assigned by a Random Forest classifier specifically trained with the challenge dataset and using DBpedia related features (such as PageRank [11]), word embeddings (contextual features), temporal popularity knowledge of an entity extracted from Wikipedia page view data, string similarity measures to measure the similarity between the title of the entity and the mention (such as edit distance), and linguistic features (such as POS tags, position in tweets, capitalization).

The 2016 best performing system, System 26, implements a lexicon matcher to match the entity in the knowledge base to the unigrams computed from the text. The approach proposed a preliminary stage of tweet normalisation resolving acronyms, hashtags to mentions written in natural language.

From 2014 on, the challenge task required participants to produce systems that were also able to link the detected mentions to their corresponding DBpedia link (if existing). Table 14 describes the approaches taken by the 2014, 2015, 2016 participants for the Candidate Detection and Selection, and NIL Clustering stages. In 2014, most of the systems proposed a Candidate Selection step as subsequent of the Mention Detection stage, implementing the conventional linguistic pipeline detecting first the mention, and then looking for referents of the mention in the external knowledge base. This resulted in a set of candidate links, which have been ranked according to the similarity of the link with respect to the mention, and the surrounded text. However, the best performing system (System 14), approached the Candidate Selection as a joint stage with the Mention Detection and link assignment, proposing a so-called end-to-end system. As opposed to most of the participants which used off-the-shelf tools, System 14 proposed a SMART gradient boosting algorithm [33], specifically trained with the challenge dataset where the features are textual features (such as textual similarity, contextual similarity), graph-based features (such as semantic cohesiveness between the entity–entity and entity–mention pairs), and statistical features (such as mention popularity using the Web as archive). The majority of the systems, including System 14, applied name normalisation for feature extraction, which was useful for identifying entities originally appearing as hashtags, or username mentions. Among the most commonly used external knowledge sources are: NER dictionaries (e.g., Google CrossWiki), Knowledge Base Gazetteers (e.g., Yago, DBpedia), weighted lexicons (e.g., Freebase, Wikipedia), other sources (e.g., Microsoft Web N-gram).25

²⁵

http://research.microsoft.com/apps/pubs/default.aspx?id=130762

A wide range of features were investigated for Candidate Selection strategies: n-grams, by capturing jointly the local (within a tweet) and global (within the knowledge base) contextual information of an entity via graph-based features (e.g., entity semantic cohesiveness). Other novel features included the use of Twitter account metadata and popularity-based statistical features for mentions and entity characterisation respectively.

In the 2015 challenge, System 20 (ranked first) proposed an enhanced version of the 2014 challenge winner, combined with a pruning stage meant to increase the precision of the Candidate Selection while considering the role of the entity type being assigned by a Conditional Random Field (CRF) classifier. In particular, System 20 is a five-sequential stage approach: preprocessing, generation of potential entity mentions, candidate selection, NIL detection, and entity mention typing. In the preprocessing stage, a tokenisation and Part-of-Speech (POS) tagging approach based on [37] was used, along with the extraction of tweet timestamps. They address the generation of potential entity mentions by computing n-grams (with $n = 1..10$ words) and matching them to Wikipedia titles, Wikipedia titles of the redirect pages, and anchor text using exact, fuzzy, and approximate match functions. An in-house dictionary of acronyms is built by splitting the mention surface into different n-grams (where one n-gram corresponds to one character). At this stage all entity mentions are linked to their candidates, i.e., the Wikipedia counterparts. The candidate selection is approached as a learning-to-rank problem: each mention is assigned a confidence score computed as the output of a supervised learning approach using a random forest as the classifier. An empirically defined threshold is used to select the relevant mentions; in case of mention overlap the span with the highest score is selected. System 20 implemented a Random Forest classifier working with a feature set composed of the predicted entity types, contextual features such as surrounding words, POS, length of the n-gram and capitalization features to predict the NIL reference linked to the mention, thus turning the NIL Clustering to a supervised learning task that uses as training data the training challenge dataset. The mention entity typing stage is treated as a supervised learning task where two independent classifiers are built: a Logistic Regression classifier for typing entity mentions and a Random Forest for typing NIL entries. The other approaches can be classified as sequential, where the complexity is moved to only performing the right matching of the n-gram from the text and the (candidate) entity in the knowledge base. Most of these approaches exploit the popularity of the entities and apply distance similarity functions to better rank entities. From the analysis, the move to fully supervised in-house developed pipelines emerges while the use of external systems is significantly reduced. The 2015 challenge introduced the task of linking mentions to novel entities, i.e. not present in the knowledge base. All approaches in this challenge exploit lexical similarity distance functions and class information of the mentions.

In 2016, the top performing system, System 26, proposed a lexicon-based joint Mention Extraction and Candidate Selection approach, where unigrams from tweets are mapped to DBpedia entities. A preprocessing stage cleans and classifies the part-of-speech tags, and normalises the initial tweets converting alphabetic, numeric, and symbolic Unicode characters to ASCII equivalents. For each entity candidate, it considers local and context-related features. Local features include the edit distance between the candidate labels and the n-gram, the candidates link graph popularity, its DBpedia type, the provenance of the label and the surface form that matches best. The context-related features assess the relation of a candidate entity to the other candidates within the given context. They include graph distance measurements, connected component analysis, or centrality and density observations using as pivot the DBpedia graph. The candidate selection is sorted according to the confidence score, which is used as a means to understand whether the entity actually describes the mention. In case the confidence score is lower than an empirically-determined threshold, the mention is annotated with a NIL.

The other approaches implement linguistic pipelines where the Candidate Selection is performed by looking up entities according to the exact lexical value of the mentions with DBpedia titles, redirect pages, and disambiguation pages. While we observed a reduction in complexity for the NIL clustering, resulting in only considering the lexical distance of the mentions as for System 27 with the Monge–Elkan similarity measure [21], or System 28, that experimented the normalised Damerau-Levenshtein, performing better than Brown clustering [12].

Table 14

Submissions and number of runs for each team for the Candidate Selection phase

Team	External system	Main features	Candidate selection strategy	Linguistic knowledge
2014 Approaches
14	–	n-grams, lower case, entity graph features (entity semantic cohesiveness), popularity-based statistical features (clicks and visiting information from the Web)	DCD-SSVM [18] and SMART gradient boosting	Wikipedia, Freebase
15	Google Search	n-grams, DBpedia and Wikipedia links, capitalisation	SVM	Wikipedia, DBpedia, WordNet, Web N-Gram, YAGO
16	TAGME	link probability, mention-link commonness distance	C4.5 (for taxonomy-filter)	Wikipedia, DBpedia
17	–	prefix, POS, suffix, Twitter account metadata, normalised mentions, trigrams	Entity Aggregate Prior, Prefix-tree Data Structure Classifier, Lexical Similarity	Wikipedia, DBpedia, YAGO
18	–	wikipedia context-based measure, anchor text measure, Twitter entity popularity	LambdaMART	Wikipedia Gazetteer, Google Cross Wiki Dictionary
19	Wikipedia Search API, DBpedia Spotlight, Google Search	mentions	Lexical Similarity and Rule-based	Wikipedia, DBpedia
2015 Approaches
20	–	word embeddings, entity popularity, commonness distance, string similarity distance	Random Forest, Logistic Regression	DBpedia
21	Semanticizer	–	Learning to Rank	DBpedia
22	–	mentions	Lesk [8]	DBpedia
23	–	mentions, PageRank	Random Walks	DBpedia
24	–	mentions	Lexical Similarity	DBpedia
25	DBpedia Spotlight	mentions	Lexical Similarity	–
2016 Approaches
26	–	graph distances, connected component analysis, or centrality and density observations	Learning to Rank	DBpedia
27	–	mentions, graph distances	Lexical Similarity	DBpedia
28	–	commonness, inverse document frequency anchor, term entity frequency, TCN, term entity frequency, term frequency paragraph, and redirect	SVM	DBpedia
29	Bebelfy	mentions	–	–
30	–	mentions	Lexical Similarity, context similarity	Wikipedia

7. Evaluation strategies

In this section, the evaluation metrics used in the different challenges are described.

7.1. 2013 evaluation measures

In 2013, the submitted systems were evaluated based on performance in extracting a mention and assigning its correct class as assigned in the Gold Standard $(GS)$ . Thus a system was requested to provide a set of tuples of the form: $(m, t)$ , where m is the mention and t is the type, which are then compared against the tuples of the gold standard ( $GS$ ). A type is any valid materialisation of the class defined in Table 3 and defined as Person-type, Organisation-type, Location-type, Misc-type. The precision (P), recall (R) and F-measure ( $F_{1}$ ) metrics were computed for each entity type. The final result for each system was reported as the average performance across the four entity types considered in the task. The evaluation was based on macro-averages across annotation types and tweets.

We performed a strict match between the tuples submitted and those in the GS. A strict match refers to an exact match, with conversion to lowercase, between a system value and the GS value for a given entity type t. Let $(m, t) \in S_{t}$ denote the set of tuples extracted for an entity type t by system S; $(m, t) \in GS$ denotes the set of tuples for entity type t in the gold standard. Then the set of true positives ( $TP$ ), false positives ( $FP$ ) and false negatives ( $FN$ ) for a system is defined as: $\begin{array}{l} (1) & {TP}_{t} = {{(m, t)}_{S} \in S | \exists {(m, t)}_{GS} \in GS}, \\ (2) & {FP}_{t} = {{(m, t)}_{S} \in S | \neg \in {(m, t)}_{GS} \in GS}, \\ (3) & {FN}_{t} = {{(m, t)}_{GS} \in GS | \neg \in {(m, t)}_{S} \in S} . \end{array}$

Since we require strict matches, a system must both detect the correct mention $(m)$ and extract the correct entity type $(t)$ from a tweet. Then for a given entity type we define: $\begin{array}{l} (4) & P_{t} = \frac{| {TP}_{t} |}{{TP}_{t} \cup {FP}_{t}}, \\ (5) & R_{t} = \frac{| {TP}_{t} |}{{TP}_{t} \cup {FN}_{t}} . \end{array}$

Then it is computed the precision and recall on a per-entity-type basis as: $\begin{array}{l} (6) & P = \frac{P_{PER} + P_{ORG} + P_{LOC} + P_{MISC}}{4}, \\ (7) & R = \frac{R_{PER} + R_{ORG} + R_{LOC} + R_{MISC}}{4}, \\ (8) & F_{1} = 2 \times \frac{P \times R}{P + R} . \end{array}$

Submissions were evaluated offline as participants were asked to annotate in a short time window a test set of the $GS$ and to send the results in a TSV26

²⁶
TSV stands for tab separated value.

file.

7.2. 2014 evaluation measures

In 2014, a system S was evaluated in terms of its performance in extracting both mentions and links from a set of tweets. For each tweet of this set, a system S provided a tuple of the form: $(m, l)$ where m is the mention and l is the link. A link is any valid DBpedia URI27

²⁷
We considered all DBpedia v3.9 resources valid.

that points to an existing resource (e.g. http://dbpedia.org/resource/Barack_Obama). The evaluation consisted of comparing the submission entry pairs against those in

GS

. The measures used to evaluate each pair are precision (P), recall (R), and f-measure (

F_{1}

). The evaluation was based on micro-averages.28

²⁸

Since the 2014 NEEL Challenge, we opted to weigh all instances of $TP$ , $FP$ , $FN$ for each tweet in the scoring, instead of weighing arithmetically by entity classes. This gives a better and detailed effectiveness of the system performances across different targets (typed mention, links) and tweets.

The evaluation procedure involved an a priori normalisation stage for each submission. Since some DBpedia links lead to redirect pages that point to final resources, we implemented a resolve mechanism for links that was uniformly applied to all participants. The correctness of tuples provided by a system S as the exact-match of the mention and the link was assessed. Here the tuple order was also taken into account. We define ${(m, l)}_{S} \in S$ as the set of pairs extracted by the system S, ${(m, l)}_{GS} \in GS$ denotes the set of pairs in the gold standard. We define the set of true positives ( $TP$ ), false positives ( $FP$ ), and false negatives ( $FN$ ) for a given system as: $\begin{array}{l} (9) & {TP}_{l} = {{(m, l)}_{S} \in S | \exists {(m, l)}_{GS} \in GS}, \\ (10) & {FP}_{l} = {{(m, l)}_{S} \in S | \neg \in {(m, l)}_{GS} \in GS}, \\ (11) & {FN}_{l} = {{(m, l)}_{GS} \in GS | \neg \in {(m, l)}_{S} \in S} . \end{array}$

${TP}_{l}$ defines the set of relevant pairs in S, in other words, the set of pairs in S that match the ones in $GS$ . ${FP}_{l}$ is the set of irrelevant pairs in S, in other words the pairs in S that do not match the pairs in $GS$ . ${FN}_{l}$ is the set of false negatives denoting the pairs that are not recognised by S, yet appear in $GS$ . As our evaluation is based on a micro-average analysis, we sum the individual true positives, false positives, and false negatives. As we require an exact-match for pairs $(m, l)$ we are looking for strict entity recognition and linking matches; each system has to link each recognised entity to the correct resource l. Precision, Recall, $F_{1}$ are defined in Eqs (12), (13), and (8), respectively, $\begin{array}{l} (12) & P = \frac{\sum_{l} | {TP}_{l} |}{\sum_{l} {TP}_{l} \cup {FP}_{l}}, \\ (13) & R = \frac{\sum_{l} | {TP}_{l} |}{\sum_{l} {TP}_{l} \cup {FN}_{l}} . \end{array}$

Submissions were evaluated offline, where participants were asked to annotate in a short time window the TS and to send the results in a TSV file.

7.3. 2015 and 2016 evaluation measures

In the 2015 and 2016 editions of the NEEL challenge, systems were evaluated according to the number of mentions correctly detected, their type correctly asserted (i.e. output of Mention Detection and Entity Typing), the links correctly assigned between a mention in a tweet and a knowledge base entry, and a NIL assigned when no knowledge base entry disambiguates the mention.

The required outputs were measured using a set of three evaluation metrics: strong_typed_mention_ match, strong_link_match, and mention_ceaf. These metrics were combined into a final score (Eq. (14)). $\begin{array}{l} score = & 0.4 * mention_ceaf \\ + 0.3 * strong_typed_mention_match \\ (14) & + 0.3 * strong_link_match, \end{array}$ where the weights are empirically assigned to favour more the role of the mention_ceaf, i.e. the ability of a system S to link the mention either to an existing entry in DBpedia or to a NIL entry generated by S and identified uniquely and consistently across different NILs.

The strong_typed_mention_match measures the performance of the system regarding the correct identification of mentions and their correct type assertion. The detection of mentions is still based on strict matching as in previous versions of the challenge. Therefore true positive (Eq. (9)), false positive (Eq. (2)), and false negative (Eq. (3)) counts are still calculated in the same manner. However, the measurement of precision and recall changed slightly. In 2013, we used macro-averaged precision and recall. In this case, first the precision, recall and F-measure are computed over the classes which are then averaged. Here, a more popular class can dominate the metrics. In 2015 we used micro-averaged metrics. In a micro-averaged precision and recall setup, each mention has an equal impact on the final result regardless of its class size. Therefore, precision (P) is calculated according to the Eq. (15) and recall (R) according to Eq. (16). Finally, strong_typed_mention_match is the micro-averaged ( $F_{1}$ ) as given by Eq. (8). $\begin{array}{l} (15) & P = \frac{\sum_{t} | {TP}_{t} |}{\sum_{t} {TP}_{t} \cup {FP}_{t}}, \\ (16) & R = \frac{\sum_{t} | {TP}_{t} |}{\sum_{t} {TP}_{t} \cup {FN}_{t}} . \end{array}$

The strong_link_match metric measures the correct link between a correctly recognised mention and a knowledge base entry. For a link to be considered correct, a system must detect a mention (m) and its type correctly $(t)$ as well as the correct Knowledge Base entry $(l)$ . Note also that this metric does not evaluate links to NIL. The detection of mentions is still based on strict matching as in previous versions of the challenge. Therefore true positive (Eq. (9)), false positive (Eq. (10)), and false negative (Eq. (11)) counts are still calculated in the same manner. This metric is also based on micro-averaged precision and recall as defined in Eqs (12) and (13) and the $F_{1}$ as in Eq. (8).

The last metric in our evaluation score is the Constrained Entity-Alignment F-measure (CEAF) [53]. This is a metric that measures coreference chains and is used to jointly evaluate Candidate Selection and NIL Clustering steps. Let $E = {m_{1}, \dots, m_{n}}$ denote the set of all mentions linked to e, where e is either a knowledge base entry or a NIL identifier. mention_ceaf finds the optimal alignment between the sets provided by the system and the gold standard and then performs the micro-averaged precision and recall over each mention.

In 2015, submissions were evaluated through an online process as participants were required to implement their systems as a publicly accessible web service following a REST-based protocol, where they could submit up to 10 contending entries to a registry of the NEEL challenge services. Each endpoint had a Web address (URI) and a name, which was referred as runID. Upon receiving the registration of the REST endpoint, calls to the contending entry were scheduled for two different time windows, namely, $D - Time$ – to test the APIs, and $T - Time$ – for the final evaluation and metric computations. To ensure the correctness of the results and avoid any loss we triggered N (with $N = 100$ ) calls to each entry. We then applied a majority voting approach over the set of annotations per tweet and statistically evaluated the latency by applying the law of large numbers [80]. The algorithm is detailed in Algorithm 1. This offered the opportunity to measure the computing time systems spent in providing the answer. The computing time was proposed to solve potential draws from Eq. (14). Where E is the set of entities, and $Tweet$ is the set of tweets.

Algorithm 1

evaluate $(E, Tweet, N = 100, M = 30)$

As setting up a REST API increased the system implementation load on the participants, we reverted back to an offline evaluation setup in 2016. As in previous challenges, participants were asked to annotate the TS during a short time window and to send the results in a TSV file which was then evaluated by the challenge chairs.

7.4. Summary

Three editions out of four followed an offline evaluation procedure. A discontinuity was introduced in 2015 with the introduction of the online evaluation procedure. Two issues were noted by the participants of the 2015 edition: (i) increasing complexity of the task, going beyond the pure NEEL objectives; (ii) unfair comparison of the computing time with respect to big players that can afford better computing resources than small research teams. These motivations caused the use of a conventional offline procedure for the 2016 edition. The emerging trend sees a consolidation of a standard de-facto scorer that was proposed in TAC-KBP and also now successfully adopted and widely used in our community. This scorer supports the measurement of the performance of the approaches in the entire annotation pipeline, ranging from the Mention Extraction, Candidate Selection, Typing, and detection of novel and emerging entities from highly dynamic contexts such as tweets.

8. Results

This section presents a compilation of the NEEL challenge results across the years. As the NEEL task evolved, the results among these years are not entirely comparable. Table 15 shows results for the 2013 challenge task, where we report scores averaged for the four entity types analysed on this task.

Table 15
Scores achieved for the NEEL 2013 submissions

2013 Entries

Team P R $F_{1}$

1 0.764 0.604 0.67

2 0.724 0.613 0.662

3 0.735 0.611 0.658

4 0.734 0.595 0.61

5 0.688 0.546 0.589

6 0.774 0.548 0.589

7 0.683 0.483 0.561

8 0.685 0.5 0.54

9 0.662 0.482 0.518

10 0.627 0.383 0.494

11 0.564 0.43 0.491

12 0.501 0.468 0.489

13 0.53 0.402 0.399

2013 Entries
1	0.764	0.604	0.67
2	0.724	0.613	0.662
3	0.735	0.611	0.658
4	0.734	0.595	0.61
5	0.688	0.546	0.589
6	0.774	0.548	0.589
7	0.683	0.483	0.561
8	0.685	0.5	0.54
9	0.662	0.482	0.518
10	0.627	0.383	0.494
11	0.564	0.43	0.491
12	0.501	0.468	0.489
13	0.53	0.402	0.399

The 2013 task consisted of building systems that could identify four entity types (i.e., Person, Location, Organisation and Miscellaneous) in a tweet. This task proved to be challenging, with some approaches favouring precision over recall. The best rank in precision was obtained by Team 1, which used a combination of rule types and data-driven approaches achieving a 76.4% precision. For recall, results varied across the four entity types with results for the miscellaneous and organisation types ranking the lowest. Averaging over entity types, the best approach was obtained by Team 2, whose solution relied on gazetteers. All top 3 teams ranked by F-measure followed a hybrid approach combining rules and gazetteers.

The 2014 challenge task extended the concept extraction challenge by not only considering the entity type recognition but also the linking of entities to the DBpedia v3.9 knowledge base. Table 16 presents the results for this task, which follow the evaluation described in Section 7. There was a clear winner that outperformed all other systems on all three metrics and it was proposed by the Microsoft Research Lab Redmond.29

²⁹

https://www.microsoft.com/en-us/research/lab/microsoft-research-redmond/

Most of the 2014 submissions followed a sequential approach doing first the recognition and then the linking. The winning system (Team 14) introduced a novel approach, namely joint learning of recognition and linking from the training data. This approach outperformed the second best team in F-measure by over 15%.

Table 16

Scores achieved for the NEEL 2014 submissions

2014 Entries

Team	P	R	$F_{1}$
14	77.1	64.2	70.06
15	57.3	52.74	54.93
16	60.93	42.25	49.9
17	53.28	39.51	45.37
18	50.95	40.67	45.23
19	49.58	32.17	39.02

The 2015 task extended the 2014 recognition and linking tasks with a clustering task. For this task participants had to provide clusters where each cluster contained only mentions to the real world entity. For 2015 we also computed the latency of each system. Table 17 presents a ranked list of results for the 2015 submissions. The last column shows the final score for each participant following Eq. (14). Here the winner (Team 20) outperformed the second best with a boost in tagging $F_{1}$ of 41.9%, in clustering $F_{1}$ of 28%, and linking $F_{1}$ of 23.9%. Team 20 improved upon the second best team on the general score with 33.1%. For 2015, the winner team proposed an end-to-end system for both candidate selection and mention typing, along with a linguistic pipeline to perform entity typing and filtering. As in 2014, the best ranked system was proposed by a company, namely Studio Ousia,30

³⁰

http://www.ousia.jp/en

that focuses on knowledge extraction and artificial intelligence.

Table 17

Scores achieved for the NEEL 2015 submissions. Tagging refers to strong_typed_mention_match, clustering refers to mention_ceaf, and linking to strong_link_match

2015 Entries

Team	Tagging $F_{1}$	Clustering $F_{1}$	Linking $F_{1}$	Latency[s]	Score
20	0.807	0.84	0.762	8.5 ± 3.62	0.8067
25	0.388	0.506	0.523	0.13 ± 0.02	0.4757
21	0.412	0.643	0.316	0.19 ± 0.09	0.4756
22	0.367	0.459	0.464	2.03 ± 2.35	0.4329
23	0.329	0.394	0.415	3.41 ± 7.62	0.3808
24	0	0.001	0	12.89 ± 27.6	0.004

Finally, the 2016 challenge followed the same task as 2015. Table 18 presents a ranked list of results for the 2016 submissions. Team 26 outperformed all other participants, with an overall $F_{1}$ score of 0.5486 and an absolute improvement of 16.58% compared to the second-best approach. Team 26 used a learning-to-rank approach for the candidate selection task along with a series of graph-based metrics making use of DBpedia as their main linguistic knowledge source.

Table 18

Scores achieved for the NEEL 2016 submissions. Tagging refers to strong_typed_mention_match, clustering refers to mention_ceaf, and linking to strong_link_match

2016 Entries

Team	Tagging $F_{1}$	Clustering $F_{1}$	Linking $F_{1}$	Score
26	0.473	0.641	0.501	0.5486
27	0.246	0.621	0.202	0.3828
28	0.319	0.366	0.396	0.3609
29	0.312	0.467	0.248	0.3548
30	0.246	0.203	0.162	0.3353

9. Conclusion

The NEEL challenge series was established in 2013 to foster the development of novel automated approaches for mining semantics from tweets and providing standardised benchmark corpora enabling the community to compare systems.

This paper describes the decisions and procedures followed in setting up and running the task. We first described the annotation procedures used to create the NEEL corpora over the years. The procedures were incrementally adjusted over time to provide continuity and ensure reusability of the approaches over the different editions. While the consolidation has provided consistent labeled data, it has also showed the robustness of the community.

We also described the different approaches proposed by the NEEL challenge participants. Over the years, we witnessed to a convergence of the approaches towards data-driven solutions supported by knowledge bases. Knowledge bases are prominently used as a source to discover known entities, relations among data, and labelled data for selecting candidates and suggesting novel entities. Data-driven approaches have become, with variations, the leading solution. Despite the consolidated number of options for addressing the challenge task, the participants’ results show that the NEEL task remains challenging in the microposts domain.

Furthermore, we explained the different evaluation strategies used in different challenges. These changes were driven by a desire to ensure fairness of the evaluation, transparency, and correctness. These adaptations involve the use of in-house scoring tools in 2013 and 2014, which were made publicly available and discussed in the community. Since 2015 the TAC-KBP challenge scorer was adopted to both leverage the wide experience developed in the TAC-KBP community and break down the analysis to account for the clustering.

Thanks to the yearly releases of the annotations and tweet IDs with a public license, the NEEL corpus has started to become widely adopted. Beyond the thirty teams who completed the evaluations in four years, more than three hundred participants have contacted the NEEL organisers with a request to acquire the corpora. The teams come from more than twenty different countries and are both from academia and industry. The 2014 and 2015 winners were companies operating in the field, respectively Microsoft and Studio Ousia. The 2013 and 2016 winners were academic teams. The success of the NEEL challenges is also illustrated by the sponsorships of the challenges offered by companies (ebay31

³¹

http://www.ebay.com

in 2013 and SpazioDati32

³²

http://www.spaziodati.eu

in 2015) and research projects (LinkedTV33

³³

http://www.linkedtv.eu

in 2014, and FREME34

³⁴

http://freme-project.eu/

in 2016).

The NEEL challenges also triggered the interest of local communities such as NEEL-IT. This community is pushing the NEEL Challenge Annotation Guidelines (with minor variations due to the intra-language dependencies) and know-how to create a benchmark for sharing the algorithms and results of mining semantics from Italian tweets. In 2015, we also built bridges with the TAC community. We plan to strengthen these and to involve a larger audience of potential participants ranging from Linguistics, Machine Learning, Knowledge Extraction, Data and Web Science.

Future work involves the generation of corpora that account for the low variance of entity-type semantics. We aim to create larger datasets covering a broader range of entity types and domains within the Twitter sphere. The 2015 enhancements in the evaluation strategy, which accounts for computational time, highlighted new challenges when focusing on an algorithm’s efficiency vs efficacy. Since more efforts on handling large scale data mining involve distributed computing and optimisation, we aim to develop new evaluation strategies. These strategies will ensure the fairness of the results when asking participants to produce large scale annotations in a small window of time. Among the future efforts, we aim to identify the differences in performance among the disparate systems and their approaches, first characterising what can be considered an error by the systems in the context of the challenge, and then deriving insightful conclusions on the building blocks needed to build an optimal system automatically.

Finally, given the increasing interest in adopting the NEEL guidelines in creating corpora for other languages, we aim to develop a multilingual NEEL challenge as a future activity.

Footnotes

Acknowledgements

This work was supported by the H2020 FREME project (GA no. 644771), by the research grant from Science Foundation Ireland (SFI) under Grant Number SFI/12/RC/2289, and by the CLARIAH-CORE project financed by the Netherlands Organisation for Scientific Research (NWO).

NEEL taxonomy

Thing

languages

ethnic groups

nationalities

religions

diseases

sports

astronomical objects

Event

holidays

sport events

political events

social events

Character

fictional characters

comic characters

title characters

Location

public places (squares, opera houses, museums, schools, markets, airports, stations, swimming pools, hospitals, sports facilities, youth centers, parks, town halls, theatres, cinemas, galleries, universities, churches, medical centers, parking lots, cemeteries)

regions (villages, towns, cities, provinces, countries, continents, dioceses, parishes)

commercial places (pubs, restaurants, depots, hostels, hotels, industrial parks, nightclubs, music venues, bike shops)

buildings (houses, monasteries, creches, mills, army barracks, castles, retirement homes, towers, halls, rooms, vicarages, courtyards)

Organization

companies (press agencies, studios, banks, stock markets, manufacturers, cooperatives)

subdivisions of companies

brands

political parties

government bodies (ministries, councils, courts, political unions)

press names (magazines, newspapers, journals)

public organizations (schools, universities, charities)

collections of people (sport teams, associations, theater companies, religious orders, youth organizations, musical bands)

Person

people’s names (titles and roles are not included, such as Dr. or President)

Product

movies

tv series

music albums

press products (journals, newspapers, magazines, books, blogs)

devices (cars, vehicles, electronic devices)

operating systems

programming languages

NEEL Challenge annotation guidelines

The challenge task consists of three consecutive stages: 1) extraction and typing of entity mentions within a tweet; 2) link of each mention to an entry in the English DBpedia35

³⁵

In the 2016 NEEL Challenge we used as referent knowledge base DBpedia 2015-04 http://wiki.dbpedia.org/dbpedia-data-set-2015-04, in 2015 we used DBpedia 2014 http://wiki.dbpedia.org/data-set-2014, while for the 2014 edition we used DBpedia v3.9 http://wiki.dbpedia.org/data-set-39.

representing the same real world entity, or NIL in case such entry does not exist; and 3) clustering of all mentions linked to NIL. Thus, the same entity, which does not have a corresponding entry in DBpedia, will be referenced with the same NIL identifier. This section introduces various definitions relevant to this task.

References

Abel,

Celik,

G.-J.

Houben and

Siehndel, Leveraging the semantics of tweets for adaptive faceted search on Twitter, in: The Semantic Web – ISWC 2011 – 10th International Semantic Web Conference, Proceedings, Part I, Bonn, Germany, October 23–27, 2011,

Aroyo,

Welty,

Alani,

Taylor,

Bernstein,

Kagal,

N.F.

Noy and

Blomqvist, eds, Lecture Notes in Computer Science, Vol. 7031, Springer, 2011, pp. 1–17. doi:10.1007/978-3-642-25073-6_1.

Abel,

Gao,

G.-J.

Houben and

Tao, Semantic enrichment of Twitter posts for user profile construction on the social web, in: The Semanic Web: Research and Applications – 8th Extended Semantic Web Conference, ESWC 2011, Proceedings, Part II, Heraklion, Crete, Greece, May 29–June 2, 2011,

Antoniou,

Grobelnik,

E.P.B.

Simperl,

Parsia,

Plexousakis,

De Leenheer and

J.Z.

Pan, eds, Lecture Notes in Computer Science, Vol. 6644, Springer, 2011, pp. 375–389. doi:10.1007/978-3-642-21064-8_26.

M.A.

Yosef,

Hoffart,

Ibrahim,

Boldyrev and

Weikum, Adapting AIDA for tweets, in: Proceedings of the 4th Workshop on Making Sense of Microposts Co-Located with the 23rd International World Wide Web Conference (WWW 2014), Seoul, Korea, April 7th, 2014,

Rowe,

Stankovic and

A.-S.

Dadzie, eds, CEUR Workshop Proceedings, Vol. 1141, CEUR-WS.org, 2014, pp. 68–69. http://ceur-ws.org/Vol-1141/paper_15.pdf.

Baldwin,

M.-C.

de Marneffe,

Han,

Y.-B.

Kim,

Ritter and

Xu, Shared tasks of the 2015 workshop on noisy user-generated text: Twitter lexical normalization and named entity recognition, in: ACL 2015 Workshop on Noisy User-Generated Text, Proceedings of the Workshop, Beijing, China, July 31, 2015, ACL, 2015, pp. 126–135. http://www.aclweb.org/anthology/W15-4319. doi:10.18653/v1/W15-4319.

Bansal,

Panem,

Radhakrishnan,

Gupta and

Varma, Linking entities in #microposts, in: Proceedings of the 4th Workshop on Making Sense of Microposts Co-Located with the 23rd International World Wide Web Conference (WWW 2014), Seoul, Korea, April 7th, 2014,

Rowe,

Stankovic and

A.-S.

Dadzie, eds, CEUR Workshop Proceedings, Vol. 1141, CEUR-WS.org, 2014, pp. 71–72. http://ceur-ws.org/Vol-1141/paper_16.pdf.

H.B.

Barathi Ganesh,

Abinaya,

Anand Kumar,

Vinaykumar and

K.P.

Soman, AMRITA – CEN@NEEL: Identification and linking of Twitter entities, in: Proceedings of the 5th Workshop on Making Sense of Microposts Co-Located with the 24th International World Wide Web Conference (WWW 2015), Florence, Italy, May 18th, 2015,

Rowe,

Stankovic and

A.-S.

Dadzie, eds, CEUR Workshop Proceedings, Vol. 1395, CEUR-WS.org, 2015, pp. 64–65. http://ceur-ws.org/Vol-1395/paper_20.pdf.

Basile,

Caputo,

A.L.

Gentile and

Rizzo, Overview of the EVALITA 2016 named entity rEcognition and linking in Italian Tweets (NEEL-IT) task, in: Proceedings of Third Italian Conference on Computational Linguistics (CLiC-it 2016) & Fifth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA 2016), Napoli, Italy, December 5–7, 2016,

Basile,

Corazza,

Cutugno,

Montemagni,

Nissim,

Patti,

Semeraro and

Sprugnoli, eds, CEUR Workshop Proceedings, Vol. 1749, CEUR-WS.org, 2016. http://ceur-ws.org/Vol-1749/paper_007.pdf.

Basile,

Caputo and

Semeraro, An enhanced lesk word sense disambiguation algorithm through a distributional semantic model, in: COLING 2014, 25th International Conference on Computational Linguistics, Proceedings of the Conference: Technical Papers, Dublin, Ireland, August 23–29, 2014,

Hajic and

Tsujii, eds, ACL, 2014, pp. 1591–1600. http://aclweb.org/anthology/C/C14/C14-1151.pdf.

Basile,

Caputo,

Semeraro and

Narducci, UNIBA: Exploiting a distributional semantic model for disambiguating and linking entities in tweets, in: Proceedings of the 5th Workshop on Making Sense of Microposts Co-Located with the 24th International World Wide Web Conference (WWW 2015), Florence, Italy, May 18th, 2015,

Rowe,

Stankovic and

A.-S.

Dadzie, eds, CEUR Workshop Proceedings, Vol. 1395, CEUR-WS.org, 2015, p. 62. http://ceur-ws.org/Vol-1395/paper_15.pdf.

10.

Bontcheva,

Derczynski,

Funk,

M.A.

Greenwood,

Maynard and

Aswani, TwitIE: An open-source information extraction pipeline for microblog text, in: Recent Advances in Natural Language Processing, RANLP 2013, Hissar, Bulgaria, 9–11 September, 2013,

Angelova,

Bontcheva and

Mitkov, eds, RANLP 2013 Organising Committee/ACL, 2013, pp. 83–90. http://aclweb.org/anthology/R/R13/R13-1011.pdf.

11.

Brin and

Page, The anatomy of a large-scale hypertextual web search engine, Computer Networks30(1–7) (1998), 107–117. doi:10.1016/S0169-7552(98)00110-X.

12.

P.F.

Brown,

V.J.

Della Pietra,

P.V.

de Souza,

J.C.

Lai and

R.L.

Mercer, Class-based n-gram models of natural language, Computational Linguistics18(4) (1992), 467–479.

13.

R.C.

Bunescu and

Pasca, Using encyclopedic knowledge for named entity disambiguation, in: EACL 2006, 11st Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference, Trento, Italy, April 3–7, 2006,

McCarthy and

Wintner, eds, ACL, 2006. http://aclweb.org/anthology/E/E06/E06-1002.pdf.

14.

Caliano,

Fersini,

Manchanda,

Palmonari and

Messina, UniMiB: Entity linking in tweets using Jaro–Winkler distance, popularity and coherence, in: Proceedings of the 6th Workshop on ‘Making Sense of Microposts’ Co-Located with the 25th International World Wide Web Conference (WWW 2016), Montréal, Canada, April 11, 2016,

A.-S.

Dadzie,

Preotiuc-Pietro,

Radovanovic,

A.E.

Cano Basave and

Weller, eds, CEUR Workshop Proceedings, Vol. 1691, 2016, pp. 70–72. http://ceur-ws.org/Vol-1691/paper_12.pdf.

15.

A.E.

Cano Basave,

Rizzo,

Varga,

Rowe,

Stankovic and

A.-S.

Dadzie, Making sense of microposts (#Microposts2014) named entity extraction & linking challenge, in: Proceedings of the 4th Workshop on Making Sense of Microposts Co-Located with the 23rd International World Wide Web Conference (WWW 2014), Seoul, Korea, April 7th, 2014,

Rowe,

Stankovic and

A.-S.

Dadzie, eds, CEUR Workshop Proceedings, Vol. 1141, CEUR-WS.org, 2014, pp. 54–60. http://ceur-ws.org/Vol-1141/microposts2014_neel-challenge-report.pdf.

16.

A.E.

Cano Basave ,

Varga,

Rowe,

Stankovic and

A.-S.

Dadzie, Making sense of microposts (#MSM2013) concept extraction challenge, in: Proceedings of the Concept Extraction Challenge at the Workshop on ‘Making Sense of Microposts’, Rio de Janeiro, Brazil, May 13, 2013,

A.E.

Cano Basave,

Rowe,

Stankovic and

A.-S.

Dadzie, eds, CEUR Workshop Proceedings, Vol. 1019, CEUR-WS.org, 2013, pp. 1–15. http://ceur-ws.org/Vol-1019/msm2013-challenge-report.pdf.

17.

Carmel,

M.-W.

Chang,

Gabrilovich,

B.P.

Hsu and

Wang, ERD’14: Entity recognition and disambiguation challenge. SIGIR Forum48(2) (2014), 63–77. doi:10.1145/2701583.2701591.

18.

Chang and

Yih, Dual coordinate descent algorithms for efficient large margin structured prediction, Transactions of the Association for Computational Linguistics1 (2013), 207–218. https://tacl2013.cs.columbia.edu/ojs/index.php/tacl/article/view/120.

19.

M.-W.

Chang,

B.-J.P.

Hsu,

Ma,

Loynd and

Wang, E2E: An end-to-end entity linking system for short and noisy text, in: Proceedings of the 4th Workshop on Making Sense of Microposts Co-Located with the 23rd International World Wide Web Conference (WWW 2014), Seoul, Korea, April 7th, 2014,

Rowe,

Stankovic and

A.-S.

Dadzie, eds, CEUR Workshop Proceedings, Vol. 1141, CEUR-WS.org, 2014, pp. 62–63. http://ceur-ws.org/Vol-1141/paper_18.pdf.

20.

Chinchor and

Robinson, Muc-7 named entity task definition, in: Message Understanding Conference Proceedings, MUC-7, 1997. http://www-nlpir.nist.gov/related_projects/muc/proceedings/ne_task.html.

21.

W.W.

Cohen,

Ravikumar and

S.E.

Fienberg, A comparison of string distance metrics for name-matching tasks, in: Proceedings of IJCAI-03 Workshop on Information Integration on the Web (IIWeb-03), Acapulco, Mexico, August 9–10, 2003,

Kambhampati and

C.A.

Knoblock, eds, 2003, pp. 73–78. http://www.isi.edu/info-agents/workshops/ijcai03/papers/Cohen-p.pdf.

22.

Cortis, ACE: A concept extraction approach using linked open data, in: Proceedings of the Concept Extraction Challenge at the Workshop on ‘Making Sense of Microposts’, Rio de Janeiro, Brazil, May 13, 2013,

A.E.

Cano Basave,

Rowe,

Stankovic and

A.-S.

Dadzie, eds, CEUR Workshop Proceedings, Vol. 1019, CEUR-WS.org, 2013, pp. 31–35. http://ceur-ws.org/Vol-1019/paper_20.pdf.

23.

Cucerzan, Large-scale named entity disambiguation based on Wikipedia data, in: EMNLP-CoNLL 2007, Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, Prague, Czech Republic, June 28–30, 2007,

Eisner, ed., ACL, 2007, pp. 708–716. http://www.aclweb.org/anthology/D07-1074.

24.

Cunningham,

Maynard,

Bontcheva and

Tablan, GATE: An architecture for development of robust HLT applications, in: ACL-02, 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, Pennsylvania, USA, 7–12 July, 2002,

Isabelle,

Charniak and

Lin, eds, Proceedings of the Conference, ACL, pp. 168–175. doi:10.3115/1073083.1073112.

25.

Cunningham,

Maynard,

Bontcheva,

Tablan,

Aswani,

Roberts,

Gorrell,

Funk,

Roberts,

Damljanovic,

Heitz,

M.A.

Greenwood,

Saggion,

Petrak,

Li,

Peterset al., Text Processing with GATE (Version 6), Gateway Press, 2011.

26.

Cunningham,

Maynard,

Bontcheva,

Tablan,

Aswani,

Roberts,

Gorrell,

Funk,

Roberts,

Damljanovic,

Heitz,

M.A.

Greenwood,

Saggion,

Petrak,

Li,

Peters,

Derczynskiet al., Developing language processing components with GATE version 8 (a user guide), Technical report, The University of Sheffield, 2017. https://gate.ac.uk/sale/tao/split.html.

27.

Dahlmeier,

Nandan and

Ting, Part-of-speech is (almost) enough: SAP research & innovation at the #Microposts2014 NEEL challenge, in: Proceedings of the 4th Workshop on Making Sense of Microposts Co-Located with the 23rd International World Wide Web Conference (WWW 2014), Seoul, Korea, April 7th, 2014,

Rowe,

Stankovic and

A.-S.

Dadzie, eds, CEUR Workshop Proceedings, Vol. 1141, CEUR-WS.org, 2014, pp. 73–74. http://ceur-ws.org/Vol-1141/paper_20.pdf.

28.

Das,

Burman,

A.R.

Balamurali and

Bandyopadhyay, NER from tweets: SRI-JU system @MSM, in: Proceedings of the Concept Extraction Challenge at the Workshop on ‘Making Sense of Microposts’, Rio de Janeiro, Brazil, May 13, 2013,

A.E.

Cano Basave,

Rowe,

Stankovic and

A.-S.

Dadzie, eds, CEUR Workshop Proceedings, Vol. 13, CEUR-WS.org, 2013, p. 62. http://ceur-ws.org/Vol-1019/paper_33.pdf.

29.

D.M.

de Oliveira,

A.H.F.

Laender,

Veloso and

A.S.

da Silva, Filter-stream named entity recognition: A case study at the #MSM2013 concept extraction challenge, in: Proceedings of the Concept Extraction Challenge at the Workshop on ‘Making Sense of Microposts’, Rio de Janeiro, Brazil, May 13, 2013,

A.E.

Cano Basave,

Rowe,

Stankovic and

A.-S.

Dadzie, eds, CEUR Workshop Proceedings, Vol. 1019, CEUR-WS.org, 2013, pp. 71–75. http://ceur-ws.org/Vol-1019/paper_34.pdf.

30.

Dlugolinsky,

Krammer,

Ciglan and

Laclavik, MSM2013 IE challenge: Annotowatch, in: Proceedings of the Concept Extraction Challenge at the Workshop on ‘Making Sense of Microposts’, Rio de Janeiro, Brazil, May 13, 2013,

A.E.

Cano Basave,

Rowe,

Stankovic and

A.-S.

Dadzie, eds, CEUR Workshop Proceedings, CEUR-WS.org, Vol. 1019, 2013, pp. 21–26. http://ceur-ws.org/Vol-1019/paper_21.pdf.

31.

Ferragina and

Scaiella, TAGME: On-the-fly annotation of short text fragments (by Wikipedia entities), in: Proceedings of the 19th ACM Conference on Information and Knowledge Management, CIKM 2010, Toronto, Ontario, Canada, October 26–30, 2010,

Huang,

Koudas,

G.J.F.

Jones,

Wu,

Collins-Thompson and

An, eds, ACM, 2010, pp. 1625–1628. doi:10.1145/1871437.1871689.

32.

J.R.

Finkel,

Grenager and

C.D.

Manning, Incorporating non-local information into information extraction systems by Gibbs sampling, in: ACL 2005, 43rd Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference, University of Michigan, USA, 25–30 June, 2005,

Knight,

H.T.

Ng and

Oflazer, eds, ACL, 2005, pp. 363–370. doi:10.3115/1219840.1219885.

33.

J.H.

Friedman, Greedy function approximation: A gradient boosting machine, The Annals of Statistics29(5) (2001), 1189–1232. doi:10.1214/aos/1013203451.

34.

Gârbacea,

Odijk,

Graus,

Sijaranamual and

de Rijke, Combining multiple signals for semanticizing tweets: University of Amsterdam at #Microposts2015, in: Proceedings of the 5th Workshop on Making Sense of Microposts Co-Located with the 24th International World Wide Web Conference (WWW 2015), Florence, Italy, May 18th, 2015,

Rowe,

Stankovic and

A.-S.

Dadzie, eds, CEUR Workshop Proceedings, Vol. 1395, CEUR-WS.org, 2015, pp. 59–60. http://ceur-ws.org/Vol-1395/paper_17.pdf.

35.

Genc,

W.A.

Mason and

J.V.

Nickerson, Classifying short messages using collaborative knowledge bases: Reading Wikipedia to understand Twitter, in: Proceedings of the Concept Extraction Challenge at the Workshop on ‘Making Sense of Microposts’, Rio de Janeiro, Brazil, May 13, 2013,

A.E.

Cano Basave,

Rowe,

Stankovic and

A.-S.

Dadzie, eds, CEUR Workshop Proceedings, Vol. 1019, CEUR-WS.org, 2013, pp. 50–53. http://ceur-ws.org/Vol-1019/paper_28.pdf.

36.

Ghosh,

Maitra and

Das, Feature based approach to named entity recognition and linking for tweets, in: Proceedings of the 6th Workshop on ‘Making Sense of Microposts’ Co-Located with the 25th International World Wide Web Conference (WWW 2016), Montréal, Canada, April 11, 2016,

A.-S.

Dadzie,

Preotiuc-Pietro,

Radovanovic,

A.E.

Cano Basave and

Weller, eds, CEUR Workshop Proceedings, Vol. 1691, CEUR-WS.org, 2016, pp. 74–76. http://ceur-ws.org/Vol-1691/paper_10.pdf.

37.

Gimpel,

Schneider,

O’Connor,

Das,

Mills,

Eisenstein,

Heilman,

Yogatama,

Flanigan and

N.A.

Smith, Part-of-speech tagging for Twitter: Annotation, features, and experiments, in: The 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference – Short Papers, Portland, Oregon, USA, 19–24 June, 2011, ACL, 2011, pp. 42–47. http://www.aclweb.org/anthology/P11-2008.

38.

Godin,

Debevere,

Mannens,

De Neve and

Van de Walle, Leveraging existing tools for named entity recognition in microposts, in: Proceedings of the Concept Extraction Challenge at the Workshop on ‘Making Sense of Microposts’, Rio de Janeiro, Brazil,

A.E.

Cano Basave,

Rowe,

Stankovic and

A.-S.

Dadzie, eds, CEUR Workshop Proceedings, Vol. 1019, CEUR-WS.org, 2013, pp. 36–39. http://ceur-ws.org/Vol-1019/paper_25.pdf.

39.

Greenfield,

R.S.

Caceres,

Coury,

Geyer,

Gwon,

Matterer,

Mensch,

C.S.

Sahin and

Simek, A reverse approach to named entity extraction and linking in microposts, in: Proceedings of the 6th Workshop on ‘Making Sense of Microposts’ Co-Located with the 25th International World Wide Web Conference (WWW 2016), Montréal, Canada, April 11, 2016,

A.-S.

Dadzie,

Preotiuc-Pietro,

Radovanovic,

A.E.

Cano Basave and

Weller, eds, CEUR Workshop Proceedings, Vol. 1691, CEUR-WS.org, 2016, pp. 67–69. http://ceur-ws.org/Vol-1691/paper_11.pdf.

40.

Grishman and

Sundheim, Message understanding conference – 6: A brief history, in: 16th International Conference on Computational Linguistics, Proceedings of the Conference, COLING 1996, Center for Sprogteknologi, Copenhagen, Denmark, August 5–9, 1996, ACL, 1996, pp. 466–471. http://aclweb.org/anthology/C96-1079. doi:10.3115/992628.992709.

41.

Guo and

Barbosa, Entity recognition and linking on tweets with random walks, in: Proceedings of the 5th Workshop on Making Sense of Microposts Co-Located with the 24th International World Wide Web Conference (WWW 2015), Florence, Italy, May 18th, 2015,

Rowe,

Stankovic and

A.-S.

Dadzie, eds, CEUR Workshop Proceedings, Vol. 1395, CEUR-WS.org, 2015, pp. 57–58. http://ceur-ws.org/Vol-1395/paper_21.pdf.

42.

M.B.

Habib,

van Keulen and

Zhu, Concept extraction challenge: University of Twente at #MSM2013, in: Proceedings of the Concept Extraction Challenge at the Workshop on ‘Making Sense of Microposts’, Rio de Janeiro, Brazil, May 13, 2013,

A.E.

Cano Basave,

Rowe,

Stankovic and

A.-S.

Dadzie, eds, CEUR Workshop Proceedings, Vol. 1019, CEUR-WS.org, 2013, pp. 17–20. http://ceur-ws.org/Vol-1019/paper_14.pdf.

43.

M.B.

Habib,

van Keulen and

Zhu, Named entity extraction and linking challenge: University of Twente at #Microposts2014, in: Proceedings of the 4th Workshop on Making Sense of Microposts Co-Located with the 23rd International World Wide Web Conference (WWW 2014), Seoul, Korea, April 7th, 2014,

Rowe,

Stankovic and

A.-S.

Dadzie, eds, CEUR Workshop Proceedings, Vol. 1141, CEUR-WS.org, 2014, pp. 64–65. http://ceur-ws.org/Vol-1141/paper_13.pdf.

44.

Hachey,

Nothman and

Radford, Cheap and easy entity evaluation, in: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, ACL 2014, Vol. 2: Short Papers, Baltimore, MD, USA, June 22–27, 2014, ACL, 2014, pp. 464–469. http://aclweb.org/anthology/P/P14/P14-2076.pdf.

45.

Hoffart,

M.A.

Yosef,

Bordino,

Fürstenau,

Pinkal,

Spaniol,

Taneva,

Thater and

Weikum, Robust disambiguation of named entities in text, in: Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, EMNLP 2011, A Meeting of SIGDAT, a Special Interest Group of the ACL, John McIntyre Conference Centre, Edinburgh, UK, 27–31 July, 2011, ACL, 2011, pp. 782–792. http://www.aclweb.org/anthology/D11-1072.

46.

A.H.

Jadidinejad, Unsupervised information extraction using BabelNet and DBpedia, in: Proceedings of the Concept Extraction Challenge at the Workshop on ‘Making Sense of Microposts’, Rio de Janeiro, Brazil, May 13, 2013,

A.E.

Cano Basave,

Rowe,

Stankovic and

A.-S.

Dadzie, eds, CEUR Workshop Proceedings, Vol. 1019, CEUR-WS.org, 2013, pp. 54–56. http://ceur-ws.org/Vol-1019/paper_32.pdf.

47.

Ji,

Grishman and

H.T.

Dang, Overview of the TAC2011 Knowledge Base Population (KBP) track, Presentation at the Fourth Text Analysis Conference (TAC 2011), November 14–15, 2011, National Institute of Standards and Technology, Gaithersburg, Maryland, USA, 2009. http://tac.nist.gov/publications/2011/presentations/KBP2011_overview.presentation.pdf.

48.

Ji,

Nothman,

Hachey and

Florian, Overview of TAC-KBP2015 tri-lingual entity discovery and linking, in: Proceedings of the Eighth Text Analysis Conference (TAC 2015), National Institute of Standards and Technology, Gaithersburg, Maryland, USA, November 16–17, 2015, 2015. http://tac.nist.gov/publications/2015/additional.papers/TAC2015.KBP_Trilingual_Entity_Discovery_and_Linking_overview.proceedings.pdf.

49.

Lehmann,

Isele,

Jakob,

Jentzsch,

Kontokostas,

P.N.

Mendes,

Hellmann,

Morsey,

van Kleef,

Auer and

Bizer, DBpedia – A large-scale, multilingual knowledge base extracted from Wikipedia, Semantic Web6(2) (2015), 167–195. doi:10.3233/SW-140134.

50.

Li,

Weng,

He,

Yao,

Datta,

Sun and

Lee, TwiNER: Named entity recognition in targeted Twitter stream, in: The 35th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR’12,

W.R.

Hersh,

Callan,

Maarek and

Sanderson, eds, ACM, Portland, OR, USA, 2012, pp. 721–730. doi:10.1145/2348283.2348380.

51.

Liang, Semi-supervised learning for natural language, Master’s thesis, Massachusetts Institute of Technology, 2005, http://hdl.handle.net/1721.1/33296.

52.

Loper and

Bird, NLTK: the natural language toolkit, in: ETMTNLP’02, Proceedings of the ACL-02 Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics – Vol. 1, Philadelphia, Pennsylvania, July 7, 2002,

Brew,

Rosner and

Radev, eds, ACL, 2002, pp. 63–70. doi:10.3115/1118108.1118117.

53.

Luo, On coreference resolution performance metrics, in: HLT/EMNLP 2005, Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference, Vancouver, British Columbia, Canada, 6–8 October, 2005, ACL, 2005, pp. 25–32. doi:10.3115/1220575.1220579.

54.

McNamee,

Simpson and

H.T.

Dang, Overview of the TAC2009 knowledge base population track, Presentation at the Second Text Analysis Conference (TAC 2009), November 16–17, 2009, National Institute of Standards and Technology, Gaithersburg, Maryland, USA, 2009. http://tac.nist.gov/publications/2009/presentations/TAC2009_KBP_overview.pdf.

55.

P.N.

Mendes,

Jakob,

García-Silva and

Bizer, Dbpedia spotlight: Shedding light on the web of documents, in: Proceedings the 7th International Conference on Semantic Systems, I-SEMANTICS, Graz, Austria, September 7–9, 2011,

Ghidini,

A.-C.

Ngonga Ngomo,

S.N.

Lindstaedt and

Pellegrini, eds, ACM International Conference Proceeding Series, ACM, 2011, pp. 1–8. doi:10.1145/2063518.2063519.

56.

P.N.

Mendes,

Weissenborn and

Hokamp, DBpedia spotlight at the MSM2013 challenge, in: Proceedings of the Concept Extraction Challenge at the Workshop on ‘Making Sense of Microposts’, Rio de Janeiro, Brazil, May 13, 2013,

A.E.

Cano,

Rowe,

Stankovic and

A.-S.

Dadzie, eds, CEUR Workshop Proceedings, Vol. 1019, CEUR-WS.org, 2013, pp. 57–61. http://ceur-ws.org/Vol-1019/paper_30.pdf.

57.

Moro and

Navigli, SemEval-2015 Task 13: Multilingual all-words sense disambiguation and entity linking, in: Proceedings of the 9th International Workshop on Semantic Evaluation, SemEval@NAACL-HLT 2015, Denver, Colorado, USA, June 4–5, 2015,

D.M.

Cer,

Jurgens,

Nakov and

Zesch, eds, ACL, 2015, pp. 288–297. http://aclweb.org/anthology/S/S15/S15-2049.pdf.

58.

Moro,

Raganato and

Navigli, Entity linking meets word sense disambiguation: A unified approach, Transactions of the Association for Computational Linguistics2 (2014), 231–244. https://tacl2013.cs.columbia.edu/ojs/index.php/tacl/article/view/291.

59.

Ó.

Muñoz-García,

García-Silva and

Ó.

Corcho, Towards concept identification using a knowledge-intensive approach, in: Proceedings of the Concept Extraction Challenge at the Workshop on ‘Making Sense of Microposts’, Rio de Janeiro, Brazil, May 13, 2013,

A.E.

Cano Basave,

Rowe,

Stankovic and

A.-S.

Dadzie, eds, CEUR Workshop Proceedings, Vol. 1019, CEUR-WS.org, 2013, pp. 45–49, http://ceur-ws.org/Vol-1019/paper_29.pdf.

60.

Osborne,

Petrovic,

McCreadie,

Macdonald and

Ounis, Bieber no more: First story detection using Twitter and Wikipedia, in: SIGIR 2012 Workshop on Time-Aware Information Access (#TAIA2012), Portland, Oregon, USA, August 16, 2012,

Diaz,

Dumais,

Radinsky,

de Rijke and

Shokouhi, eds, 2012.

61.

Petrovic,

Osborne and

Lavrenko, Streaming first story detection with application to Twitter, in: Human Language Technologies: Conference of the North American Chapter of the Association of Computational Linguistics, Proceedings, Los Angeles, California, USA, June 2–4, 2010, ACL, 2010, pp. 181–189. http://www.aclweb.org/anthology/N10-1021.

62.

L.-A.

Ratinov and

Roth, Design challenges and misconceptions in named entity recognition, in: Proceedings of the Thirteenth Conference on Computational Natural Language Learning, CoNLL 2009, Boulder, Colorado, USA, June 4–5, 2009,

Stevenson and

Carreras, eds, ACL, 2009, pp. 147–155. doi:10.3115/1596374.1596399.

63.

L.-A.

Ratinov,

Roth,

Downey and

Anderson, Local and global algorithms for disambiguation to Wikipedia, in: The 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference, Portland, Oregon, USA, 19–24 June, 2011,

Lin,

Matsumoto and

Mihalcea, eds, ACL, 2011, pp. 1375–1384. http://www.aclweb.org/anthology/P11-1138.

64.

Ritter,

Clark, Mausam and

Etzioni, Named entity recognition in tweets: An experimental study, in: Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, EMNLP 2011, A meeting of SIGDAT, a Special Interest Group of the ACL, John McIntyre Conference Centre, Edinburgh, UK, 27–31 July, 2011, ACL, 2011, pp. 1524–1534. http://www.aclweb.org/anthology/D11-1141.

65.

Rizzo,

A.E.

Cano Basave,

Pereira and

Varga, Making sense of microposts (#Microposts2015) named entity rEcognition and linking (NEEL) challenge, in: Proceedings of the 5th Workshop on Making Sense of Microposts Co-Located with the 24th International World Wide Web Conference (WWW 2015), Florence, Italy, May 18th, 2015,

Rowe,

Stankovic and

A.-S.

Dadzie, eds, CEUR Workshop Proceedings, Vol. 1395, CEUR-WS.org, 2015, pp. 44–53. http://ceur-ws.org/Vol-1395/microposts2015_neel-challenge-report/.

66.

Rizzo and

Troncy, NERD: A framework for unifying named entity recognition and disambiguation extraction tools, in: EACL 2012, 13th Conference of the European Chapter of the Association for Computational Linguistics, Avignon, France, April 23–27, 2012,

Daelemans,

Lapata and

Màrquez, eds, ACL, 2012, pp. 73–76. http://aclweb.org/anthology-new/E/E12/E12-2015.pdf.

67.

Rizzo,

Troncy,

Hellmann and

Brümmer, NERD meets NIF: Lifting NLP extraction results to the Linked Data Cloud, in: WWW2012 Workshop on Linked Data on the Web, Lyon, France, 16 April, 2012,

Bizer,

Heath,

Berners-Lee and

Hausenblas, eds, CEUR Workshop Proceedings, Vol. 937, CEUR-WS.org, 2012. http://ceur-ws.org/Vol-937/ldow2012-paper-02.pdf.

68.

Rizzo,

van Erp,

Plu and

Troncy, Making sense of microposts (#Microposts2016) Named Entity rEcognition and Linking (NEEL) challenge, in: Proceedings of the 6th Workshop on ‘Making Sense of Microposts’ Co-Located with the 25th International World Wide Web Conference (WWW 2016), Montréal, Canada, April 11, 2016,

A.-S.

Dadzie,

Preotiuc-Pietro,

Radovanovic,

A.E.

Cano Basave and

Weller, eds, CEUR Workshop Proceedings, Vol. 1691, CEUR-WS.org, 2016, pp. 50–59. http://ceur-ws.org/Vol-1691/microposts2016_neel-challenge-report/.

69.

Rizzo,

van Erp and

Troncy, Benchmarking the extraction and disambiguation of named entities on the Semantic Web, in: Proceedings of the Ninth International Conference on Language Resources and Evaluation, LREC 2014, Reykjavik, Iceland, May 26–31, 2014,

Calzolari,

Choukri,

Declerck,

Loftsson,

Maegaard,

Mariani,

Moreno,

Odijk and

Piperidis, eds, European Language Resources Association (ELRA), 2014, pp. 4593–4600. http://www.lrec-conf.org/proceedings/lrec2014/summaries/176.html.

70.

Sachidanandan,

Sambaturu and

Karlapalem, NERTUW: named entity recognition on tweets using Wikipedia, in: Proceedings of the Concept Extraction Challenge at the Workshop on ‘Making Sense of Microposts’, Rio de Janeiro, Brazil, May 13, 2013,

A.E.

Cano Basave,

Rowe,

Stankovic and

A.-S.

Dadzie, eds, CEUR Workshop Proceedings, Vol. 1019, CEUR-WS.org, 2013, pp. 67–70. http://ceur-ws.org/Vol-1019/paper_35.pdf.

71.

E.F.T.K.

Sang and

De Meulder, Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition, in: Proceedings of the Seventh Conference on Natural Language Learning, CoNLL 2003, Held in Cooperation with HLT-NAACL 2003, Edmonton, Canada, May 31–June 1, 2003,

Daelemans and

Osborne, eds, ACL, 2003, pp. 142–147. http://aclweb.org/anthology/W/W03/W03-0419.pdf.

72.

Scaiella,

Barbera,

Parmesan,

Prestia,

Del Tessandoro and

Verí, DataTXT at #Microposts2014 challenge, in: Proceedings of the 4th Workshop on Making Sense of Microposts Co-Located with the 23rd International World Wide Web Conference (WWW 2014), Seoul, Korea, April 7th, 2014,

Rowe,

Stankovic and

A.-S.

Dadzie, eds, CEUR Workshop Proceedings, Vol. 1141, CEUR-WS.org, 2014, pp. 66–67. http://ceur-ws.org/Vol-1141/paper_19.pdf.

73.

Sinha and

Barik, Named entity extraction and linking in #microposts, in: Proceedings of the 5th Workshop on Making Sense of Microposts Co-Located with the 24th International World Wide Web Conference (WWW 2015), Florence, Italy, May 18th, 2015,

Rowe,

Stankovic and

A.-S.

Dadzie, eds, CEUR Workshop Proceedings, Vol. 1395, CEUR-WS.org, 2015, pp. 66–67. http://ceur-ws.org/Vol-1395/paper_19.pdf.

74.

Torres-Tramón,

Hromic,

Walsh,

B.R.

Heravi and

Hayes, Kanopy4Tweets: Entity extraction and linking for Twitter, in: Proceedings of the 6th Workshop on ‘Making Sense of Microposts’ Co-Located with the 25th International World Wide Web Conference (WWW 2016), Montréal, Canada, April 11, 2016,

A.-S.

Dadzie,

Preotiuc-Pietro,

Radovanovic,

A.E.C.

Basave and

Weller, eds, CEUR Workshop Proceedings, Vol. 1691, 2016, pp. 64–66. http://ceur-ws.org/Vol-1691/paper_13.pdf.

75.

van den Bosch and

Bogers, Memory-based named entity recognition in tweets, in: Proceedings of the Concept Extraction Challenge at the Workshop on ‘Making Sense of Microposts’, Rio de Janeiro, Brazil, May 13, 2013,

A.E.

Cano Basave,

Rowe,

Stankovic and

A.-S.

Dadzie, eds, CEUR Workshop Proceedings, Vol. 1019, CEUR-WS.org, 2013, pp. 40–43. http://ceur-ws.org/Vol-1019/paper_03.pdf.

76.

van Erp,

P.N.

Mendes,

Paulheim,

Ilievski,

Plu,

Rizzo and

Waitelonis, Evaluating entity linking: An analysis of current benchmark datasets and a roadmap for doing a better job, in: Proceedings of the Tenth International Conference on Language Resources and Evaluation LREC 2016, Portorož, Slovenia, May 23–28, 2016,

Calzolari,

Choukri,

Declerck,

Goggi,

Grobelnik,

Maegaard,

Mariani,

Mazo,

Moreno,

Odijk and

Piperidis, eds, European Language Resources Association (ELRA), 2016. http://www.lrec-conf.org/proceedings/lrec2016/summaries/926.html.

77.

van Erp,

Rizzo and

Troncy, Learning with the Web: Spotting named entities on the intersection of NERD and machine learning, in: Proceedings of the Concept Extraction Challenge at the Workshop on ‘Making Sense of Microposts’, Rio de Janeiro, Brazil, May 13, 2013,

A.E.

Cano Basave,

Rowe,

Stankovic and

A.-S.

Dadzie, eds, CEUR Workshop Proceedings, Vol. 1019, CEUR-WS.org, 2013, pp. 27–30. http://ceur-ws.org/Vol-1019/paper_15.pdf.

78.

Varga,

A.E.

Cano Basave,

Rowe,

Ciravegna and

He, Linked knowledge sources for topic classification of microposts: A semantic graph-based approach, Journal of Web Semantics26 (2014), 36–57. doi:10.1016/j.websem.2014.04.001.

79.

Waitelonis and

Sack, Named entity linking in #tweets with KEA, in: Proceedings of the 6th Workshop on ‘Making Sense of Microposts’ Co-Located with the 25th International World Wide Web Conference (WWW 2016), Montréal, Canada, April 11, 2016,

A.-S.

Dadzie,

Preotiuc-Pietro,

Radovanovic,

A.E.C.

Basave and

Weller, eds, CEUR Workshop Proceedings, Vol. 1691, CEUR-WS.org, 2016, pp. 61–63. http://ceur-ws.org/Vol-1691/paper_14.pdf.

80.

R.E.

Walpole and

R.H.

Myers, Probability and Statistics for Engineers & Scientists, 8 edn, Pearson Education International, 2007.

81.

Yamada,

Takeda and

Takefuji, An end-to-end entity linking approach for tweets, in: Proceedings of the 5th Workshop on Making Sense of Microposts Co-Located with the 24th International World Wide Web Conference (WWW 2015), Florence, Italy, May 18th, 2015,

Rowe,

Stankovic and

A.-S.

Dadzie, eds, CEUR Workshop Proceedings, Vol. 1395, CEUR-WS.org, 2015, pp. 55–56. http://ceur-ws.org/Vol-1395/paper_16.pdf.

Lessons learnt from the Named Entity rEcognition and Linking (NEEL) challenge series

Abstract

Keywords

1. Introduction

1 This number does not account for the teams who experimented with the corpora out of the challenges’ timeline.

6 http://en.wikipedia.org

2.1.1. Textual document

7 Microposts is the term used in the social media field to refer to tweets and social media posts in general.

2.1.3. Entity

2.2. Typical Entity Linking workflow and evaluation strategies

Table 1 Named Entity Recognition and Linking challenges since 2013

8 http://www.nist.gov/tac

9 http://sigir.org/sigir2014

11 https://www.aclweb.org

12 http://alt.qcri.org/semeval2015

13 http://www.twitter.com

4.1. Collection procedure and statistics

14 http://demeter.inf.ed.ac.uk/redites

Table 3 Mapping between the taxonomy used in the first challenge of the NEEL challenge series (left column), and the taxonomy used since the 2014 on (right column) CE 2013 NEEL taxonomy MISC Thing PER Person LOC Location ORG Organization – Character – Product – Event

16 For annotating the 2014 challenge dataset, we used Crowdflower with selected expert annotators rather than the crowd.

5.1. Entity overlap

20 As surface form we refer to the lexical value of the mention.

21 An entity resource is an entry in a knowledge base that describes that entity, for example http://dbpedia.org/resource/Hillary_Clinton is the DBpedia entry that describes the American politician Hillary Rodham Clinton.

6. Emerging trends and systems overview

22 https://opennlp.apache.org

7.1. 2013 evaluation measures

26 TSV stands for tab separated value.

27 We considered all DBpedia v3.9 resources valid.

8. Results

Footnotes

Acknowledgements

NEEL taxonomy

NEEL Challenge annotation guidelines

References

¹
This number does not account for the teams who experimented with the corpora out of the challenges’ timeline.

⁶
http://en.wikipedia.org

⁷
Microposts is the term used in the social media field to refer to tweets and social media posts in general.

Table 1
Named Entity Recognition and Linking challenges since 2013

⁸
http://www.nist.gov/tac

⁹
http://sigir.org/sigir2014

¹¹
https://www.aclweb.org

¹²
http://alt.qcri.org/semeval2015

¹³
http://www.twitter.com

¹⁴
http://demeter.inf.ed.ac.uk/redites

Table 3
Mapping between the taxonomy used in the first challenge of the NEEL challenge series (left column), and the taxonomy used since the 2014 on (right column)

CE 2013 NEEL taxonomy

MISC Thing

PER Person

LOC Location

ORG Organization

– Character

– Product

– Event

¹⁶
For annotating the 2014 challenge dataset, we used Crowdflower with selected expert annotators rather than the crowd.

²⁰
As surface form we refer to the lexical value of the mention.

²¹
An entity resource is an entry in a knowledge base that describes that entity, for example http://dbpedia.org/resource/Hillary_Clinton is the DBpedia entry that describes the American politician Hillary Rodham Clinton.

²²
https://opennlp.apache.org

²⁶
TSV stands for tab separated value.

²⁷
We considered all DBpedia v3.9 resources valid.