Abstract
The large number of tweets generated daily is providing decision makers with means to obtain insights into recent events around the globe in near real-time. The main barrier for extracting such insights is the impossibility of manual inspection of a diverse and dynamic amount of information. This problem has attracted the attention of industry and research communities, resulting in algorithms for the automatic extraction of semantics in tweets and linking them to machine readable resources. While a tweet is shallowly comparable to any other textual content, it hides a complex and challenging structure that requires domain-specific computational approaches for mining semantics from it. The NEEL challenge series, established in 2013, has contributed to the collection of emerging trends in the field and definition of standardised benchmark corpora for entity recognition and linking in tweets, ensuring high quality labelled data that facilitates comparisons between different approaches. This article reports the findings and lessons learnt through an analysis of specific characteristics of the created corpora, limitations, lessons learnt from the different participants and pointers for furthering the field of entity recognition and linking in tweets.
Keywords
Introduction
Tweets have been proven to be useful in different applications and contexts such as music recommendation, spam detection, emergency response, market analysis, and decision making. The limited number of tokens in a tweet however implies a lack of sufficient contextual information necessary for understanding its content. A commonly used approach is to extract
The automated identification, classification and linking of named entities has proven to be challenging due to, among other things, the inherent characteristics tweets:
The NEEL challenge series, established first in 2013 and since then running yearly, has captured a community need for making sense from tweets through a wealth of high quality annotated corpora and to monitor the emerging trends in the field. The first edition of the challenge named Concept Extraction (CE) Challenge [16] focused on entity identification and classification. A step further into this task is to ground entities in tweets by linking them to knowledge base referents. This prompted the Named Entity Extraction and Linking (NEEL) Challenge the following year [15]. These two research avenues, which add to the intrinsic complexity of the tasks proposed in 2013 and 2014, prompted the Named Entity rEcognition and Linking (NEEL) Challenge in 2015 [65]. In 2015, the role of the named entity type in the grounding process was investigated, as well as the identification of named entities that cannot be grounded because they do not have a knowledge base referent (defined as NIL). The English DBpedia 2014 dataset was the designated referent knowledge base for the 2015 NEEL challenge, and the evaluation was performed through live querying the Web APIs participants prepared, in an automatic fashion to measure the computing time. The 2016 edition [68] consolidated the 2015 edition, using the English DBpedia 2015-04 version as referent knowledge base. This edition proposed an offline evaluation where the computing time was not taken into account in the final evaluation.
The four challenges have published four incremental manually labeled benchmark corpora. The creation of the corpora followed rigid designations and protocols, resulting in high quality labeled data that can be used as seeds for reasoning and supervised approaches. Despite these protocols, the corpora have strengths and weaknesses that we have discovered over the years and they are discussed in this article.
The purpose of each challenge was to set up an open and competitive environment that would encourage participants to deliver novel approaches or improve on existing ones for recognising and linking entities from tweets to either a referent knowledge base entry or NIL where such an entry does not exist. From the first (in 2013) to the 2016 NEEL challenge, thirty research teams have submitted at least one entry to the competitions proposing state-of-the-art approaches. More than three hundred teams have explicitly acquired the corpora in the four years, underlining the importance of the challenges in the field.1 This number does not account for the teams who experimented with the corpora out of the challenges’ timeline.
This paper reports on the findings and lessons learnt from the last four years of NEEL challenges, analysing the corpora in detail, highlighting their limitations, and providing guidance to implement top performing approaches in the field from the different participants. The resulting body of work has implications for researchers, application designers and social media engineers who wish to harvest information from tweets for their own objectives. The remainder of this paper is structured as follows: in Section 2 we introduce a comparison with recent shared tasks in entity recognition and linking and underline the reason that has prompted the need to establish the NEEL challenge series. Next, in Section 3, the decisions regarding different versions of the NEEL challenge are introduced and the initiative is compared against the other shared tasks. We then detail the steps followed in generating the four different corpora in Section 4, followed by a quantitative and qualitative analysis of the corpora in Section 5. We then list the different approaches presented and narrow down the emerging trends in Section 6, grounding the trends according to the evaluation strategies presented in Section 7. Section 8 reports the participants’ results and provides an error analysis. We conclude and list our future activities in Section 9. We then provide in appendix the NEEL Taxonomy (Appendix A) and the NEEL Challenge Annotation Guidelines (Appendix B).
The first research challenge to identify the importance of the recognition of entities in textual documents was held in 1997 during the 7th Message Understanding Conference (MUC-7) [20]. In this challenge, the term
Recognising an entity in a textual document was the first big challenge, but after overcoming this obstacle, the research community moved into a second and challenging task: disambiguating entities. This problem appears when a mention in text may refer to more than one entity. For instance, the mention
The Entity Disambiguation task popularised after Bunescu and Pasca [13] in 2006 explored the use of an encyclopaedia as a source for entities. In particular, after [23] demonstrated the benefit of using Wikipedia,6
In 2009, the TAC-KBP challenge [54] introduced a new problem to both the Entity Recognition and Entity Disambiguation communities. In Entity Recognition, the mention is recognised in text without information about the exact entity that is referred to by the mention. On the other hand, Entity Disambiguation focuses only on the resolution of entities that have a referent in a provided knowledge base. The TAC-KBP challenge illustrated the problem that a mention identified in text may not have a referent entity in the knowledge base. In this case, the suggestion was to link such a mention to a NIL entity in order to indicate that it is not present in the knowledge base. This problem was referred as Named Entity Linking and it is still a hard and current research problem. Nowadays, however, the terms Entity Disambiguation and Entity Linking have become interchangeable.
Since the TAC-KBP challenge, there has been an explosion in the number of algorithms proposed for Entity Linking using a variety of textual documents, Knowledge Bases, and even using different definitions of entities. This variety, whilst beneficial, also extends to how approaches are evaluated, regarding metrics and gold standard datasets used. Such diversity makes it difficult to perform comparisons between various Entity Linking algorithms and creates the need for benchmark initiatives.
In this section, we first introduce the main components of the Entity Linking task and their possible variations, followed by a typical workflow used to approach to the task, the expected output of each step and three strategies for evaluation of Entity Linking systems. We conclude with an overview of benchmark initiatives and their decisions regarding the use of Entity Linking components and evaluation strategies.
Entity Linking is defined as
In this definition, Entity Linking contains three main components: text, knowledge base, and entity. The features of each component may vary, and consequently, have an impact on the results of algorithms used to perform the task. For instance, a state-of-the-art solution based on long textual documents may have a poor performance when evaluated over short documents with little contextual information within the text. In a similar manner, a solution developed to link entities of types Person, Location, and Organisation may not be able to link entities of type Movie. Therefore, the choice of each component defines which type of solutions are being evaluated by each specific benchmark initiative.
Textual document
In Entity Linking, textual documents are usually divided in two main categories: long text, and short text. Long textual documents usually contain more than 400 words, such as news articles and web sites. Short documents (such as microposts7 Microposts is the term used in the social media field to refer to tweets and social media posts in general.
Long textual documents provide a series of document-level features that can be explored for Entity Linking such as: the presence of multiple entity mentions in a single document; well-written text (expressed by the lack or relative absence of misspellings); and the availability of contextual information that supports the grounding of each mention. Contextual information entails the supporting facts that help in deciding the best knowledge base entry to be linked with a given mention. For instance, let us assume the knowledge base has two candidate entries to be linked with the mention
Short text documents are considered more challenging than long ones because they have the exact opposite features such as: the presence of few entity mentions in a single document (due to the limited size of the text); the presence of misspellings or phonetic spelling (e.g. “I call u 2morrow” rather than “I call you tomorrow”); and the low availability of contextual information within the text. It is important to note though that even within the short text category there are still important distinctions between microposts and search queries that may impact the performance of Entity Linking algorithms.
In performing a search, it is expected that the search query will be composed by a mention to the entity of interest being searched and, sometimes, by additional contextual information. Therefore, despite the challenge of being a short text document, search queries are assumed to contain at least one mention to an entity and likely to contain additional contextual information. However, for microposts this assumption does not hold.
Microposts do not necessarily have an entity as target. For instance, a document with the content “So happy today!!!” does not explicitly cite any entity mention. Also, microposts may be used to talk about entities without providing any context within the text, as in “Adele, you rock!!”. In this aspect, Entity Linking for microposts is more challenging than for search queries because it is unclear if a micropost will contain an entity and context for the linking. Furthermore, microposts are also more likely to contain misspellings and phonetic writing than search queries. If a search engine user misspells a term then it is very likely that she will not find the desired information. In this case, it is safe to assume that search engine users will try to avoid misspellings and phonetic writing as much as possible. On the other hand, in micropost communities, misspellings and phonetic writing are used as strategies to shorten words, thus enabling the communication of more information within a single micropost. Therefore, misspelling and phonetic writing are common features of microposts and need to be taken into consideration when performing Entity Linking.
The second component of Entity Linking we consider is the knowledge base used. Knowledge bases differ from each other regarding the domains of knowledge they cover (e.g. domain-specific or encyclopedic knowledge), the features used to describe entries (e.g. long textual descriptions, attribute-value pairs, or relationship links between entities), their ratio of updates, and the amount of entities they cover.
As with textual documents, different characteristics will impact in the Entity Linking task. The domain covered by the knowledge base will influence which entity mentions will possibly have a link. If there is a mismatch between the domain of the text (e.g. biomedical text) and the domain of the knowledge base (e.g. politics) then all, or most, entity mentions found in text will not have a reference in the knowledge base. In the extreme case of complete mismatch, the Entity Linking process will be reduced to Entity Recognition. Therefore, in order to perform linking, the knowledge base should at least be partially related to the domain of the text being linked.
Furthermore, the features used to describe entities in the knowledge base influence which algorithms can make use of it. For instance, if entities are represented only through textual descriptions, a text-based algorithm needs to be used to find the best mention-entry link. If, however, knowledge base entries are only described through relationship links with other entities then a graph-based algorithm may be more suitable.
The third characteristic of a knowledge base which impacts Entity Linking is its ratio of updates. Static knowledge bases (i.e. knowledge bases that are not or infrequently updated) represent only the status of a given domain at the moment it was generated. Any entity which becomes relevant to that domain, after that point in time will not be represented in the knowledge base. Therefore, in a textual document, only mentions to entities prior to the creation of the knowledge base will have a link, all others would be linked to NIL. The faster entities change in a given domain the more likely it is for the knowledge base to become outdated. In the likelihood that there is a complete disjoint between text and knowledge base, all links from text would invariably be linked to NIL. Depending on the textual document to be linked, the ratio of updates may or may not be an important feature. Social and news media are more likely to have a faster change on their entities of interest than manufacturing reports, for instance.
Another characteristic of a knowledge base which may impact Entity Linking is the amount of entities it covers. Two knowledge bases with the same characteristics may still vary on the amount of entities they cover. When applied to Entity Linking, the more entities a knowledge base covers the more likely there will be a matching between text and knowledge base. Of course in this case we should assume that both knowledge bases are focused on representing key entities in their domain rather than long tail ones.
Entity
The third component of interest for Entity Linking is the definition of entity. Despite its importance for the Entity Linking task, entities are not formally defined. Instead, entities are defined either through example or through the data available. Named entities are the most common case of definition by example. Named entities were introduced in 1997 as part of the Message Understanding Conference as instances of Person, Organisation, and Geo-political types. An extension of named entities is usually performed through the inclusion of additional types such as Locations, Facilities, Movies. In these cases there is no formal definition of entities, rather they are exemplars of a set of categories.
An alternative definition of entities assumes that entities are anything represented by the knowledge base. In other words, the definition of entity is given by the data available (in this case, data from the knowledge base). Whereas this definition makes the Entity Linking task easier by not requiring any refined “human-like” reasoning about types, it makes it impossible to identify NIL links. If entity is anything in the knowledge base, how could we ever possibly have, by definition, an entity which is not in the knowledge base?
The choice of entity will depend on the Entity Linking desired. If the goal is to consider links to NIL then the definition based on types is the most suitable, otherwise the definition based on the knowledge base may be used.
Typical Entity Linking workflow and evaluation strategies

Typical Entity Linking workflow with expected output of each step.
Regardless of the different Entity Linking components, most proposed systems for Entity Linking follow a workflow similar to the one presented in Fig. 1. This workflow is composed of the following steps: Mention Detection, Entity Typing, Candidate Detection, Candidate Selection, and NIL Clustering. Note that, although it is usually a sequential workflow, there are approaches that create a feedback loop between different steps, or merge two or more steps into a single one.
The Mention Recognition step receives textual documents as input and recognises all terms in text that refer to entities. The goal of this step is to perform typical Named Entity Recognition. Next, the Entity Typing step detects the type of each mention previously recognised. This task is usually framed as a categorisation problem. Candidate Detection next receives the detected mentions and produces a list with all entries in the knowledge base that are candidates to be linked with each mention. In the Candidate Selection step, these candidate lists are processed and, by making use of available contextual information, the correct link for each mention, either an entry from the knowledge base or a NIL reference, is provided. Last, the NIL Clustering step receives a series of mentions linked to NIL as input and generates clusters of mentions referring to the same entity, i.e. each cluster contains all NIL mentions representing one, and only one, entity, and there are no two clusters representing the same entity.
The evaluation of Entity Linking systems is based on this typical workflow and can be of three types: end-to-end, step-by-step, or partial end-to-end.
An
The opposite to an end-to-end evaluation is a
Finally, the
The number of variations in Entity Linking makes it hard to benchmark Entity Linking systems. Different research communities focus on different types of text and knowledge base, and different algorithms will perform better or worse on any specific step. In this section, we present the Entity Linking benchmark initiatives to date, the Entity Linking specifications used, and the communities involved. The challenges are summarised in Table 1.
Named Entity Recognition and Linking challenges since 2013
Named Entity Recognition and Linking challenges since 2013
Entity Linking was first introduced in 2009 as a challenge for the Text Analysis Conference.8
As of 2009, the TAC-KBP benchmark was not concerned about recognition of entities in text, in particular considering that their entities of interest were instances of types Organisation, Geo-political, and Person, and the recognition of these types of entities in text was already a well-established task in the community. The challenge was then mainly concerned with correct Entity Typing and Candidate Selection. In later years, Mention Detection and NIL Clustering were also included in the TAC-KBP pipeline [47]. Also, more entity types are now considered such as Location and Facility, as well as, multiple languages [48].
Characteristics that have been constant in TAC-KBP are the use of long textual documents, entities given by Type, and the use of encyclopedic knowledge bases. A reason for long textual documents would be that this type of text is more likely to contain contextual information to populate a knowledge base, in particular news articles and web sites. The use of entities given by Type is a direct consequence of the availability of named entity recognition algorithms based on types and the need for NIL detection. The use of an encyclopedic knowledge base was because Person, Organisation, and Geo-political entities are not domain-specific and due to the availability of Wikipedia as a free available knowledge base on the Web.
The Entity Recognition and Disambiguation (ERD) challenge [17] was a benchmark initiative organised in 2015 as part of the SIGIR conference9
The Information Retrieval community, and consequently the ERD challenge, focuses on the processing of large amounts of information. Therefore, the systems evaluated should provide not only the correct results but also fulfill basic standards for large scale web systems, i.e. they should be available through Web APIs for public use, they should accept a minimum number of requests without timeout, and they should ensure a minimum uptime availability. All these standards were translated into the evaluation method of the ERD challenge that required systems to have a given Web API available for querying during the time of the evaluation. Also, large scale web systems are evaluated regarding how useful their output is for the task at hand regardless of the internal algorithms used, so the evaluation used by ERD was an end-to-end evaluation using standard information retrieval evaluation metrics (i.e. precision, recall, and f-measure).
The community of natural language processing and computational linguistics within the ACL-IJCNLP11
In 2015, the Workshop on Noisy User-generated Text (W-NUT) [4] promoted the study of documents that are not written in standard English, with tweets as the focus of its two shared tasks. One of these tasks was targeted at the normalisation of text. In other words, expressions such as “r u coming 2” should be normalised into standard English on the form of “are you coming to”. The second task proposed named entity recognition within tweets in which systems were required to detect mentions to entities corresponding to a list of ten entity types. This proposed task corresponds to the first two steps of the Entity Linking workflow: Mention Detection and Entity Typing.
Word Sense Disambiguation and Entity Linking are two tasks that perform disambiguation of textual documents through links with a knowledge base. Their main difference is that the former disambiguates the meaning of words with respect to a dictionary of word senses, whereas the latter disambiguates words with respect to a list of entity referents. These two tasks have been historically treated as different tasks since they require knowledge bases of a dissimilar nature. However, with the development of Babelnet, a knowledge base containing both entities and word senses, Word Sense Disambiguation and Entity Linking could be finally performed using a single knowledge base.
In 2015, a shared task for Multilingual All-Words Sense Disambiguation and Entity Linking [57] using Babelnet was proposed as part of the International Workshop on Semantic Evaluation (SemEval).12
Named Entity Recognition and Entity Linking have been active research topics since their introduction by MUC-7 in 1997 and TAC-KBP in 2009, respectively. The main focus of these initiatives had been on long textual documents, such as news articles, or web sites. Meanwhile, microposts emerged as a new type of communication on the Social Web and have been a widespread format to express opinions, sentiments, and facts about entities. The popularisation of microposts through the use of Twitter,13
The evolution of the NEEL challenge followed the evolution of Entity Linking. The challenge was first held in 2013 under the name of Concept Extraction (CE) and was concerned with the detection of mentions to entities in microposts and the specification of their types. In the next year, already under the acronym of NEEL, the challenge also included linking mentions to an encyclopedic Knowledge Base or to NIL. In 2015 and 2016, NEEL was expanded to also include clustering of NIL mentions.
To propose a fair benchmark for approaches to Entity Linking with microposts, the organisation of the NEEL challenge had to make certain decisions concerning different Entity Linking components and the available strategies for evaluation, always taking into consideration the trends and needs of the research community focused on the Web and microposts. In this section, we provide the motivation for these decisions. A discussion on their impact will be provided in later sections.
Taking this into account, we chose to use DBpedia [49], a structured knowledge base based on Wikipedia, mainly because it is frequently updated with entities appearing in events covered in social media. Another motivation for using DBpedia is that its format lends itself better to the task than Wikipedia itself. Each NEEL version used the latest available version of DBpedia.
In 2013, the list of entity types was based on the taxonomy used in CoNLL 2003 [71]. From 2014 onwards, the NEEL Taxonomy (Appendix A) was created with the goal of providing a more fine-grained classification of entities. This would represent a vast amount of entities of interest in the context of the Web. The types of entities used and how the NEEL Taxonomy was built is described in Section 4.
The NEEL challenge has used different evaluation settings in different versions of the challenge. Each change has its own motivation, but the main focus for each of them was to provide a fair and comprehensive evaluation of the submitted systems.
The first decision regards the submission of a file or the evaluation through Web APIs. Both approaches have their advantages and disadvantages. The use of a file lowers the bar to new participants in the challenge because they do not need to develop a Web API in addition to the usual Entity Linking steps, nor to have a Web server available during the whole evaluation process. This was the proposed model for 2013, 2014, and 2016. However, during NEEL 2014, some participants suggested that the challenge should apply a blind evaluation, i.e. the participants should know the input data just at the time of the query in order to avoid common mistakes of tuning the system based on evaluation data. Therefore, in 2015 the submission of evaluation results was changed to Web API calls. The impact of this change was that a few teams could not participate in the challenge, mainly because their Web server was not available during evaluation or their API did not have the correct signature. This format of evaluation also required extra effort on the part of the organisers that had to advise participant teams that their web servers were not available. Given the amount of problems generated and no real benefit experienced, the organisation opted for going back to the transfer of files with the results of the systems as in previous years.
The second decision concerns the evaluation strategy, which impacts the metrics used and on the overall benchmark ranking. In this step, we either have the option for an end-to-end, a partial end-to-end, or a step-by-step evaluation. Borrowing from the named entity recognition community, the first two versions of the challenge (i.e. 2013 and 2014) were based on an end-to-end evaluation. In this evaluation, standard evaluation metrics (i.e. precision, recall, and f-measure) were applied on top of the aggregated results of the system. A drawback of end-to-end evaluation is that in Entity Linking, if one step in the typical workflow does not perform well, its error will propagate until the last step. Therefore, an end-to-end evaluation will only evaluate based on the aggregated errors from all steps. This was not a problem when the systems were required to perform one or two simple steps, but when the challenge starts requiring a larger number of steps then a more fine-grained evaluation is required.
A partial end-to-end strategy evaluates the output of each Entity Linking step by analysing only the final result of the system. This evaluation uses different metrics for each part of the workflow and had been successfully performed by multiple TAC-KBP versions. Therefore, due to its benefits for the research community, the partial end-to-end evaluation has also been applied in the NEEL challenge in 2015 and 2016. Furthermore, the NEEL challenge applied this strategy using the same evaluation tool as TAC-KBP [44], which aimed to enable an easier interchange of participants between both communities.
The step-by-step evaluation has never being applied within the NEEL series. Despite its robustness by eliminating error propagation, it is very time consuming, in particular if participant systems do not implement the typical workflow. The evaluation process for each year as well as the specific metrics used will be discussed in Section 7.
In the next sections we will explain in detail how the NEEL challenges were organised, how the benchmark corpora were generated semi-manually, details of participant systems in each year, and the impact of each change in the participation in subsequent years.
The organisation of the NEEL challenges led to the yearly release of datasets of high value for the research community. Over the years, the datasets increased in size and coverage.
Collection procedure and statistics
The initial 2013 challenge dataset contains 4,265 tweets collected from the end of 2010 to the beginning of 2011 using the Twitter firehose with no explicit hashtag search. These tweets cover a variety of topics, including comments on news and politics. The dataset was split into 66% training and 33% test.
The second 2014 challenge dataset contains 3,505 event-annotated tweets, where each entity was linked to its corresponding DBpedia URI. This dataset was collected as part of the Redites project14
The 2015 challenge dataset extends the 2014 dataset. This dataset consists of tweets published over a longer period, between 2011 and 2013. In addition to this, we also collected tweets from the Twitter firehose from November 2014 covering both event (such as the UCI Cyclo-cross World Cup) and non-event tweets. The dataset was split into training (58%), consisting of the entire 2014 dataset, development (8%), which enabled participants to tune their systems, and test (34%) from the newly added 2015 tweets.
The 2016 challenge dataset builds on the 2014 and 2015 datasets, and consists of tweets extracted from the Twitter firehose from 2011 to 2013 and from 2014 to 2015 via a selection of popular hashtags. This dataset was split into training (65%) consisting of the entire 2015 dataset, development (1%), and test (34%) sets from the newly collected tweets for the 2016 challenge.
General statistics of the training, dev, and test data sets.
Only 300 tweets have been randomly selected to be manually annotated and included in the gold standard.
These figures refer to the 300 tweets of the gold standard.
Statistics describing the training, development and test sets are provided in Table 2. In all, but not in the 2015 challenge, the training datasets presented a higher rate of named entities linked to DBpedia than the development and test datasets. The percentage of tweets that mention at least one entity is 74.42% in the training, 72.96% in the test set for the 2013 dataset; 32% in the training, and 40% in the test set for the 2014 dataset; 57.83% in the training set, 77.4% in the development set, and 82.05% in the test set for the 2015 dataset; and 67.60% in the training set, 100% in the development set, and 9.35% in the test set for the 2016 dataset. The overlap of entities between the training and test data is 8.09% for the 2013 dataset, 13.27% for the 2014 dataset, 4.6% for the 2015 dataset, and 6.59% for the 2016 dataset. Following the Terms of Use of Twitter, for all the four challenge datasets, participants were only provided the tweet IDs and the annotations, the tweet text had to be downloaded from Twitter.
The taxonomy for annotating the entities changed from a four-class taxonomy, based on the taxonomy used in CoNLL 2003 [71], in 2013 to an extended version seven-type taxonomy, namely the NEEL Taxonomy (Appendix A), which is derived from the NERD Ontology [67]. This new taxonomy was introduced to provide a more fine-grained classification of the entities, covering also names of characters, products and events. Furthermore, it is deemed to better answer the need to cope with the semantic diversity of named entities in textual documents as shown in [66]. Table 3 shows the mapping between the two classification schemes.
Mapping between the taxonomy used in the first challenge of the NEEL challenge series (left column), and the taxonomy used since the 2014 on (right column)
Mapping between the taxonomy used in the first challenge of the NEEL challenge series (left column), and the taxonomy used since the 2014 on (right column)
Entity type statistics for the two data sets from 2013
Entity type statistics for the three data sets from 2015
Summary statistics of the entity types are provided in Tables 4, 5, and 6 for the 2013, 2015, and 2016 corpora respectively.15 The statistics cover the observable data in the corpora. Thus, the distributions of implicit classes in the 2014 corpus are not reported. The choice of removing the class information from the release was made on purpose because of the final objective of the task of having end-to-end solutions.
Entity type statistics for the three data sets from 2016. The statistics of the Test set refer to the manually annotated set of tweets selected to generate the gold standard
In the 2013 challenge, 4 annotators created the gold standard; in the 2014 challenge a total of 14 annotators were used who had different backgrounds, including computer scientists, social scientists, social web experts, semantic web experts and natural language processing experts; in the 2015 challenge, 3 annotators generated the annotations; in the 2016 challenge, 2 experts took on the manual annotation campaign.
The annotation process for the 2013 dataset started with the unannotated corpus and consists of the following steps:
Manual annotation: the corpus was split into four quarters, each was annotated by a different human annotator.
Consistency: for consistency checking, each annotator further checked the annotations that the other three performed to verify correctness.
Consensus: for the annotations without consensus, discussions among the four annotators was used to come to a final conclusion. This process resulted in resolving annotation inconsistencies.
Adjudication: a very small number of errors was also reported by the participants, which was taken into account in the final version of the dataset.
With the inclusion of entity links, the annotation process for the 2014, 2015 datasets was amended to consist of the following phases:
Unsupervised automated annotation: the corpus was initially annotated using the NERD framework [69], which extracted potential entity mentions, candidate links to DBpedia, and entity types. The NERD framework was used as an off-the-shelf annotation tool, i.e. it was used without training it properly.
Manual annotation: the labeled data set was divided into batches, with different annotators – three annotators in the 2014 challenge, and two annotators in the 2015 challenge – to each batch. In this phase manual annotations were performed using an annotation tool (e.g. CrowdFlower for the 2014 challenge dataset,16 For annotating the 2014 challenge dataset, we used Crowdflower with selected expert annotators rather than the crowd. For the 2015 challenge we chose GATE instead of Crowdflower, because GATE allows for the annotation of entities according to an ontology, and to compute inter-annotator agreement on the dataset.
Consistency: the annotators – three experts in the 2014 challenge, and a forth annotator in the 2015 challenge – double-checked the annotations and generated the gold standard (for the training, development and test sets). Three main tasks were carried out here: (i) consistency checking of entity types; (ii) consistency checking of URIs; (iii) resolution of ambiguous cases raised by the annotators. The annotators looped through Phase 2 and 3 of the process until the problematic cases were resolved.
NIL Clustering: particular to the 2015 challenge, a seed cluster generation algorithm through merging of string- and type- identical named entity mentions was used to generate an initial NIL Clustering.
Consensus: also in the 2015 challenge, based on the results of the seed algorithm, the third annotator manually verified all NIL clusters in order to remove links asserted to the wrong cluster, and merge clusters referring to the same entity. Special attention was paid to name variations such as acronyms, misspellings, and similar names.
Adjudication: where the challenge participants reported incorrect or missing annotations. Each reported mention was evaluated by one of the challenge chairs to check compliance with the NEEL Challenge Annotation Guidelines, and additions and corrections made as required.
In the 2016 challenge, the training set was built on top of the 2014 and 2015 datasets in order to provide continuity with previous years and to build upon existing findings. The 2016 challenge used the NEEL Challenge Annotation Guidelines provided in 2015. Due to the intensity of the annotation task, 10% of the test set was annotated manually.18 The participants were asked to annotate the entire corpus of tweets.
Manual annotation: the data set was divided into 2 batches, one for each annotator. In this phase, annotations were performed using GATE. The annotators were asked to analyse the annotations generated in Phase 1 by adding or removing entity annotations as required. The annotators were also asked to mark any ambiguous cases encountered. Along with the batches, the annotators received the NEEL Challenge Annotation Guidelines.
Consistency: the annotators checked each other annotations and generated the gold standard (for the training, development and test sets). Three main tasks were carried out here: (i) consistency checking of entity types; (ii) consistency checking of URIs; (iii) resolution of ambiguous cases raised by the annotators. The annotators iterated between Phases 1 and 2 until the problematic cases were resolved.
NIL Clustering: an unsupervised NIL Clustering generation was performed, using a seed cluster generation algorithm based on exact string matching of mention strings and their types.
Consensus: one of the two expert annotators went through all NIL clusters in order to, where appropriate, include or exclude them from a given cluster.
Adjudication: where the challenge participants reported incorrect or missing annotations. Each reported mention was evaluated by one of the challenge chairs to check compliance with the Challenge Annotation Guidelines, and additions and corrections were made as required.
The inter-annotator agreement (IAA) for the challenge datasets (2014, 2015 and 2016) is presented in Table 7.19 The inter-annotator agreement for the 2013 dataset could not be computed, as the challenge settings and intermediate data were lost due to a lack of organisation of the challenge.
Inter-annotator agreement on the challenge datasets
The lessons learnt from building high quality gold standards are that the annotation process must be guided with annotation guidelines, at least two annotators must be involved in the annotation process to ensure consistency, and the feedback from the participants is valuable in improving the quality of the datasets, providing complementary annotations to the cases found by experts. The annotation guidelines, written by experts, must describe the annotation task (for instance, entity types and NEEL taxonomy) through examples, and must be regularly updated during the manual annotation stage, describing special cases, issues encountered. In order to speed up the annotation process, it is a good practice to employ an annotation tool. We used GATE because the annotation process was guided by a taxonomy-centric view. The annotation task took less time if the annotators shared the same background (e.g. all annotators were semantic web and natural language processing experts with experience in information extraction).
While the main goals of the 2013–2016 challenges were the same, and the 2014–2016 corpora are largely built on top of each other, there are some differences among the datasets. In this section, we will analyse the different datasets according to the characteristics of the entities and events annotated in them. We hereby reuse measures and scripts from [76] and add a readability measure analysis of the corpora. Note that for the Entity Linking analyses, we can only compare the 2014–2016 NEEL corpora since the 2013 corpus (CE2013) does not contain entity links.
Entity overlap
Table 8 presents the entity overlap between the different datasets. Each row in the table represents the percentage of unique entities present in that dataset that are also represented in the other datasets.
Entity overlap in the analysed datasets. Behind the dataset name in each row the number of unique entities present in that dataset is given. For each dataset pair the overlap is given as the number of entities and percentage (in parentheses)
Entity overlap in the analysed datasets. Behind the dataset name in each row the number of unique entities present in that dataset is given. For each dataset pair the overlap is given as the number of entities and percentage (in parentheses)
We define the true confusability of a surface form As surface form we refer to the lexical value of the mention.
The confusability of a location name offers only a rough
Confusability stats for analysed datasets. Average stands for average number of meanings per surface form, Min. and Max. stand for the minimum and maximum number of meanings per surface form found in the corpus respectively, and
We define the true dominance of an entity resource An entity resource is an entry in a knowledge base that describes that entity, for example
The dominance statistics for the analysed datasets are presented in Table 10. The dominance scores for all corpora are quite high and the standard deviation is low, meaning that in the vast majority of cases, a single resource is associated with a certain surface form in the annotations, creating a low of variance for an automatic disambiguation system.
Dominance stats for analysed datasets
In this section, we have analysed the corpora in terms of their variance in named entities and readability.
As the datasets are built on top of each other, they show a fair amount of overlap in entities between each other. This need not to be a problem, if there is enough variation among the entities, but the confusability and dominance statistics show that there are very few entities in our datasets with many different referents (“John Smiths”) and if such an entity is present, often only one of its referents is meant. To remedy this, future entity linking corpora should take care to balance the entity distribution and include more variety.
We experimented with various readability measures to assess the reading difficulty of the various tweet corpora. These measures would indicate that tweets are generally not very difficult in terms of word and sentence length, but the abbreviations and slang present in tweets proves them to be more difficult to interpret for readers outside the target community. To the best of our knowledge, there is no readability metric that takes this into account. Therefore we chose not to include those experimental results in this article.
Emerging trends and systems overview
In the remainder of this analysis, we focus on two main tasks, namely Mention Detection and Candidate Selection. Thirty different approaches were applied in four editions of the challenge since 2013. Table 11 lists all ranked teams.
Per year submissions and number of runs for each team
Per year submissions and number of runs for each team
Whilst there are substantial differences between the proposed approaches, a number of trends can be observed in the top-performing named entity recognition and linking approaches to tweets. Firstly, we observe the large adoption of data-driven approaches: while in the first and second year of the challenge there was an extensive use of off-the-shelf approaches, the top ranking systems from 2013–2016 show a high dependence on the training data. This is not surprising, since these approaches are supervised, but this clearly suggests that, to reach top performance, labeled data is necessary. Additionally, the extensive use of knowledge bases as dictionaries of typed entities and entity relation holder have dramatically affected the performance over the years. This strategy overcomes the lexical limitations of a tweet and performs well on the identification of entities available in the knowledge base used as referent. A common phase in all submitted approaches is normalisation, meaning smoothing the lexical variations of the tweets and translating them to language structures that can be better parsed by state-of-the-art approaches that expect more formal and well-formed text. Whilst the linguistic workflow favours the use of sequential solutions, Entity Recognition and Linking for tweets is proposed as joint step using large knowledge bases as referent entity directories. While knowledge bases support the linking of entities with mentions in text, they cannot support the identification of novel and emerging entities. Ad-hoc solutions for tweets for the generation of NILs have been proposed, ranging from edit distance-based solutions to the use of Brown clustering.
Between the first NEEL challenge on Concept Extraction (CE) and the 2016 edition we observe the following:
tweet normalisation as first step of any approach. This is generally defined as preprocessing and it increases the expressiveness of the tweets, e.g. via the expansion of Twitter accounts and hashtags with the actual names of entities they represent, or with conversion of non-ASCII characters, and, generally, noise filtering;
the contribution of knowledge bases in the mention detection and typing task. This leads to higher coverage, which, along with the linguistic analysis and type prediction, better fits this particular domain;
the use of high performing end-to-end approaches for the candidate selection. Such a methodology was further developed with the addition of fuzzy distance functions operating over n-grams and acronyms;
the inclusion of a pruning stage to filter out candidate entities. This was presented in various approaches ranging from Learning-to-Rank to recasting the problem as a classification task. We observed that the approach based on a classifier reached better performance (in particular, the classifier that performed best for this task was implemented using a SVM based on a radial basis function kernel), however it required an extensive feature engineering of the feature set used as training;
utilising hierarchical clustering of mentions to aggregate exact mentions of the same entity in the text and thus complementing the knowledge base entity directory in case of absence of an entity;
a considerable decrease in off-the-shelf systems. These were popular in the first editions of NEEL, but in later editions their performance grew increasingly limited as the task became more constrained.
Table 12 provides an overview of the methods and features used in these four years, grouped according to the step involved in the workflow. In addition to the list of the steps listed in Fig. 1.
Map of the approaches per sub-task applied in the NEEL series of challenges from 2013 until 2016
https://github.com/semanticize/semanticizer
http://www.alchemyapi.com
http://www.opencalais.com
http://www.zemanta.com
Map of the approaches per sub-task applied in the NEEL series of challenges from 2013 until 2016
Submissions and number of runs for each team for the Mention Detection phase
(Continued)
Table 13 presents a description of the approaches used for Mention Detection combined with Typing. Participants approached the task using lexical similarity matchers, machine learning algorithms, and hybrid methods that combine the two. For 2013, the strategies yielding the best results where hybrid, where models relied on the application of off-the-shelf systems (e.g., AIDA [45], ANNIE [24], OpenNLP,22
The 2014 systems approached the Mention Detection task adding lexicons and features computed from DBpedia resources. System 14, the best performing system, used a matcher from n-grams computed from the text and the lexicon entries taken from DBpedia. From the 2014 challenge on, we observe more approaches favouring recall in the Mention Detection, while focusing less on using linguistic features for mention detection. System 15, proposed by the same authors of the best performing system in 2014, addressed the Mention Detection task with a large set of linguistic and lexicon-related features (such as the probability of the candidate obtained from the Microsoft Web N-Gram services, or its appearance in WordNet) and using a SVM classifier with a radial basis function kernel specifically trained with the challenge data. Such an approach resulted in high precision, but it slightly penalised recall.
The 2015 best performing approach for Mention Detection, System 20, was largely inspired by the 2014 winning approach: the use of n-grams used to look up resources in DBpedia and a set of lexical features such as POS tags and position in tweets. The type was assigned by a Random Forest classifier specifically trained with the challenge dataset and using DBpedia related features (such as PageRank [11]), word embeddings (contextual features), temporal popularity knowledge of an entity extracted from Wikipedia page view data, string similarity measures to measure the similarity between the title of the entity and the mention (such as edit distance), and linguistic features (such as POS tags, position in tweets, capitalization).
The 2016 best performing system, System 26, implements a lexicon matcher to match the entity in the knowledge base to the unigrams computed from the text. The approach proposed a preliminary stage of tweet normalisation resolving acronyms, hashtags to mentions written in natural language.
From 2014 on, the challenge task required participants to produce systems that were also able to link the detected mentions to their corresponding DBpedia link (if existing). Table 14 describes the approaches taken by the 2014, 2015, 2016 participants for the Candidate Detection and Selection, and NIL Clustering stages. In 2014, most of the systems proposed a Candidate Selection step as subsequent of the Mention Detection stage, implementing the conventional linguistic pipeline detecting first the mention, and then looking for referents of the mention in the external knowledge base. This resulted in a set of candidate links, which have been ranked according to the similarity of the link with respect to the mention, and the surrounded text. However, the best performing system (System 14), approached the Candidate Selection as a joint stage with the Mention Detection and link assignment, proposing a so-called end-to-end system. As opposed to most of the participants which used off-the-shelf tools, System 14 proposed a SMART gradient boosting algorithm [33], specifically trained with the challenge dataset where the features are textual features (such as textual similarity, contextual similarity), graph-based features (such as semantic cohesiveness between the entity–entity and entity–mention pairs), and statistical features (such as mention popularity using the Web as archive). The majority of the systems, including System 14, applied name normalisation for feature extraction, which was useful for identifying entities originally appearing as hashtags, or username mentions. Among the most commonly used external knowledge sources are: NER dictionaries (e.g., Google CrossWiki), Knowledge Base Gazetteers (e.g., Yago, DBpedia), weighted lexicons (e.g., Freebase, Wikipedia), other sources (e.g., Microsoft Web N-gram).25
In the 2015 challenge, System 20 (ranked first) proposed an enhanced version of the 2014 challenge winner, combined with a pruning stage meant to increase the precision of the Candidate Selection while considering the role of the entity type being assigned by a Conditional Random Field (CRF) classifier. In particular, System 20 is a five-sequential stage approach: preprocessing, generation of potential entity mentions, candidate selection, NIL detection, and entity mention typing. In the preprocessing stage, a tokenisation and Part-of-Speech (POS) tagging approach based on [37] was used, along with the extraction of tweet timestamps. They address the generation of potential entity mentions by computing n-grams (with
In 2016, the top performing system, System 26, proposed a lexicon-based joint Mention Extraction and Candidate Selection approach, where unigrams from tweets are mapped to DBpedia entities. A preprocessing stage cleans and classifies the part-of-speech tags, and normalises the initial tweets converting alphabetic, numeric, and symbolic Unicode characters to ASCII equivalents. For each entity candidate, it considers local and context-related features. Local features include the edit distance between the candidate labels and the n-gram, the candidates link graph popularity, its DBpedia type, the provenance of the label and the surface form that matches best. The context-related features assess the relation of a candidate entity to the other candidates within the given context. They include graph distance measurements, connected component analysis, or centrality and density observations using as pivot the DBpedia graph. The candidate selection is sorted according to the confidence score, which is used as a means to understand whether the entity actually describes the mention. In case the confidence score is lower than an empirically-determined threshold, the mention is annotated with a NIL.
The other approaches implement linguistic pipelines where the Candidate Selection is performed by looking up entities according to the exact lexical value of the mentions with DBpedia titles, redirect pages, and disambiguation pages. While we observed a reduction in complexity for the NIL clustering, resulting in only considering the lexical distance of the mentions as for System 27 with the Monge–Elkan similarity measure [21], or System 28, that experimented the normalised Damerau-Levenshtein, performing better than Brown clustering [12].
Submissions and number of runs for each team for the Candidate Selection phase
In this section, the evaluation metrics used in the different challenges are described.
2013 evaluation measures
In 2013, the submitted systems were evaluated based on performance in extracting a mention and assigning its correct class as assigned in the Gold Standard
We performed a
Since we require strict matches, a system must both detect the correct mention
Then it is computed the precision and recall on a per-entity-type basis as:
Submissions were evaluated offline as participants were asked to annotate in a short time window a test set of the TSV stands for tab separated value.
In 2014, a system We considered all DBpedia v3.9 resources valid. Since the 2014 NEEL Challenge, we opted to weigh all instances of
The evaluation procedure involved an
Submissions were evaluated offline, where participants were asked to annotate in a short time window the TS and to send the results in a TSV file.
In the 2015 and 2016 editions of the NEEL challenge, systems were evaluated according to the number of mentions correctly detected, their type correctly asserted (i.e. output of Mention Detection and Entity Typing), the links correctly assigned between a mention in a tweet and a knowledge base entry, and a NIL assigned when no knowledge base entry disambiguates the mention.
The required outputs were measured using a set of three evaluation metrics:
The
The
The last metric in our evaluation score is the Constrained Entity-Alignment F-measure (CEAF) [53]. This is a metric that measures coreference chains and is used to jointly evaluate Candidate Selection and NIL Clustering steps. Let
In 2015, submissions were evaluated through an online process as participants were required to implement their systems as a publicly accessible web service following a REST-based protocol, where they could submit up to 10 contending entries to a registry of the NEEL challenge services. Each endpoint had a Web address (URI) and a name, which was referred as runID. Upon receiving the registration of the REST endpoint, calls to the contending entry were scheduled for two different time windows, namely,

As setting up a REST API increased the system implementation load on the participants, we reverted back to an offline evaluation setup in 2016. As in previous challenges, participants were asked to annotate the TS during a short time window and to send the results in a TSV file which was then evaluated by the challenge chairs.
Three editions out of four followed an offline evaluation procedure. A discontinuity was introduced in 2015 with the introduction of the online evaluation procedure. Two issues were noted by the participants of the 2015 edition: (i) increasing complexity of the task, going beyond the pure NEEL objectives; (ii) unfair comparison of the computing time with respect to big players that can afford better computing resources than small research teams. These motivations caused the use of a conventional offline procedure for the 2016 edition. The emerging trend sees a consolidation of a standard de-facto scorer that was proposed in TAC-KBP and also now successfully adopted and widely used in our community. This scorer supports the measurement of the performance of the approaches in the entire annotation pipeline, ranging from the Mention Extraction, Candidate Selection, Typing, and detection of novel and emerging entities from highly dynamic contexts such as tweets.
Results
This section presents a compilation of the NEEL challenge results across the years. As the NEEL task evolved, the results among these years are not entirely comparable. Table 15 shows results for the 2013 challenge task, where we report scores averaged for the four entity types analysed on this task.
Scores achieved for the NEEL 2013 submissions
Scores achieved for the NEEL 2013 submissions
The 2013 task consisted of building systems that could identify four entity types (i.e., Person, Location, Organisation and Miscellaneous) in a tweet. This task proved to be challenging, with some approaches favouring precision over recall. The best rank in precision was obtained by Team 1, which used a combination of rule types and data-driven approaches achieving a 76.4% precision. For recall, results varied across the four entity types with results for the miscellaneous and organisation types ranking the lowest. Averaging over entity types, the best approach was obtained by Team 2, whose solution relied on gazetteers. All top 3 teams ranked by F-measure followed a hybrid approach combining rules and gazetteers.
The 2014 challenge task extended the concept extraction challenge by not only considering the entity type recognition but also the linking of entities to the DBpedia v3.9 knowledge base. Table 16 presents the results for this task, which follow the evaluation described in Section 7. There was a clear winner that outperformed all other systems on all three metrics and it was proposed by the Microsoft Research Lab Redmond.29
Scores achieved for the NEEL 2014 submissions
The 2015 task extended the 2014 recognition and linking tasks with a clustering task. For this task participants had to provide clusters where each cluster contained only mentions to the real world entity. For 2015 we also computed the latency of each system. Table 17 presents a ranked list of results for the 2015 submissions. The last column shows the final score for each participant following Eq. (14). Here the winner (Team 20) outperformed the second best with a boost in tagging
Scores achieved for the NEEL 2015 submissions. Tagging refers to
Finally, the 2016 challenge followed the same task as 2015. Table 18 presents a ranked list of results for the 2016 submissions. Team 26 outperformed all other participants, with an overall
Scores achieved for the NEEL 2016 submissions. Tagging refers to
The NEEL challenge series was established in 2013 to foster the development of novel automated approaches for mining semantics from tweets and providing standardised benchmark corpora enabling the community to compare systems.
This paper describes the decisions and procedures followed in setting up and running the task. We first described the annotation procedures used to create the NEEL corpora over the years. The procedures were incrementally adjusted over time to provide continuity and ensure reusability of the approaches over the different editions. While the consolidation has provided consistent labeled data, it has also showed the robustness of the community.
We also described the different approaches proposed by the NEEL challenge participants. Over the years, we witnessed to a convergence of the approaches towards data-driven solutions supported by knowledge bases. Knowledge bases are prominently used as a source to discover known entities, relations among data, and labelled data for selecting candidates and suggesting novel entities. Data-driven approaches have become, with variations, the leading solution. Despite the consolidated number of options for addressing the challenge task, the participants’ results show that the NEEL task remains challenging in the microposts domain.
Furthermore, we explained the different evaluation strategies used in different challenges. These changes were driven by a desire to ensure fairness of the evaluation, transparency, and correctness. These adaptations involve the use of in-house scoring tools in 2013 and 2014, which were made publicly available and discussed in the community. Since 2015 the TAC-KBP challenge scorer was adopted to both leverage the wide experience developed in the TAC-KBP community and break down the analysis to account for the clustering.
Thanks to the yearly releases of the annotations and tweet IDs with a public license, the NEEL corpus has started to become widely adopted. Beyond the thirty teams who completed the evaluations in four years, more than three hundred participants have contacted the NEEL organisers with a request to acquire the corpora. The teams come from more than twenty different countries and are both from academia and industry. The 2014 and 2015 winners were companies operating in the field, respectively Microsoft and Studio Ousia. The 2013 and 2016 winners were academic teams. The success of the NEEL challenges is also illustrated by the sponsorships of the challenges offered by companies (ebay31
The NEEL challenges also triggered the interest of local communities such as NEEL-IT. This community is pushing the NEEL Challenge Annotation Guidelines (with minor variations due to the intra-language dependencies) and know-how to create a benchmark for sharing the algorithms and results of mining semantics from Italian tweets. In 2015, we also built bridges with the TAC community. We plan to strengthen these and to involve a larger audience of potential participants ranging from Linguistics, Machine Learning, Knowledge Extraction, Data and Web Science.
Future work involves the generation of corpora that account for the low variance of entity-type semantics. We aim to create larger datasets covering a broader range of entity types and domains within the Twitter sphere. The 2015 enhancements in the evaluation strategy, which accounts for computational time, highlighted new challenges when focusing on an algorithm’s efficiency vs efficacy. Since more efforts on handling large scale data mining involve distributed computing and optimisation, we aim to develop new evaluation strategies. These strategies will ensure the fairness of the results when asking participants to produce large scale annotations in a small window of time. Among the future efforts, we aim to identify the differences in performance among the disparate systems and their approaches, first characterising what can be considered an error by the systems in the context of the challenge, and then deriving insightful conclusions on the building blocks needed to build an optimal system automatically.
Finally, given the increasing interest in adopting the NEEL guidelines in creating corpora for other languages, we aim to develop a multilingual NEEL challenge as a future activity.
Footnotes
Acknowledgements
This work was supported by the H2020 FREME project (GA no. 644771), by the research grant from Science Foundation Ireland (SFI) under Grant Number SFI/12/RC/2289, and by the CLARIAH-CORE project financed by the Netherlands Organisation for Scientific Research (NWO).
NEEL taxonomy
languages
ethnic groups
nationalities
religions
diseases
sports
astronomical objects
holidays
sport events
political events
social events
fictional characters
comic characters
title characters
public places (squares, opera houses, museums, schools, markets, airports, stations, swimming pools, hospitals, sports facilities, youth centers, parks, town halls, theatres, cinemas, galleries, universities, churches, medical centers, parking lots, cemeteries)
regions (villages, towns, cities, provinces, countries, continents, dioceses, parishes)
commercial places (pubs, restaurants, depots, hostels, hotels, industrial parks, nightclubs, music venues, bike shops)
buildings (houses, monasteries, creches, mills, army barracks, castles, retirement homes, towers, halls, rooms, vicarages, courtyards)
companies (press agencies, studios, banks, stock markets, manufacturers, cooperatives)
subdivisions of companies
brands
political parties
government bodies (ministries, councils, courts, political unions)
press names (magazines, newspapers, journals)
public organizations (schools, universities, charities)
collections of people (sport teams, associations, theater companies, religious orders, youth organizations, musical bands)
people’s names (titles and roles are not included, such as Dr. or President)
movies
tv series
music albums
press products (journals, newspapers, magazines, books, blogs)
devices (cars, vehicles, electronic devices)
operating systems
programming languages
NEEL Challenge annotation guidelines
The challenge task consists of three consecutive stages: 1) extraction and typing of entity mentions within a tweet; 2) link of each mention to an entry in the English DBpedia35 In the 2016 NEEL Challenge we used as referent knowledge base DBpedia 2015-04
