Information extraction meets the Semantic Web: A survey

Abstract

We provide a comprehensive survey of the research literature that applies Information Extraction techniques in a Semantic Web setting. Works in the intersection of these two areas can be seen from two overlapping perspectives: using Semantic Web resources (languages/ontologies/knowledge-bases/tools) to improve Information Extraction, and/or using Information Extraction to populate the Semantic Web. In more detail, we focus on the extraction and linking of three elements: entities, concepts and relations. Extraction involves identifying (textual) mentions referring to such elements in a given unstructured or semi-structured input source. Linking involves associating each such mention with an appropriate disambiguated identifier referring to the same element in a Semantic Web knowledge-base (or ontology), in some cases creating a new identifier where necessary. With respect to entities, works involving (Named) Entity Recognition, Entity Disambiguation, Entity Linking, etc. in the context of the Semantic Web are considered. With respect to concepts, works involving Terminology Extraction, Keyword Extraction, Topic Modeling, Topic Labeling, etc., in the context of the Semantic Web are considered. Finally, with respect to relations, works involving Relation Extraction in the context of the Semantic Web are considered. The focus of the majority of the survey is on works applied to unstructured sources (text in natural language); however, we also provide an overview of works that develop custom techniques adapted for semi-structured inputs, namely markup documents and web tables.

Keywords

Information Extraction Entity Linking Keyword Extraction Topic Modeling Relation Extraction Semantic Web

1. Introduction

The Semantic Web pursues a vision of the Web where increased availability of structured content enables higher levels of automation. Berners-Lee [20] described this goal as being to “enrich human readable web data with machine readable annotations, allowing the Web’s evolution as the biggest database in the world”. However, making annotations on information from the Web is a non-trivial task for human users, particularly if some formal agreement is required to ensure that annotations are consistent across sources. Likewise, there is simply too much information available on the Web – information that is constantly changing – for it to be feasible to apply manual annotation to even a significant subset of what might be of relevance.

While the amount of structured data available on the Web has grown significantly in the past years, there is still a significant gap between the coverage of structured and unstructured data available on the Web [248]. Mika referred to this as the semantic gap [206], whereby the demand for structured data on the Web outstrips its supply. For example, in an analysis of the 2013 Common Crawl dataset, Meusel et al. [202] found that of the 2.2 billion webpages considered, 26.3% contained some structured metadata. Thus, despite initiatives like Linking Open Data [275], Schema.org [201,205] (promoted by Google, Microsoft, Yahoo, and Yandex) and the Open Graph Protocol [128] (promoted by Facebook), this semantic gap is still observable on the Web today [202,206].

As a result, methods to automatically extract or enhance the structure of various corpora have been a core topic in the context of the Semantic Web. Such processes are often based on Information Extraction methods, which in turn are rooted in techniques from areas such as Natural Language Processing, Machine Learning and Information Retrieval. The combination of techniques from the Semantic Web and from Information Extraction can be seen from two perspectives: on the one hand, Information Extraction techniques can be applied to populate the Semantic Web, while on the other hand, Semantic Web techniques can be applied to guide the Information Extraction process. In some cases, both aspects are considered together, where an existing Semantic Web ontology or knowledge-base is used to guide the extraction, which further populates the given ontology and/or knowledge-base (KB).1

¹
Herein we adopt the convention that the term “ontology” refers primarily to terminological knowledge, meaning that it describes classes and properties of the domain, such as person, knows, country, etc. On the other hand, we use the term “KB” to refer to primarily “assertional knowledge”, which describes specific entities (aka. individuals) of the domain, such as Barack Obama, China, etc.

In the past years, we have seen a wealth of research dedicated to Information Extraction in a Semantic Web setting. While many such papers come from within the Semantic Web community, many recent works have come from other communities, where, in particular, general-knowledge Semantic Web KBs – such as DBpedia [171], Freebase [26] and YAGO2 [139] – have been broadly adopted as references for enhancing Information Extraction tasks. Given the wide variety of works emerging in this particular intersection from various communities (sometimes under different nomenclatures), we see that a comprehensive survey is needed to draw together the techniques proposed in such works. Our goal is then to provide such a survey.

Survey Scope: This survey provides an overview of published works that directly involve both Information Extraction methods and Semantic Web technologies. Given that both are very broad areas, we must be rather explicit in our inclusion criteria.

With respect to Semantic Web technologies, to be included in the scope of a survey, a work must make non-trivial use of an ontology, knowledge-base, tool or language that is founded on one of the core Semantic Web standards: RDF/RDFS/OWL/SKOS/SPARQL.2

Works that simply mention general terms such as “semantic” or “ontology” may be excluded by this criteria if they do not also directly use or depend upon a Semantic Web standard.

By Information Extraction methods, we focus on the extraction and/or linking of three main elements from an (unstructured or semi-structured) input source.

Entities: anything with named identity, typically an individual (e.g., Barack Obama, 1961).

Concepts: a conceptual grouping of elements. We consider two types of concepts:

Classes: a named set of individuals (e.g., U.S. President(s));

Topics: categories to which individuals or documents relate (e.g, U.S. Politics).

Relations: an n-ary tuple of entities ( $n ⩾ 2$ ) with a predicate term denoting the type of relation (e.g., marry(Barack Obama,Michele Obama,Chicago)).

More formally, we can consider entities as atomic elements from the domain, concepts as unary predicates, and relations as n-ary ( $n ⩾ 2$ ) predicates. We take a rather liberal interpretation of concepts to include both classes based on set-theoretic subsumption of instances (e.g., OWL classes [136]), as well as topics that form categories over which broader/narrower relations can be defined (e.g., SKOS concepts [207]). This is rather a practical decision that will allow us to draw together a collective summary of works in the interrelated areas of Terminology Extraction, Keyword Extraction, Topic Modeling, etc., under one heading.

Returning to “extracting and/or linking”, we consider the extraction process as identifying mentions referring to such entities/concepts/relations in the unstructured or semi-structured input, while we consider the linking process as associating a disambiguated identifier in a Semantic Web ontology/KB for a mention, possibly creating one if not already present and using it to disambiguate and link further mentions.

Information Extraction Tasks: The survey deals with various Information Extraction tasks. We now give an introductory summary of the main tasks considered (though we note that the survey will delve into each task in much more depth later):

Named Entity Recognition:

demarcate the locations of mentions of entities in an input text:

aka. Entity Recognition, Entity Extraction;

e.g., in the sentence “ Barack Obama was born in Hawaii ”, mark the underlined phrases as entity mentions.

Entity Linking:

associate mentions of entities with an appropriate disambiguated KB identifier:

involves, or is sometimes synonymous with, Entity Disambiguation;3

In some cases Entity Linking is considered to include both recognition and disambiguation; in other cases, it is considered synonymous with disambiguation applied after recognition.

often used for the purposes of Semantic Annotation;

e.g., associate “ Hawaii ” with the DBpedia identifier dbr:Hawaii for the U.S. state (rather than the identifier for various songs or books by the same name).4

⁴

We use well-known IRI prefixes as consistent with the lookup service hosted at: http://prefix.cc. All URLs in this paper were last accessed on 2018/05/30.

Terminology Extraction:

extract the main phrases that denote concepts relevant to a given domain described by a corpus, sometimes inducing hierarchical relations between concepts;

aka. Term Extraction, often used for the purposes of Ontology Learning;

e.g., identify from a text on Oncology that “breast cancer” and “melanoma” are important concepts in the domain;

optionally identify that both of the above concepts are specializations of “cancer”;

terms may be linked to a KB/ontology.

Keyphrase Extraction:

extract the main phrases that categorize the subject/domain of a text (unlike Terminology Extraction, the focus is often on describing the document, not the domain);

aka. Keyword Extraction, which is often generically applied to cover extraction of multi-word phrases; often used for the purposes of Semantic Annotation;

e.g., identify that the keyphrases “breast cancer” and “mammogram” help to summarize the subject of a particular document;

keyphrases may be linked to a KB/ontology.

Topic Modeling:

Cluster words/phrases frequently co-occurring together in the same context; these clusters are then interpreted as being associated to abstract topics to which a text relates;

aka. Topic Extraction, Topic Classification;

e.g., identify that words such as “cancer”, “breast”, “doctor”, “chemotherapy” tend to co-occur frequently and thus conclude that a document containing many such occurrences is about a particular abstract topic.

Topic Labeling:

For clusters of words identified as abstract topics, extract a single term or phrase that best characterizes the topic;

aka. Topic Identification, esp. when linked with an ontology/KB identifier; often used for the purposes of Text Classification;

e.g., identify that the topic {“cancer”, “breast”, “doctor”, “chemotherapy”} is best characterized with the term “cancer” (potentially linked to dbr:Cancer for the disease and not, e.g., the astrological sign).

Relation Extraction:

Extract potentially n-ary relations (for $n ⩾ 2$ ) from an unstructured (i.e., text) or semi-structured (e.g., HTML table) source;

a goal of the area of Open Information Extraction;

e.g., in the sentence “Barack Obama was born in Hawaii”, extract the binary relation wasBornIn(Barack Obama,Hawaii);

binary relations may be represented as RDF triples after linking entities and linking the predicate to an appropriate property (e.g., mapping wasBornIn to the DBpedia property dbo:birthPlace);

n-ary ( $n ⩾ 3$ ) relations are often represented with a variant of reification [134,272].

Note that we will use a more simplified nomenclature { Entity, Concept, Relation } × { Extraction, Linking } as previously described to structure our survey with the goal of grouping related works together; thus, works on Terminology Extraction, Keyphrase Extraction, Topic Modeling and Topic Labeling will be grouped under the heading of Concept Extraction and Linking.

Again we are only interested in such tasks in the context of the Semantic Web. Our focus is on unstructured (text) inputs, but we will also give an overview of methods for semi-structured inputs (markup documents and tables) towards the end of the survey.

Related Areas, Surveys and Novelty: There are a variety of areas that relate and overlap with the scope of this survey, and likewise there have been a number of previous surveys in these areas. We now discuss such areas and surveys, how they relate to the current contribution, and outline the novelty of the current survey.

As we will see throughout this survey, Information Extraction (IE) from unstructured sources – i.e., textual corpora expressed primarily in natural language – relies heavily on Natural Language Processing (NLP). A number of resources have been published within the intersection of NLP and the Semantic Web (SW), where we can point, for example, to a recent book published by Maynard et al. [191] in 2016, which likewise covers topics relating to IE. However, while IE tools may often depend on NLP processing techniques, this is not always the case, where many modern approaches to tasks such as Entity Linking do not use a traditional NLP processing pipeline. Furthermore, unlike the introductory textbook by Maynard et al. [191], our goal here is to provide a comprehensive survey of the research works in the area. Note that we also provide a brief primer on the most important NLP techniques in a supplementary appendix, discussed later.

On the other hand, Data Mining involves extracting patterns inherent in a dataset. Example Data Mining tasks include classification, clustering, rule mining, predictive analysis, outlier detection, recommendation, etc. Knowledge Discovery refers to a higher-level process to help users extract knowledge from raw data, where a typical pipeline involves selection of data, pre-processing and transformation of data, a Data Mining phase to extract patterns, and finally evaluation and visualization to aid users gain knowledge from the raw data and provide feedback. Some IE techniques may rely on extracting patterns from data, which can be seen as a Data Mining step;5

⁵

In fact, the title “Information Extraction” pre-dates that of the title “Data Mining” in its modern interpretation.

however, Information Extraction need not use Data Mining techniques, and many Data Mining tasks – such as outlier detection – have only a tenuous relation to Information Extraction. A survey of approaches that combine Data Mining/Knowledge Discovery with the Semantic Web was published by Ristoski and Paulheim [261] in 2016.

With respect to our survey, both Natural Language Processing and Data Mining form part of the background of our scope, but as discussed, Information Extraction has a rather different focus to both areas, neither covering nor being covered by either.

On the other hand, relating more specifically to the intersection of Information Extraction and the Semantic Web, we can identify the following (sub-)areas:

Semantic Annotation:

aims to annotate documents with entities, classes, topics or facts, typically based on an existing ontology/KB. Some works on Semantic Annotation fall within the scope of our survey as they include extraction and linking of entities and/or concepts (though not typically relations). A survey focused on Semantic Annotation was published by Uren et al. [301] in 2006.

Ontology-Based Information Extraction:

refers to leveraging the formal knowledge of ontologies to guide a traditional Information Extraction process over unstructured corpora. Such works fall within the scope of this survey. A prior survey of Ontology-Based Information Extraction was published by Wimalasuriya and Dou [313] in 2010.

Ontology Learning:

helps automate the (costly) process of ontology building by inducing an (initial) ontology from a domain-specific corpus. Ontology Learning also often includes Ontology Population, meaning that instance of concepts and relations are also extracted. Such works fall within our scope. A survey of Ontology Learning was provided by Wong et al. [316] in 2012.

Knowledge Extraction:

aims to lift an unstructured or semi-structured corpus into an output described using a knowledge representation formalism (such as OWL). Thus Knowledge Extraction can be seen as Information Extraction but with a stronger focus on using knowledge representation techniques to model outputs. In 2013, Gangemi [112] provided an introduction and comparison of fourteen tools for Knowledge Extraction over unstructured corpora.

Other related terms such as “Semantic Information Extraction” [110], “Knowledge-Based Information Extraction” [140], “Knowledge-Graph Completion” [179], and so forth, have also appeared in the literature. However, many such titles are used specifically within a given community, whereas works in the intersection of IE and SW have appeared in many communities. For example, “Knowledge Extraction” is used predominantly by the SW community and not others.6

⁶

Here we mean “Knowledge Extraction” in an IE-related context. Other works on generating explanations from neural networks use the same term in an unrelated manner.

Hence our survey can be seen as drawing together works in such (sub-)areas under a more general scope: works involving IE techniques in a SW setting.

Intended Audience: This survey is written for researchers and practitioners who are already quite familiar with the main SW standards and concepts – such as the RDF, RDFS, OWL and SPARQL standards, etc. – but are not necessarily familiar with IE techniques. Hence we will not introduce SW concepts (such as RDF, OWL, etc.) herein. Otherwise, our goal is to make the survey as accessible as possible. For example, in order to make the survey self-contained, in Appendix x we provide a detailed primer on some traditional NLP and IE processes; the techniques discussed in this appendix are, in general, not in the scope of the survey, since they do not involve SW resources, but are heavily used by works that fall in scope. We recommend readers unfamiliar with the IE area to read the appendix as a primer prior to proceeding to the main body of the survey. Knowledge of some core Information Retrieval concepts – such as TF–IDF, PageRank, cosine similarity, etc. – and some core Machine Learning concepts – such as logistic regression, SVM, neural networks, etc. – may be necessary to understand finer details, but not to understand the main concepts.

Nomenclature: The area of Information Extraction is associated with a diverse nomenclature that may vary in use and connotation from author to author. Such variations may at times be subtle and at other times be entirely incompatible. Part of this relates to the various areas in which Information Extraction has been applied and the variety of areas from which it draws influence. We will attempt to use generalized terminology and indicate when terminology varies.

Survey Methodology: Based on the previous discussion, this survey includes papers that:

deal with extraction and/or linking of entities, concepts and/or relations,

deal with some Semantic Web standard – namely RDF, RDFS or OWL – or a resource published or otherwise using those standards,

have details published, in English, in a relevant workshop, conference or journal since 1999,

consider extraction from unstructured sources.

For finding in-scope papers, our methodology begins with a definition of keyphrases appropriate to the section at hand. These keyphrases are divided into lists of IE-related terms (e.g., “entity extraction”, “entity linking”) and SW-related terms (e.g., “ontology”, “linked data”), where we apply a conjunction of their products to create search phrases (e.g., “entity extraction ontology”). Given the diverse terminology used in different communities, often we need to try many variants of keyphrases to capture as many papers as possible. Table 1 lists the base keyphrases used to search for papers; the final keyword searches are given by the set $(E \cup C \cup R) ‖ SW$ , where “‖” denotes concatenation (with a delimiting space).

Table 1

Keywords used to search for candidate papers. E/C/R list keywords relating to entities, concepts and relations; SW lists keywords relating to the Semantic Web

Type	Keyword set
E	"coreference resolution", "entity disambiguation", "entity linking", "entity recognition", "entity resolution", "named entity", "semantic annotation"
C	"concept models", "glossary extraction", "group detection", "keyphrase assignment", "keyphrase extraction", "keyphrase recognition", "keyword assignment", "keyword extraction", "keyword recognition", "latent variable models", "LDA" "LSA", "pLSA", "term extraction", "term recognition", "terminology mining", "topic extraction", "topic identification", "topic modeling"
R	"OpenIE", "open information extraction", "open knowledge extraction", "relation detection", "relation extraction", "semantic relation"
SW	"linked data", "ontology", "OWL", "RDF", "RDFS", "semantic web", "SPARQL", "web of data"

Our survey methodology consists of four initial phases to search, extract and filter papers. For each defined keyphrase, we (I) perform a search on Google Scholar for related papers, merging and deduplicating lists of candidate papers (numbering in the thousands in total); (II) we initially apply a rough filter for relevance based on the title and type of publication; (III) we filter for relevance by abstract; and (IV) finally we filter for relevance by the body of the paper.

To collect further literature, while reading relevant papers, we also take note of other works referenced in related works, works that cite more prominent relevant papers, and also check the bibliography of prominent authors in the area for other papers that they have written; such works were added in phase III to be later filtered in phase IV. Table 2 presents the numbers of papers considered by each phase of the methodology.7

⁷

Table 2 refers to papers considering text as input; a further 20 papers considering semi-structured inputs are presented later in the survey, which will bring the total to 109 selected papers.

Table 2

Number of papers included in the survey (by phase). E/C/R denote counts of highlighted papers in this survey relating to entities, concepts, and relations, resp.; Σ denotes the sum of E + C + R by phase

Phase	E	C	R	Σ
Seed collection (I)	2,418	8,666	8,008	19,092
Filter by title (II)	114	167	148	429
Filter by abstract (III)	100	115	102	317
Final list (IV)	25	36	28	89

We provide further details of our survey online, including the lists of papers considered by each phase.8

⁸

http://www.tamps.cinvestav.mx/ lmartinez/survey/

We may include out-of-scope papers to the extent that they serve as important background for the in-scope papers: for example, it is important for an uninitiated reader to understand some of the core techniques considered in the traditional Information Extraction area and to understand some of the core standards and resources considered in the core Semantic Web area. Furthermore, though not part of the main survey, in Section 5, we provide a brief overview of otherwise related papers that consider semi-structured input sources, such as markup documents, tables, etc.

Survey Structure: The structure of the remainder of this survey is as follows:

Section 2

discusses extraction and linking of entities for unstructured sources.

Section 3

discusses extraction and linking of concepts for unstructured sources.

Section 4

discusses extraction and linking of relations for unstructured sources.

Section 5

discusses techniques adapted specifically for extracting entities, concepts and/or relations from semi-structured sources.

Section 6

concludes the survey with a discussion.

Additionally, Appendix x provides a primer on classical Information Extraction techniques for readers previously unfamiliar with the IE area; we recommend such a reader to review this material before continuing.

2. Entity extraction & linking

Entity Extraction & Linking (EEL)9

⁹
We note that naming conventions can vary widely: sometimes Named Entity Linking (NEL) is used; sometimes the acronym (N)ERD is used for (Named) Entity Recognition & Disambiguation; sometimes EEL is used as a synonym for NED; other phrases can also be used, such as Named Entity Extraction (NEE), or Named Entity Resolution, or variations on the idea of semantic annotation or semantic tagging (which we consider applications of EEL).

refers to identifying mentions of entities in a text and linking them to a reference KB provided as input.

Entity Extraction can be performed using an off-the-shelf Named Entity Recognition (NER) tool as used in traditional IE scenarios (see Appendix A.1); however such tools typically extract entities for limited numbers of types, such as persons, organizations, places, etc.; on the other hand, the reference KB may contain entities from hundreds of types. Hence, while some Entity Extraction & Linking tools rely on off-the-shelf NER tools, others define bespoke methods for identifying entity mentions in text, typically using entities’ labels in the KB as a dictionary to guide the extraction.

Once entity mentions are extracted from the text, the next phase involves linking – or disambiguating – these mentions by assigning them to KB identifiers; typically each mention in the text is associated with a single KB identifier chosen by the process as the most likely match, or is associated with multiple KB identifiers and an associated weight (aka. support) indicating confidence in the matches that allow the application to choose which entity links to trust.

Example: In Listing 1, we provide an excerpt of an EEL response given by the online DBpedia Spotlight demo10

¹⁰

http://dbpedia-spotlight.github.io/demo/

in JSON format. Within the result, the “@URI” attribute is the selected identifier obtained from DBpedia, the “@support” is a degree of confidence in the match, the “@types” list matches classes from the KB, the “@surfaceForm” represents the text of the entity mention, the “@offset” indicates the character position of the mention in the text, the “@similarityScore” indicates the strength of a match with the entity label in the KB, and the “@percentageOfSecondRank” indicates the ratio of the support computed for the first- and second-ranked documents thus indicating the level of ambiguity.

Listing 1.

DBpedia spotlight EEL example

Of course, the exact details of the output of an EEL process will vary from tool to tool, but such a tool will minimally return a KB identifier and the location of the entity mention; a support will also often be returned.

Applications: EEL is used in a variety of applications, such as semantic annotation [41], where entities mentioned in text can be further detailed with reference data from the KB; semantic search [296], where search over textual collections can be enhanced – for example, to disambiguate entities or to find categories of relevant entities – through the structure provided by the KB; question answering [300], where the input text is a user question and the EEL process can identify which entities in the KB the question refers to; focused archival [81], where the goal is to collect and preserve documents relating to particular entities; detecting emerging entities [137], where entities that do not yet appear in the KB, but may be candidates for adding to the KB, are extracted.11

¹¹

Emerging entities are also sometimes known as Out-Of Knowledge-Base (OOKB) entities or Not In Lexicon (NIL) entities.

EEL can also serve as the basis for later IE processes, such as Topic Modeling, Relation Extraction, etc., as discussed later.

Process: As stated by various authors [62,154,243,246], the EEL process is typically composed of two main steps: recognition, where relevant entity mentions in the text are found; and disambiguation, where entity mentions are mapped to candidate identifiers with a final weighted confidence. Since these steps are (often) loosely coupled, this section surveys the various techniques proposed for the recognition task and thereafter discusses disambiguation.

2.1. Recognition

The goal of EEL is to extract and link entity mentions in a text with entity identifiers in a KB; some tools may additionally detect and propose identifiers for emerging entities that are not yet found in the KB [231,238,247]. In both cases, the first step is to mark entity mentions in the text that can be linked (or proposed as an addition) to the KB. Thus traditional NER tools – discussed in Appendix A.1 – can be used. However, in the context of EEL where a target KB is given as input, there can be key differences between a typical EEL recognition phase and traditional NER:

In cases where emerging entities are not detected, the KB can provide a full list of target entity labels, which can be stored in a dictionary that is used to find mentions of those entities. While dictionaries can be found in traditional NER scenarios, these often refer to individual tokens that strongly indicate an entity of a given type, such as common first or family names, lists of places and companies, etc. On the other hand, in EEL scenarios, the dictionary can be populated with complete entity labels from the KB for a wider range of types; in scenarios not involving emerging entities, this dictionary will be complete for the entities to recognize. Of course, this can lead to a very large dictionary, depending on the KB used.

Relating to the previous point, (particularly) in scenarios where a complete dictionary is available, the line between extraction and linking can become blurred since labels in the dictionary from the KB will often be associated with KB identifiers; hence, dictionary-based detection of entities will also provide initial links to the KB. Such approaches are sometimes known as End-to-End (E2E) approaches [247], where extraction and linking phases become more tightly coupled.

In traditional NER scenarios, extracted entity mentions are typically associated with a type, usually with respect to a number of trained types such as person, organization, and location. However, in many EEL scenarios, the types are already given by the KB and are in fact often much richer than what traditional NER models support.

In this section, we thus begin by discussing the preparation of a dictionary and methods used for recognizing entities in the context of EEL.

2.1.1. Dictionary

The predominant method for performing EEL relies on using a dictionary – also known as a lexicon or gazetteer – which maps labels of target entities in the KB to their identifiers; for example, a dictionary might map the label “Bryan Cranston” to the DBpedia IRI dbr:Bryan_Cranston. In fact, a single KB entity may have multiple labels (aka. aliases) that map to one identifier, such as “Bryan Cranston”, “Bryan Lee Cranston”, “Bryan L. Cranston”, etc. Furthermore, some labels may be ambiguous, where a single label may map to a set of identifiers; for example, “Boston” may map to dbr:Boston, dbr:Boston_(band), and so forth. Hence a dictionary may map KB labels to identifiers in a many-to-many fashion. Finally, for each KB identifier, a dictionary may contain contextual features to help disambiguate entities in a later stage; for example, context information may tell us that dbr:Boston is typed as dbo:City in the KB, or that known mentions of dbr:Boston in a text often have words like “population” or “metropolitan” nearby.

Thus, with respect to dictionaries, the first important aspect is the selection of entities to consider (or, indeed, the source from which to extract a selection of entities). The second important aspect – particularly given large dictionaries and/or large corpora of text – is the use of optimized indexes that allow for efficient matching of mentions with dictionary labels. The third aspect to consider is the enrichment of each entity in the dictionary with contextual information to improve the precision of matches. We now discuss these three aspects of dictionaries in turn.

Selection of entities: In the context of EEL, an obvious source from which to form the dictionary is the labels of target entities in the KB. In many Information Extraction scenarios, KBs pertaining to general knowledge are employed; the most commonly used are:

DBpedia [ 171 ]

A KB extracted from Wikipedia and used by ADEL [247], DBpedia Spotlight [200], ExPoSe [238], Kan-Dis [145], NERSO [125], Seznam [96], SDA [42] and THD [84], as well as works by Exner and Nugues [97], Nebhi [226], Giannini et al. [118], amongst others;

Freebase [ 26 ]

A collaboratively-edited KB – previously hosted by Google but now discontinued in favor of Wikidata [294] – used by JERL [184], Kan-Dis [145], NEMO [67], Neofonie [158], NereL [281], Seznam [96], Tulip [181], as well as works by Zheng et al. [330], amongst others;

Wikidata [ 309 ]

A collaboratively-edited KB hosted by the Wikimedia Foundation that, although released more recently than other KBs, has been used by HERD [284];

YAGO(2) [ 139 ]

Another KB extracted from Wikipedia with richer meta-data, used by AIDA [140], AIDA-Light [230], CohELL [122], J-NERD [231], KORE [138] and LINDEN [278], as well as works by Abedini et al. [1], amongst others.

These KBs are tightly coupled with owl:sameAs links establishing KB-level coreference and are also tightly coupled with Wikipedia; this implies that once entities are linked to one such KB, they can be transitively linked to the other KBs mentioned, and vice versa. KBs that are tightly coupled with Wikipedia in this manner are popular choices for EEL since they describe a comprehensive set of entities that cover numerous domains of general interest; furthermore, the text of Wikipedia articles on such entities can form a useful source of contextual information.

On the other hand, many of the entities in these general-interest KBs may be irrelevant for certain application scenarios. Some systems support selecting a subset of entities from the KB to form the dictionary, potentially pertaining to a given domain or a selection of types. For example, DBpedia Spotlight [200] can build a dictionary from the DBpedia entities returned as results for a given SPARQL query. Such a pre-selection of relevant entities can help reduce ambiguity and tailor EEL for a given application.

Conversely, in EEL scenarios targeting niche domains not covered by Wikipedia and its related KBs, custom KBs may be required. For example, for the purposes of supporting multilingual EEL, Babelfy [216] constructs its own KB from a unification of Wikipedia, WordNet, and BabelNet. In the context of Microsoft Research, JERL [184] uses a proprietary KB (Microsoft’s Satori) alongside Freebase. Other approaches make minimal assumptions about the KB used, where earlier EEL approaches such as SemTag [82] and KIM [250] only assume that KB entities are associated with labels (in experiments, SemTag [82] uses Stanford’s TAP KB, while KIM [250] uses a custom KB called KIMO).

Dictionary matching and indexing: In order to match mentions with the dictionary in an efficient manner – with $O (1)$ or $O (\log (n))$ lookup performance – optimized data structures are required, which depend on the form of matching employed. The need for efficiency is particularly important for some of the KBs previously mentioned, where the number of target entities involved can go into the millions. The size of the input corpora is also an important consideration: while slower (but potentially more accurate) matching algorithms can be tolerated for smaller inputs, such algorithms are impractical for larger input texts.

A major challenge is that desirable matches may not be an exact match, but may rather only be captured by an approximate string-matching algorithm. While one could consider, for example, approximate matching based on regular expressions or edit distances, such measures do not lend themselves naturally to index-based approaches. Instead, for large dictionaries, or large input corpora, it may be necessary to trade recall (i.e., the percentage of correct spots captured) for efficiency by using coarser matching methods. Likewise, it is important to note that KBs such as DBpedia enumerate multiple “alias” labels for entities (extracted from the redirect entries in Wikipedia), which if included in the dictionary, can help to improve recall while using coarser matching methods.

A popular approach to index the dictionary is to use some variation on a prefix tree (aka. trie), such as used by the Aho–Corasick string-searching algorithm, which can find mentions of an input list of strings within an input text in time linear to the combined size of the inputs and output. The main idea is to represent the dictionary as a prefix tree where nodes refer to letters, and transitions refer to sequences of letters in a dictionary word; further transitions are put from failed matches (dead-ends) to the node representing the longest matching prefix in the dictionary. With the dictionary preloaded into the index, the text can then be streamed through the index to find (prefix) matches. Phrases are typically indexed separately to allow both word-level and phrase-level matching. This algorithm is implemented by GATE [68] and LingPipe [38], with the latter being used by DBpedia Spotlight [200].

The main drawback of tries is that, for the matching process to be performed efficiently, the dictionary index must fit in memory, which may be prohibitive for very large dictionaries. For these reasons, the Lucene/Solr Tagger implements a more general finite state transducer that also reuses suffixes and byte-encodings to reduce space [70]; this index is used by HERD [284] and Tulip [181] to store KB labels.

In other cases, rather than using traditional Information Extraction frameworks, some authors have proposed to implement custom indexing methods. To give some examples, KIM [250] uses a hash-based index over tokens in an entity mention;12

¹²
This implementation was later integrated into GATE: https://gate.ac.uk/sale/tao/splitch13.html.

AIDA-Light [230] uses a Locality Sensitive Hashing (LSH) index to find approximate matches in cases where an initial exact-match lookup fails; and so forth.

Of course, the problem of indexing the dictionary is closely related to the problem of inverted indexing in Information Retrieval, where keywords are indexed against the documents that contain them. Such inverted indexes have proven their scalability and efficiency in Web search engines such as Google, Bing, etc., and likewise support simple forms of approximate matching based on, for example, stemming or lemmatization, which pre-normalize document and query keywords. Exploiting this natural link to Information Retrieval, the ADEL [247], AGDISTIS [302], Kan-Dis [145], TagMe [101] and WAT [243] systems use inverted-indexing schemes such as Lucene13

¹³

http://lucene.apache.org/core/

and Elastic.14

¹⁴

https://www.elastic.co; note that ElasticSearch is in fact based on Lucene.

To manage the structured data associated with entities, such as identifiers or contextual features, some tools use more relational-style data management systems. For example, AIDA [140] uses the PostgreSQL relational database to retrieve entity candidates, while ADEL [247] and Neofonie [158] use the Couchbase15

¹⁵

http://www.couchbase.com

and Redis16

¹⁶

https://redis.io/

NoSQL stores, respectively, to manage the labels and meta-data of their dictionaries.

Contextual features: Rather than being a flat map of entity labels to (sets of) KB identifiers, dictionaries often include contextual features to later help disambiguate candidate links. Such contextual features may be categorized as being structured or unstructured.

Structured contextual features are those that can be extracted directly from a structured or semi-structured source. In the context of EEL, such features are often extracted from the reference KB itself. For example, each entity in the dictionary can be associated with the (labels of the) types of that entity, but also perhaps the labels of the properties that are defined for it, or a count of the number of triples it is associated with, or the entities it is related to, or its centrality (and thus “importance”) in the graph-structure of the KB, and so forth.

On the other hand, unstructured contextual features are those that must instead be extracted from textual corpora. In most cases, this will involving extracting statistics and patterns from an external reference corpus that potentially has already had its entities labeled (and linked with the KB). Such features may capture patterns in text surrounding the mentions of an entity, entities that are frequently mentioned close together, patterns in the anchor-text of links to a page about that entity, in how many documents a particular entity is mentioned, how many times it tends to be mentioned in a particular document, and so forth; clearly such information will not be available from the KB itself.

A very common choice of text corpora for extracting both structured and unstructured contextual features is Wikipedia, whose use in this setting was – to the best of our knowledge – first proposed by Bunescu and Pasca [33], then later followed by many other subsequent works [39,40,42,66,101,243,246,255]. The widespread use of Wikipedia can be explained by the unique advantages it has for such tasks:

The text in Wikipedia is primarily factual and available in a variety of languages.

Wikipedia has broad coverage, with documents about entities in a variety of domains.

Articles in Wikipedia can be directly linked to the entities they describe in various KBs, including DBpedia, Freebase, Wikidata, YAGO(2), etc.

Mentions of entities in Wikipedia often provide a link to the article about that entity, thus providing labeled examples of entity mentions and associated examples of anchor text in various contexts.

Aside from the usual textual features such as term frequencies and co-occurrences, a variety of richer features are available from Wikipedia that may not be available in other textual corpora, including disambiguation pages, redirections of aliases, category information, info-boxes, article edit history, and so forth.17

¹⁷

Information from info-boxes, disambiguation, redirects and categories are also represented in a structured format in DBpedia.

We will further discuss how contextual features – stored as part of the dictionary – can be used for disambiguation later in this section.

2.1.2. Spotting

We now assume a dictionary that maps labels (e.g., “Bryan Cranston”, “Bryan Lee Cranston”, etc.) to a (set of) KB identifier(s) for the entity question (e.g„ “dbr:Bryan_Cranston”) and potentially some contextual information (e.g., often co-occurs with “dbr:Breaking_Bad”, anchor text often uses the term “Heisenberg”, etc.). In the next step, we identify entity mentions in the input text. We refer to this process as spotting, where we survey key approaches.

Token-based: Given that entity mentions may consist of multiple sequential words – aka. n-grams – the brute-force option would be to send all n-grams in the input text to the dictionary, for n up to, say, the maximum number of words found in a dictionary entry, or a fixed parameter. We refer generically to these n-grams as tokens and to these methods for extracting n-grams as tokenization. Sometimes these methods are referred to as window-based spotting or recognition techniques.

A number of systems use such a form of tokenization. SemTag uses the TAP ontology for seeking entity mentions that match tokens from the input text. In AIDA-Light [230], AGDISTIS [302], Lupedia [204], and NERSO [125], recognition uses sliding windows over the text for varying-length n-grams.

Although relatively straightforward, a fundamental weakness with token-based methods relates to performance: given a large text, the dictionary-lookup implementation will have to be very efficient to deal with the number of tokens a typical such process will generate, many of which will be irrelevant. While some basic features, such as capitalization, can also be taken into account to filter (some) tokens, still, not all mentions may have capitalization, and many irrelevant or incoherent entities can still be retrieved; for example, by decomposing the text “New York City”, the second bi-gram may produce York City in England as a candidate, though (probably) irrelevant to the mention. Such entities are known as overlapping entities, where post-processing must be applied (discussed later).

POS-based: A natural way to try to improve upon lexical tokenization methods in End-to-End systems is to try use some initial understanding of the grammatical role of words in the text, where POS-tags are used in order to be more selective with respect to what tokens are sent to be matched against the dictionary.

A first idea is to use POS-tags to quickly filter individual words that are likely to be irrelevant, where traditional NLP/IE libraries can be used in a preprocessing step. For example, ADEL [247], AIDA [140], Babelfy [216] and WAT [243] use the Stanford POS-tagger to focus on extracting entity mentions from words tagged as NNP (proper noun, singular) and NNPS (proper noun, plural). DBpedia Spotlight [200] rather relies on LingPipe POS-tagging, where verbs, adjectives, adverbs, and prepositions from the input text are disregarded from the process.

On the other hand, entity mentions may involve words that are not nouns and may be disregarded by the system; this is particularly common for entity types not usually considered by traditional NER tools, including titles of creative works like “Breaking Bad”.18

¹⁸
See Listing 8 where “Breaking” is tagged VGB (verb gerund/past participle) and “Bad” as JJ (adjective).

Heuristics such as analysis of capitalization can be used in certain cases to prevent filtering useful words; however, in other cases where words are not capitalized, the process will likely fail to recognize such mentions unless further steps are taken. Along those lines, to improve recall, Babelfy [216] first uses a POS-tagger to identify nouns that match substrings of entity labels in the dictionary and then checks the surrounding text of the noun to try to expand the entity mention captured (using a maximum window of five words).

Parser-based: Rather than developing custom methods, one could consider using more traditional NER techniques to identify entity mentions in the text. Such an approach could also be used, for example, to identify emerging entities not mentioned in the KB. However, while POS-tagging is generally quite efficient, applying a full constituency or dependency parse (aka. deep parsing methods) might be too expensive for large texts. On the other hand, recognizing entity mentions often does not require full parse trees.

As a trade-off, in traditional NER, shallow-parsing methods are often applied: such methods annotate an initial grouping – or chunking [191] – of individual words, materializing a shallow tier of the full parse-tree [68,88]. In the context of NER, noun-phrase chunks (see Listing 9 for an example NP/noun phrase annotation) are particularly relevant. As an example, the THD system [84] uses GATE’s rule-based Java Annotation Patterns Engine (JAPE) [68,295], consisting of regular-expression-like patterns over sequences of POS tags; more specifically, to extract entity mentions, THD uses the JAPE pattern NNP+, which will capture sequences of one-or-more proper nouns. A similar approach is taken by ExtraLink [34], which uses SProUT [88]’s XTDL rules – composed of regular-expressions over sequences of tokens typed with POS tags or dictionaries – to extract entity mentions.

As discussed in Appendix x, machine learning methods have become increasingly popular in recent years for parsing and NER. Hoffert et al. [137] propose combining AIDA and YAGO2 with Stanford NER – using a pre-trained Conditional Random Fields (CRF) classifier – to identify emerging entities. Likewise, ADEL [247] and UDFS [78] also use Stanford NER, while JERL [184] uses a custom unified CRF model that simultaneously performs extraction and linking. On the other hand, WAT [243] relies on OpenNLP’s NER tool based on a Maximum Entropy model. Going one step further, J-NERD [231] uses the dependency parse-tree (extracted using a Stanford parser), where dependencies between nouns are used to create a tree-based model for each sentence, which are then combined into a global model across sentences, which in turn is fed into a subsequent approximate inference process based on Gibbs sampling.

One limitation of using machine-learning techniques in this manner is that they must be trained on a specific corpus. While Stanford NER and OpenNLP provide a set of pre-trained models, these tend to only cover the traditional NER types of person, organization, location and perhaps one or two more (or a generic miscellaneous type). On the other hand, a KB such as DBpedia may contain thousands of entity types, where off-the-shelf models would only cover a fraction thereof. Custom models can, however, be trained using these frameworks given appropriately labeled data, where for example ADEL [247] additionally trains models to recognize professions, or where UDFS [78] trains for ten types on a Twitter dataset, etc. However, richer types require richly-typed labeled data to train on. One option is to use sub-class hierarchies to select higher-level types from the KB to train with [231]. Furthermore, as previously discussed, in EEL scenarios, the types of entities are often given by the KB and need not be given by the NER tool: hence, other “non-standard” types of entities can be labeled “miscellaneous” to train for generic recognition.

On the other hand, a benefit of using parsers based on machine learning is that they can significantly reduce the amount of lookups required on the dictionary since, unlike token-based methods, initial entity mentions can be detected independently of the KB dictionary. Likewise, such methods can be used to detect emerging entities not yet featured in the KB.

Hybrid: The techniques described previously are sometimes complementary, where a number of systems thus apply hybrid approaches combining various such techniques. One such system is ADEL [247], which uses a mix of three high-level recognition techniques: persons, organizations and locations are extracted using Stanford NER; mentions based on proper nouns are extracted using Stanford POS; and more challenging mentions not based on proper nouns are extracted using an (unspecified) dictionary approach; entity mentions produced by all three approaches are fed into a unified disambiguation and pruning phase. A similar approach is taken by the FOX (Federated knOwledge eXtraction Framework) [289], which uses ensemble learning to combine the results of four NER tools – namely Stanford NER, Illinois NET, Ottawa BalIE, and OpenNLP – where the resulting entity mentions are then passed through the AGDISTIS [302] tool to subsequently link them to DBpedia.

2.2. Disambiguation

We assume that a list of candidate identifiers has now been retrieved from the KB for each mention of interest using the techniques previously described. However, some KB labels in the dictionary may be ambiguous and may refer to multiple candidate identifiers. Likewise, the mentions in the text may not exactly match any single label in the dictionary. Thus an individual mention may be associated with multiple initial candidates from the KB, where a distinguishing feature of EEL systems is the disambiguation phase, whose goal is to decide which KB identifiers best match which mentions in the text. To achieve this, the disambiguation phase will typically involve various forms of filtering and scoring of the initial candidate identifiers, considering both the candidates for individual entity mentions, as well as (collectively) considering candidates proposed for entity mentions in a region of the text. Disambiguation may thus result in:

mentions being pruned as irrelevant to the KB (or proposed as emerging entities),

candidates being pruned as irrelevant to a mention, and/or

candidates being assigned a score – called a support – for a particular mention.

In some systems, phases of pruning and scoring may interleave, while in others, scoring is applied first and pruning is applied (strictly) thereafter.

A wide variety of approaches to disambiguation can be found in the EEL literature. Our goal, in this survey, is thus to organize and discuss the main approaches used thus far. Along these lines, we will first discuss some of the low-level features that can be used to help with the disambiguation process. Thereafter we discuss how these features can be combined to select a final set of mentions and candidates and/or to compute a support for each candidate identifier of a mention.

2.2.1. Features for disambiguation

In order to perform disambiguation of candidate KB identifiers for an entity mention, one may consider information relating to the mention itself, to the keywords surrounding the mention, to the candidates for surrounding mentions, and so forth. In fact, a range of features have been proposed in the literature to support the disambiguation process. To structure the discussion of such features, we organize them into the following five high-level categories:

Mention-based (M):

Such features rely on information about the entity mention itself, such as its text, capitalization, the recognition score for the mention, the presence of overlapping mentions, or the presence of abbreviated mentions.

Keyword-based (K):

Such features rely on collecting contextual keywords for candidates and/or mentions from reference sources of text (often using Wikipedia). Keyword-based similarity measures can then be applied over pairs or sets of contexts.

Graph-based (G):

Such features rely on constructing a (weighted) graph representing mentions and/or candidates and then applying analyses over the graph, such as to determine cocitation measures, dense-subgraphs, distances, or centrality.

Category-based (C):

Such features rely on categorical information that captures the high-level domain of mentions, candidates and/or the input text itself, where Wikipedia categories are often used.

Linguistic-based (L):

Such features rely on the grammatical role of words, or on the grammatical relation between words or chunks in the text (as produced by traditional NLP tools).

These categories reflect the type of information from which the features are extracted and will be used to structure this section, allowing us to introduce increasingly more complex types of sources from which to compute features. However, we can also consider an orthogonal conceptualization of features based on what they say about mentions or candidates:

Mention-only (mo):

A feature about the mention independent of other mentions or candidates.

Mention–mention (mm):

A feature between two or more mentions independent of their candidates.

Candidate-only (co):

A feature about a candidate independent of other candidates or mentions.

Mention–candidate (mc):

A feature about the candidate of a mention independent of other mentions.

Candidate–candidate (cc):

A feature between two or more candidates independent of their mentions.

Various (v):

A feature that may involve multiple of the above, or map mentions and/or candidates to a higher-level (or latent) feature, such as domain.

We will now discuss these features in more detail in order of the type of information they consider.

Mention-based: With respect to disambiguation, important initial information can be gleaned from the mentions themselves, both in terms of the text of the mention, the type selected by the NER tool (where available), and the relation of the mention to other neighboring mentions in a specific region of text.

To begin, the strings of mentions can be used for disambiguation. While recognition often relies on matching a mention to a dictionary, this process is typically implemented using various forms of indexes that allow for efficiently matching substrings (such as prefixes, suffixes or tokens) or full strings. However, once a smaller set of initial candidates has been identified, more fine-grained string-matching can be applied between the respective mention and candidate labels. For example, given a mention “Bryan L. Cranston” and two candidates with labels “Bryan L. Reuss” (as the longest prefix match) and “Bryan Cranston” (as a keyword match), one could apply an edit-distance measure to refine these candidates. Along these lines, for example, ADEL [247] and NERFGUN [126] use Levenshtein edit-distance, while DoSeR [336] and AIDA-Light [230] use a trigram-based Jaccard similarity.19

¹⁹
More specifically, each input string is decomposed into a set of 3-character substrings, where the Jaccard coefficient (the cardinality of the intersection over union) of both sets is computed.

A natural limitation of such a feature is that it will score different candidates with the same labels with precisely the same score; hence such features are typically combined with other disambiguation features.

Whenever the recognition phase produces a type for entity mentions independently of the types available in the KB – as typically happens when a traditional NER tool is used – this NER-based type can be compared with the type of each candidate in the KB. Given that relatively few types are produced by NER tools (without using the KB) – where the most widely accepted types are person, organization and location – these types can be mapped manually to classes in the KB, where class inference techniques can be applied to also capture candidates that are instances of more specific classes. We note that both ADEL [247] and J-NERD [231] incorporate such a feature (both recently proposed approaches). While this can be a useful feature for disambiguating some entities, the KB will often contain types not covered by the NER tool (at least using off-the-shelf pre-trained models).

The recognition process itself may produce a score for a mention indicating a confidence that it is referring to a (named) entity; this can additionally be used as a feature in the disambiguation phase, where, for example, a mention for which only weakly-related KB candidates are found is more likely to be rejected if its recognition score is also low. A simple such feature may capture capitalization, where HERD [284] and Tulip [181] mark lower-case mentions as “tentative” in the disambiguation phase, indicating that they need stronger evidence during disambiguation not to be pruned. Another popular feature, called keyphraseness by Mihalcea and Csomai [203], measures the number or ratio of times the mention appears in the anchor text of a link in a contextual corpus such as Wikipedia; this feature is considered by AIDA [140], DBpedia Spotlight [200], NERFGUN [126], HERD [284], etc.

We already mentioned how spotting may result in overlapping entity mentions being recognized, where, for example, the mention “York City” may overlap with the mention “New York City”. A natural approach to resolve such overlaps – used, for example, by ADEL [247], AGDISTIS [302], HERD [284], KORE [138] and Tulip [181] – is to try expand entity mentions to a maximal possible match. While this seems practical, some of the “nested” entity mentions may be worth keeping. Consider “New York City Police Department”; while this is a maximal (aka. external) entity mention referring to an organization, it may also be valuable to maintain the nested “New York City” mention. As such, in the traditional NER literature, Finkel and Manning [105] argued for Nested Named Entity Recognition, which preserves relevant overlapping entities. However, we are not aware of any work on EEL that directly considers relevant nested entities, though systems such as Babelfy [216] explicitly allow overlapping entities. In certain (probably rare) cases, ambiguous overlaps without a maximal entity may occur, such as for “The third Man Ray exhibition”, where “The Third Man” may refer to a popular 1949 movie, while “Man Ray” may refer to an American artist; neither are nested nor external entities. Though there are works on NER that model such (again, probably rare) cases [183], we are not aware of EEL approaches that explicitly consider these cases.

We can further consider abbreviated forms of mentions where a “complete mention” is used to introduce an entity, which is thereafter referred to using a shorter mention. For example, a text may mention “Jimmy Wales” in the introduction, but in subsequent mentions, the same entity may be referred to as simply “Wales”; clearly, without considering the presence of the longer entity mention, the shorter mention could be erroneously linked to the country. In fact, this is a particular form of coreference, where short mentions, rather than pronouns, are used to refer to an entity in subsequent mentions. A number of approaches – such as those proposed by Cucerzan [66] or Durrett and Klein [91] for linking to Wikipedia, as well as systems such as ADEL [247], AGDISTIS [302], KORE [138], MAG [217] and Seznam [96] linking to RDF KBs – try to map short mentions to longer mentions appearing earlier in the text. On the other hand, counterexamples appear to be quite common, where, for example, a text on Enzo Ferrari may simultaneously use “Ferrari” as a mention for the person and the car company he founded; automatically disambiguating individual mentions may then prove difficult in such cases. Hence, this feature will often be combined with context features, described in the following.

Keyword-based: A variety of keyword-based techniques from the area of Information Retrieval (IR) are relevant not only to the recognition process, but also to the disambiguation process. While recognition can be done efficiently at large scale using inverted indexes, for example, relevance measures can be used to help score and rank candidates. A natural idea is to consider a mention as a keyword query posed against a textual document created to describe each KB entity, where IR measures of relevance can be used to score candidates. A typical IR measure used to determine the relevance of a document to a given keyword query is TF–IDF, where the core intuition is to consider documents that contain more mentions (term-frequency: TF) of relatively rare keywords (inverse-document frequency: IDF) in the keyword query to be more relevant to that query. Another typical measure is to use cosine similarity, where documents (and keyword queries) are represented as vectors in a normalized numeric space (known as a Vector Space Model (VSM) that may use, for example, numeric TF–IDF values), where the similarity of two vectors can be computed by measuring the cosine of the angle between them.

Systems relying on IR-based measures for disambiguation include: DBpedia Spotlight [200], which defines a variant called TF–ICF, where ICF denotes inverse-candidate frequency, considering the ratio of candidates that mention the term; THD [84], which uses the Lucene-based search API of Wikipedia implementing measures similar to TF–IDF; SDA [42], which builds a textual context for each KB entity from Wikipedia based on article titles, content, anchor text, etc., where candidates are ranked based on cosine-similarity; NERFGUN [126], which compares mentions against the abstracts of Wikipedia articles referring to KB entities using cosine-similarity;20

²⁰

The term “abstracts of Wikipedia articles” refers to the first paragraph of the Wikipedia article, which is seen as providing a textual overview of the entity in question [126,171].

etc.

Other approaches consider an extended textual context not only for the KB entities, but also for the mentions. For example, considering the input sentence “Santiago frequently experiences strong earthquakes.”, although Santiago is an ambiguous label that may refer to (e.g.) several cities, the term “earthquake” will most frequently appear in connection with Santiago de Chile, which can be determined by comparing the keywords in the context of the mention with keywords in the contexts of the candidates. Such an approach is used by SemTag [82], which performs Entity Linking with respect to the TAP KB: however, rather than build a context from an external source like Wikipedia, the system instead extracts a context from the text surrounding human-labeled instances of linked entity mentions in a reference text.

Other more modern approaches adopt a similar distributional approach – where words are considered similar by merit of appearing frequently in similar contexts – but using more modern techniques. Amongst these, CohEEL [122] build a statistical language model for each KB entity according to the frequency of terms appearing in its associated Wikipedia article; this model is used during disambiguation to estimate the probability of the observed keywords surrounding a mention being generated if the mention referred to a particular entity KB. A related approach is used in the DoSeR system [336], where word embeddings are used for disambiguation: in such an approach, words are represented as vectors in a fixed-dimensional numeric space where words that often co-occur with similar words will have similar vectors, allowing, for example, to predict words according to their context; the DoSeR system then computes word embeddings for KB entities using known entity links to model the context in which those entities are mentioned in the text, which can subsequently be used to predict further mentions of such entities based on the mention’s context.

Another related approach is to consider collective assignment: rather than disambiguating one mention at a time by considering mention–candidate similarity, the selection of a candidate for one mention can affect the scoring of candidates for another mention. For example, considering the sentence “Santiago is the second largest city of Cuba”, even though Santiago de Chile has the highest prior probability to be the entity referred to by “Santiago” (being a larger city mentioned more often), one may find that Santiago de Cuba has the strongest relation to Cuba (mentioned nearby) than all other candidates for the mention “Santiago”—or, in other words, Santiago de Cuba is the most coherent with Cuba, which can override the higher prior probability of Santiago de Chile. While this is similar to the aforementioned distributional approaches, a distinguishing feature of collective assignment is to consider not only surrounding keywords, but also candidates for surrounding entity mentions. A seminal such approach – for EEL with respect to Wikipedia – was proposed by Cucerzan [66], where a cosine-similarity measure is applied between not only the contexts of mentions and their associated candidates, but also between candidates for neighboring entity mentions; disambiguation then attempts to simultaneously maximize the similarity of mentions to candidates as well as the similarity amongst the candidates chosen for other nearby entity mentions.

This idea of collective assignment would become influential in later works linking entities to RDF-based KBs. For example, the KORE [138] system extended AIDA [140] with a measure called keyphrase overlap relatedness,21

²¹

... not to be confused with overlapping mentions.

where mentions and candidates are associated with a keyword context, and where the relatedness of two contexts is based on the Jaccard similarity of their sets of keywords; this measure is then used to perform a collective assignment. To avoid computing pair-wise similarity over potentially large sets of candidates, the authors propose to use locality-sensitive hashing, where the idea is to hash contexts into a space such that similar contexts will be hashed into the same region (aka. bucket), allowing relatedness to be computed for the mentions and candidates in each bucket. Collective assignment based on comparing the textual contexts of candidates would become popular in many subsequent systems, including AIDA-Light [230], JERL [184], J-NERD [231], Kan-Dis [145], and so forth. Collective assignment is also the key principle underlying many of the graph-based techniques discussed in the following.

Graph-based: During disambiguation, useful information can be gained from the graph of connections between entities in a contextual source such as Wikipedia, or in the target KB itself. First, graphs can be used to determine the prior probability of a particular entity; for example, considering the sentence “Santiago is named after St. James.”, the context does not directly help to disambiguate the entity, but applying links analysis, it could be determined that (with respect to a given reference corpus) the candidate entity most commonly spoken about using the mention “Santiago” is Santiago de Chile. Second, as per the previous example for Cucerzan’s [66] keyword-based disambiguation – “Santiago is the second largest city of Cuba” – one may find that a collective assignment can override the higher prior probability of an isolated candidate to instead maintain a high relatedness – or coherence – of candidates selected in a particular part of text, where similarity graphs can be used to determine the coherence of candidates.

A variety of entity disambiguation approaches rely on the graph structure of Wikipedia, where a seminal approach was proposed by Medelyan et al. [198] and later refined by Milne and Witten [209]. The graph-structure of Wikipedia is used to perform disambiguation based on two main concepts: commonness and relatedness. Commonness is measured as the (prior) probability that a given entity mention is used in the anchor text to point to the Wikipedia article about a given candidate entity; as an example, one could consider that the plurality of anchor texts in Wikipedia containing the (ambiguous) mention “Santiago” would link to the article on Santiago de Chile; thus this entity has a higher commonness than other candidates. On the other hand, relatedness is a cocitation measure of coherence based on how many articles in Wikipedia link to the articles of both candidates: how many inlinking documents they share relative to their total inlinks. Thereafter, unambiguous candidates help to disambiguate ambiguous candidates for neighboring entity mentions based on relatedness, which is weighted against commonness to compute a support for all candidates; Medelyan et al. [198] argue that the relative balance between relatedness and commonness depends on the context, where for example if “Cuba” is mentioned close to “Santiago”, and “Cuba” has high commonness and low ambiguity, then this should override the commonness of Santiago de Chile since the context clearly relates to Cuba, not Chile.

Further approaches then built upon and refined Milne and Witten’s notion of commonness and relatedness. For example, Kulkarni et al. [168] propose a collective assignment method based on two types of score: a compatibility score defined between a mention and a candidate, computed using a selection of standard keyword-based approaches; and Milne and Witten’s notion of relatedness defined between pairs of candidates. The goal then is to find the selection of candidates (one per mention) that maximizes the sum of the compatibility scores and all pairwise relatedness scores amongst selected candidates. While this optimization problem is NP-hard, the authors propose to use approximations based on integer linear programming and hill climbing algorithms.

Another approach using the notions of commonness and relatedness is that of TagMe [101]; however, rather than relying on the relatedness of unambiguous entities to disambiguate a context, TagMe instead proposes a more complex voting scheme, where the candidates for each entity can vote for the candidates on surrounding entities based on relatedness; candidates with higher commonness have stronger votes. Candidates with a commonness below a fixed threshold are pruned where two algorithms are then used to decide final candidates: Disambiguation by Classifier (DC), which uses commonness and relatedness as features to classify correct candidates; and Disambiguation by Threshold (DT), which selects the top-ϵ candidates by relatedness and then chooses the remaining candidate with the highest commonness (experimentally, the authors deem $ϵ = 0.3$ to offer the best results).

While the aforementioned tools link entity mentions to Wikipedia, other approaches linking to RDF-based KBs have followed adaptations of such ideas. One such tool is AIDA [140], which performs two main steps: collective mapping and graph reduction. In the collective mapping step, the tool creates a weighted graph that includes mentions and initial candidates as nodes: first, mentions are connected to their candidates by a weighted edge denoting their similarity as determined from a keyword-based disambiguation approach; second, entity candidates are connected by a weighted edge denoting their relatedness based on (1) the same notion of relatedness introduced by Milne and Witten [209], combined with (2) the distance between the two entities in the YAGO KB. The resulting graph is referred to as the mention–entity graph, whose edges are weighted in a similar manner to the measures considered by Kulkarni et al. [168]. In the subsequent graph reduction phase, the candidate nodes with the lowest weighted degree in this graph are pruned iteratively while preserving at least one candidate entity for each mention, resulting in an approximation of the densest possible (disambiguated) subgraph.

The concept of computing a dense subgraph of the mention–entity graph was reused in later systems. For example, the AIDA-Light [230] system (a variant of AIDA with focus on efficiency) uses keyword-based features to determine the weights on mention–entity and entity–entity edges in the mention–entity graph, from which a subgraph is then computed. As another variant on the dense subgraph idea, Babelfy [216] constructs a mention–entity graph but where edges between entity candidates are assigned based on semantic signatures computed using the Random Walk with Restart algorithm over a weighted version of a custom semantic network (BabelNet); thereafter, an approximation of the densest subgraph is extracted by iteratively removing the least coherent vertices – considering the fraction of mentions connected to a candidate and its degree – until the number of candidates for each mention is below a specified threshold.

Rather than trying to compute a dense subgraph of the mention–entity graph, other approaches instead use standard centrality measures to score nodes in various forms of graph induced by the candidate entities. NERSO [125] constructs a directed graph of entity candidates retrieved from DBpedia based on the links between their articles on Wikipedia; over this graph, the system applies a variant on a closeness centrality measure, which, for a given node, is defined as the inverse of the average length of the shortest path to all other reachable nodes; for each mention, the centrality, degree and type of node is then combined into a final support for each candidate. On the other hand, the WAT system [243] extends TagMe [101] with various features, including a score based on the PageRank22

²²

PageRank is itself a variant of eigenvector centrality, which can be conceptualized as the probability of being at a node after an arbitrarily long random walk starting from a random node.

of nodes in the mention–entity graph, which loosely acts as a context-specific version of the commonness feature. ADEL [247] likewise considers a feature based on the PageRank of entities in the DBpedia KB, while HERD [284], DoSeR [336] and Seznam [96] use PageRank over variants of a mention–entity graph. Using another popular centrality-based measure, AGDISTIS [302] first creates a graph by expanding the neighborhood of the nodes corresponding to candidates in the KB up to a fixed width; the approach then applies Kleinberg’s HITS algorithm, using the authority score to select disambiguated entities.

In a variant of the centrality theme, Kan-Dis [145] uses two graph-based measures. The first measure is a baseline variant of Katz’s centrality applied over the candidates in the KB’s graph [237], where a parametrized sum over the k shortest paths between two nodes is taken as a measure of their relatedness such that two nodes are more similar the shorter the k shortest paths between them are. The second measure is then a weighted version of the baseline, where edges on paths are weighted based on the number of similar edges from each node, such that, for example, a path between two nodes through a country for the relation “resident” will have less effect on the overall relatedness of those nodes than a more “exclusive” path through a music-band with the relation “member”.

Other systems apply variations on this theme of graph-based disambiguation. KIM [250] selects the candidate related to the most previously-selected candidates by some relation in the KB; DoSeR [336] likewise considers entities as related if they are directly connected in the KB and considers the degree of nodes in the KB as a measure of commonness; and so forth.

Category-based: Rather than trying to measure the coherence of pairs of candidates through keyword contexts or cocitations or their distance in the KB, some works propose to map candidates to higher-level category information and use such categories to determine the coherence of candidates. Most often, the category information from Wikipedia is used.

The earliest approaches to use such category information were those linking mentions to Wikipedia identifiers. For example, in cases where the keyword-based contexts of candidates contained insufficient information to derive reliable similarity measures, Bunescu and Pasca [33] propose to additionally use terms from the article categories to extend these contexts and learn correlations between keywords appearing in the mention context and categories found in the candidate context. A similar such idea – using category information from Wikipedia to enrich the contexts of candidates – was also used by Cucerzan [66].

A number of approaches also use categorical information to link entities to RDF KBs. An early such proposals was the LINDEN [278] approach, which was based on constructing a graph containing nodes representing candidates in the KB, their contexts, and their categories; edges are then added connecting candidates to their contexts and categories, while categories are connected by their taxonomic relations. Contextual and categorical information was taken from Wikipedia. A cocitation-based notion of candidate–candidate relatedness similar to that of Medelyan et al. [198] is then combined with another candidate–candidate relatedness measure based on the probability of an entity in the KB falling under the most-specific shared ancestor of the categories of both entities.

As previously discussed, AIDA-Light [230] determines mention–candidate and candidate–candidate similarities using a keyword-based approach, where the similarities are used to construct a weighted mention–entity graph; this graph is also enhanced with categorical information from YAGO (itself derived from Wikipedia and WordNet), where category nodes are added to the graph and connected to the candidates in those categories; additionally, weighted edges between candidates can be computed based on their distance in the categorical hierarchy. J-NERD [231] likewise uses similar features based on latent topics computed from Wikipedia’s categories.

Linguistic-based: Some more recent approaches propose to apply joint inference to combine disambiguation with other forms of linguistic analysis. Conceptually the idea is similar to that of using keyword contexts, but with a deeper analysis that also considers further linguistic information about the terms forming the context of a mention or a candidate.

We have already seen examples of how the recognition task can sometimes gain useful information from the disambiguation task. For example, in the sentence “Nurse Ratched is a character in Ken Kesey’s novel One Flew over the Cuckoo’s Nest”, the latter mention – “One Flew over the Cuckoo’s Nest” – is a challenging example for recognition due to its length, broken capitalization, uses of non-noun terms, and so forth; however, once disambiguated, the related entities could help to find recognize the right boundaries for the mention. As another example, in the sentence “Bill de Blasio is the mayor of New York City”, disambiguating the latter entity may help recognize the former and vice versa (e.g., avoiding demarcating “Bill” or “York City” as mentions).

Recognizing this interdependence of recognition and disambiguation, one of the first approaches proposed to perform these tasks jointly was NereL [281], which applies a first high-recall NER pass that both underestimates and overestimates (potentially overlapping) mention boundaries, where features of these candidate mentions are combined with features for the candidate identifiers for the purposes of a joint inference step. A more complex unified model was later proposed by Durrett [91], which captured features not only for recognition (POS-tags, capitalization, etc.) and disambiguation (string-matching, PageRank, etc.), but also for coreference (type of mention, mention length, context, etc.), over which joint inference is applied. JERL [184] also uses a unified model for representing the NER and NED tasks, where word-level features (such as POS tags, dictionary hits, etc.) are combined with disambiguation features (such as commonness, coherence, categories, etc.), subsequently allowing for joint inference over both. J-NERD [231] likewise uses features based on Stanford’s POS tagger and dependency parser, dictionary hits, coherence, categories, etc., to represent recognition and disambiguation in a unified model for joint inference.

Aside from joint recognition and disambiguation, other types of unified models have also been proposed. Babelfy [216] applies a joint approach to model and perform Named Entity Disambiguation and Word Sense Disambiguation in a unified manner. As an example, in the sentence “Boston is a rock group”, the word “rock” can have various senses, where knowing that in this context it is used in the sense of a music genre will help disambiguate “Boston” as referring to the music group and not the city; on the other hand, disambiguating the entity “Boston” can help disambiguate the word sense of “rock”, and thus we have an interdependence between the two tasks. Babelfy thus combines candidates for both word senses and entity mentions into a single semantic interpretation graph, from which (as previously mentioned) a dense (and thus coherent) sub-graph is extracted. Another approach applying joint Named Entity Disambiguation and Word Sense Disambiguation is Kan-Dis [145], where nouns in the text are extracted and their senses modeled as a graph – weighted by the notion of semantic relatedness described previously – from which a dense subgraph is extracted.

Summary of features: Given the breadth of features covered, we provide a short recap of the main features for reference:

Mention-based:

Given the initial set of mentions identified and the labels of their corresponding candidates, we can consider:

A mention-only feature produced by the NER tool to indicate the confidence in a particular mention;

Mention–candidate features based on the string similarity between mention and candidate labels, or matches between mention (NER) and candidate (KB) types;

Mention–mention features based on overlapping mentions, or the use of abbreviated references from a previous mention.

Keyword-based:

Considering various types of textual contexts extracted for mentions (e.g., varying length windows of keywords surrounding the mention) and candidates (e.g., Wikipedia anchor texts, article texts, etc.), we can compute:

Mention–candidate features considering various keyword-based similarity measures over their contexts (e.g., TF–IDF with cosine similarity; Jaccard similarity, word embeddings, and so forth);

Candidate–candidate features based on the same types of similarity measures over candidate contexts.

Graph-based:

Considering the graph-structure of a reference source such as Wikipedia, or the target KB, we can consider:

Candidate-only features, such as prior probability based on centrality, etc.;

Mention–candidate features, based on how many links use the mention’s text to link to a document about the candidate;

Candidate–candidate coherence features, such as cocitation, distance, density of subgraphs, topical coherence, etc.

Category-based:

Considering the graph-structure of a reference source such as Wikipedia, or the target KB, we can consider:

Candidate–category features based on membership of the candidate to the category;

Text–category coherence features based on categories of candidates;

Candidate–candidate features based on taxonomic similarity of associated categories.

Linguistic-based:

Considering POS tags, word senses, coreferences, parse trees of the input text, etc., we can consider:

Mention-only features based on POS or other NER features;

Mention–mention features based on dependency analysis, or the coherence of candidates associated with them;

Mention–candidate features based on coherence of sense-aware contexts;

Candidate–candidate features based on connection through semantic networks.

This list of useful features for disambiguation is by no means complete and has continuously expanded as further Entity Linking papers have been published. Furthermore, EEL systems may use features not covered, typically exploiting specific information available in a particular KB, a particular reference source, or a particular input source. As some brief examples, we can mention that NEMO [67] uses geo-coordinate information extracted from Freebase to determine a geographical coherence over candidates, Yerva et al. [320] consider features computed from user profiles on Twitter and other social networks, ZenCrowd [77] considers features drawn from crowdsourcing, etc.

2.2.2. Scoring and pruning

As we have seen, a wide range of features have been proposed for the purposes of the disambiguation task. A general question then is: how can such features be weighted and combined into a final selection of candidates, or a final support for each candidate?

The most straightforward option is to consider a high-level feature used to score candidates (potentially using other features on a lower level), where for example AGDISTIS [302] relies on final HITS authority scores, DBpedia Spotlight [200] on TF–ICF scores, NERSO [125] on closeness centrality and degree; THD [84] on Wikipedia search rankings, etc.

Another option is to parameterize weights or thresholds for features and find the best values for them individually over a labeled dataset, which is used, for example, by Babelfy [216] to tune the parameters of its Random-Walk-with-Restart algorithm and the number of candidates to be pruned by its densest-subgraph approximation, or by AIDA [140] to configure thresholds and weights for prior probabilities and coherence.

An alternative method is to allow users to configure such parameters themselves, where AIDA [140] and DBpedia Spotlight [200] offer users the ability to configure parameters and thresholds for prior probabilities, coherence measures, tolerable level of ambiguity, and so forth. In this manner, a human expert can configure the system for a particular application, for example, tuning to trade precision for recall, or vice-versa.

Yet another option is to define a general objective function that then turns the problem of selecting the best candidates into an optimization problem, allowing the final candidate assignment to be (approximately) inferred. One such method is Kulkarni et al.’s [168] collective assignment approach, which uses integer linear programming and hill-climbing methods to compute a candidate assignment that (approximately) maximizes mention–candidate and candidate–candidate similarity weights. Another such method is JERL [184], which models entity recognition and disambiguation in a joint model over which dynamic programming methods are applied to infer final candidates. Systems optimizing for dense entity–mention subgraphs – such as AIDA [140], Babelfy [216] or Kan-Dis [144] – follow similar techniques.

Table 3
Overview of entity extraction & linking systems. KB denotes the main knowledge-base used; Matching and Indexing refer to methods used to match/index entity labels from the KB; Context refers to the sources of contextual information used; Recognition refers to the process for identifying entity mentions; Disambiguation refers to the types of high-level disambiguation features used (M:Mention, K:Keyword, G:Graph, C:Category, L:Linguistic); “—” denotes no information found, not used or not applicable

System Year KB Matching Indexing Context Recognition Disambiguation

ADEL [247] 2016 DBpedia Keyword Elastic Couchbase Wikipedia Tokens Stanford POS Stanford NER M, G

AGDISTIS [302] 2014 Any Keyword Lucene Wikipedia Tokens M, G

AIDA [140] 2011 YAGO2 Keyword Postgres Wikipedia Stanford NER M, K, G

AIDA-Light [230] 2014 YAGO2 Keyword LSH Dictionary LSH Wikipedia Tokens M, K, G, C

Babelfy [216] 2014 Wikipedia WordNet BabelNet Substring — Wikipedia Stanford POS M, G, L

CohEEL [122] 2016 YAGO2 Keywords — Wikipedia Stanford NER K, G

DBpedia Spotlight [200] 2011 DBpedia Substring Aho–Corasick Wikipedia LingPipe POS M, K

DoSeR [336] 2016 DBpedia YAGO3 Exact Keywords Custom Wikipedia — M, G

ExPoSe [238] 2014 DBpedia Substring Aho–Corasick Wikipedia LingPipe POS M, K

ExtraLink [34] 2003 Custom (Tourism) — — — SProUT XTDL [Manual]

GianniniCDS [118] 2015 DBpedia Substring SPARQL Wikipedia — C

JERL [184] 2015 Freebase Satori — — Wikipedia Hybrid (CRF) K, G, C, L

J-NERD [231] 2016 YAGO2 Keyword Dictionary Wikipedia Hybrid (CRF) M, K, C, L

Kan-Dis [145] 2015 DBpedia Freebase Keyword Lucene Wikipedia Stanford NER K, G, L

KIM [250] 2004 KIMO Keyword Hashmap — GATE JAPE G

KORE [138] 2012 YAGO2 Keyword LSH Postgres Wikipedia Stanford NER M, K, G

LINDEN [278] 2012 YAGO1 — — Wikipedia — C

MAG [217] 2017 Any Keyword Substring Lucene Wikipedia Stanford POS M, G

NereL [281] 2013 Freebase Keyword Freebase API Wikipedia UIUC NER Illinois Chunker Tokens M, K, G, C, L

NERFGUN [126] 2016 DBpedia Substring Dictionary Wikipedia — M, K, G

NERSO [125] 2012 DBpedia Exact SPARQL Wikipedia Tokens G

SDA [42] 2011 DBpedia Keyword — Wikipedia Tokens K

SemTag [82] 2003 TAP Keyword — Lab. Data Tokens K

THD [84] 2012 DBpedia Keyword Lucene Wikipedia GATE JAPE K, G

Weasel [299] 2015 DBpedia Substring Dictionary Wikipedia Stanford NER K, G

System	Year	KB	Matching	Indexing	Context	Recognition	Disambiguation
ADEL [247]	2016	DBpedia	Keyword	Elastic Couchbase	Wikipedia	Tokens Stanford POS Stanford NER	M, G
AGDISTIS [302]	2014	Any	Keyword	Lucene	Wikipedia	Tokens	M, G
AIDA [140]	2011	YAGO2	Keyword	Postgres	Wikipedia	Stanford NER	M, K, G
AIDA-Light [230]	2014	YAGO2	Keyword LSH	Dictionary LSH	Wikipedia	Tokens	M, K, G, C
Babelfy [216]	2014	Wikipedia WordNet BabelNet	Substring	—	Wikipedia	Stanford POS	M, G, L
CohEEL [122]	2016	YAGO2	Keywords	—	Wikipedia	Stanford NER	K, G
DBpedia Spotlight [200]	2011	DBpedia	Substring	Aho–Corasick	Wikipedia	LingPipe POS	M, K
DoSeR [336]	2016	DBpedia YAGO3	Exact Keywords	Custom	Wikipedia	—	M, G
ExPoSe [238]	2014	DBpedia	Substring	Aho–Corasick	Wikipedia	LingPipe POS	M, K
ExtraLink [34]	2003	Custom (Tourism)	—	—	—	SProUT XTDL	[Manual]
GianniniCDS [118]	2015	DBpedia	Substring	SPARQL	Wikipedia	—	C
JERL [184]	2015	Freebase Satori	—	—	Wikipedia	Hybrid (CRF)	K, G, C, L
J-NERD [231]	2016	YAGO2	Keyword	Dictionary	Wikipedia	Hybrid (CRF)	M, K, C, L
Kan-Dis [145]	2015	DBpedia Freebase	Keyword	Lucene	Wikipedia	Stanford NER	K, G, L
KIM [250]	2004	KIMO	Keyword	Hashmap	—	GATE JAPE	G
KORE [138]	2012	YAGO2	Keyword LSH	Postgres	Wikipedia	Stanford NER	M, K, G
LINDEN [278]	2012	YAGO1	—	—	Wikipedia	—	C
MAG [217]	2017	Any	Keyword Substring	Lucene	Wikipedia	Stanford POS	M, G
NereL [281]	2013	Freebase	Keyword	Freebase API	Wikipedia	UIUC NER Illinois Chunker Tokens	M, K, G, C, L
NERFGUN [126]	2016	DBpedia	Substring	Dictionary	Wikipedia	—	M, K, G
NERSO [125]	2012	DBpedia	Exact	SPARQL	Wikipedia	Tokens	G
SDA [42]	2011	DBpedia	Keyword	—	Wikipedia	Tokens	K
SemTag [82]	2003	TAP	Keyword	—	Lab. Data	Tokens	K
THD [84]	2012	DBpedia	Keyword	Lucene	Wikipedia	GATE JAPE	K, G
Weasel [299]	2015	DBpedia	Substring	Dictionary	Wikipedia	Stanford NER	K, G

A final approach is to use classifiers to learn appropriate weights and parameters for different features based on labeled data. Amongst such approaches, we can mention that ADEL [247] performs experiments with k-NN, Random Forest, Naive Bayes and SVM classifiers, finding k-NN to perform best; AIDA [140], LINDEN [278] and WAT [243] use SVM variants to learn feature weights; HERD [284] uses logistic regression to assign weights to features; and so forth. All such methods rely on labeled data to train the classifiers over; we will discuss such datasets later when discussing the evaluation of EEL systems.

Table 4

Overview of disambiguation features used by EEL systems. (M:Metric-based, K:Keyword-based, G:Graph-based, C:Category-based, L:Linguistic-based) (mo:mention-only, mm:mention–mention, mc:mention–candidate, co:candidate-only, cc:candidate–candidate; v:various)

System	String similarity – [M \| mc]	Type comparison – [M \| mc]	Keyphraseness – [M \| mo]	Overlapping mentions – [M \| mm]	Abbreviations – [M \| mm]	Mention–candidate contexts – [K \| mc]	Candidate–candidate contexts – [K \| cc]	Commonness/Prior – [G \| mc]	Relatedness (Cocitation) – [G \| cc]	Relatedness (KB) – [G \| cc]	Centrality – [G \| co]	Categories – [C \| v]	Linguistic – [L \| v]
ADEL [247]	✓	✓		✓	✓	✓		✓			✓
AGDISTIS [302]	✓			✓						✓	✓
AIDA [140]			✓			✓		✓	✓	✓
AIDA-Light [230]	✓					✓	✓					✓
Babelfy [216]											✓		✓
CohEEL [122]						✓		✓		✓
DBpedia Spotlight [200]	✓		✓			✓
DoSeR [336]	✓							✓			✓
ExPoSe [238]	✓	✓	✓			✓
GianniniCDS [118]												✓
JERL [184]			✓			✓	✓	✓	✓				✓
J-NERD [231]	✓	✓	✓		✓	✓	✓					✓	✓
Kan-Dis [145]						✓	✓			✓	✓	✓	✓
KIM [250]										✓
KORE [138]			✓		✓	✓	✓	✓	✓	✓
LINDEN [278]									✓			✓
MAG [217]	✓			✓	✓	✓				✓	✓		✓
NereL [281]	✓		✓				✓	✓	✓	✓		✓	✓
NERFGUN [126]	✓		✓				✓			✓	✓
NERSO [125]											✓
SDA [42]						✓
SemTag [82]						✓
THD [84]						✓					✓
Weasel [299]						✓			✓		✓

Such methods for scoring and classifying results can be used to compute a final set of mentions and their candidates, either selecting a single candidate for each mention or associating multiple candidates with a support by which they can be ranked.

2.3. System summary and comparison

Table 3 provides an overview of how the EEL techniques discussed in this section are used by highlighted systems that: deal with a resource (e.g., a KB) using one of the Semantic Web standards; deal with EEL over plain text; have a publication offering system details; and are standalone systems. Based on these criteria, we exclude systems discussed previously that deal only with Wikipedia (since they do not directly relate to the Semantic Web). In this table, Year indicates the year of publication, KB denotes the primary Knowledge Base used for evaluation, Matching corresponds to the manner in which raw mentions are detected and matched with KB entities (Keyword refers to keyword search, Substring refers to methods such as prefix matching, LSH refers to Locality Sensitive Hashing), Indexing refers to the manner in which KB meta-data are indexed, Context refers to external sources used to enrich the description of entities, Recognition indicates how raw mentions are extracted from the text, and Disambiguation indicates the high-level strategies used to pair each mention with its most suitable KB identifier (if any). Regarding the latter dimension, Table 4 further details the features used by each system based on the categorization given in Section 2.2.1.

With respect to the EEL task, given the breadth of approaches now available for this task, a challenging question is then: which EEL approach should I choose for application X? Different options are associated with different strengths and weaknesses, where we can highlight the following key considerations in terms of application requirements:

KB selection: While some tools are general and accept or can be easily adapted to work with arbitrary KBs, other tools are more tightly coupled with a particular KB, relying on features inherent to that KB or a contextual source such as Wikipedia. Hence the selection of a particular target KB may already suggest the suitability of some tools over others. For example, ADEL and DBpedia Spotlight relies on the structure provided by DBpedia; AIDA and KORE on YAGO2; while ExtraLink, KIM, and SemTag are focused on custom ontologies.

Domain selection: When working within a specific topical domain, the amount of entities to consider will often be limited. However, certain domains may involve types of entity mentions that are atypical; for example, while types such as persons, organizations, locations are well-recognized, considering the medical domain as an example, diseases or (branded) drugs may not be well recognized and may require special training or configuration. Examples of domain-specific EEL approaches include Sieve [89] (using the SNOMED-CT ontology), and that proposed by Zheng et al. [329] (based on a KB constructed from BioPortal ontologies23

²³
https://bioportal.bioontology.org

Text characteristics: Aside from the domain (be it specific or open), the nature of the text input can better suit one type of system over another. For example, even considering a fixed medical domain, Tweets mentioning illnesses will offer unique EEL challenges (short context, slang, lax capitalization, etc.) versus news articles, webpages or encyclopedic articles about diseases, where again, certain tools may be better suited for certain input text characteristics. For example, TagMe [101] focuses on EEL over short texts, while approaches such as UDFS [78] and those proposed by Yerva et al. [320] and Yosef et al. [322] focus more specifically on Tweets.

Language: Language can be an important factor in the selection of an EEL system, where certain tools may rely on resources (stemmers, lemmatizers, POS-taggers, parsers, training datasets, etc.) that assume a particular language. Likewise, tools that do not use any language-specific resources may still rely to varying extents on features (such as capitalization, distinctive proper nouns, etc.) that will be present to varying extents in different languages. While many EEL tools are designed or evaluated primarily around the English language, others offer explicit support for multiple languages [269]; amongst these multilingual systems, we can mention Babelfy [216], DBpedia Spotlight [71] and MAG [217].

Emerging entities: As data change over time, new entities are constantly generated. An application may thus need to detect emerging entities, which is only supported by some approaches; for example, approaches by Hoffert et al. [137] and Guo et al. [124] extract emerging entities with NIL annotations in cases where the confidence of KB candidates is below a threshold. On the other hand, even if an application does not need recognition of emerging entities, when considering a given approach or tool, it may be important to consider the cost/feasibility of periodically updating the KB in dynamic scenarios (e.g., recognizing emerging trends in social media).

Performance and overhead: In scenarios where EEL must be applied over large and/or highly dynamic inputs, the performance of the EEL system becomes a critical consideration, where tools can vary in orders of magnitude with respect to runtimes. Likewise, EEL systems may have prohibitive hardware requirements, such as having to store the entire dictionary in primary memory, and/or the need to collectively model all mentions and entities in a given text in memory, etc. The requirements of a particular system can then be an important practical factor in certain scenarios. For example, the AIDA-Light [230] system greatly improves on the runtime performance of AIDA [321], with a slight loss in precision.

Output quality: Quality is often defined as “fit for purpose”, where an EEL output fit for one application/purpose might be unfit for another. For example, a semi-supervised application where a human expert will later curate links might emphasize recall over the precision of the top-ranked candidate chosen, since rejecting erroneous candidates is faster than searching for new ones manually [77]. On the other hand, a completely automatic system may prefer a cautious output, prioritizing precision over recall. Likewise, some applications may only care if an entity is linked once in a text, while others may put a high priority on repeated (short) mentions also being linked. Different purposes provide different instantiations of the notion of quality, and thus may suggest the fitness of one tool over another. Such variability of quality is seen in, for example, GERBIL [303] benchmark results,24

²⁴

http://gerbil.aksw.org/gerbil/overview

where the best system for one dataset may perform worse in another dataset with different characteristics.

Various other considerations, such as availability of software, availability of appropriate training data, licensing of software, API restrictions, costs, etc., will also often apply.

In summary, no one EEL system fits all and EEL remains an active area of research. In order to exploit the inherent strengths and weaknesses of different EEL systems, a variety of ensemble approaches have been proposed. Furthermore, a wide variety of benchmarks and datasets have been proposed for evaluating and comparing such systems. We discuss ensemble systems and EEL evaluation in the following sections.

2.4. Ensemble systems

As previously discussed, different EEL systems may be associated with different strengths and weaknesses. A natural idea is then to combine the results of multiple EEL systems in an ensemble approach (as seen elsewhere, for example, in Machine Learning algorithms [80]). The main goal of ensemble methods is to thus try to compare and exploit complementary aspects of the underlying systems such that the results obtained are better than possible using any single such system. Five such ensemble systems are: NERD (2012) [ 263 ]

(Named Entity Recognition and Disambiguation) uses an ontology to integrate the input and output of ten NER and EEL tools, namely AlchemyAPI, DBpedia Spotlight, Evri, Extractiv, Lupedia, OpenCalais, Saplo, Wikimeta, Yahoo! Content Extractor, and Zemanta. Later works proposed classifier-based methods (Naive Bayes, k-NN, SVM) for combining results [265].

BEL (2014) [ 334 ]

(Bagging for Entity Linking) Recognizes entity mentions through Stanford NER, later retrieving entity candidates from YAGO that are disambiguated by means of a majority-voting algorithm according to various ranking classifiers applied over the mentions’ contexts.

Dexter (2014) [ 40 ]

uses TagMe and WikiMiner combined with a collective linking approach to match entity mentions in a text with Wikipedia identifiers, where they propose to be able to switch approaches depending on the features of the input document(s), such as domain, length, etc.

NTUNLP (2014) [ 47 ]

performs EEL with respect to Freebase using a combination of DBpedia Spotlight and TagMe results, extended with a custom EEL method using the Freebase search API. Thresholds are applied over all three methods and overlapping mentions are filtered.

WESTLAB (2016) [ 41 ]

uses Stanford NER & ADEL to recognize entity mentions, subsequently merging the output of four linking systems, namely AIDA, Babelfy, DBpedia Spotlight and TagMe.

2.5. Evaluation

EEL involves two high-level tasks: recognition and disambiguation. Thus, evaluation may consider the recognition phase separately, or the disambiguation phase separately, or the entire EEL process as a whole. Given that the evaluation of recognition is well-covered by the traditional NER literature, here we focus on evaluations that consider whether or not the recognized mentions are deemed correct and whether or not the assigned KB identifier is deemed correct.

Given the wide range of EEL approaches proposed in the literature, we do not discuss details of the evaluations of individual tools conducted by the authors themselves. Rather we will discuss some of the most commonly used evaluation datasets and then discuss evaluations conducted by third parties to compare various selections of EEL systems.

Datasets: A variety of datasets have been used to evaluate the EEL process in different settings and under different assumptions. Here we enumerate some datasets that have been used to evaluate multiple tools:

AIDA–CoNLL [ 140 ]:

The CoNLL-2003 dataset25

²⁵
https://www.clips.uantwerpen.be/conll2003/ner.tgz

consists of 1,393 Reuters’ news articles whose entities were manually identified and typed for the purposes of training and evaluating traditional NER tools. For the purposes of training and evaluating AIDA [140], the authors manually linked the entities to YAGO. This dataset was later used by ADEL [247], AIDA-Light [230], Babelfy [216], HERD [284], JERL [184], J-NERD [231], KORE [138], amongst others.

AQUAINT [ 209 ]

The AQUAINT dataset contains 50 English documents collected from the Xinhua News Service, New York Times, and the Associated Press. Each document contains about 250–300 words, where the first mention of an entity is manually linked to Wikipedia. The dataset was first proposed and used by Milne and Witten [209], and later used by AGDISTIS [302].

ELMD [ 239 ]

The ELMD dataset contains 47,254 sentences with 92,930 annotated and classified entity mentions extracted from a collection of Last.fm artist biographies. This dataset was automatically annotated through the ELVIS system,26

²⁶

https://github.com/sergiooramas/elvis

which homogenizes and combines the output of different Entity Linking tools. It was manually verified to have a precision of 0.94 and is available online.27

²⁷

https://www.upf.edu/web/mtg/elmd

IITB [ 168 ]

The IITB dataset contains 103 English webpages taken from a handful of domains relating to sports, entertainment, science and technology; the text of the webpages is scraped and semi-automatically linked with Wikipedia. The dataset was first proposed and used by Kulkarni [168] and later used by AGDISTIS [302].

Meij [ 199 ]

This dataset contains 562 manually annotated tweets sampled from 20 “verified users” on Twitter and linked with Wikipedia. The dataset was first proposed by Meij et al. [199], and later used by Cornolti et al. [62] to form part of a more general purpose EEL benchmark.

KORE-50 [ 138 ]

The KORE-50 dataset contains 50 English sentences designed to offer a challenging set of examples for Entity Linking tools; the sentences relate to various domains, including celebrities, music, business, sports, and politics. The dataset emphasizes short sentences, entity mentions with a high number of occurrences, highly ambiguous mentions, and entities with low prior probability. The dataset was first proposed and used for KORE [138], and later reused by Babelfy [216] and Kan-Dis [145], amongst others.

MEANTIME [ 212 ]

The MEANTIME dataset consists of 120 English Wikinews articles on topics relating to finance, with translations to Spanish, Italian and Dutch. Entities are annotated with links to DBpedia resources. This dataset has been recently used by ADEL [247].

MSNBC [ 66 ]

The MSNBC dataset contains 20 English news articles from 10 different categories, which were semi-automatically annotated. The dataset was proposed and used by Cucerzan [66], and later reused to evaluate AGDISTIS [302], LINDEN [278] and by Kulkarni et al. [168].

VoxEL [ 268 ]

The VoxEL dataset contains 15 news articles (on politics) in 5 different languages sourced from the VoxEurop website.28

²⁸

https://voxeurop.eu/

It was manually annotated with the NIFify system 29

²⁹

https://github.com/henryrosalesmendez/NIFify

using two different criteria for labelling: a strict version containing 204 annotated mentions (per language) of persons, organizations and locations; and a relaxed version containing 674 annotated mentions (per language) of Wikipedia entities.

WP [ 138 ]

The WP dataset samples English Wikipedia articles relating to heavy metal musical groups. Articles with related categories are retrieved and sentences with at least three named entities (found by anchor text in links) are kept; in total, 2019 sentences are considered. The dataset was first proposed and used for the KORE [138] system and also later used by AIDA-Light [230].

Aside from being used for evaluation, we note that such datasets – particular larger ones like AIDA-CoNLL – can be (and are) used for training purposes. Moreover, although varied gold standard datasets have been proposed for EEL, Jha et al. [152] stated some issues regarding such datasets, for example, data consensus (there is a lack of consensus on standard rules for annotating entities), updates (KB links change over time), and annotation quality (regarding the number and expertise of evaluations and judges of the dataset). Thus, Jha et al. [152] propose the Eaglet system for detecting such issues over existing datasets.

Metrics Traditional metrics such as Precision, Recall, and F-measure are applied to evaluate EEL systems. Moreover, micro and macro variants are also applied in systems such as AIDA [321], DoSeR [336] and frameworks such as BAT [62] and GERBIL [303]; taking Precision as an example, macro-Precision considers the average Precision over individual documents or sentences, while micro-Precision considers the entire gold standard as one test without distinguishing the individual documents or sentences. Other systems and frameworks may use measures that distinguish the type of entity or the type of mention, where, for example, the GERBIL framework distinguishes results for KB entities from emerging entities.

Third-party comparisons: A number of third-party evaluations have been conducted in order to compare various EEL tools. Note that we focus on evaluations that include a disambiguation step, and thus exclude studies that focus only on NER (e.g., [135]).

As previously discussed, Rizzo and Troncy [264] proposed the NERD approach to integrate various Entity Linking tools with online APIs. They also provided some comparative results for these tools, namely Alchemy, DBpedia Spotlight, Evri, OpenCalais and Zemanta [263]. More specifically, they compared the number of entities detected by each tool from 1,000 New York Times articles, considering six entity types: person, organization, country, city, time and number. These results show that while the commercial black box tools managed to detect thousands of entities, DBpedia Spotlight only detected 16 entities in total; to the best of our knowledge, the quality of the entities extracted was not evaluated. However, in follow-up work by Rizzo et al. [265], the authors use the AIDA–CoNLL dataset and a Twitter dataset to compare the linking precision, recall and F-measure of Alchemy, DataTXT, DBpedia Spotlight, Lupedia, TextRazor, THD, Yahoo! and Zemanta. In these experiments, Alchemy generally had the highest recall, DataTXT or TextRazor the highest precision, while TextRazor had the best F-measure for both datasets.

Gangemi [112] presented an evaluation of tools for Knowledge Extraction on the Semantic Web (or tools trivially adaptable to such a setting). Using a sample text obtained from an extract of an online article of The New York Times30

³⁰

http://www.nytimes.com/2012/12/09/world/middleeast/syrian-rebels-tied-to-al-qaeda-play-key-role-in-war.html

as input, he evaluated the precision, recall, F-measure and accuracy of several tools for diverse tasks, including Named Entity Recognition, Entity Linking (referred to as Named Entity Resolution), Topic Detection, Sense Tagging, Terminology Extraction, Terminology Resolution, Relation Extraction, and Event Detection. Focusing on the EEL task, he evaluated nine tools: AIDA, Stanbol, CiceroLite, DBpedia Spotlight, FOX, FRED+Semiosearch, NERD, Semiosearch and Wikimeta. In these results, AIDA, CiceroLite and NERD had perfect precision (1.00), while Wikimeta had the highest recall (0.91); in a combined F-measure, Wikimeta fared best (0.80), with AIDA (0.78) and FOX (0.74) and CiceroLite (0.71) not far behind. On the other hand, the observed precision (0.75) and in particular recall (0.27) of DBpedia Spotlight was relatively low.

Cornolti et al. [62] presented an evaluation framework for Entity Linking systems, called the BAT-framework.31

³¹

https://github.com/marcocor/bat-framework

The authors used this framework to evaluate five systems – AIDA, DBpedia Spotlight, Illinois Wikifier, M&W Miner and TagMe (v2) – with respect to five publicly available datasets – AIDA–CoNLL, AQUAINT, IITB, Meij and MSNBC – that offer a mix of different types of inputs in terms of domains, lengths, densities of entity mentions, and so forth. In their experiments, quite consistently across the various datasets and configurations, AIDA tended to have the highest precision, TagMe and W& M Miner tended to have the highest recall, while TagMe tended to have the highest F-measure; one exception to this trend was the IITB dataset based on long webpages, where DBpedia Spotlight had the highest recall (0.50), while AIDA had very low recall (0.04); on the other hand, for this dataset, M&W Miner had the best F-measure (0.52). An interesting aspect of Cornolti et al.’s study is that it includes performance experiments, where the authors found that TagMe was an order of magnitude faster for the AIDA–CoNLL dataset than any other tool while still achieving the best F-measure on that dataset; on the other hand, AIDA and DBpedia Spotlight were amongst the slowest tools, being around 2–3 orders of magnitude slower than TagMe.

Trani et al. [40] and Usbeck et al. [303] later provided evaluation frameworks based on the BAT-framework. First, Trani et al. proposed the DEXTER-EVAL, which allows to quickly load and run evaluations following the BAT framework.32

³²

https://github.com/diegoceccarelli/dexter-eval

Later, Usbeck et al. [303] proposed GERBIL,33

³³

http://aksw.org/Projects/GERBIL.html

where the tasks defined for the BAT-framework are reused. GERBIL additionally packages six new tools (AGDISTIS, Babelfy, Dexter, NERD, KEA and WAT), six new datasets, and offers improved extensibility to facilitate the integration of new annotators, datasets, and measures. However, the focus of the paper is on the framework and although some results are presented as examples, they only involve particular systems or particular datasets.

Derczynski et al. [79] focused on a variety of tasks over tweets, including NER/EL, which has unique challenges in terms of having to process short texts with little context, heavy use of abbreviated mentions, lax capitalization and grammar, etc., but also has unique opportunities for incorporating novel features, such as user or location modeling, tags, followers, and so forth. While a variety of approaches are evaluated for NER, with respect to EEL, the authors evaluated four systems – DBpedia Spotlight, TextRazor, YODIE and Zemanta – over two Twitter datasets – a custom dataset (where entity mentions are given to the system for disambiguation) and the Meij dataset (where the raw tweet is given). In general, the systems struggled in both experiments. YODIE – a system with adaptations for Twitter – performed best in the first disambiguation task (note that TextRazor was not tested). In the second task, DBpedia had the best recall (0.48), TextRazor had the highest precision (0.65) while Zemanta had the best F-measure (0.41) (note that YODIE was not run in this second test).

Challenge events: A variety of EEL-related challenge events have been co-located with conferences and workshops, providing a variety of standardized tasks and calling for participants to apply their techniques to the tasks in question and submit their results. These challenge events thus offer an interesting format for empirical comparison of different tools in this space. Amongst such events considering an EEL-related task, we can mention:

Entity Recognition and Disambiguation (ERD)

is a challenge at the Special Interest Group on Information Retrieval Conference (SIGIR), where the ERD’14 challenge [37] featured two tasks for linking mentions to Freebase: a short-text track considering 500 training and 500 test keyword searches from a commercial engine, and a long-text track considering 100 training and 100 testing documents scraped from webpages.

Knowledge Base Population (KBP)

is a track at the NIST Text Analysis Conference (TAC) with an Entity Linking Track, providing a variety of EEL-related tasks (including multi-lingual scenarios), as well as training corpora, validators and scorers for task performance.34

³⁴

http://nlp.cs.rpi.edu/kbp/2014/

Making Sense of Microposts (#Microposts2016)

is a workshop at the World Wide Web Conference (WWW) with a Named Entity rEcognition and Linking (NEEL) Challenge, providing a gold standard dataset for evaluating named entity recognition and linking tasks over microposts, such as found on Twitter.35

³⁵

http://microposts2016.seas.upenn.edu/challenge.html

Open Knowledge Extraction (OKE)

is a challenge hosted by the European Semantic Web Conference (ESWC), which typically contains two tasks, the first of which is an EEL task using the GERBIL framework [303]; ADEL [247] won in 2015 while WESTLAB [41] won in 2016.36

³⁶

https://project-hobbit.eu/events/open-knowledge-extraction-oke-challenge-at-eswc-2017/

Workshop on Noisy User-generated Text (W-NUT)

is hosted by the Annual Meeting of the Association for Computational Linguistics (ACL), which provides training and development data based on the CoNLL data format.37

³⁷

http://noisy-text.github.io/2017/

We highlight the diversity of conferences at which such events have been hosted – covering Linguistics, the Semantic Web, Natural Language Processing, Information Retrieval, and the Web – indicating the broad interest in topics relating to EEL.

2.6. Summary

Many EEL approaches have been proposed in the past 15 years or so – in a variety of communities – for matching entity mentions in a text with entity identifiers in a KB; we also notice that the popularity of such works increased immensely with the availability of Wikipedia and related KBs. Despite the diversity in proposed approaches, the EEL process is comprised of two conceptual steps: recognition and disambiguation.

In the recognition phase, entity mentions in the text are identified. In EEL scenarios, the dictionary will often play a central role in this phase, indexing the labels of entities in the KB as well as contextual information. Subsequently, mentions in the text referring to the dictionary can be identified using string-, token- or NER-based approaches, generating candidate links to KB identifiers. In the disambiguation phase, candidates are scored and/or selected for each mention; here, a wide range of features can be considered, relying on information extracted about the mention, the keywords in the context of the mentions and the candidates, the graph induced by the similarity and/or relatedness of mentions and candidates, the categories of an external reference corpus, or the linguistic dependencies in the input text. These features can then be combined by various means – thresholds, objective functions, classifiers, etc. – to produce a final candidate for each mention or a support for each candidate.

2.7. Open questions

While the EEL task has been widely studied in recent years, many important research questions remain open, where our survey suggests the following:

Defining “Entity”: A foundational question that remains open is to rigorously define what is an “entity” in the context of EEL [152,180,270]. The traditional definition from the NER community considers mentions of entities from fixed classes, such as Person, Place, or Organization. However, EEL is often conducted with respect to KBs that contain entities from potentially hundreds of classes. Hence some tools and datasets choose to adopt a more relaxed notion of “entity”; for example, the KORE dataset contains the element dbr:Rock_music that might be considered by some as a concept and not an entity (and hence the subject of word sense disambiguation [216,224] rather than EEL). There is also a lack of consistency regarding how emerging entities not in the KB, overlapping entities, coreferences, etc., should be handled [180,270]. Thus, a key open question relates to finding an agreement on what entities may, should and/or must be extracted and linked as part of the EEL process.

Multilingual EEL: EEL approaches have traditionally focused on English texts. However, more and more approaches are considering EEL over non-English texts, including Babelfy [216], MAG [217], THD [84], and updated versions of legacy systems such as DBpedia Spotlight [71]. Such systems face a number of open challenges, including the development of language-specific or language-agnostic components (e.g., having POS taggers for different languages), the disparity of reference information available for different languages (e.g., Wikipedia is more complete for English than other languages), as well as being robust to language variations (e.g., differences in alphabet, capitalization, punctuation) [269].

Specialized Settings: While the majority of EEL approaches consider relatively clean and long text documents as input – such as news articles – other applications may require EEL over noisy or short text. One example that has received attention recently is the application of EEL methods over Twitter [78,79,320,322], which presents unique challenges – such as the frequent use of slang and abbreviations, a lack of punctuation and capitalization, as well as having limited context – but also present unique opportunities – such as leveraging user profiles and social context. Beyond Twitter, EEL could be applied in any number of specialized settings, each with its own challenges and opportunities, raising further open questions.

Novel Techniques: Improving the precision and recall of EEL will likely remain an open question for years to come; however, we can identify two main trends that are likely to continue into the future. The first trend is the use of modern Machine Learning techniques for EEL; for example, Deep Learning [121] has been investigated in the context of improving both recognition [87] and disambiguation [107]. The second trend is towards approaches that consider multiple related tasks in a joint approach, be it to combine recognition and disambiguation [184,231], or to combine word sense disambiguation and EEL [145,216], etc. Novel techniques are required to continue to improve the quality of EEL results.

Evaluation: Though benchmarking frameworks such as BAT [62] and GERBIL [303] represent important milestones towards better evaluating and comparing EEL systems, potentially much more work can be done along these lines. With respect to datasets, creating gold standards often requires significant manual labor, where mistakes may sometimes be introduced in the annotation process [152], or datasets may be labeled with respect to incompatible notions of “entity” [152,180,270]. Moreover, the domains [89] and languages [269] covered by existing datasets are limited. Aside from the need for more labeled datasets, evaluations tend to consider EEL systems as complex black boxes, which obfuscates the reasons for a particular system’s success or failure; more fine-grained evaluation of techniques – rather than systems – could potentially offer more fundamental insights into the EEL process, leading to further research questions.

3. Concept extraction & linking

A given corpus may refer to one or more domains, such as Medicine, Finance, War, Technology, and so forth. Such domains may be associated with various concepts indicating a more specific topic, such as “breast cancer”, “solid state disks”, etc. Concepts (unlike many entities) are often hierarchical, where, for example, “breast cancer”, “melanoma”, etc., may indicate concepts that specialize the more general concepts of “cancer”, which in turn specializes the concept of “disease”, etc.

For the purposes of this section, we coin the generic phrase Concept Extraction & Linking to encapsulate the following three related but subtly distinct Information Extraction tasks – as discussed in Appendix x – that can be brought to bear in terms of gaining a greater understanding of the concepts spoken about in a corpus, which in turn can help, for example, to understand the important concepts in the domain that a collection of documents are about, or the topic of a document.

Terminology Extraction (TE):

Given a corpus we know to be in a given domain, we may be interested to learn what terms/concepts are core to the terminology of that domain.38

³⁸
Also known as Term Extraction [100], Term Recognition [3], Vocabulary Extraction [85], Glossary Extraction [60], etc.

(See Listing 12, Appendix x, for an example.)

Keyphrase Extraction (KE):

This task focuses on extracting important keyphrases for a given document.39

³⁹

Often simply referred to as Keyword Extraction [151,215].

In contrast with TE, which focuses on extracting important concepts relevant to a given domain, KE is focused on extracting important concepts relevant to a particular text. (See Listing 13, Appendix x, for an example.)

Topic Modeling (TM):

The goal of Topic Modeling is to analyze cooccurrences of related keywords and cluster them into candidate grouping that potentially capture higher-level semantic “topics”.40

⁴⁰

Sometimes referred to as topic extraction [113] or topic classification [305].

(See Listing 14, Appendix x, for an example.)

There is a clear connection between TE and KE: though the goals are somewhat divergent – the former focuses on understanding the domain itself while the latter focuses on categorizing documents – both require extraction of domain terms/keyphrases from text and hence we summarize works in both areas together.

Likewise, the methods employed and the results gained through TE and KE may also overlap with the previously studied task of Entity Extraction & Linking (EEL). Abstractly, one can consider EEL as focusing on the extraction of individuals, such as “Saturn”; on the other hand, TE and KE focus on the extraction of conceptual terms, such as “planets”. However, this distinction is often fuzzy, since TE and KE approaches may identify “Saturn” as a term referring to a domain concept, while EEL approaches may identify “planets” as an entity mention. Indeed, some papers that claim to perform Keyphrase Extraction are indistinguishable from techniques for performing entity extraction/linking [203], and vice versa.

However, we can draw some clear general distinctions between EEL and the domain extraction tasks discussed in this section: the goal in EEL is to extract all entities mentioned, while the goal in TE and KE is to extract a succinct set of domain-relevant keywords that capture the terminology of a domain or the subject of a document. When compared with EEL, another distinguishing aspect of TE, KE and TM is that while the former task will produce a flat list of candidate identifiers for entity mentions in a text, the latter tasks (often) go further and attempt to induce hierarchical relations or clusters from the extracted terminology.

In this section, we discuss works relating to TE, KE and TM that directly relate to the Semantic Web, be it to help in the process of building an ontology or KB, or using an ontology or KB to guide the extraction process, or linking the results of the extraction process to an ontology or KB. We highlight that this section covers a wide diversity of works from authors working in a wide variety of domains, with different perspectives, using different terminology; hence our goal is to cover the main themes and to abstract some common aspects of these works rather than to capture the full detail of all such heterogeneous approaches.

Example: A sample of TE and KE results are provided in Listing 2, based on the examples provided in Listings 12 and 13 (see Appendix x). One motivation for applying such techniques in the context of the Semantic Web is to link the extracted terms with disambiguated identifiers from a KB. The example output consists of (hypothetical) RDF triples linking extracted terms to categorical concepts described in the DBpedia KB. These linked categories in the KB are then associated with hierarchical relations that may be used to generalize or specialize the topic of the document.

Listing 2.

Concept extraction and linking example

Applications: In the context of the Semantic Web, a core application of CEL tasks – and a major focus of TE in particular – is to help with the creation, validation or extension of a domain ontology. Automatically extracting an expressive domain ontology from text is, of course, an inherently challenging task that falls within the area of ontology learning [31,51,186,225,316]. In the context of TE, the focus is on extracting a terminological ontology [100,169] (aka. lexicalized ontology [227], termino-ontology [227] or simple ontology [43]), which captures terms referring to important concepts in the domain, potentially including a taxonomic hierarchy between concepts or identifying terms that are aliases for the same concept. The resulting concepts (and hierarchy) may be used, for example, in a semi-automated ontology building process to seed or extend the concepts in the ontology.41

⁴¹

In the context of ontology building, some authors distinguish an onomasiological process from a semasiological process, where the former process relates to taking a known concept in an ontology and extracting the terms by which it may be referred to in a text, while the latter process involves taking terms and extracting their underlying conceptual meaning in the form of an ontology [36].

Other applications relate to categorizing documents in a corpus according to their key concepts, and thus by topic and/or domain; this is the focus of the KE and TM tasks in particular. When these high-level topics are related back to a particular KB, this can enable various forms of semantic search [123,173,296], for example to navigate the hierarchy of domains/topics represented by the KB while browsing or searching documents. Other applications include text enrichment or semantic annotation whereby terms in a text are tagged with structured information from a reference KB or ontology [73,74,162,308].

Process: The first step in all such tasks is the extraction of candidate domain terms/keywords in the text, which may be performed using variations on the methods for EEL; this process may also involve a reference ontology or KB used for dictionary or learning purposes, or to seed patterns. The second step is to perform a filtering of the terms, selecting only those that best reflect the concepts of the domain or the subject of a document. A third optional step is to induce a hierarchy or clustering of the extracted terms, which may lead to either a taxonomy or a topic model; in the case of a topic model, a further step may be to identify a singular term that identifies each cluster. A final optional step may be to link terms or topic identifiers to an existing KB, including disambiguation where necessary (if not already implicit in a previous step). In fact, the steps described in this process may not always be sequential; for example, where a reference KB or ontology is used, it may not be necessary to induce a hierarchy from the terms since such a hierarchy may already be given by the reference source.

3.1. Terminology/keyphrase recognition

We consider a term to be a textual mention of a domain-specific concept, such as “cancer”. Terms may also be composed of more than one word and indeed by relatively complex phrases, such as “inner planets of the solar system”. Terminology then refers more generically to the collection of terms or specialized vocabulary pertinent to a particular domain. In the context of TE, terms/terminology are typically understood as describing a domain, while in the context of KE, keyphrases are typically understood as describing a particular document. However, the extraction process of TE and KE are similar; hence we proceed by generically discussing the extraction of terms.

In fact, approaches to extract raw candidate terms follow a similar line to that for extracting raw entity mentions in the context of EEL. Generic preprocessing methods such as stop-word removal, stemming and/or lemmatization are often applied, along with tokenization. Some term extraction methods then rely on window-based methods, extracting n-grams up to a predefined length [100]. Other term extractors apply POS-tagging and then define shallow syntactic patterns to capture, for example, noun phrases (“solar system”), noun phrases prefixed by adjectives (“inner planets”), and so forth [60,85,220]. Other systems use an ensemble of methods to extract a broad selection of terms that are subsequently filtered [60].

There are, however, subtle differences when compared with extracting entities, particularly when considering traditional NER scenarios looking for names of people, organizations, places, etc.; when extracting terms, for example, capitalization becomes less useful as a signal, and syntactic patterns may need to be more complex to identify concepts such as “inner planets of [the] solar system”. Furthermore, the features considered when filtering term candidates change significantly when compared with those for filtering entity mentions (as we will discuss presently). For these reasons, various systems have been proposed specifically for extracting terms, including KEA [315], TExSIS [185], TermeX [76] and YaTeA [9] (here taking a selection reused by the highlighted systems). Such systems differ from EEL particularly in terms of the filtering process, discussed presently.

3.2. Filtering

Once a set of candidate terms have been identified, a range of features can be used for either automatic or semi-automatic filtering. These features can be broadly categorized as being linguistic or statistical; however, other contextual features can be used, which will be described presently. Furthermore, filtering can be applied with respect to a domain-specific dictionary of terms as taken from a reference KB or ontology.

Linguistic features relate to lexical or syntactic aspects of the term itself, where the most basic such feature would be the number of words forming the term (more words indicating more specific terms and vice-versa). Other linguistic features can likewise include generic aspects such as POS tags [73,123,173,215,220], shallow syntactic patterns [60,85,119,215], etc.; such features may be used in the initial extraction of terms or as a post-filtering step. Furthermore, terms may be filtered or selected based on appearing in a particular hierarchical branch of terminology, such as terms relating to forms of cancer; these techniques will be discussed in the next subsection.

As explained in Appendix x, producing and maintaining linguistic patterns/rules is a time consuming task, which in turn results in incomplete rules. Statistical measures look more broadly at the usage of a particular term in a corpus. In terms of such measures, two key properties of terms are often analyzed in this context [60]: unithood and termhood.

Unithood refers to the cohesiveness of the term as referring to a single concept, which is often assessed through analysis of collocations: expressions where the meaning of each individual word may vary widely from their meaning in the expression such that the meaning of the word depends directly on the expression; an example collocation might be “mean squared error” where, in particular, the individual words “mean” and “squared” taken in isolation have meanings unrelated to the expression, where the phrase “mean squared error” thus has high unithood. There are then a variety of measures to detect collocations, most based on the idea of comparing the expected number of times the collocation would be found if occurrences of the individual words were independent versus the amount of times the collocation actually appears [90]. As an example, Lossio-Ventura et al. [308] use Web search engines to determine collocations, where they estimate the unithood of a candidate term as the ratio of search results returned for the exact phrase (“mean squared error”), versus the number of results returned for all three terms (mean AND squared AND error). Unithood thus addresses a particular challenge – similar to that of overlapping entities in EEL (recalling “New York City”) – where extraction of partial mentions (such as “squared error”) may lose meaning, particularly if linkable to the KB.

The second form of statistical measure, called termhood, refers to the relevance of the term to the domain in question. To measure termhood, variations on the theme of the TF–IDF measure are commonly used [60,85,100,123,173,308], where, for example, terms that appear often in a (domain) specific text (high TF) but appear less often in a general corpus (high IDF) indicate higher termhood. Note that termhood relates closely to Topic Modeling measures, where the context of terms is used to find topically-related terms; such approaches will be discussed later.

Other features can rather be contextual, looking at the position of terms in the text [252]; such features are particularly important in the context of identifying keyphrases/terms that capture the domain or topic of a given document. The first such feature is known as the phrase depth, which measures how early in the document is the first appearance of the term: phrases that appear early on (e.g., in the title or first paragraph) are deemed to be most relevant to the document or the domain it describes. Likewise, terms that appear throughout the entire document are considered more relevant: hence the phrase lifespan – the ratio of the document lying between the first and last occurrence of the term – is also considered as an important feature [252].

A KB can also be used to filter terms through a linking process. The most simple such procedure is to filter terms that cannot be linked to the KB [162]. Other proposed methods rather apply a graph-based filtering, where terms are first linked to the KB and then the sub-graph of the KB induced by the terms is extracted; subsequently, terms in the graph that are disconnected [46] or exhibiting low centrality [144] can be filtered. This process will be described in more detail later.

3.3. Hierarchy induction

Often the extracted terms will refer to concepts with some semantic relations that are themselves useful to model as part of the process. The semantic relations most often considered are synonymy (e.g., “heart attack” and “myocardial infarction” being synonyms) and hypernymy/hyponymy (e.g., “migraine” is a hyponym of “headache” with the former being a more specific form of the latter, while conversely “headache” is a hypernym of “migraine”). While synonymy induces groups of terms with (almost exactly) the same meaning, hypernymy/hyponymy induces a hierarchical structure over the terms. Such relations and structures can be extracted either by analysis of the text itself and/or through the information gained from some reference source. These relations can then be used to filter relevant terminology, or to induce an initial semantic structure that can be formalized as a taxonomy (e.g., expressed in the SKOS standard [207]), or as a formal ontology (e.g., expressed in the OWL standard [136]), and so forth. As such, this topic relates heavily to the area of ontology learning, where we refer to the textbook by Cimiano [49] and the more recent survey of Wong et al. [316] for details; here our goal is to capture the main ideas.

In terms of detecting hypernymy from the text itself, a key method relies on distinguishing the head term, which signifies the more general hypernym in a (potentially) multi-word term; from modifier terms, which then specialize the hypernym [7,32,153,227,306]. For example, the head term of “metastatic breast cancer” is “cancer”, while “breast” and “metastatic” are modifiers that specialize the head term and successively create hyponyms. As a more complex example, the head term of “inner planets of the solar system” would be “planets”, while “inner” and “of the solar system” are modifying phrases. Analysis of the head/modifier terms thus allows for automatic extraction of hypernymic relations, starting with the most general head term, such as “cancer”, and then subsequently extending to hyponyms by successively adding modifiers appearing in a given multi-word term, such as “breast cancer”, “metastatic breast cancer”, and so forth.

Of course, analyzing head/modifier terms will miss hypernyms not involving modifiers, and synonyms; for example, the hyponym “carcinoma” of “cancer” is unlikely to be revealed by such analysis. An alternative approach is to rely on lexico-syntactic patterns to detect synonymy or hypernymy. A common set of patterns to detect hypernymy are Hearst patterns [130], which look for certain connectives between noun phrases. As an example, such patterns may detect from the phrase “cancers, such as carcinomas, …” that “carcinoma” is a hyponym of “cancer”. Hearst-like patterns are then used by a variety of systems (e.g., [7,55,119,153,162,192]). While such patterns can capture additional hypernyms with high precision [130], Buitelaar et al. [31] note that finding such patterns in practice is rare and that the approach tends to offer low recall. Hence, approaches have been proposed to use the vast textual information of the Web to find instances of such patterns using, for example, Web search engines such as Google [50], the abstracts of Wikipedia articles [298], amongst other Web sources.

Another approach that potentially offers higher recall is to use statistical analyses of large corpora of text. Many such approaches (e.g., [6,51,52,58]) are based, for example, on distributional semantics, which aggregates the context (surrounding terms) in which a given term appears in a large corpus. The distributional hypothesis then considers that terms with similar contexts are semantically related. Within this grouping of approaches, one can then find more specific strategies based on various forms of clustering [51], Formal Concept Analysis [52], LDA [58], embeddings [6], etc., to find and induce a hierarchy from terms based on their context. These can then be used as the basis to detect synonyms; or more often to induce a hierarchy of hypernyms, possibly adding hidden concepts – fresh hypernyms of cohyponyms – to create a connected tree of more/less specific domain terms.

Of course, reference resources that already contain semantic relations between terms can be used to aid in this process. One important such resource is WordNet [208], which, for a given term, provides a set of possible semantic senses in terms of what it might mean (homonymy/polysemy [304]), as well as a set of synonyms called synsets. Those synsets are then related by various semantic relations, including hypernymy, meronymy (part of), etc. WordNet is thus a useful reference for understanding the semantic relations between concepts, used by a variety of systems (e.g., [55,153,225,325], etc.). Other systems rather rely on, for example, Wikipedia categorizations [6,7,197] in combination with reference KBs. A core challenge, however, when using such approaches is the problem of word sense disambiguation [224] (sometimes called the semantic interpretation problem [225]): given a term, determine the correct sense in which it is used. We refer to the survey by Navigli [225] for discussion.

An alternative to extracting semantic relations between terms in the text is to instead rely on the existing relations in a given KB [6,46,144,305]. That is to say, if the terms (or indeed simply entities) can be linked to a suitable existing KB, then semantic relations can be extracted from the KB itself rather than the text. This approach is often applied by tools described in the following section (e.g., [46,144,305]), whose goal is to understand the domain of a document rather than attempting to model a domain from text.

3.4. Topic modeling

While the previous methods are mostly concerned with extracting a terminology from a corpus that describes a given domain (e.g., for the purposes of building a domain-specific ontology), other works are concerned with modeling and potentially identifying the domain to which the documents in a given corpus pertain (e.g., for the purposes of classifying documents). We refer to these latter approaches generically as Topic Modeling approaches [177,198]. Such approaches are based on analysis of terms (or sometimes entities) extracted from the text over which Topic Modeling approaches can be applied to cluster and analyze thematically related terms (e.g., “carcinoma”, “malignant tumor”, “chemotherapy”). Thereafter, topic labeling [5,144] can be (optionally) applied to assign such groupings of terms a suitable KB identifier referring to the topic in question (e.g., dbr:Cancer). Application areas for such techniques include Information Retrieval [241], Recommender Systems [155], Text Classification [143], Cognitive Science [244], and Social Network Analysis [259], to name but a few.

For applying Topic Modeling, one can of course first consider directly applying the traditional methods proposed in the literature: LSA, pLSA and/or LDA (see Appendix x for discussion). However, these approaches have a number of drawbacks. First, such approaches typically work on individual words and not multi-word terms (though extensions have been proposed to consider multi-word terms). Second, topics are considered as latent variables associated with a probability of generating words, and thus are not directly “labeled”, making them difficult to explain or externalize (though, again, labeled extensions have also been proposed, for example for LDA). Third, words are never semantically interpreted in such models, but are rather considered as symbolic references over which statistical/probabilistic inference can be applied. Hence a number of approaches have emerged that propose to use the structured information available in KBs and/or ontologies to enhance the modeling of topics in text. The starting point for all such approaches is to extract some terms from the text, using approaches previously outlined: some rely simply on token- or POS-based methods to extract terms, which can be filtered by frequency or TF–IDF variants to capture domain relevance [5,149,150], whereas others rather prefer entity recognition tools (which are subsequently mapped to higher-level topics through relations in the KB, as we describe later) [46,170,298].

With extracted terms in hand, the next step for many approaches – departing from traditional Topic Modeling – is to link those terms to a given KB, where the semantic relations of the KB can be exploited to generate more meaningful topics. The most straightforward such approach is to assume an ontology that offers a concept/class hierarchy to which extracted terms from the document are mapped. Thus the ontology can be seen as guiding the Topic Modeling process, and in fact can be used to select a label for the topic. One such approach is to apply a statistical analysis over the term-to-concept mapping. For example, in such a setting, Jain and Pareek [149] propose the following: for each concept in the ontology, count the ratio of extracted terms mapped to it or its (transitive) sub-concepts, and take that ratio as an indication of the relevance of the concept in terms of representing a high-level topic of the document. Another approach is to consider the spanning tree(s) induced by the linked terms in the hierarchy, taking the lowest common ancestor(s) as a high-level topic [144]. However, as noted by Hulpuş et al. [144], such approaches relying on class hierarchies tend to elect very generic topic labels, where they give the example of “Barack Obama” being captured under the generic concept person, rather than a more interesting concept such as U.S. President. To tackle this problem, a number of approaches have proposed to link terms – including entity mentions – to Wikipedia’s categories, from which more fine-grained topic labels can be selected for a given text [64,147,276].

Other approaches apply traditional Topic Modeling methods (typically pLSA or LDA) in conjunction with information extracted from the KB. Some approaches propose to apply Topic Modeling in an initial phase directly over the text; for example, Canopy [144] applies LDA over the input documents to group words into topics and then subsequently links those words with DBpedia for labeling each topic (described later). On the other hand, other approaches apply Topic Modeling after initially linking terms to the KB; for example, Todor et al. [298] first link terms to DBpedia in order to enrich the text with annotations of types, categories, hypernyms, etc., where the enriched text is then passed through an LDA process. Some recent approaches rather extend traditional topic models to consider information from the KB during the inference of topic-related distributions. Along these lines, for example, Allahyari [5] propose an LDA variant, called “OntoLDA”, which introduces a latent variable for concepts (taken from DBpedia and linked with the text), which sits between words and topics: a document is then considered to contain a distribution of (latent) topics, which contains a distribution of (latent) concepts, which contains a distribution of (observable) words. Another such hybrid model, but rather based on pLSA, is proposed by Chen et al. [46] where the probability of a concept mention (or a specific entity mention42

⁴²
They use the term entity to refer to both concepts, such as person, and individuals, such as Barack Obama.

) being generated by a topic is computed based on the distribution of topics in which the concept or entity appears and the same probability for entities that are related in the KB (with a given weight).

The result of these previous methods – applying Topic Modeling in conjunction with a KB-linking phase – is a set of topics associated with a set of terms that are in turn linked with concepts/entities in the KB. Interestingly, the links to the KB then facilitate labeling each topic by selecting one (or few) core term(s) that help capture or explain the topic. More specifically, a number of graph-based approaches have recently been proposed to choose topic labels [5,144,150], which typically begin by selecting, for each topic, the nodes in the KB that are linked by terms under that topic, and then extracting a sub-graph of the KB in the neighborhood of those nodes, where typically the largest connected component is considered to be the topical/thematic graph [5,150]. The goal, thereafter, is to select the “label node(s)” that best summarize(s) the topic, for which a number of approaches apply centrality measures on the topical graph: Janik and Kochut [150] investigate use of a closeness centrality measure, Allahyari and Kochut [5] propose to use the authority score of HITS (later mapping central nodes to DBpedia categories), while Hulpuş et al. [144] investigate various centrality measures, including closeness, betweenness, information and random-walk variants, as well as “focused” centrality measures that assign special weights to nodes in the topic (not just in the neighborhood). On the other hand, Varga et al. [305] propose to extract a KB sub-graph (from DBpedia or Freebase) describing entities linked from the text (containing information about classes, properties and categories), over which weighting schemes are applied to derive input features for a machine learning model (SVM) that classifies the topics of microposts.

3.5. Representation

Domain knowledge extracted through the previous processes may be represented using a variety of Semantic Web formats. In the ontology building process, induced concepts may be exported to RDFS/OWL [136] for further reasoning tasks or ontology refinement and development. However, RDFS/OWL makes a distinction between concepts and individuals that may be inappropriate for certain modeling requirements; for example, while a term such as “US Presidents” could be considered as a sub-topic of “US Politics” since any document about the former could be considered also as part of the latter, the former is neither a sub-concept nor an instance of the latter in the set-theoretic setting of OWL.43

⁴³
It is worth noting that OWL does provide means for meta-modeling (aka. punning), where concepts can be simultaneous considered as groups of individuals when reasoning at a terminological level, and as individuals when reasoning at an assertional level.

For such scenarios, the Simple Knowledge Organization System (SKOS) [207] was standardized for modeling more general forms of conceptual hierarchies, taxonomies, thesauri, etc., including semantic relations such as broader-than (e.g., hypernym-of), narrow-than (e.g., hyponym-of), exact-match (e.g., synonym-of), close-match (e.g., near-synonym-of), related (e.g., within-same-topic-as), etc.; the standard also offers properties to define primary labels and aliases for concepts.

Aside from these Semantic Web standards, a number of other representational formats have been proposed in the literature. The Lexicon Model for Ontologies, aka. LEMON [53], was proposed as a format to associate ontological concepts with richer linguistic information, which, on a high level, can be seen as a model that bridges between the world of formal ontologies to the world of natural language (written, spoken, etc.); the core LEMON concept is a lexical entry, which can be a word, affix or phrase (e.g., “cancer”); each lexical entry can have different forms (e.g., “cancers”, “cancerous”), and can have multiple senses (e.g., “ex:cancer_sense1” for medicine, “ex:cancer_sense2” for astrology, etc.); both lexical entries and senses can then be linked to their corresponding ontological concepts (or individuals).

Table 5

Overview of concept extraction & linking systems. Setting denotes the primary domain in which experiments are conducted; Goal indicates the stated aim of the system (KE: Keyphrase Extraction, OB: Ontology Building, SA: Semantic Annotation, TE: Terminology Extraction, TM: Topic Modeling); Recognition summarizes the term extraction method used; Filtering summarizes methods used to select suitable domain terms; Relations indicates the semantics relations extracted between terms in the text (Hyp.: Hyponyms, Syn.: Synonyms, Mer.: Meronyms); Linking indicates KBs/ontologies to which terms are linked; Reference indicates other sources used; Topic indicates the method(s) used to model (and label) topics. “—” denotes no information found, not used or not applicable

Name	Year	Setting	Goal	Recognition	Filtering	Relations	Linking	Reference	Topic
AllahyariK [5]	2015	Multi-domain	TM	Token-based	Statistical	—	DBpedia	Wikipedia	LDA/Graph
Canopy [144]	2013	Multi-domain	TM	—	—	—	DBpedia	Wikipedia	LDA/Graph
CardilloWRJVS [36]	2013	Medicine	TE	TExSIS	Manual	—	SNOMED, DBpedia	—	—
ChemuduguntaHSS [43]	2008	Science	SA	Token-based	Statistical	—	—	CIDE, ODP	LDA
ChenJYYZ [46]	2016	Comp. Sci., News	TM	DBpedia Spotlight	Graph/Stat.	—	DBpedia	—	pLSA/Graph
CimianoHS [52]	2005	Tourism, Finance	OB	POS-based	Statistical	Hyp.	—	—	—
CRCTOL [153]	2010	Terrorism, Sports	OB	POS-based	Statistical	Hyp.	—	WordNet	—
Distiller [73]	2014	—	KE	Stat./Lexical	Hybrid	—	DBpedia	—	—
DolbyFKSS [85]	2009	I.T., Energy	TE	POS-based	Statistical	—	DBpedia, Freebase	—	—
F-STEP [197]	2013	News	TE	WikiMiner	WikiMiner	Hyp.	DBpedia, Freebase	Wikipedia	—
FGKBTE [100]	2014	Crime, Terrorism	TE	Token-based	Statistical	—	FunGramKB	—	—
GillamTA [119]	2005	Nanotechnology	OB	Patterns	Hybrid	Hyp.	—	—	—
GullaBI [123]	2006	Petroleum	OB	POS-based	Statistical	—	—	—	—
JainP [149]	2010	Comp. Sci.	TM	POS-based	Stat./Manual	—	Custom onto.	—	Hierarchy
JanikK [150]	2008	News	TM	—	Statistical	—	Wikipedia	—	Graph
LauscherNRP [170]	2016	Politics	TM	DBpedia Spotlight	Statistical	—	DBpedia	—	L-LDA
LemnitzerVKSECM [173]	2007	E-Learning	OB	POS-based	Statistical	Hyp.	OntoWordNet	Web search	—
LiTeWi [60]	2016	Software, Science	OB	Ensemble	WikiMiner	—	Wikipedia	—	—
LossioJRT [308]	2016	Biomedical	TE	POS-based	Statistical	—	UMLS, SNOMED	Web search	—
MoriMIF [215]	2004	Social Data	KE	TermeX	Statistical	—	—	Google	—
MuñozGCHN [220]	2011	Telecoms	KE	POS-based	Hybrid	—	DBpedia	Wikipedia	—
OntoLearn [225 ,306]	2001	Tourism	OB	POS-based	Statistical	Various	—	WordNet, Google	—
OntoLT [32]	2004	Neurology	OB	POS-based	Statistical	Hyp.	Any	—	—
OSEE [161 ,162]	2012	Bioinformatics	SA	POS-based	KB-based	—	Gene Ontology	Various (OBO)	—
OwlExporter [314]	2010	Software	OB	POS-based	KB-based	—	Any	—	—
PIRATES [252]	2010	Software	SA	KEA	Hybrid	—	SEOntology	—	—
SPRAT [192]	2009	Fishery	OB	Patterns	TermRaider	Syn. Hyp.	FAO	WordNet	—
TaxoEmbed [6]	2016	Multi-domain	OB	—	KB-based	Hyp.	Wikidata, YAGO	Wikipedia, BabelNet
Text2Onto [55 ,186]	2005	—	OB	POS/JAPE	Statistical	Hyp. Mer.	—	WordNet	—
TodorLAP [298]	2016	News	TM	DBpedia Spotlight	—	Hyp.	DBpedia	Wikipedia	LDA
TyDI [227]	2010	Biotechnology	OB	—	YaTeA	Syn. Hyp.	—	—	—
VargaCRCH [305]	2014	Microposts	TM	OpenCalais	—	—	DBpedia, Freebase	—	Graph/ML
ZhangYT [325]	2009	News	TE	Manual	Statistical	—	—	WordNet	—

Along related lines, Hellmann et al. [131] propose the NLP Interchange Format (NIF), whose goal is to enhance the interoperability of NLP tools by using an ontology to describe common terms and concepts; the format can provide Linked Data as output for further data reuse. Other proposals have also been made in terms of publishing linguistic resources as Linked Data. Cimiano et al. [54] propose such an approach for publishing and linking terminological resources following the Linked Data principles, combining the LEMON, SKOS, and PROV-O vocabularies in their core model; OnLit was proposed by Klimek et al. [166] as a Linked Data version of the LiDo Glossary of Linguistic Terms; etc. For further information, we refer the reader to the editorial by McCrae et al. [195], which offers an overview of terminological/linguistic resources published as Linked Data.

3.6. System summary and comparison

Based on the previously discussed techniques, in Table 5, we provide an overview of highlighted CEL systems that deal with the Semantic Web in a direct way, and that have a publication offering details; a legend is provided in the caption of the table. Note that in the Recognition and Filtering column, some systems delegate these tasks to external recognition tools – such as DBpedia Spotlight [200], KEA [315], OpenCalais,44

⁴⁴
http://www.opencalais.com/

TermeX [76], TermRaider,45

⁴⁵

https://gate.ac.uk/projects/neon/termraider.html

TExSIS [185], WikiMiner [210], YaTeA [9] – which are indicated in the respective columns as appropriate.

The approaches reviewed in this section might be applied for diverse and heterogeneous cases. Thus, comparing CEL approaches is not a trivial task; however, we can mention some general aspects and considerations when choosing a particular CEL approach.

Target task: As per the Goal column of Table 5, the surveyed approaches cover a number of different related tasks with different applications. These include Keyphrase Extraction, Ontology Building, Semantic Annotation, Terminology Extraction, Topic Modeling, etc. An important consideration when choosing an approach is thus to select one that best fits the given application.

Language: Although the approaches described in this section provide strategies for processing text in English, different languages can also be covered in the TE/KE tasks using similar techniques (e.g., the approach proposed by Lemnitzer et al. [173]). Moreover, term translation is the focus of some approaches, such as OntoLearn [225,306]. Finally, KBs such as DBpedia or BabelNet offer multilingual resources that can support the extraction of terms in varied languages.

Output quality. The quality of CEL tasks depends on numerous factors, such as the ontologies and datasets used, the manual intervention involved in their processes, etc. For example, approaches such as OSEE [161,162], OntoLearn [225,306], or that proposed by Chemudugunta et al. [43], rely on ontologies for recognizing or filtering terms; while this strategy could provide an increased precision, new terms may not be identified at all and thus, a low recall may be produced. On the other hand, approaches by Cardillo et al. [36] and Zhang et al. [325] involve manual intervention; although this is a costly process, it can help ensure a higher quality result over smaller input corpora of text.

Domain: Although a specific domain is commonly used for testing (e.g. Biomedical, Finance, News, Terrorism, etc.), CEL approaches rely on NLP tools that can be employed for varied domains in a general fashion. However, some CEL approaches may be built with a specific KB in mind. For example, Cardillo et al. [36] and Lossio et al. [308] use SNOMED-CT for the medical domain, while F-STEP [197], Distiller [73], and Dolby et al. [85] use DBpedia. On the other hand, KB-agnostic approaches such as OntoLT [32] and OwlExporter [314] generalize to any KB/domain.

Text characteristics/Recognition. Different features can be used during the extraction and filtering of terms and topics. For example, some systems deal with the recognition of multi-word expressions (e.g., OntoLearn [225,306]), or contextual features provided by the FCA (as proposed by Cimiano et al. [52]) or position of words in a text (e.g., PIRATES [252]). Such a selection of features may influence the results in different ways for particular applications; it may be difficult to anticipate a priori how such factors may influence an application, where it may be best to evaluate such approaches for a particular setting.

Efficiency and scalability. When faced with a large input corpus, the efficiency of a CEL approach can become a major factor. Some CEL approaches rely on computationally-expensive NLP tasks (e.g., deep parsing), while other approaches rely on more lightweight statistical tasks to extract and filter terms. Further steps to extract a hierarchy or link terms with a KB may introduce a further computational cost. Unfortunately however, efficiency (in terms of runtimes) is generally not reported in the CEL papers surveyed, which rather focus on metrics to assess output quality.

We can conclude that comparing CEL approaches is complicated not only by the diversity of methods proposed and the goals targeted, but also by a lack of standardized, comparative evaluation frameworks; we will discuss this issue in the following subsection.

3.7. Evaluation

Given the diversity of approaches gathered together in this section, we remark that the evaluation strategies employed are likewise equally diverse. In particular, evaluation varies depending on the particular task considered (be it TE, KE, TM or some combination thereof) and the particular application in mind (be it ontology building, text classification, etc.). Evaluation in such contexts is often further complicated by the potentially subjective nature of the goal of such approaches. When assessing the quality of the output, some questions may be straightforward to answer, such as: Is this phrase a cohesive term (unithood/precision)? On the other hand, evaluation must somehow deal with more subjective domain-related questions, such as: Is this a domain-relevant term (termhood/precision)? Have we captured all domain-relevant terms appearing in the text (recall)? Is this taxonomy of terms correct (precision)? Does this label represent the terms forming this topic (precision)? Does this document have these topics (precision)? Are all topics of the document captured (recall)? And so forth. Such questions are inherently subjective, may raise disagreement amongst human evaluators [173], and may require expertise in the given domain to answer adequately.

Datasets: For evaluating CEL approaches, notably there are many Web-based corpora that have been pre-classified with topics or keywords, often annotated by human experts – such as users, moderators or curators of a particular site – through, for example, tagging systems. These can be reused for evaluation of domain extraction tools, in particular to see if automated approaches can recreate the high-quality classifications or topic models inherent in such corpora. Some such corpora that have been used include: BBC News46

⁴⁶
http://mlg.ucd.ie/datasets/bbc.html

[144,298], British Academic Written Corpus47

⁴⁷

http://www2.warwick.ac.uk/fac/soc/al/research/collections/bawe/

[5,144], British National Corpus48

⁴⁸

http://www.natcorp.ox.ac.uk/

[52], CNN News [150], DBLP49

⁴⁹

http://dblp.uni-trier.de/db/

[46], eBay50

⁵⁰

http://www.ebay.com

[74], Enron Emails,51

⁵¹

https://www.cs.cmu.edu/ ./enron/

Twenty Newsgroups52

⁵²

https://archive.ics.uci.edu/ml/datasets/Twenty+Newsgroups

[46,298], Reuters News53

⁵³

http://www.daviddlewis.com/resources/testcollections/reuters21578/ and http://www.daviddlewis.com/resources/testcollections/rcv1/.

[52,174,325], StackExchange54

⁵⁴

https://archive.org/details/stackexchange

[144], Web News [298], Wikipedia categories [64], Yahoo! categories [297], and so forth. Other datasets have been specifically created as benchmarks for such approaches. In the context of KE, for example, gold standards such as SemEval [163] and Crowd500 [189] have been created, with the latter being produced through a crowdsourcing process; such datasets were used by Gagnon et al. [111] and Jean-Louis et al. [151] for KE-related third-party evaluations. In the context of TE, existing domain ontologies can be used – or manually created and linked with text – to serve as a gold standard for evaluation [27,32,52,162,227,325].

Rather than employing a pre-annotated gold standard, an alternative strategy is to apply the approach under evaluation to a non-annotated corpus and thereafter seek human judgment on the output, typically in comparison with baseline approaches from the literature. Such an approach is employed in the context of TE by Nédellec et al. [227], Kim and Tuan [162], and Dolby et al. [85]; or in the context of KE by Muñoz-García et al. [220]; or in the context of TM by Hulpuş et al. [144] and Lauscher et al. [170]; and so forth. In such evaluations, TM approaches are often compared against traditional approaches such as LDA [5], PLSA [46], hierarchical clustering [127], etc.

Metrics: The most commonly used metrics for evaluating CEL approaches are precision, recall, and $F_{1}$ measure. When comparing with baseline approaches over a priori non-annotated corpora, recall can be a manually-costly aspect to assess; hence comparative rather than absolute measures such as relative recall are sometimes used [162]. In cases where results are ranked, precision@k measures may be used to assess the quality of the top-k terms extracted [308].

Particular tasks may be associated with specialized measures. Evaluations in the area of Topic Modeling may also consider perplexity (the log-likelihood of a held-out test set) and/or coherence [266], where a topic is considered coherent if it covers a high ratio of the words/terms appearing in the textual context from which it was extracted (measured by metrics such as Pointwise Mutual Information (PMI) [60], Normalized Mutual Information (NMI), etc.). On the other hand, for Terminology Extraction approaches with hierarchy induction, measures such as Semantic Cotopy [52] and Taxonomic Overlap [52] may be used to compare the output hierarchy with that of a gold-standard ontology.

In cases where a fixed number of users/judges are in charge of developing a gold-standard dataset or assessing the output of systems in an a posteriori manner, the agreement among them is expressed as Cohen’s or Fleiss’ κ-measure. In this sense, Randolph [254] provides a typical description (and usage examples) of the κ-measure, where the considered aspects are the number of cases or instances to evaluate, the number of human judges, the number of categories, and the number of judges who assigned the case to the same category.

Third-party comparisons: To the best of our knowledge, there has been little work on third-party evaluations for comparing CEL approaches, perhaps because of the diversity of goals and methods applied, the aforementioned challenges associated with evaluating CEL systems, etc. Among the available results, Gangemi [112] compared three commercial tools – Alchemy, OpenCalais and PoolParty – for Topic Extraction, where Alchemy had the highest F-measure. In a separate evaluation for Terminology Extraction – comparing Alchemy, CiceroLite, FOX [288], FRED [113] and Wikimeta – Gangemi [112] reported that FRED [113] had the highest F-measure.55

⁵⁵

We do not include Alchemy, CiceroLite, nor Wikimeta in the discussion of Table 5 since we have not found any publication providing details on their implementation. Though FOX [288] and FRED [113] do have publications, few details are provided on how CEL is implemented. We will, however, discuss FRED [113] in the context of Relation Extraction in Section 4.

3.8. Summary

In this section, we gather together three approaches for extracting domain-related concepts from a text: Terminology Extraction (TE), Keyphrase Extraction (KE), and Topic Modeling (TM). While the first task is typically concerned with applications involving ontology building, or otherwise extracting a domain-specific terminology from an appropriate corpus, the latter two tasks are typically concerned with understanding the domain of a given text. As we have seen in this section, all tasks relate in important aspects, particularly in the identification of domain-relevant concepts; indeed, TE and TM further share the goal of extracting relationships between the extracted terms, be it to induce a hierarchy of hypernyms, to find synonyms, or to find thematic clusters of terms.

While all of the discussed approaches rely – to varying degrees – on techniques proposed in the traditional IE/NLP literature for such tasks, the use of reference ontologies and/or KBs has proven useful for a number of technical aspects inherent to these tasks:

using the entity labels and aliases of a domain-specific KB as a dictionary to guide the extraction of conceptual domain terms in a text [162,308];

linking terms to KB concepts and using the semantic relations in the KB to find (un)related terms [5,46,144,150,197];

enriching text with additional information taken from the KB [298];

classifying text with respect to an ontological concept hierarchy [149];

building topic models that include semantic relations from the KB [5,46];

determining topic labels/identifiers based on the centrality of nodes in the KB graph [5,46,144];

representing and integrating terminological knowledge [36,53,54,131,195,207].

On the other hand, such processes are also often used to create or otherwise enrich Semantic Web resources, such as for ontology building applications, where TE can be used to extract a set of domain-relevant concepts – and possibly some semantic relations between them – to either seed or enhance the creation of a domain-specific ontology [49,51,55,225,306,316].

3.9. Open questions

Although the tasks involved in CEL are used in varied applications and domains, some general open questions can be abstracted from the previous discussion, where we highlight the following:

Interrelation of Tasks: Under the heading of CEL, in this survey we have grouped three main tasks: Terminology Extraction (TE), Keyphrase Extraction (KE) and Topic Modeling (TM). These tasks are often considered in isolation – sometimes by different communities and for different applications – but as per our discussion, they also share clear overlaps. An important open question is thus with respect to how these tasks can be generalized, how approaches to each task can potentially complement each other or, conversely, where the objectives of such tasks necessarily diverge and require distinct techniques to solve.

Specialized Settings/Multilingual CEL: As was previously discussed for EEL, most approaches for CEL consider complete texts of high quality, such as technical documents, papers, etc. On the other hand, CEL has not been well explored in other settings, such as online discussion fora, Twitter, etc., which may present a different set of challenges. Likewise, most focus has been on English texts, where works such as LEMON [53] highlight the impact that approaches considering multiple languages could have.

Contextual Information: As previously discussed, TE, KE, and TM can be used to support the extraction of topics from text. As a consequence, such topics can be used to enrich the input text and existing KBs (such as DBpedia). This would be useful in scenarios where input documents need to include further contextual information or to be organized into (potentially new) categories.

Crowdsourcing: A challenging aspect of CEL is the inherent subjectivity with respect to evaluating the output of such methods. Hence some approaches have proposed to leverage crowdsourcing, amongst which we can mention the creation of the Crowd500 dataset [189] for evaluating KE approaches. An open question is to then further develop on this idea and consider leveraging crowdsourcing for solving specific sub-tasks of CEL or to support evaluation, including for other tasks relating to TE or TM (though of course such an approach may not be suitable for specialist domains requiring expert knowledge).

Linking: While a variety of approaches have been proposed to extract terminology/keyphrases from text, only more recently has there been a trend towards linking such mentions with a KB [5,46,144,150,162]. While such linking could be seen as a form of EEL, and as being related to word sense disambiguation, it is not necessarily subsumed by either: many EEL approaches tend to focus on mentions of named entities, while word sense disambiguation does not typically consider multi-term phrases. Hence, an interesting subject is to explore custom techniques for linking the conceptual terminology/keyphrases produced by the TE and KE processes to a given KB, which may include, for example, synonym expansion, domain-specific filtering, etc.

Benchmark Framework: In Section 3.7, we discussed the diverse ways in which CEL approaches are evaluated. While a number of gold standard datasets have been proposed for specific tasks [163,189], we did not encounter much reuse of such datasets, where the only systematic third-party evaluation we could find for such approaches was that conducted by Gangemi [112]. We thus observe a strong need for a standardized benchmarking framework for CEL approaches, with appropriate metrics and datasets. Such a framework would need to address a number of key open questions relating to the diversity of approaches that CEL encompasses, as well as the subjectivity inherent in its goals (e.g., deciding what terminology is important to a domain, what keyphrases are important to a document, etc.).

4. Relation extraction & linking

At the heart of any Semantic Web KB are relations between entities [279]. Thus an important traditional IE task in the context of the Semantic Web is Relation Extraction (RE), which is the process of finding relationships between entities in the text. Unlike the tasks of Terminology Extraction or Topic Modeling that aim to extract fixed relationships between concepts (e.g., hypernymy, synonymy, relatedness, etc.), RE aims to extract instances of a broader range of relations between entities (e.g., born-in, married-to, interacts-with, etc.). Relations extracted may be binary relations or even higher arity n-ary relations. When the predicate of the relation – and the entities involved in it – are linked to a KB, or when appropriate identifiers are created for the predicate and entities, the results can be used to (further) populate the KB with new facts. However, first it is also necessary to represent the output of the Relation Extraction process as RDF: while binary relations can be represented directly as triples, n-ary relations require some form of reified model to encode. By Relation Extraction and Linking (REL), we then refer to the process of extracting and representing relations from unstructured text, subsequently linking their elements to the properties and entities of a KB.

In this section, we thus discuss approaches for extracting relations from text and linking their constituent predicates and entities to a given KB; we also discuss representations used to encode REL results as RDF for subsequent inclusion in the KB.

Example: In Listing 3, we provide a hypothetical (and rather optimistic) example of REL with respect to DBpedia. Given a textual statement, the output provides an RDF representation of entities interacting through relationships associated with properties of an ontology. Note that there may be further information in the input not represented in the output, such as that the entity “Bryan Lee Cranston” is a person, or that he is particularly known for portraying “Walter White”; the number and nature of the facts extracted depend on many factors, such as the domain, the ontology/KB used, the techniques employed, etc.

Listing 3.

Relation extraction and linking example

Note that Listing 3 exemplifies direct binary relations. Many REL systems rather extract n-ary relations with generic role-based connectives. In Listing 4, we provide a real-world example given by the online FRED demo;56

⁵⁶

http://wit.istc.cnr.it/stlab-tools/fred/demo

for brevity, we exclude some output triples not directly pertinent to the example. Here we see that relations are rather represented in an n-ary format, where for example, the relation “portrays” is represented as an RDF resource connected to the relevant entities by role-based predicates, where “Walter White” is given by the predicate dul:associatedWith, “Breaking Bad” is given by the predicate fred:in, and “Bryan Lee Cranston” is given by an indirect path of three predicates vn.role:Agent/fred:for⁻/vn.role:Theme; the type of relation is then given as fred:Portray.

Listing 4.

FRED relation extraction example

Applications: One of the main applications for REL is KB Population,57

⁵⁷

Also known as Ontology Population [49,57,61,192,314].

where relations extracted from text are added to the KB. For example, REL has been applied to (further) populate KBs in the domains of Medicine [236,273], Terrorism [148], Sports [260], among others. Another important application for REL is to perform Structured Discourse Representation, where arguments implicit in a text are parsed and potentially linked with an ontology or KB to explicitly represent their structure [12,109,113]. REL is also used for Question Answering (Q&A) [300], whose purpose is to answer natural language questions over KBs, where approaches often begin by applying REL on the question text to gain an initial structure [317,333]. Other interesting applications have been to mine deductive inference rules from text [178], or for pattern recognition over text [120], or to verify or provide textual references for existing KB triples [106].

Process: The REL process can vary depending on the particular methodology adopted. Some systems rely on traditional RE processes (e.g., [92,93,223]), where extracted relations are linked to a KB after extraction; other REL systems – such as those based on distant supervision – use binary relations in the KB to identify and generalize patterns from text mentioning the entities involved, which are then used to subsequently extract and link further relations. Generalizing, we structure this section as follows. First, many (mostly distant supervision) REL approaches begin by identifying named entities in the text, either through NER (generating raw mentions) or through EEL (additionally providing KB identifiers). Second, REL requires a method for parsing relations from text, which in some cases may involve using a traditional RE approach. Third, distant supervision REL approaches use existing KB relations to find and learn from example relation mentions, from which general patterns and/or features are extracted and used to generate novel relations. Fourth, an REL approach may apply a clustering procedure to group relations based on hypernymy or equivalence. Fifth, REL approaches – particularly those focused on extracting n-ary relations – must define an appropriate RDF representation to serialize output relations. Finally, in order to link the resulting relations to a given KB/ontology, REL often considers an explicit mapping step to align identifiers.

4.1. Entity extraction (and linking)

The first step of REL often consists of identifying entity mentions in the text. Here we distinguish three strategies, where, in general, many works follow the EEL techniques previously discussed in Section 2, particularly those adopting the first two strategies.

The first strategy is to employ an end-to-end EEL system – such as DBpedia Spotlight [200], Wikifier [97], etc. – to match entities in the raw text. The benefit of this strategy is that KB identifiers are directly identified for subsequent phases.

The second strategy is to employ a traditional NER tool – often from Stanford CoreNLP [11,113,213,223,233] – and potentially link the resulting mentions to a KB. This strategy has the benefit of being able to identify mentions of emerging entities, allowing to extract relations about entities not already in the KB.

The third strategy is to rather skip the NER/EL phrase and rather directly apply an off-the-shelf RE/OpenIE tool or an existing dependency-based parser (discussed later) over the raw text to extract relational structures; such structures then embed parsed entity mentions over which EEL can be applied (potentially using an existing EEL system such as DBpedia Spotlight [109] or TagMe [101]). This has the benefit of using established RE techniques and potentially capturing emerging entities; however, such a strategy does not leverage knowledge of existing relations in the KB for extracting relation mentions (since relations are extracted prior to accessing the KB).

In summary, entities may be extracted and linked before or after relations are extracted. Processing entities before relations can help to filter sentences that do not involve relationships between known entities, to find examples of sentences expressing known relations in the KB for training purposes, etc. On the other hand, processing entities after relations allows direct use of traditional RE and OpenIE tools, and may help to extract more complex (e.g., n-ary) relations involving entities that are not supported by a particular EEL approach (e.g., emerging entities), etc. These issues will be discussed further in the sections that follow.

In the context of REL, when extracting and linking entities, Coreference Resolution (CR) plays a very important role. While other EEL applications may not require capturing every coreference in a text – e.g., it may be sufficient to capture that the entity is mentioned at least one in a document for semantic search or annotation tasks – in the context of REL, not capturing coreferences will potentially lose many relations. Consider again Listing 3, where the second sentence begins “ He is known for ...”; CR is necessary to understand that “ He ” refers to “Bryan Lee Cranston”, to extract that he portrays “Walter White”, and so forth. In Listing 3, the portrays relation is connected (indirectly) to the node identifying “Bryan Lee Cranston”; this is possible because FRED uses Stanford CoreNLP’s CR methods. A number of other REL systems [97,116] likewise apply CR to improve recall of relations extracted.

4.2. Parsing relations

The next phase of REL systems often involves parsing structured descriptions from relation mentions in the text. The complexity of such structures can vary widely depending on the nature of the relation mention, the particular theory by which the mention is parsed, the use of pronouns, and so forth. In particular, while some tools rather extract simple binary relations of the form $p (s, o)$ with a designated subject–predicate–object, others may apply a more abstract semantic representation of n-ary relations with various dependent terms playing various roles.

In terms of parsing more simple binary relations, as mentioned previously, a number of tools use existing OpenIE systems, which apply a recursive extraction of relations from webpages, where extracted relations are used to guide the process of extracting further relations. In this setting, for example, Dutta et al. [93] use NELL [214] and ReVerb [99], Liu et al. [182] use PATTY [223], while Soderland and Mandhani [285] use TextRunner [16] to extract relations; these relations will later be linked with an ontology or KB.

In terms of parsing potentially more complex n-ary relations, a variety of methods can be applied. A popular method is to begin with a dependency-based parse of the relation mention. For example, Grafia [109] uses a Stanford PCFG parser to extract dependencies in a relation mention, over which CR and EEL are subsequently applied. Likewise, other approaches using a dependency parser to extract an initial syntactic structure from relation mentions include DeepDive [233], PATTY [223], Refractive [98] and works by Mintz et al. [213], Nguyen and Moschitti [232], etc.

Other works rather apply higher-level theories of language understanding to the problem of modeling relations. One such theory is that of frame semantics [103], which considers that people understand sentences by recalling familiar structures evoked by a particular word; a common example is that of the term “revenge”, which evokes a structure involving various constituents, including the avenger, the retribution, the target of revenge, the original victim being avenged, and the original offense. These structures are then formally encoded as frames, categorized by the word senses that evoke the frame, encapsulating the constituents as frame elements. Various collections of frames have then been defined – with FrameNet [14] being a prominent example – to help identify frames and annotate frame elements in text. Such frames can be used to parse n-ary relations, as used for example by Refractive [98], PIKES [61] or Fact Extractor [106].

A related theory used to parse complex n-ary relations is that of Discourse Representation Theory (DRT) [156], which offers a more logic-based perspective for reasoning about language. In particular, DRT is based on the idea of Discourse Representation Structures (DRS), which offer a first-order-logic (FOL) style representation of the claims made in language, incorporating n-ary relations, and even allowing negation, disjunction, equalities, and implication. The core idea is to build up a formal encoding of the claims made in a discourse spanning multiple sentences where the equality operator, in particular, is used to model coreference across sentences. These FOL style formulae are contextualized as boxes that indicate conjunction.

Tools such as Boxer [28] then allow for extracting such DRS “boxes” following a neo-Davidsonian representation, which at its core involves describing events. Consider the example sentence “Barack Obama met Raul Castro in Cuba”; we could consider representing this as $meet (BO, RC, CU)$ with $BO$ denoting “Barack Obama”, etc.58

⁵⁸
Here we use a rather distinct representation of arguments in the relation for space/visual reasons and to follow the notation used by Boxer (which is based on variables).

Now consider “BarackObama met with Raul Castro in 2016”; if we represent this as

meet (BO, RC, 2016)

, we see that the meaning of the third argument 2016 conflicts with the role of cu as a location earlier even though both are prefixed by the preposition “in”. Instead, we will create an existential operator to represent the meeting; considering “Barack Obama met briefly with Raul Castro in 2016 while in Cuba”, we could write (e.g.):

\begin{array}{l} \exists e : meet (e), Agent (e, BO), CoAgent (e, RC), \\ briefly (e), Theme (e, CU), Time (e, 2016) \end{array}

where e denotes the event being described, essentially decomposing the complex n-ary relation into a conjunction of unary and binary relations.59

⁵⁹

One may note that this is analogous to the same process of representing n-ary relations in RDF [134].

Note that expressions such as

Agent (e, BO)

are considered as semantic roles, contrasted with syntactic roles; if we consider “Barack Obama met with Raul Castro”, then bo has the syntactic role of subject and rc the role of object, but if we swap – “Raul Castro met with Barack Obama” – while the syntactic roles swap, we see little difference in semantic meaning: both bo and rc play the semantic role of (co-)agents in the event. The roles played by members in an event denoted by a verb are then given by various syntactic databases, such as VerbNet [165]60

⁶⁰

See https://verbs.colorado.edu/verb-index/vn/meet-36.3.php.

and PropBank [164]. The Boxer [28] tool then uses VerbNet to create DRS-style boxes encoding such neo-Davidsonian representations of events denoted by verbs. In turn, REL tools such as LODifier [12] and FRED [113] (see Listing 4) use Boxer to extract relations encoded in these DRS boxes.

4.3. Distant supervision

There are a number of significant practical shortcomings of using resources such as FrameNet, VerbNet, and PropBank to extract relations. First, being manually-crafted, they are not necessarily complete for all possible relations and syntactic patterns that one might consider and, indeed, are often only available in English. Second, the parsing method involved may be quite costly to run over all sentences in a very large corpus. Third, the relations extracted are complex and may not conform to the typically binary relations in the KB; creating a posteriori mappings may be non-trivial.

An alternative data-driven method for extracting relations – based on distant supervision61

⁶¹
Also known as weak supervision [141].

– has thus become increasingly popular in recent years, with a seminal work by Mintz et al. [213] leading to a flurry of later refinements and extensions. The core hypothesis behind this method is that given two entities with a known relation in the KB, sentences in which both entities are mentioned in a text are likely to also mention the relation. Hence, given a KB predicate (e.g., dbo:genre), we can consider the set of known binary relations between pairs of entities from the KB (e.g, (dbr:Breaking_Bad,dbr:Drama), (dbr:X_Files,dbr:Science_Fiction), etc.) with that predicate and look for sentences that mention both entities, hypothesizing that the sentence offers an example of a mention of that relation (e.g., “in the

drama

series

Breaking Bad

”, or “one of the most popular

Sci-Fi

shows was

X-Files

”). From such examples, patterns and features can be generalized to find fresh KB relations involving other entities in similar such mentions appearing in the text.

The first step for distant supervision methods is to find sentences containing mentions of two entities that have a known binary relation in the KB. This step essentially relies on the EEL process described earlier and can draw on techniques from Section 2. Note that examples may be drawn from external documents, where, for example, Sar-graphs [167] proposes to use Bing’s Web search to find documents containing both entities in an effort to build a large collection of example mentions for known KB relations. In particular, being able to draw from more examples allows for increasing the precision and recall of the REL process by finding better quality examples for training [233].

Once a list of sentences containing pairs of entities is extracted, these sentences need to be analyzed to extract patterns and/or features that can be applied to other sentences. For example, as a set of lexical features, Mintz et al. [213] propose to use the sequence of words between the two entities, to the left of the first entity and to the right of the second entity; a flag to denote which entity came first; and a set of POS tags. Other features proposed in the literature include matching the label of the KB property (e.g., dbo:birthPlace – “Birth Place”) and the relation mention for the associated pair of entities (e.g., “was born in”) [182]; the number of words between the two entity mentions [117]; the frequency of n-grams appearing in the text window surrounding both entities, where more frequent n-grams (e.g., “was born in”) are indicative of general patterns rather than specific details for the relation of a particular pair of entities (e.g., “prematurely in a taxi”) [223], etc.

Aside from shallow lexical features, systems often parse the example relations to extract syntactic dependencies between the entities. A common method, again proposed by Mintz et al. [213] in the context of supervision, is to consider dependency paths, which are (shortest) paths in the dependency parse tree between the two entities; they also propose to include window nodes – terms on either side of the dependency path – as a syntactic feature to capture more context. Both the lexical and syntactic features proposed by Mintz et al. were then reused in a variety of subsequent related works using distant supervision, including Knowledge Vault [86], DeepDive [233], and many more besides.

Once a set of features is extracted from the relation mentions for pairs of entities with a known KB relation, the next step is to generalize and apply those features for other sentences in the text. Mintz et al. [213] originally proposed to use a multi-class logistic regression classifier: for training, the approach extracts all features for a given entity pair (with a known KB relation) across all sentences in which that pair appears together, which are used to train the classifier for the original KB relation; for classification, all entities are identified by Stanford NER, and for each pair of entities appearing together in some sentence, the same features are extracted from all such sentences and passed to the classifier to predict a KB relation between them.

A variety of works followed up on and further refined this idea. For example, Riedel et al. [257] note that many sentences containing the entity pair will not express the KB relation and that a significant percentage of entity pairs will have multiple KB relations; hence combining features for all sentences containing the entity pair produces noise. To address this issue, they propose an inference model based on the assumption that, for a given KB relation between two entities, at least one sentence (rather than all) will constitute a true mention of the relation; this is realized by introducing a set of binary latent variables for each such sentence to predict whether or not that sentence expresses the relation. Subsequently, for the MultiR system, Hoffman et al. [141] proposed a model further taking into consideration that relation mentions may overlap, meaning that a given mention may simultaneously refer to multiple KB relations; this idea was later refined by Surdeanu et al. [292], who proposed a similar multi-instance multi-label (MIML-RE) model capturing the idea that a pair of entities may have multiple relations (labels) in the KB and may be associated with multiple relation mentions (instances) in the text.

Another complication arising in learning through distant supervision is that of negative examples, where Semantic Web KBs like Freebase, DBpedia, YAGO, are necessarily incomplete and thus should be interpreted under an Open World Assumption (OWA): just because a relation is not in a KB, it does not mean that it is not true. Likewise, for a relation mention involving a pair of entities, if that pair does not have a given relation in the KB, it should not be considered as a negative example for training. Hence, to generate useful negative examples for training, the approach by Surdeanu et al. [292], Knowledge Vault [86], the approach by Min et al. [211], etc., propose a heuristic called (in [86]) a Local Closed World Assumption (LCWA), which assumes that if a relation $p (s, o)$ exists in the KB, then any relation $p (s, o^{'})$ not in the KB is a negative example; e.g., if $born (BO, US)$ exists in the KB, then $born (BO, X)$ should be considered a negative example assuming it is not in the KB. While obviously this is far from infallible – working well for functional-esque properties like $capital$ but less well for often multi-valued properties like $child$ – it has proven useful in practice [86,211,292]; even if it produces false negative examples, it will produce far fewer than considering any relation not in the KB as false, and the benefit of having true negative examples amortizes the cost of potentially producing false negatives.

A further complication in distant supervision is with respect to noise in automatically labeled relation mentions caused, for example, by incorrect EEL results where entity mentions are linked to an incorrect KB identifier. To tackle this issue, a number of DS-based approaches include a seed selection process to try to select high-quality examples and reduce noise in labels. Along these lines, for example, Augenstein et al. [11] propose to filter DS-labeled examples involving ambiguous entities; for example, the relation mention “New York is a state in the U.S.” may be discarded since “New York” could be mistakenly linked to the KB identifier for the city, which may lead to a noisy example for a KB property such as has-city.

Other approaches based on distant supervision rather propose to extract generalized patterns from relation mentions for known KB relations. Such systems include BOA [117] and PATTY [223], which extract sequences of tokens between entity pairs with a known KB relation, replacing the entity pairs with (typed) variables to create generalized patterns associated with that relation, extracting features used to filter low-quality patterns; an example pattern in the case of PATTY would be “<Person> is the lead-singer of <MusicBand>” as a pattern for dbo:bandMember where, e.g., MusicBand indicates the expected type of the entity replacing that variable.

We also highlight a more recent trend towards alternative distant supervision methods based on embeddings (e.g., [179,312,324]). Such approaches have the benefit of not relying on NLP-based parsing tools, but rather relying on distributional representations of words, entities and/or relations in a fixed-dimensional vector space that, rather than producing a discrete parse-tree structure, provides a semantic representation of text in a (continuous) numeric space. Approaches such as proposed by Lin et al. [179] go one step further: rather than computing embeddings only over the text, such approaches also compute embeddings for the structured KB, in particular, the KB entities and their associated properties; these KB embeddings can be combined with textual embeddings to compute, for example, similarity between relation mentions in the text and relations in the KB.

We remark that tens of other DS-based approaches have recently been published using Semantic Web KBs in the linguistic community, most often using Freebase as a reference KB, taking an evaluation corpus from the New York Times (originally compiled by Riedel et al. [257]). While strictly speaking such works would fall within the scope of this survey, upon inspection, many do not provide any novel use of the KB itself, but rather propose refinements to the machine learning methods used. Hence we consider further discussion of such approaches as veering away from the core scope of this survey, particularly given their number. Herein, rather than enumerating all works, we have instead captured the seminal works and themes in the area of distant supervision for REL; for further details on distant supervision for REL in a Semantic Web setting, we can instead refer the interested reader to the Ph.D. dissertation of Augenstein [10].

4.4. Relation clustering

Relation mentions extracted from the text may refer to the same KB relation using different terms, or may imply the existence of a KB relation through hypernymy/sub-property relations. For example, mentions of the form “X is married to Y”, “X is the spouse of Y”, etc., can be considered as referring to the same KB property (e.g., dbo:spouse), while a mention of the form “X is the husband of Y” can likewise be considered as referring to that KB property, though in an implied form through hypernymy. Some REL approaches thus apply an analysis of such semantic relations – typically synonymy or hypernymy – to cluster textual mentions, where external resources – such as WordNet, FrameNet, VerbNet, PropBank, etc., – are often used for such purposes. These clustering techniques can then be used to extend the set of mentions/patterns that map to a particular KB relation.

An early approach applying such clustering was Artequakt [4], which leverages WordNet knowledge – specifically synonyms and hypernyms – to detect which pairs of relations can be considered equivalent or more specific than one another. A more recent version of such an approach is proposed by Gerber et al. [116] in the context of their RdfLiveNews system, where they define a similarity measure between relation patterns composed of a string similarity measure and a WordNet-based similarity measure, as well as the domain(s) and range(s) of the target KB property associated with the pattern; thereafter, a graph-based clustering method is applied to group similar patterns, where within each group, a similarity-based voting mechanism is used to select a single pattern deemed to represent that group. A similar approach was employed by Liu et al. [182] for clustering mentions, combining a string similarity measure and a WordNet-based measure; however they note that WordNet is not suitable for capturing similarity between terms with different grammatical roles (e.g., “spouse”, “married”), where they propose to combine WordNet with a distributional-style analysis of Wikipedia to improve the similarity measure. Such a technique is also used by Dutta et al. [93] for clustering relation mentions using a Jaccard-based similarity measure for keywords and a WordNet-based similarity measure for synonyms; these measures are used to create a graph over which Markov clustering is run.

An alternative clustering approach for generalized relation patterns is to instead consider the sets of entity pairs that each such pattern considers. Soderland and Mandhani [285] propose a clustering of patterns based on such an idea: if one pattern captures a (near) subset of the entity pairs that another pattern captures, they consider the former pattern to be subsumed by the latter and consider the former pattern to infer relations pertaining to the latter. A similar approach is proposed by Nakashole et al. [223] in the context of their PATTY system, where subsumption of relation patterns is likewise computed based on the sets of entity pairs that they capture; to enable scalable computation, the authors propose an implementation based on the MapReduce framework. Another approach along these lines – proposed by Riedel et al. [258] – is to construct what the authors call a universal schema, which involves creating a matrix that maps pairs of entities to KB relations and relation patterns associated with them (be it from training or test data); over this matrix, various models are proposed to predict the probability that a given relation holds between a pair of entities given the other KB relations and patterns the pair has been (probabilistically) assigned in the matrix.

4.5. RDF representation

In order to populate Semantic Web KBs, it is necessary for the REL process to represent output relations as RDF triples. In the case of those systems that produce binary relations, each such relation will typically be represented as an RDF triple unless additional annotations about the relation – such as its provenance – are also captured. In the case of systems that perform EEL and a DS-style approach, it is furthermore the case that new IRIs typically will not need to be minted since the EEL process provides subject/object IRIs while the DS labeling process provides the predicate IRI from the KB. This process has the benefit of also directly producing RDF triples under the native identifier scheme of the KB. However, for systems that produce n-ary relations – e.g., according to frames, DRT, etc. – in order to populate the KB, an RDF representation must be defined. Some systems go further and provide RDFS/OWL axioms that enrich the output with well-defined semantics for the terms used [113].

The first step towards generating an RDF representation is to mint new IRIs for the entities and relations extracted. The BOA [117] framework proposes to first apply Entity Linking using a DS-style approach (where predicate IRIs are already provided), where for emerging entities not found in the KB, IRIs are minted based on the mention text. The FRED system [113] likewise begins by minting IRIs to represent all of the elements, roles, etc., produced by the Boxer DRT-based parser, thus skolemizing the events: grounding the existential variables used to denote such events with a constant (more specifically, an IRI).

Next, an RDF representation must be applied to structure the relations into RDF graphs. In cases where binary relations are not simply represented as triples, existing mechanisms for RDF reification – namely RDF n-ary relations, RDF reification, singleton properties, named graphs, etc. (see [134,272] for examples and more detailed discussion) – can, in theory, be adopted. In general, however, most systems define bespoke representations (most similar to RDF n-ary relations). Among these, Freitas et al. [109] propose a bespoke RDF-based discourse representation format that they call Structured Discourse Graphs capturing the subject, predicate and object of the relation, as well as (general) reification and temporal annotations; LODifier [12] maps Boxer output to RDF by mapping unary relations to rdf:type triples and binary relations to triples with a custom predicate, using RDF reification to represent the disjunction, negation, etc., present in the DRS output; FRED [113] applies an n-ary–relation-style representation of the DRS-based relations extracted by Boxer, likewise mapping unary relations to rdf:type triples and binary relations to triples with a custom predicate (see Listing 4); etc.

Rather than creating a bespoke RDF representation, other systems rather try to map or project extracted relations directly to the native identifier scheme and data model of the reference KB. Likewise, those systems that first create a bespoke RDF representation may apply an a posteriori mapping to the KB/ontology. Such methods for performing mappings are now discussed.

4.6. Relation mapping

While in a distant supervision approach, the patterns and features extracted from textual relation mentions are directly associated with a particular (typically binary) KB property, REL systems based on other extraction methods – such as parsing according to legacy OpenIE systems, or frames/DRS theory – are still left to align the extracted relations with a given KB.

A common approach – similar to distant supervision – is to map pairs of entities in the parsed relation mentions to the KB to identify what known relations correspond to a given relation pattern.62

⁶²
More specifically, we distinguish between distant supervision approaches that use KB entities and relations to extract relation mentions (as discussed previously), and the approaches here, which extract such mentions without reference to the KB and rather map to the KB in a subsequent step, using matches between existing KB relations and parsed mentions to propose candidate KB properties.

This process is more straightforward when the extracted relations are already in a binary format, as produced, for example, by OpenIE systems. Dutta et al. [93] apply such an approach to map the relations extracted by OpenIE systems – namely the NELL and ReVerb tools – to DBpedia properties: the entities in triples extracted from such OpenIE systems are mapped to DBpedia by an EEL process, where existing KB relations are fed into an association-rule mining process to generate candidate mappings for a given OpenIE predicate and pair of entity-types. These rules are then applied over clusters of OpenIE relations to generate fresh DBpedia triples.

In the case of systems that natively extract n-ary relations – e.g., those systems based on frames or DRS – the process of mapping such relations to a binary KB relation – sometimes known as projection of n-ary relations [167] – is considerably more complex. Rather than trying to project a binary relation from an n-ary relation, some approaches thus rather focus on mapping elements of n-ary relations to classes in the KB. Such an approach is adopted by Gerber et al. [116] for mapping elements of binary relations extracted via learned patterns to DBpedia entities and classes. The FRED system [113] likewise provides mappings of its DRS-based relations to various ontologies and KBs, including WordNet and DOLCE ontologies (using WSD) and the DBpedia KB (using EEL).

On the other hand, other systems do propose techniques for projecting binary relations from n-ary relations and linking them with KB properties; such a process must not only identify the pertinent KB property (or properties), but also the subject and object entities for the given n-ary relation; furthermore, for DRS-style relations, care must be taken since the statement may be negated or may be part of a disjunction. Along those lines, Exner and Nugues [97] initially proposed to generate triples from DRS relations by means of a combinatorial approach, filtering relations expressed with negation. In follow-up work on the Refractive system, Exner and Nugues [98] later propose a method to map n-ary relations extracted using PropBank to DBpedia properties: existing relations in the KB are matched to extracted PropBank roles such that more matches indicate a better property match; thereafter, the subject and object of the KB relation are generalized to their KB class (used to identify subject/object in extracted relations), and the relevant KB property is proposed as a candidate for other instances of the same role (without a KB relation) and pairs of entities matching the given types. Legalo [251] proposes a method for mapping FRED results to binary KB relations by concatenating the labels of nodes on paths in the FRED output between elements identified as (potential) subject/object pairs, where these concatenated path labels are then mapped to binary KB properties to project new RDF triples. Rouces et al. [272], on the other hand, propose a rule-based approach to project binary relations from FrameNet patterns, where dereification rules are constructed to map suitable frames to binary triples by mapping frame elements to subject and object positions, creating a new predicate from appropriate conjugate verbs, further filtering passive verb forms with no clear binary relation.

4.7. System summary and comparison

Based on the previously discussed techniques, an overview of the highlighted REL systems is provided in Table 6, with a column legend provided in the caption. As before, we highlight approaches that are directly related with the Semantic Web and that offer a peer-reviewed publication with novel technical details regarding REL. With respect to the Entity Recognition column, note that many approaches delegate this task to external tools and systems – such as DBpedia Spotlight [200], GATE [68], Stanford CoreNLP [187], TagMe [101], Wikifier [255], etc. – which are mentioned in the respective column.

Table 6
Overview of relation extraction and linking systems. Entity Recognition denotes the NER or EEL strategy used; Parsing denotes the method used to parse relation mentions (Cons.: Constituency Parsing, Dep.: Dependency Parsing, DRS: Discourse Representation Structures, Emb.: Embeddings); PS refers to the Property Selection method (PG: Property Generation, RM: Relation Mapping, DS: Distant Supervision); Rep. refers to the reification model used for representation (SR: Standard Reification, BR: Binary Relation); KB refers to the main knowledge-base used; “—” denotes no information found, not used or not applicable

System Year Entity recognition Parsing PS Rep. KB Domain

Artequakt [4] 2003 GATE Patterns RM BR Artists ontology Artists

AugensteinMC [11] 2016 Stanford Features DS BR Freebase Open

BOA [117] 2012 DBpedia Spotlight Patterns, Features DS BR DBpedia News, Wikipedia

DeepDive [233] 2012 Stanford Dep., Features DS BR Freebase Open

DuttaMS [92,93] 2015 Keyword OpenIE DS BR DBpedia Open

ExnerN [97] 2012 Wikifier Frames DS BR DBpedia Wikipedia

Fact Extractor [106] 2017 Wiki Machine Frames DS n-ary DBpedia Football

FRED [113] 2016 Stanford, TagMe DRS PG/RM n-ary DBpedia/BabelNet Open

Graphia [109] 2012 DBpedia Spotlight Dep. PG SR DBpedia Wikipedia

Knowledge Vault [86] 2014 — Features DS BR Freebase Open

LinSLLS [179] 2016 Stanford Emb. DS BR Freebase News

LiuHLZLZ [182] 2013 Stanford Dep., Features DS BR YAGO News

LODifier [12] 2012 Wikifier DRS RM SR WordNet Open

MIML-RE [292] 2012 Stanford Dep., Features DS BR Freebase News, Wikipedia

MintzBSJ [213] 2009 Stanford Dep., Features DS BR Freebase Wikipedia

MultiR [141] 2011 Stanford Dep., Features DS BR Freebase Wikipedia

Nebhi [226] 2013 GATE Patterns, Dep. DS BR DBpedia News

NguyenM [232] 2011 — Dep., Cons. DS BR YAGO Wikipedia

PATTY [223] 2013 Stanford Dep., Patterns RM — YAGO Wikipedia

PIKES [61] 2016 DBpedia Spotlight SRL RM n-ary DBpedia Open

PROSPERA [222] 2011 Keyword Patterns RM BR YAGO Open

RdfLiveNews [116] 2013 DBpedia Spotlight Patterns PG/DS BR DBpedia News

Refractive [98] 2014 Stanford Frames DS SR — Wikipedia

RiedelYM [257] 2010 Stanford Dep., Features DS BR Freebase News

Sar-graphs [167] 2016 Dictionary Dep. DS — Freebase/BabelNet Open

TakamatsuSN [293] 2012 Hyperlinks Dep. DS BR Freebase Wikipedia

Wsabie [312] 2013 Stanford Dep., Features, Emb. DS BR Freebase News

System	Year	Entity recognition	Parsing	PS	Rep.	KB	Domain
Artequakt [4]	2003	GATE	Patterns	RM	BR	Artists ontology	Artists
AugensteinMC [11]	2016	Stanford	Features	DS	BR	Freebase	Open
BOA [117]	2012	DBpedia Spotlight	Patterns, Features	DS	BR	DBpedia	News, Wikipedia
DeepDive [233]	2012	Stanford	Dep., Features	DS	BR	Freebase	Open
DuttaMS [92,93]	2015	Keyword	OpenIE	DS	BR	DBpedia	Open
ExnerN [97]	2012	Wikifier	Frames	DS	BR	DBpedia	Wikipedia
Fact Extractor [106]	2017	Wiki Machine	Frames	DS	n-ary	DBpedia	Football
FRED [113]	2016	Stanford, TagMe	DRS	PG/RM	n-ary	DBpedia/BabelNet	Open
Graphia [109]	2012	DBpedia Spotlight	Dep.	PG	SR	DBpedia	Wikipedia
Knowledge Vault [86]	2014	—	Features	DS	BR	Freebase	Open
LinSLLS [179]	2016	Stanford	Emb.	DS	BR	Freebase	News
LiuHLZLZ [182]	2013	Stanford	Dep., Features	DS	BR	YAGO	News
LODifier [12]	2012	Wikifier	DRS	RM	SR	WordNet	Open
MIML-RE [292]	2012	Stanford	Dep., Features	DS	BR	Freebase	News, Wikipedia
MintzBSJ [213]	2009	Stanford	Dep., Features	DS	BR	Freebase	Wikipedia
MultiR [141]	2011	Stanford	Dep., Features	DS	BR	Freebase	Wikipedia
Nebhi [226]	2013	GATE	Patterns, Dep.	DS	BR	DBpedia	News
NguyenM [232]	2011	—	Dep., Cons.	DS	BR	YAGO	Wikipedia
PATTY [223]	2013	Stanford	Dep., Patterns	RM	—	YAGO	Wikipedia
PIKES [61]	2016	DBpedia Spotlight	SRL	RM	n-ary	DBpedia	Open
PROSPERA [222]	2011	Keyword	Patterns	RM	BR	YAGO	Open
RdfLiveNews [116]	2013	DBpedia Spotlight	Patterns	PG/DS	BR	DBpedia	News
Refractive [98]	2014	Stanford	Frames	DS	SR	—	Wikipedia
RiedelYM [257]	2010	Stanford	Dep., Features	DS	BR	Freebase	News
Sar-graphs [167]	2016	Dictionary	Dep.	DS	—	Freebase/BabelNet	Open
TakamatsuSN [293]	2012	Hyperlinks	Dep.	DS	BR	Freebase	Wikipedia
Wsabie [312]	2013	Stanford	Dep., Features, Emb.	DS	BR	Freebase	News

Choosing an RE strategy for a particular application scenario can be complex given that every approach has pros and cons regarding the application at hand. However, we can identify some key considerations that should be taken into account:

Binary vs. n-ary: Does the application require binary relations or does it require n-ary relations? Oftentimes the results of systems that produce binary relations can be easier to integrate with existing KBs already composed of such, where DS-based approaches, in particular, will produce triples using the identifier scheme of the KB itself. On the other hand, n-ary relations may capture more nuances in the discourse implicit in the text, for example, capturing semantic roles, negation, disjunction, etc., in complex relations.

Identifier creation: Does the application require finding and identifying new instances in the text not present in the KB? Does it require finding and identifying emerging relations? Most DS-based approaches do not consider minting new identifiers but rather focus on extracting new triples within the KB’s universe (the sets of identifiers it provides). However, there are some exceptions, such as BOA [117]. On the other hand, most REL systems dealing with n-ary relations mint new IRIs as part of their output representation.

Language: Does the application require extraction for a language other than English? Though not discussed previously, we note that almost all systems presented here are evaluated only for English corpora, the exceptions being BOA [117], which is tested for both English and German text; and the work by Fossati et al. [106], which is tested for Italian text. Thus in scenarios involving other languages, it is important to consider to what extent an approach relies on a language-specific technique, such as POS-tagging, dependency parsing, etc. Unfortunately, given the complexity of REL, most works are heavily reliant on such language-specific components. Possible solutions include trying to replace the particular component with its equivalent in another language (which has no guarantees to work as well as those tested in evaluation), or, as proposed for the FRED [113] tool, use an existing API (e.g., Bing!, Google, etc.) to translate the text to the supported language (typically English), with the obvious caveat of the potential for translation errors (though such services are continuously improving in parallel with, e.g., Deep Learning).

Scale & Efficiency: In applications dealing with large corpora, scalability and efficiency become crucial considerations. With some exceptions, most of the approaches do not explicitly tackle the question of scale and efficiency. On the other hand, REL should be highly parallelizable given that processing of different sentences, paragraphs and/or documents can be performed independently assuming some globally-accessible knowledge from the KB. Parallelization has been used, e.g., by Nakashole et al. [223], who cluster relational patterns using a distributed MapReduce framework. Indeed, initiatives such as Knowledge Vault – using standard DS-based REL techniques to extract 1.6 billion triples from a large-scale Web corpus – provide a practical demonstration that, with careful engineering and selection of techniques, REL can be applied to corpora at a very large (potentially Web) scale.

Various other considerations, such as availability or licensing of software, provision of an API, etc., may also need to be taken into account.

Of course, a key consideration when choosing an REL approach is the quality of output produced by that approach, which can be assessed using the evaluation protocols discussed in the following section.

4.8. Evaluation

REL is a challenging task, where evaluation is likewise complicated by a number of fundamental factors. In general, human judgment is often required to assess the quality of the output of systems performing such a task, but such assessments can often be subjective. Creating a gold-standard dataset can likewise be complicated, particularly for those systems producing n-ary relations, requiring an expert informed on the particular theory by which such relations are extracted; likewise, in DS-related scenarios, the expert must label the data according to the available KB relations, which may be a tedious task requiring in-depth knowledge of the KB. Rather than creating a gold-standard dataset, another approach is to apply a posteriori assessment of the output by human judges, i.e., run the process over unlabeled text, generate relations, and have the output validated by human judges; while this would appear more reasonable for systems based on frames or DRS – where creating a gold-standard for such complex relations would be arduous at best – there are still problems in assessing, for example, recall.63

⁶³
Likewise we informally argue that a human judge presented with results of a system is more likely to confirm that output and give it the benefit of subjectivity, especially when compared with the creation of a gold standard dataset where there is more freedom in choice of relations and more ample opportunity for subjectivity.

Rather than relying on costly manual annotations, some systems rather propose automated methods of evaluation based on the KB where, for example, parts of the KB are withheld and then experiments are conducted to see if the tool can reinstate the withheld facts or not; however, such approaches offer rather approximate evaluation since the KB is incomplete, the text may not even mention the withheld triples, and so forth.

In summary, then, approaches for evaluating REL are quite diverse and in many cases there are no standard criteria for assessing the adequacy of a particular evaluation method. Here we discuss some of the main themes for evaluation, broken down by datasets used, how evaluators are employed to judge the output, how automated evaluation can be conducted, and what are the typical metrics considered.

Datasets: Most approaches consider REL applied to general-domain corpora, such as Wikipedia articles, Newspaper articles, or even webpages. However, to simplify evaluation, many approaches may restrict REL to consider a domain-specific subset of such corpora, a fixed subset of KB properties or classes, and so forth. For example, Fossati et al. [106] focus their REL efforts on the Wikipedia articles about Italian soccer players using a selection of relevant frames; Augenstein et al. [11] apply evaluation for relations pertaining to entities in seven Freebase classes for which relatively complete information is available, using the Google Search API to find relevant documents for each such entity; and so forth.

A number of standard evaluation datasets have, however, emerged, particularly for approaches based on distant supervision. A widely reused gold-standard dataset, for example, was that initially proposed by Riedel et al. [257] for evaluating their system, where they select Freebase relations pertaining to people, businesses and locations (corresponding also to NER types) and then link them with New York Times articles, first using Stanford NER to find entities, then linking those entities to Freebase, and finally selecting the appropriate relation (if any) to label pairs of entities in the same sentence with; this dataset was later reused by a number of works [141,258,292]. Other such evaluation resources have since emerged. Google Research64

⁶⁴

https://code.google.com/archive/p/relation-extraction-corpus/downloads

provides five REL corpora, with relation mentions from Wikipedia linked with manual annotation to five Freebase properties indicating institutions, date of birth, place of birth, place of death, and education degree. Likewise, the Text Analysis Conference often hosts a Knowledge Base Population (TAC–KBP) track, where evaluation resources relating to the REL task can be found;65

⁶⁵

For example, see https://tac.nist.gov/2017/KBP/ColdStart/index.html.

such resources have been used and further enhanced, for example, by Surdeanu et al. [292] for their evaluation (whose dataset was in turn used by other works, e.g., by Min et al. [211], DeepDive [233], etc.). Another such initiative is hosted at the European Semantic Web Conference, where the Open Knowledge Extraction challenge (ESWC–OKE) has hosted materials relating to REL using RDFa annotations on webpages as labeled data.66

⁶⁶

For example, see Task 3: https://github.com/anuzzolese/oke-challenge-2016.

Note that all prior evaluation datasets relate to binary relations of the form subject–predicate–object. Creating gold standard datasets for n-ary relations is complicated by the heterogeneity of representations that can be employed in terms of frames, DRS or other theories used. To address this issue, Gangemi et al. [114] proposed the construction of RDF graphs by means of logical patterns known as motifs that are extracted by the FRED tool and thereafter manually corrected and curated by evaluators to follow best Semantic Web practices; the result is a corpus annotated by instances of such motifs that can be reused for evaluation of REL tools producing similar such relations.

Evaluators: In scenarios for which a gold standard dataset is not available – or not feasible to create – the results of the REL process are often directly evaluated by humans. Many papers assign experts to evaluate the results, typically (we assume) authors of the papers, though often little detail on the exact evaluation process is given, aside from a rater agreement expressed as Cohen’s or Fleiss’ κ-measure for a fixed number of evaluators (as discussed in Section 3.7).

Aside from expert evaluation, some works leverage crowdsourcing platforms for labeling training and test datasets, where a broad range of users contribute judgments for a relatively low price. Amongst such works, we can mention Mintz et al. [213] using Amazon’s Mechanical Turk67

⁶⁷

https://www.mturk.com/mturk/welcome

for evaluating relations, while Legalo [251] and Fossati et al. [106] use the Crowdflower68

⁶⁸

https://www.figure-eight.com/

platform (now called Figure Eight).

Automated evaluation: Some works have proposed methods for performing automated evaluation of REL processes, in particular for testing DS-based methods. A common approach is to perform held-out experiments, where KB relations are (typically randomly) omitted from the training/DS phase and then metrics are defined to see how many KB relations are returned by the process, giving an indicator of recall; the intuition of such approaches is that REL is often used for completing an incomplete KB, and thus by holding back KB triples, one can test the process to see how many such triples the process can reinstate. Such an approach avoids expensive manual labeling but is not very suitable for precision since the KB is incomplete, and likewise assumes that held-out KB relations are both correct and mentioned in the text. On the other hand, such experiments can help gain insights at larger scales for a more diverse range of properties, and can be used to assess a relative notion of precision (e.g., to tune parameters), and have thus been used by Mintz et al. [213], Takamatsu et al. [293], Knowledge Vault [86], Lin et al. [179], etc. On the other hand, as mentioned previously, some works – including Knowledge Vault [86] – adopt a partial Closed World Assumption as a heuristic to generate negative examples taking into account the incompleteness of the KB; more specifically, extracted triples of the form $(s, p, o^{'})$ are labeled incorrect if (and only if) a triple $(s, p, o)$ is present in the KB but $(s, p, o^{'})$ is not.

Metrics: Standard evaluation measures are typically applied, including precision, recall, F-measure, accuracy, Area-Under-Curve (AUC–ROC), and so forth. However, given that relations may be extracted for multiple properties, sometimes macro-measures such as Mean Average Precision (MAP) are applied to summarize precision across all such properties rather than taking a micro precision measure [92,213,293]. Given the subjectivity inherent in evaluating REL, Fossati et al. [106] use a strict and lenient version of precision/recall/F-measure, where the former requires the relation to be exact and complete, while the latter also considers relations that are partially correct; relating to the same theme, the Legalo system includes confidence as a measure indicating the level of agreement and trust in crowdsourced evaluators for a given experiment. Some systems produce confidence or supports for relations, where P@k measures are sometimes used to measure the precision for the top-k results [179,182,213,257]. Finally, given that REL is inherently composed of several phases, some works present metrics for various parts of the task; as an example, for extracted triples, Dutta [93] considers a property precision (is the mapped property correct?), instance precision (are the mapped subjects/objects correct?), triple precision (is the extracted triple correct?), amongst other measures to, for example, indicate the ratio of extracted triples new to the KB.

Third-party comparisons: While some REL papers include prior state-of-the-art approaches in their evaluations for comparison purposes, we are not aware of a third-party study providing evaluation results of REL systems. Although Gangemi [112] provides a comparative evaluation of Alchemy, CiceroLite, FRED and ReVerb – all with public APIs available – for extracting relations from a paragraph of text on the Syrian war, he does not publish results for a linking phase; FRED is the only REL tool tested that outputs RDF.

Despite a lack of third-party evaluation results, some comparative metrics can be gleaned from the use of standard datasets over several papers relating to distant supervision; we stress, however, that these are often published in the context of evaluating a particular system (and hence are not strictly third-party comparisons69

⁶⁹

We remark that the results of Gangemi [112] are strictly not third-party either due to the inclusion of results from FRED [113].

). With respect to DS-based approaches, and as previously mentioned, a prominent dataset used is the one proposed by Riedel et al. [257], with articles from the New York Times corpus annotated with 53 different types of relations from Freebase; the training set contains 18,252 relations, while the test set contains 1,950 relations. Lin et al. [179] then used this dataset to perform a held-out evaluation, comparing their approach with that of Mintz et al. [213], Hoffmann et al. [141], and Surdeanu et al. [292], for which source code is available. These results show that fixing 50% precision, Mintz et al. achieved 5% recall, Hoffmann et al. and Surdeanu et al. achieved 10% recall, while the best approach by Lin et al. achieved 33% recall. As a general conclusion, these results suggest that there is still considerable room for improvement in the area of REL based on distant supervision.

4.9. Summary

This section presented the task of Relation Extraction and Linking in the context of the Semantic Web. The applications for such a task include KB Population, Structured Discourse Representation, Machine Reading, Question Answering, Fact Verification, amongst a variety of others. We discussed relevant papers following a high-level process consisting of: entity extraction (and coreference resolution), relation parsing, distant supervision, relation clustering, RDF representation, relation mapping, and evaluation. It is worth noting, however, that not all systems follow these steps in the presented order and not all systems apply (or even require) all such steps. For example, entity extraction may be conducted during relation parsing (where particular arguments can be considered as extracted entities), distant supervision does not require a formal representation nor relation-mapping phase, and so forth. Hence the presented flow of techniques should be considered illustrative, not prescriptive.

In general, we can distinguish two types of REL systems: those that produce binary relations, and those that produce n-ary relations (although binary relations can subsequently be projected from the latter tools [251,272]). With respect to binary relations, distant supervision has become a dominating theme in recent approaches, where KB relations are used, in combination with EEL and often CR, to find example mentions of binary KB relations, generalizing patterns and features that can be used to extract further mentions and, ultimately, novel KB triples; such approaches are enabled by the existence of modern Semantic Web KBs with rich factual information about a broad range of entities of general interest. Other approaches for extracting binary relations rather rely on mapping the results of existing OpenIE systems to KBs/ontologies. With respect to extracting n-ary relations, such approaches rely on more traditional linguistic techniques and resources to extract structures according to frame semantics or Discourse Representation Theory; the challenge thereafter is to represent the results as RDF and, in particular, to map the results to an existing KB, ontology, or collection thereof.

4.10. Open questions

REL is a very important task for populating the Semantic Web. Several techniques have been proposed for this task in order to cover the extraction of binary and n-ary relations from text. However, some aspects could still be improved or developed further:

Relation types: Unlike EEL where particular types of entities are commonly extracted, in REL it is not easy to define the types of relations to be extracted and linked to the Semantic Web. Previous studies, such as the one presented by Storey [290], provide an organization of relations – identified from disciplines such as linguistics, logic, and cognitive psychology – that can be incorporated into traditional database management systems to capture the semantics of real world information. However, to the best of our knowledge, a thorough categorization of semantic relationships on the Semantic Web has not been presented, which in turn, could be useful for defining requirements of information representation, standards, rules, etc., and their representation in existing standards (RDF, RDFS, OWL).

Specialized Settings/Multilingual REL: In brief, we can again raise the open question of adapting REL to settings with noisy text (such as Twitter) and generalizing REL approaches to work with multiple languages. In this context, DS approaches may prove to be more successful given that they rely more on statistical/learning frameworks (i.e., they do not require curated databases of relations, roles, etc., which are typically specific to a language), and given that KBs such as Wikidata, DBpedia and Babelnet can provide examples of relations in multiple languages.

Datasets: The preferred evaluation method for the analyzed approaches is through an a posteriori manual assessment of represented data. However, this is an expensive task that requires human judges with adequate knowledge of the domain, language, and representation structures. Although there are a couple of labeled datasets already published (particularly for DS approaches), the definition of further datasets would benefit the evaluation of approaches under more diverse conditions. The problem of creating a reference gold standard would then depend on the first point, relating to what types of relations should be targeted for extraction from text in domain-specific and/or open-domain settings, and how the output should be represented to allow comparison with the labeled relations for the dataset.

Evaluation: Existing REL approaches extract different outputs relating to particular entity types, domains, structures, and so on. Thus, evaluating/comparing different approaches is not a straightforward task. Another challenge is to allow for a more fine-grained evaluation of REL approaches, which are typically complex pipelines involving various algorithms, resources, and often external tools, where noisy elements extracted in some early stage of the process can have a major negative effect on the final output, making it difficult to interpret the cause of poor evaluation results or the key points that should be improved.

5. Semi-structured information extraction

The primary focus of the survey – and the sections thus far – is on Information Extraction over unstructured text. However, the Web is full of semi-structured content, where HTML, in particular, allows for demarcating titles, links, lists, tables, etc., imposing a limited structure on documents. While it is possible to simply extract the text from such sources and apply previous methods, the structure available in the source, though limited, can offer useful hints for the IE process.

Hence a number of works have emerged proposing Information Extraction methods using Semantic Web languages/resources targeted at semi-structured sources. Some works are aimed at building or otherwise enhancing Semantic Web KBs (where, in fact, many of the KBs discussed originated from such a process [139,171]). Other works rather focus on enhancing or annotating the structure of the input corpus using a Semantic Web KB as reference. Some works make significant reuse of previously discussed techniques for plain text – particularly Entity Linking and sometimes Relation Extraction – adapted for a particular type of input document structure. Other works rather focus on custom techniques for extracting information from the structure of a particular data source.

Table 7
Overview of information extraction systems for markup documents. Task denotes the IE task(s) considered (EEL: Entity Extraction & Linking, CEL: Concept Extraction & Linking, REL: Relation Extraction & Linking); Structure denotes the type of document structure leveraged for the IE task; “—” denotes no information found, not used or not applicable

System Year Task Source Domain Structure KB

COHSE [18] 2008 CEL Webpages Medical Hyperlinks Any (SKOS)

DBpedia [171] 2007 EEL/CEL/REL Wikipedia Open Wiki —

DeVirgilio [74] 2011 EEL/CEL Webpages Commerce HTML (DOM) DBpedia

Epiphany [2] 2011 EEL/CEL/REL Webpages Open HTML (DOM) Any (SPARQL)

Knowledge Vault [86] 2014 EEL/CEL/REL Webpages Open HTML (DOM) Freebase

Legalo [251] 2014 REL Webpages Open Hyperlinks —

LIEGE [277] 2012 EEL Webpages Open Lists YAGO

LODIE [115] 2014 EEL/REL Webpages Open HTML (DOM) Any (SPARQL)

RathoreR [282] 2014 CEL Wikipedia Physics Titles Custom ontology

YAGO (2007) [139] 2007 EEL/CEL/REL Wikipedia Open Wiki —

System	Year	Task	Source	Domain	Structure	KB
COHSE [18]	2008	CEL	Webpages	Medical	Hyperlinks	Any (SKOS)
DBpedia [171]	2007	EEL/CEL/REL	Wikipedia	Open	Wiki	—
DeVirgilio [74]	2011	EEL/CEL	Webpages	Commerce	HTML (DOM)	DBpedia
Epiphany [2]	2011	EEL/CEL/REL	Webpages	Open	HTML (DOM)	Any (SPARQL)
Knowledge Vault [86]	2014	EEL/CEL/REL	Webpages	Open	HTML (DOM)	Freebase
Legalo [251]	2014	REL	Webpages	Open	Hyperlinks	—
LIEGE [277]	2012	EEL	Webpages	Open	Lists	YAGO
LODIE [115]	2014	EEL/REL	Webpages	Open	HTML (DOM)	Any (SPARQL)
RathoreR [282]	2014	CEL	Wikipedia	Physics	Titles	Custom ontology
YAGO (2007) [139]	2007	EEL/CEL/REL	Wikipedia	Open	Wiki	—

Our goal in this section is thus to provide an overview of some of the most popular techniques and tools that have emerged in recent years for Information Extraction over semi-structured sources of data using Semantic Web languages/resources. Given that the techniques vary widely in terms of the type of structure considered, we organize this section differently from those that came before. In particular, we proceed by discussing two prominent types of semi-structured sources – markup documents and tables – and discuss works that have been proposed for extracting information from such sources using Semantic Web KBs.

We do not include languages or approaches for mapping from one explicit structure to another (e.g., R2RML [72]), nor that rely on manual scraping (e.g., Piggy Bank [146]), nor tools that simply apply existing IE frameworks (e.g., Magpie [94], RDFaCE [159], SCMS [229]). Rather we focus on systems that extract and/or disambiguate entities, concepts, and/or relations from the input sources and that have methods adapted to exploit the partial structure of those sources (i.e., they do not simply extract and apply IE processes over plain text). Again, we only include proposals that in some way directly involve a Semantic Web standard (RDF(S)/OWL/SPARQL, etc.), or a resource described in those standards, be it to populate a Semantic Web KB, or to link results with such a KB.

5.1. Markup documents

The content of the Web has traditionally been structured according to the HyperText Markup Language (HTML), which lays out a document structure for webpages to follow. While this structure is primarily perceived as a way to format, display and offer navigational links between webpages, it can also be – and has been – leveraged in the context of Information Extraction. Such structure includes, for example, the presence of hyperlinks, title tags, paths in the HTML parse tree, etc. Other Web content – such as Wikis – may be formatted in markup other than HTML, where we include frameworks for such formats here. We provide an overview of these works in Table 7. Given that all such approaches implement diverse methods that depend on the markup structure leveraged, we will not discuss techniques in detail. However, we will provide more detailed discussion for IE techniques that have been proposed for HTML tables in a following section.

COHSE (2008) [ 18 ]

(Conceptual Open Hypermedia Service) uses a reference taxonomy to provide personalized semantic annotation and hyperlink recommendation for the current webpage that a user is browsing. A use-case is discussed for such annotation/recommendation in the biomedical domain, where a SKOS taxonomy can be used to recommend links to further material on more/less specific concepts appearing in the text, with different types of users (e.g., doctors, the public) receiving different forms of recommended links.

DBpedia (2007) [ 171 ]

is a prominent initiative to extract a rich RDF KB from Wikipedia. The main source of extracted information comes from the semi-structured info-boxes embedded in the top right of Wikipedia articles; however, further information is also extracted from abstracts, hyperlinks, categories, and so forth. While much of the extracted information is based on manually-specified mappings for common attributes, components are provided for higher-recall but lower-precision automatic extraction of info-box information, including recognition of datatypes, etc.

DeVirgilio (2011) [ 74 ]

uses Keyphrase Extraction to semantically annotate webpages, linking keywords to DBpedia. The approach breaks webpages down into “semantic blocks” describing specific elements based on HTML elements; Keyphrase Extraction is the applied over individual blocks. Evaluation is conducted in the E-Commerce domain, adding RDFa annotations using the Goodrelations vocabulary [132].

Epiphany (2011) [ 2 ]

aims to semantically annotate webpages with RDFa, incorporating embedded links to existing Linked Data KBs. The process is based on an input KB, where labels of instances, classes and properties are extracted. A custom IE pipeline is then defined to chunk text and match it with the reference labels, with disambiguation performed based on existing relations in the KB for resolved entities. Facts from the KB are then matched to the resolved instances and used to embed RDFa annotations in the webpage.

Knowledge Vault (2014) [ 86 ]

was discussed before in the context of Relation Extraction & Linking over text. However, the system also includes a component for extracting features from the structure of HTML pages. More specifically, the system extracts the Document Object Model (DOM) from a webpage, which is essentially a hierarchical tree of HTML tags. For relations identified on the webpage using a DS approach, the path in the DOM tree between both entities (for which an existing KB relation exists) is extracted as a feature.

Legalo (2014) [ 251 ]

applies Relation Extraction based on the hyperlinks of webpages that describe entities, with the intuition that the anchor text (or more generalized context) of the hyperlink will contain textual hints about the relation between both entities. More specifically, a frame-based representation of the textual context of the hyperlink is extracted and linked with a KB; next, to create a label for a direct binary relation (an RDF triple), rules are applied on the frame-based representation to concatenate labels on the shortest path, adding event and role tags. The label is then linked to properties in existing vocabularies.

LIEGE (2012) [ 277 ]

(Link the entIties in wEb lists with the knowledGe basE) performs EEL with respect to YAGO and Wikipedia over the text elements of HTML lists embedded in webpages. The authors propose specific features for disambiguation in the context of such lists where, in particular, the main assumption is that the entities appearing in an HTML list will often correspond to the same concept; this intuition is captured with a similarity-based measure that, for a given list, computes the distance of the types of candidate entities in the class hierarchy of YAGO. Other typical disambiguation features for EEL, such as prior probability, keyword-based similarities between entities, etc., are also applied.

LODIE (2014) [ 115 ]

propose a method for using Linked Data to perform enhanced wrapper induction: leveraging the often regular structure of webpages on the same website to extract a mapping that serves to extract information in bulk from all its pages. LODIE then proposes to map webpages to an existing KB to identify the paths in the HTML parse tree that lead to known entities for concepts (e.g., movies), their attributes/relations (e.g., runtime, director), and associated values. These learned paths can then be applied to unannotated webpages on the site to extract further (analogous) information.

RathoreR (2014) [ 282 ]

focus on Topic Modeling for webpages guided by a reference ontology. The overall process involves applying Keyphrase Extraction over the textual content of the webpage, mapping the keywords to an ontology, and then using the ontology to decide the topic. However, the authors propose to leverage the structure of HTML, where keywords extracted from the title, the meta-tags or the section-headers are analyzed first; if no topic is found, the process resorts to using keywords from the body of the document.

YAGO (2007) [ 139 ]

is another major initiative for extracting information from Wikipedia in order to create a Semantic Web KB. Most information is extracted from info-boxes, but also from categories, titles, etc. The system also combines information from GeoNames, which provides geographic context; and WordNet, which allows for extracting cleaner taxonomies from Wikipedia categories. A distinguishing aspect of YAGO2 is the ability to capture temporal information as a first-class dimension of the KB, where entities and relations/attributes are associated with a hierarchy of properties denoting start/end dates.

It is interesting to note that KBs such as DBpedia [171] and YAGO2 [139] – used in so many of the previous IE works discussed throughout the survey – are themselves the result of IE processes, particularly over Wikipedia. This highlights something of a “snowball effect”, where as IE methods improve, new KBs arise, and as new KBs arise, IE methods improve.70

⁷⁰
Though of course, we should not underestimate the value of Wikipedia itself as a raw source for IE tasks.

5.2. Tables

Tabular data are common on the Web, where HTML tables embedded in webpages are plentiful and often contain rich, semi-structured, factual information [35,65]. Hence, extracting information from such tables is indeed a tempting prospect. However, web tables are primarily designed with human readability in mind rather than machine readability. Web tables, while numerous, can thus be highly heterogeneous and idiosyncratic: even tables describing similar content can vary widely in terms of structuring that content [65]. More specifically, the following complications arise when trying to extract information from such tables:

Although Web tables are easy to identify (using the <table> HTML tag), many Web tables are used purely for layout or other presentational purposes (e.g., navigational sidebars, forms, etc.); thus, a preprocessing step is often required to isolate factual tables from HTML [35].

Even tables containing factual data can vary greatly in structure: they may be “transposed”, or may simply list attributes in one column and values in another, or may represent a matrix of values. Sometimes a further subset – called “relational tables” [35] – are thus extracted, where the table contains a column header, with subsequent rows comprising tuples in the relation.

Even relational tables may contain irregular structure, including cells with multiple rows separated by an informal delimiter (e.g., a comma), nested tables as cell values, merged cells with vertical and/or horizontal orientation, tables split into various related sections, and so forth [245].

Although column headers can be identified as such using (<th>) HTML tags, there is no fixed schema: for example, columns may not always have a fixed domain of values, there may be no obvious primary key or foreign keys, there may be hierarchical (i.e., multi-row) headers; etc.

Column names and cell values often lack clear identifiers or typing: Web tables often contain potentially ambiguous human-readable labels.

There have thus been numerous works on extracting information from tables, sometimes referred to as table interpretation, table annotation, etc. (e.g., [35,45,65,245,307,318], to name some prominent works). The goal of such works is to interpret the implicit structure of tables so as to categorize them for search; or to integrate the information they contain and enable performing joins over them, be it to extend tables with information from other tables, or extracting the information to an external unified representation that can be queried.

More recently, a variety of approaches have emerged using Semantic Web KBs as references to help with extracting information from tables (sometimes referred to as semantic table interpretation, semantic table annotation, etc.). We discuss such approaches herein.

Process: While proposed approaches vary significantly, more generally, given a table and a KB, such works aim to link tables/columns to KB classes, link columns or tuples of columns to KB properties, and link individual cells to KB entities. The aim can then be to annotate the table with respect to the KB (useful for, e.g., later integrating or retrieving tables), and indeed to extract novel entities or relations from the table to further populate the KB. Hence we consider this an IE scenario. While methods discussed previously for IE over unstructured sources can be leveraged for tables, the presence of a tabular structure does suggest the applicability of novel features for the IE process. For example, one might expect in some tables to find that elements of the same column pertain to the same type, or pairs of entities on the same row to have a similar relation as analogous pairs on other rows. On the other hand, cells in a table have a different textual context, which may be the caption, the text referring to the table, etc., rather than the surrounding text; hence, for example, distributional approaches intended for text may not be directly applicable for tables.

Table 8
Overview of information extraction systems for Web tables. EEL and REL denotes the entity extraction & linking and relation extraction & linking strategies used; Annotation denotes the elements of the table considered by the approach (P: Protagonist, E: Entities, S: Subject column, T: Column types, R: Relations, T′: Table type); KB denotes the reference KB used (WDC: WebDataCommons, BTC: Billion Triple Challenge 2014); ‘—’ denotes no information found, not used or not applicable

System Year EEL REL Annotation KB Domain

AIDA [140] 2011 AIDA — E YAGO Wikipedia

DRETa [219] 2014 Wikilinks Features PER DBpedia Wikipedia

Knowledge Vault [86] 2014 — Features ER Freebase Web

LimayeSC [176] 2010 Keyword Features ETR YAGO Wikipedia, Web

MSJ Engine [172] 2015 — — EST WDC, BTC Web

MulwadFJ [218] 2013 Keyword Features ETR DBpedia, YAGO Wikipedia, Web

ONDINE [30] 2013 Keyword Features ETR Custom ontology Microbes, Chemistry, Aeronautics

RitzeB [262] 2017 Various Features ESRT′ Web DBpedia

TabEL [21] 2015 String-based — E Wikipedia YAGO

TableMiner⁺ [327] 2017 Various Features ESTR Freebase Wikipedia, Movies, Music

ZwicklbauerEGS [335] 2015 — — T DBpedia Wikipedia

System	Year	EEL	REL	Annotation	KB	Domain
AIDA [140]	2011	AIDA	—	E	YAGO	Wikipedia
DRETa [219]	2014	Wikilinks	Features	PER	DBpedia	Wikipedia
Knowledge Vault [86]	2014	—	Features	ER	Freebase	Web
LimayeSC [176]	2010	Keyword	Features	ETR	YAGO	Wikipedia, Web
MSJ Engine [172]	2015	—	—	EST	WDC, BTC	Web
MulwadFJ [218]	2013	Keyword	Features	ETR	DBpedia, YAGO	Wikipedia, Web
ONDINE [30]	2013	Keyword	Features	ETR	Custom ontology	Microbes, Chemistry, Aeronautics
RitzeB [262]	2017	Various	Features	ESRT′	Web	DBpedia
TabEL [21]	2015	String-based	—	E	Wikipedia	YAGO
TableMiner⁺ [327]	2017	Various	Features	ESTR	Freebase	Wikipedia, Movies, Music
ZwicklbauerEGS [335]	2015	—	—	T	DBpedia	Wikipedia

Example: Consider an HTML table embedded in a webpage about the actor Bryan Cranston as follows:

Character	Series	Network	Ep.
Uncle Russell	Raising Miranda	CBS	9
Hal	Malcolm in the Middle	Fox	62
Walter	Breaking Bad	AMC	151
Lucifier Light Bringer	Fallen	ABC	4
Vince	Sneaky Pete	Amazon	10

We see that the table contains various entities, and that entities in the same column tend to correspond to a particular type. We also see that entities on each row often have implicit relations between them, organized by column; for example, on each row, there are binary relations between the elements of the Character and Series columns, the Series and Networks columns, and (more arguably in the case that multiple actors play the same character) between the Character and Ep. columns. Furthermore, we note that some relations exist from Bryan Cranston – the subject of the webpage – to the elements of various columns of the table.71

⁷¹

In fact, we could consider each tuple as an n-ary relation involving Bryan Cranston; however, this goes more towards a Direct Mapping representation of the table [8,83]; rather the methods we discuss focus on extraction of binary relations.

The approaches we enumerate here attempt to identify entities in table cells, assign types to columns, extract binary KB relations across columns, and so forth.

However, we also see some complications in the table structure, where some values span multiple cells. While this particular issue is relatively trivial to deal with – where simply duplicating values into each spanned cell is effective [245] – a real-world collection of (HTML) tables may exhibit further such complications; here we gave a relatively clean example.

Systems: We now discuss works that aim to extract entities, concepts or relations from tables, using Semantic Web KBs. We also provide an overview of these works in Table 8.72

⁷²

We also note that many such works were covered by the recent survey of Ristoski and Paulheim [261], but with more of an emphasis on data mining aspects. We are interested in such papers from a related IE perspective where raw entities/concepts/relations are extracted; hence they are also included here for completeness.

AIDA (2011) [ 140 ]

is primarily an Entity Linking tool (discussed in more detail previously in Section 2), but it provides parsers for extracting and linking entities in HTML tables; however, no table-specific features are discussed in the paper.

DRETa (2014) [ 219 ]

aims to extract relations in the form of DBpedia triples from Wikipedia’s tables. The process uses internal Wikipedia hyperlinks in tables to link cells to DBpedia entities. Relations are then analyzed on a row-by-row basis, where an existing relation in DBpedia between two entities in one row is postulated as a candidate relation for pairs of entities in the corresponding columns of other rows; implicit relations from the entity of the article containing the table and the entities in each column of the table are also considered for generating candidate relations. These relations – extracted as DBpedia triples – are then filtered using classifiers that consider a range of features for the source cells, columns, rows, headers, etc., thus generating the final triples.

Knowledge Vault (2014) [ 86 ]

extracts relations from 570 million Web tables. First, an EEL process is applied to identify entities in a given table. Next, these entities are matched to Freebase and compared with existing relations. These relations are then proposed as candidates relations between the two columns of the table in question. Thereafter, ambiguous columns are discarded with respect to the existing KB relations and extracted facts are assigned a confidence score based on the EEL process. A total of 9.4 million Freebase facts are ultimately extracted in the final result.

LimayeSC (2010) [ 176 ]

propose a probabilistic model that, given YAGO as a reference KB and a Web table as input, simultaneously assigns entities to cells, types to columns, and relations to pairs of columns. The core intuition is that the assignment of a candidate to one of these three aspects affects the assignment of the other two, and hence a collective assignment can boost accuracy. A variety of features are thus defined over the table in relation to YAGO, over which joint inference is applied to optimize a collective assignment.

MSJ Engine (2015) [ 172 ]

(Mannheim Search Join Engine) aims to extend a given input (HTML) table with additional attributes (columns) and associated values (cells) using a reference data corpus comprising of Linked Data KBs and other tables. The engine first identifies a “subject” column of the input table deemed to contain the names of the primary entities described; the datatype (domain) of other columns is then identified. This meta-description is used to search for other data with the same entities using information retrieval techniques. Thereafter, retrieved tables are (left-outer) joined with the input table based on a fuzzy match of columns, using the attribute names, ontological hierarchies and instance overlap measures.

MulwadFJ (2013) [ 218 ]

aim to annotate tables with respect to a reference KB by linking columns to classes, cells to (fresh) entities or literals, and pairs of columns to properties denoting their relation. The KB that they consider combines DBpedia, YAGO and Wikipedia. Candidate entities are derived using keyword search on the cell value and surrounding values for context; candidate column classes are taken as the union of all classes in the KB for candidate entities in that column; candidate relations for pairs of columns are chosen based on existing KB relations between candidate entities in those columns; thereafter, a joint inference step is applied to select a suitable collective assignment of cell-to-entity, column-to-class and column-pair-to-property mappings.

ONDINE (2013) [ 30 ]

uses specialized ontologies to guide the annotation and subsequent extraction of information from Web tables. A core ontology encodes general concepts, unit concepts for quantities, and relations between concepts. On the other hand, a domain ontology is used to capture a class hierarchy in the domain of extraction, where classes are associated with labels. Table columns are then categorized by the ontology classes and tuples of columns are categorized by ontology relations, using a combination of cosine-similarity matching on the column names and the column values. Fuzzy sets are then used to represent a given annotation, encoding uncertainty, with an RDF-based representation used to represent the result. The extracted fuzzy information can then be queried using SPARQL.

RitzeB (2017) [ 262 ]

enumerate and evaluate a variety of features that can be brought to bear for extracting information from tables. They consider a taxonomy of features that covers: features extracted from the table itself, including from a single (header/value) cell, or multiple cells; and features extracted from the surrounding context of the table, including page attributes (e.g., title) or free text. Using these features, they then consider three matching tasks with respect to DBpedia and an input table: row-to-entity, column-to-property, and table-to-class, where various linking strategies are defined. The scores of these matchers are then aggregated and tested against a gold standard to determine the usefulness of individual features, linking strategies and aggregation metrics on the precision/recall of the resulting assignments.

TabEL (2015) [ 21 ]

focuses on the task of EEL for tables with respect to YAGO, where they begin by applying a standard EEL process over cells: extracting mentions and generating candidate KB identifiers. Multiple entities can be extracted per cell. Thereafter, various features are assigned to candidates, including prior probabilities, string similarity measures, and so forth. However, they also include special features for tables, including a repetition feature to check if the mention has been linked elsewhere in the table and also a measure of semantic similarity for entities assigned to the same row or table; these features are encoded into a model over which joint inference is applied to generate a collective assignment.

TableMiner ⁺ (2017) [ 327 ]

annotates tables with respect to Freebase by first identifying a subject column considered to contain the names of the entities being primarily described. Next, a learning phase is applied on each entity column (distinguished from columns containing datatype values) to annotate the column and the entities it contains; this process can involve sampling of values to increase efficiency. Next, an update/refinement phase is applied to collectively consider the (keyword-based) similarity across column annotations. Relations are then extracted from the subject column to other columns based on existing triples in the KB and keyword similarity metrics.

ZwicklbauerEGS (2013) [ 335 ]

focus on the problem of assigning a DBpedia type to each column of an input table. The process involves three steps. First, a set of candidate identifiers is extracted for each cell. Next, the types (both classes and categories) are extracted from each candidate. Finally, for a given column, the type most frequently extracted for the entities in its cells is assigned as the type for that column.

Summary: Hence we see that exploring custom IE processes dedicating to tabular input formats using Semantic Web KBs is a burgeoning but still relatively recent area of research; techniques combine a mix of traditional IE methods as described previously, as well as novel low-level table-specific features and high-level global inference models that capture the dependencies in linking between different columns of the same table, different cells of the same column or row, etc.

Also, approaches vary in what they annotate. For example, while Zwicklbauer et al. [335] focus on typing columns, and AIDA [140] and TabEL [21] focus on annotating entities, most works annotate various aspects of the table, in particular for the purposes of extracting relations. Amongst those approaches extracting relations, we can identify an important distinction: those that begin by identifying a subject column to which all other relations extend [172,262,327], and those that rather extract relations between any pair of columns in the table [30,86,176,218,219]. All approaches that we found for Relation Extraction, however, rely on extracting a set of features and then applying machine learning methods to classify likely-correct relations; similarly, almost all approaches rely on a “distant supervision” style algorithm, where seed relations in the KB appearing in rows of the table are used as a feature to identify candidate relations between column pairs. In terms of other annotations, we note that DRETa [219] extracts the protagonist of a table as the main entity about which the containing webpage is about (considered an entity with possible relations to entities in the table), while Ritze and Bizer [262] extract a type for each table that is based on the type(s) of entities in the subject column.

5.3. Other formats

Information Extraction has also been applied to various other formats in conjunction with Semantic Web KBs and/or ontologies. Amongst these, a number of works have proposed specialized EEL techniques for multimedia formats, including approaches for performing EEL with respect to images [17], audio (speech) [19,253], and video [175,204,310]. Other works have focused on IE techniques in the context of social platforms, such as for Twitter [79,320,322], tagging systems [160,287], or for other user-generated content, such as keyword search logs [63], etc.

Techniques inspired by IE have also been applied to structured input formats, including Semantic Web KBs themselves. For example, a variety of approaches have been recently proposed to model topics for Semantic Web KBs themselves, either to identify the main topics within a KB, or to identify related KBs [25,240,267,283]. Given that such methods apply to structured input formats, these works veer away from pure Information Extraction and head more towards the related areas of Data Mining and Knowledge Discovery – as discussed already in a recent survey by Ristoski and Paulheim [261] – where the goal is to extract high-level patterns from data for applications including KB refinement, recommendation tasks, clustering, etc. We thus consider such works as outside the current scope.

6. Discussion

In this survey, we have discussed a wide variety of works that lie at the intersection of the Information Extraction and Semantic Web areas. In particular, we discussed works that extract entities, concepts and relations from unstructured and semi-structured sources, linking them with Semantic Web KBs/ontologies.

Trends: The works that we have surveyed span almost two decades. Interpreting some trends from Tables 3, 5, 6, 7 & 8, we see that earlier works (prior to ca. 2009) in this intersection related more specifically to Information Extraction tasks that were either intended to build or populate domain-specific ontologies, or were guided by such ontologies. Such ontologies were assumed to model the conceptual domain under analysis but typically without providing an extensive list of entities; as such, traditional IE methods were used involving NER of a limited range of types, machine-learning models trained over manually-labeled corpora, handcrafted linguistic patterns and rules to bootstrap extraction, generic linguistic resources such as WordNet for modeling word sense/hypernyms/synsets, deep parsing, and so forth.

However, post 2009, we notice a shift towards using general-domain KBs – DBpedia, Freebase, YAGO, etc. – that provide extensive lists of entities (with labels and aliases), a wide variety of types and categories, graph-structured representations of cross-domain knowledge, etc. We also see a related trend towards more statistical, data-driven methods. We posit that this shift is due to two main factors: (i) the expansion of Wikipedia as a reference source for general domain knowledge – and related seminal works proposing its exploitation for IE tasks – which, in turn, naturally translate into using KBs such as DBpedia and YAGO extracted from Wikipedia; (ii) advancement in statistical NLP techniques that emphasize understanding of language through relatively shallow analyses of large corpora of text (for example, techniques based on the distributional hypothesis) rather than use of manually crafted patterns, training over labeled resources, or deep linguistic parsing. Of course, we also see works that blend both worlds, making the most of both linguistic and statistical techniques in order to augment IE processes.

Another general trend we have observed is one towards more “holistic” methods – such as collective assignment, joint models, etc. – that consider the interdependencies implicit in extracting increasingly rich machine-readable information from text. On the one hand, we can consider intra-task dependencies being modeled where, for example, linking one entity mention to a particular KB entity may affect how other surrounding entities are linked. On the other hand, more and more in the recent literature we can see inter-task dependencies being modeled, where the tasks of NER and EEL [184,231], or WSD and EEL [145,216], or EEL and REL [11], etc., are seen as interdependent. We see this trend of jointly modeling several interrelated aspects of IE as set to continue, following the idea that improving IE methods requires looking at the “bigger picture” and not just one aspect in isolation.

Communities: In terms of the 109 highlighted papers in this survey for EEL, CEL, REL and Semi-Structured Inputs – i.e., those papers referenced in the first columns of Tables 3, 5, 6, 7 & 8 – we performed a meta-analysis of the venues (conferences or journals) at which they were published, and the primary area(s) associated with that venue. The results are compiled in Table 9, showing 18 (of 55) venues with at least two such papers; for compiling these results, we count workshops and satellite events under the conference with which they were co-located. While Semantic Web venues top the list, we notice a significant number of papers in venues associated with other areas.

Table 9

Top venues where highlighted papers are published. Venue denotes publication series, Area(s) denotes the primary CS area(s) of the venue; E/C/R/S denote counts of highlighted papers in this survey relating to Entities, Concepts, Relations and Semi-Structured input, resp.; Σ denotes the sum of E + C + R + S

Venue	Area(s)	E	C	R	S	Σ
ISWC	SW	3	2	2	2	9
Sem. Web J.	SW			4	2	6
ACL	NLP	1		4		5
EKAW	SW	1	1	1	2	5
EMNLP	NLP	2	1	2		5
ESWC	SW	2	1	1		4
J. Web Sem.	SW	1	1	1	1	4
WWW	Web	3		1		4
Int. Sys.	AI		2	1		3
WSDM	DM/Web		1	1	1	3
AIRS	IR			2		2
CIKM	DB/IR	2				2
SIGKDD	DM			1	1	2
JASIST	Other		2			2
NLDB	NLP/DB		2			2
OnTheMove	DB/SW		1		1	2
PVLDB	DB				2	2
Trans. ACL	NLP	2				2

In order to perform a higher-level analysis of the areas from which the highlighted works have emerged, we mapped venues to areas (as shown for the venues in Table 9). In some cases the mapping from venues to areas was quite clear (e.g., ISWC → Semantic Web), while in others we chose to assign two main areas to a venue (e.g., WSDM → Web/Data Mining). Furthermore, we assigned venues in multidisciplinary or otherwise broader areas (e.g., Information Science) to a general classification: Other. Table 10 then aggregates the areas in which all highlighted papers were published; in the case that a paper is published at a venue assigned to two areas, we count the paper as $+ 0.5$ in each area. The table is ordered by the total number of highlighted papers published. In this analysis, we see that while the plurality of papers come from the Semantic Web community, the majority (roughly two-thirds) do not, with many coming from the NLP, AI and DB communities, amongst others. We can also see, for example, that NLP papers tend to focus on unstructured inputs, while Database and Data Mining papers rather tend to target semi-structured inputs.

Most generally, we see that works developing Information Extraction techniques in a Semantic Web context have been pursued within a variety of communities; in other words, the use of Semantic Web KBs has become popular in variety of other (non-SW) research communities interested in Information Extraction.

Table 10

Top areas where highlighted papers are published. E/C/R/S denote counts of highlighted papers in this survey relating to Entities, Concepts, Relations and Semi-Structured input, resp.; Σ denotes the sum of E + C + R + S

Area	E	C	R	S	Σ
Semantic Web (SW)	9	12.5	11	8.5	41
Nat. Lang. Proc. (NLP)	6	5	7	1	19
Art. Intelligence (AI)	4	6	2	1	13
Databases (DB)	1	1.5	2.5	4	9
Other		7		2	9
Information Retr. (IR)	2	2	2		6
Web	3	0.5	1.5	0.5	5.5
Data Mining (DM)		0.5	1.5	3	5
Machine Learning (ML)		1	0.5		1.5
Total	25	36	28	20	109

Final remarks: Our goal with this work was to provide not only a comprehensive survey of literature in the intersection of the Information Extraction and Semantic Web areas, but also to – insofar as possible – offer an introductory text to those new to the area.

Hence we have focused on providing a survey that is as self-contained as possible, including a primer on traditional IE methods, and thereafter an overview on the extraction and linking of entities, concepts and relations, both for unstructured sources (the focus of the survey), as well as an overview of such techniques for semi-structured sources. In general, methods for extracting and linking relations, for example, often rely on methods for extracting and linking entities, which in turn often rely on traditional IE and NLP techniques. Along similar lines, techniques for Information Extraction over semi-structured sources often rely heavily on similar techniques used for unstructured sources. Thus, aside from providing a literature survey for those familiar with such areas, we believe that this survey also offers a useful entry-point for the uninitiated reader, spanning all such interrelated topics.

Likewise, as previously discussed, the relevant literature has been published by various communities, using sometimes varying terminology and techniques, with different perspectives and motivation, but often with a common underlying (technical) goal. By drawing together the literature from different communities, we hope that this survey will help to bridge such communities and to offer a broader understanding of the research literature at this now busy intersection where Information Extraction meets the Semantic Web.

Footnotes

Acknowledgements

This work was funded in part by the Millennium Institute for Foundational Research on Data (IMFD) and Fondecyt, Grant No. 1181896. We would also like to thank the reviewers as well as Henry Rosales-Méndez and Ana B. Rios-Alvarado for their helpful comments on the survey.

Primer: Traditional information extraction

Information Extraction (IE) refers to the automatic extraction of implicit information from unstructured or semi-structured data sources. Along these lines, IE methods are used to identify entities, concepts and/or semantic relations that are not otherwise explicitly structured in a given source. IE is not a new area and dates back to the origins of Natural Language Processing (NLP), where it was seen as a use-case of NLP: to extract (semi-)structured data from text. Applications of IE have broadened in recent years, particularly in the context of the Web, including the areas of Knowledge Discovery, Information Retrieval, etc.

To keep this survey self-contained, in this appendix, we will offer a general introduction to traditional IE techniques as applied to primarily textual sources. Techniques can vary widely depending on the type of source considered (short strings, documents, forms, etc.), the available reference information considered (databases, labeled data, tags, etc.), expected results, and so forth. Rather than cover the full diversity of methods that can be found in the literature – for which we rather refer the reader to a dedicated survey such as that provided by Sarawagi [274] – our goal will be to cover core tasks and concepts found in traditional IE pipelines, as are often (re)used by works in the context of the Semantic Web. We will also focus primarily on English-centric examples and tools, though much of the discussion generalizes (assuming the availability of appropriate resources) to other languages, which we discuss as appropriate.

References

Abedini,

Mahmoudi and

A.H.

Jadidinejad, From text to knowledge: Semantic entity extraction using YAGO ontology, International Journal of Machine Learning and Computing1(2) (2011), 113–119. doi:10.7763/IJMLC.2011.V1.17.

Adrian,

Hees,

Herman,

Sintek and

Dengel, Epiphany: Adaptable RDFa generation linking the Web of Documents to the Web of Data, in: Knowledge Engineering and Management by the Masses – 17th International Conference, EKAW,

Cimiano and

H.S.

Pinto, eds, Springer, 2010, pp. 178–192. doi:10.1007/978-3-642-16438-5_13.

Akalya and

Sherine, Term recognition and extraction based on semantics for ontology construction, International Journal of Computer Science Issues IJCSI9(2) (2012), 163–169.

Alani,

Kim,

D.E.

Millard,

M.J.

Weal,

Hall,

P.H.

Lewis and

Shadbolt, Automatic ontology-based knowledge extraction from Web documents, IEEE Intelligent Systems18(1) (2003), 14–21. doi:10.1109/MIS.2003.1179189.

Allahyari and

Kochut, Automatic topic labeling using ontology-based topic models, in: 14th IEEE International Conference on Machine Learning and Applications ICMLA,

Tao,

L.A.

Kurgan,

Palade,

Goebel,

Holzinger,

Verspoor and

M.A.

Wani, eds, IEEE, 2015, pp. 259–264. doi:10.1109/ICMLA.2015.88.

L.E.

Anke,

Camacho-Collados,

C.D.

Bovi and

Saggion, Supervised distributional hypernym discovery via domain adaptation, in: Conference on Empirical Methods in Natural Language Processing (EMNLP),

Su,

Carreras and

Duh, eds, ACL, 2016, pp. 424–435.

L.E.

Anke,

Saggion,

Ronzano and

Navigli, ExTaSem! Extending, taxonomizing and semantifying domain terminologies, in: Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence,

Dale and

M.P.

Wellman, eds, AAAI, 2016, pp. 2594–2600.

Arenas,

Bertails,

Prud’hommeaux and

Sequeda, A Direct Mapping of Relational Data to RDF. W3C Recommendation, 2012, https://www.w3.org/TR/rdb-direct-mapping/.

Aubin and

Hamon, Improving term extraction with terminological resources, in: Advances in Natural Language Processing, 5th International Conference on NLP, FinTAL,

Salakoski,

Ginter,

Pyysalo and

Pahikkala, eds, Springer, 2006, pp. 380–387. doi:10.1007/11816508.

10.

Augenstein, Web Relation Extraction with Distant Supervision, PhD Thesis, The University of Sheffield, 2016, http://etheses.whiterose.ac.uk/13247/.

11.

Augenstein,

Maynard and

Ciravegna, Distantly supervised Web relation extraction for knowledge base population, Semantic Web7(4) (2016), 335–349. doi:10.3233/SW-150180.

12.

Augenstein,

Padó and

S.R.

Lodifier, Generating Linked Data from unstructured text, in: The Semantic Web: Research and Applications – 9th Extended Semantic Web Conference, ESWC,

Simperl,

Cimiano,

Polleres,

Ó.

Corcho and

Presutti, eds, Springer, 2012, pp. 210–224. doi:10.1007/978-3-642-30284-8_21.

13.

Bach and

Badaskar, A review of relation extraction, in: Literature Review for Language and Statistics II, 2, 2007.

14.

C.F.

Baker,

C.J.

Fillmore and

J.B.

Lowe, The Berkeley FrameNet project, in: 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics COLING-ACL,

Boitet and

Whitelock, eds, Morgan Kaufmann Publishers/ACL, 1998, pp. 86–90.

15.

Baldridge, The OpenNLP project, 2010, http://opennlp.apache.org/index.html.

16.

Banko,

M.J.

Cafarella,

Soderland,

Broadhead and

Etzioni, Open information extraction from the Web, in: International Joint Conference on Artificial Intelligence (IJCAI),

M.M.

Veloso, ed., 2007.

17.

Bartolini,

Giovannetti,

Marchi,

Montemagni,

Andreatta,

Brunelli,

Stecher and

Bouquet, Multimedia information extraction in ontology-based semantic annotation of product catalogues, in: Semantic Web Applications and Perspectives (SWAP),

Tummarello,

Bouquet and

Signore, eds, Proceedings of the 3rd Italian Semantic Web Workshop, CEUR-WS.org, 2006.

18.

Bechhofer,

Yesilada,

Stevens,

Jupp and

Horan, Using ontologies and vocabularies for dynamic linking, IEEE Internet Computing12(3) (2008), 32–39. doi:10.1109/MIC.2008.68.

19.

Benton and

Dredze, Entity Linking for spoken language, in: North American Chapter of the Association for Computational Linguistics: Human Language Technologies,

Mihalcea,

J.Y.

Chai and

Sarkar, eds, ACL, 2015, pp. 225–230.

20.

Berners-Lee and

Fischetti, Weaving the Web: The Original Design and Ultimate Destiny of the World Wide Web by Its Inventor, 1st edn, Harper, San Francisco, 1999.

21.

C.S.

Bhagavatula,

Noraset and

Downey, TabEL: Entity Linking in Web tables, in: International Semantic Web Conference (ISWC),

Arenas,

Ó.

Corcho,

Simperl,

Strohmaier,

d’Aquin,

Srinivas,

P.T.

Groth,

Dumontier,

Heflin,

Thirunarayan and

Staab, eds, Springer, 2015, pp. 425–441. doi:10.1007/978-3-319-25007-6_25.

22.

Bird,

Klein and

Loper, Natural Language Processing with Python, O’Reilly, 2009.

23.

D.M.

Blei, Probabilistic topic models, Commun. ACM55(4) (2012), 77–84. doi:10.1145/2133806.2133826.

24.

D.M.

Blei,

A.Y.

Ng and

M.I.

Jordan, Latent Dirichlet allocation, Journal of machine Learning research3 (2003), 993–1022.

25.

Böhm,

Kasneci and

Naumann, Latent topics in graph-structured data, in: Information and Knowledge Management (CIKM),

Chen,

Lebanon,

Wang and

M.J.

Zaki, eds, ACM Press, 2012, pp. 2663–2666. doi:10.1145/2396761.2398718.

26.

K.D.

Bollacker,

Evans,

Paritosh,

Sturge and

Taylor, Freebase: A collaboratively created graph database for structuring human knowledge, in: International Conference on Management of Data, (SIGMOD),

J.T.-L.

Wang, ed., 2008, pp. 1247–1250. doi:10.1145/1376616.1376746.

27.

Bordea,

Lefever and

Buitelaar, SemEval-2016 task 13: Taxonomy extraction evaluation (TExEval-2), in: International Workshop on Semantic Evaluation (SemEval@NAACL-HLT),

Bethard,

D.M.

Cer,

Carpuat,

Jurgens,

Nakov and

Zesch, eds, 2016, pp. 1081–1091.

28.

Bos, Wide-coverage semantic analysis with Boxer, in: Conference on Semantics in Text Processing, (STEP),

Bos and

Delmonte, eds, ACL, 2008, pp. 277–286.

29.

Brill, A simple rule-based part of speech tagger, in: Applied Natural Language Processing Conference, (ANLP), 1992, pp. 152–155. doi:10.3115/974499.974526.

30.

Buche,

Dibie-Barthélemy,

Ibanescu and

Soler, Fuzzy Web data tables integration guided by an ontological and terminological resource, IEEE Trans. Knowl. Data Eng.25(4) (2013), 805–819. doi:10.1109/TKDE.2011.245.

31.

Buitelaar and

Magnini, Ontology learning from text: An overview, in: Ontology Learning from Text: Methods, Applications and Evaluation, Vol. 123, IOS Press, 2005, pp. 3–12.

32.

Buitelaar,

Olejnik and

Sintek, A Protégé plug-in for ontology extraction from text based on linguistic analysis, in: The Semantic Web: Research and Applications, First European Semantic Web Symposium, (ESWS),

Bussler,

Davies,

Fensel and

Studer, eds, Springer, 2004, pp. 31–44. doi:10.1007/978-3-540-25956-5_3.

33.

Bunescu and

Pasca, Using encyclopedic knowledge for named entity disambiguation, in: European Chapter of the Association for Computational Linguistics (EACL),

McCarthy and

Wintner, eds, 2006, pp. 9–16.

34.

Busemann,

Drozdzynski,

Krieger,

Piskorski,

Schäfer,

Uszkoreit and

Xu, Integrating information extraction and automatic hyperlinking, in: Annual Meeting of the Association for Computational Linguistics (ACL), Companion Volume to the Proceedings,

Funakoshi,

Kübler and

Otterbacher, eds, 2003, pp. 117–120.

35.

M.J.

Cafarella,

A.Y.

Halevy,

D.Z.

Wang,

Wu and

Zhang, WebTables: Exploring the power of tables on the Web, PVLDB1(1) (2008), 538–549. doi:10.14778/1453856.1453916.

36.

Cardillo,

Roumier,

Jamoulle and

Vander Stichele, Using ISO and Semantic Web standards for creating a multilingual medical interface terminology: A use case for hearth failure, in: International Conference on Terminology and Artificial Intelligence, 2013.

37.

Carmel,

Chang,

Gabrilovich,

B.P.

Hsu and

Wang, ERD’14: Entity Recognition and Disambiguation challenge, in: Conference on Research and Development in Information Retrieval, SIGIR,

Geva,

Trotman,

Bruza,

C.L.A.

Clarke and

Järvelin, eds, ACM, 2014, p. 1292. doi:10.1145/2600428.2600734.

38.

Carpenter and

Baldwin, Text Analysis with LingPipe 4, LingPipe Publishing, 2011.

39.

Ceccarelli,

Lucchese,

Orlando,

Perego and

Trani, Dexter: An open source framework for Entity Linking, in: Proceedings of the Sixth International Workshop on Exploiting Semantic Annotations in Information Retrieval (ESAIR), Co-Located with CIKM,

P.N.

Bennett,

Gabrilovich,

Kamps and

Karlgren, eds, ACM, 2013, pp. 17–20. doi:10.1145/2513204.2513212.

40.

Ceccarelli,

Lucchese,

Orlando,

Perego and

Trani, Dexter 2.0 – an open source tool for semantically enriching data, in: International Semantic Web Conference (ISWC), Posters & Demonstrations Track,

Horridge,

Rospocher and

van Ossenbruggen, eds, CEUR-WS.org, 2014, pp. 417–420.

41.

Chabchoub,

Gagnon and

Zouaq, Collective disambiguation and semantic annotation for entity linking and typing, in: Semantic Web Challenges – Third SemWebEval Challenge at (ESWC),

Sack,

Dietze,

Tordai and

Lange, eds, Springer, 2016, pp. 33–47. doi:10.1007/978-3-319-46565-4_3.

42.

Charton,

Gagnon and

Ozell, Automatic Semantic Web annotation of named entities, in: Canadian Conference on Artificial Intelligence,

C.J.

Butz and

Lingras, eds, Springer, 2011, pp. 74–85. doi:10.1007/978-3-642-21043-3_10.

43.

Chemudugunta,

Holloway,

Smyth and

Steyvers, Modeling documents by combining semantic concepts with unsupervised statistical learning, in: International Semantic Web Conference (ISWC),

A.P.

Sheth,

Staab,

Dean,

Paolucci,

Maynard,

T.W.

Finin and

Thirunarayan, eds, Springer, 2008, pp. 229–244. doi:10.1007/978-3-540-88564-1_15.

44.

Chen and

C.D.

Manning, A fast and accurate dependency parser using neural networks, in: Empirical Methods in Natural Language Processing (EMNLP),

Moschitti,

Pang and

Daelemans, eds, ACL, 2014, pp. 740–750.

45.

Chen,

Tsai and

Tsai, Mining tables from large scale HTML texts, in: International Conference on Computational Linguistics (COLING), Morgan Kaufmann, 2000, pp. 166–172.

46.

Chen,

J.M.

Jose,

Yu,

Yuan and

Zhang, Probabilistic topic modelling with semantic graph, in: Advances in Information Retrieval, European Conference on Information Retrieval (ECIR),

Ferro,

Crestani,

Moens,

Mothe,

Silvestri,

G.M.

Di Nunzio,

Hauff and

Silvello, eds, Springer, 2016, pp. 240–251. doi:10.1007/978-3-319-30671-1_18.

47.

Chiu,

Shih,

Lee,

Shao,

Cai,

Wei and

Chen, NTUNLP approaches to recognizing and disambiguating entities in long and short text at the ERD challenge 2014, in: International Workshop on Entity Recognition & Disambiguation (ERD),

Carmel,

M.-W.

Chang,

Gabrilovich,

B.-J.P.

Hsu and

Wang, eds, ACM, 2014, pp. 3–12. doi:10.1145/2633211.2634363.

48.

Christodoulopoulos,

Goldwater and

Steedman, Two decades of unsupervised POS induction: How far have we come?, in: Empirical Methods in Natural Language Processing (EMNLP), ACL, 2010, pp. 575–584.

49.

Cimiano, Ontology Learning and Population from Text – Algorithms, Evaluation and Applications, Springer, 2006. doi:10.1007/978-0-387-39252-3.

50.

Cimiano,

Handschuh and

Staab, Towards the self-annotating Web, in: World Wide Web Conference (WWW),

S.I.

Feldman,

Uretsky,

Najork and

C.E.

Wills, eds, ACM, 2004, pp. 462–471. doi:10.1145/988672.988735.

51.

Cimiano,

Hotho and

Staab, Comparing conceptual, divise and agglomerative clustering for learning taxonomies from text, in: European Conference on Artificial Intelligence (ECAI),

R.L.

de Mántaras and

Saitta, eds, IOS Press, 2004, pp. 435–439.

52.

Cimiano,

Hotho and

Staab, Learning concept hierarchies from text corpora using formal concept analysis, J. Artif. Intell. Res.24 (2005), 305–339. doi:10.1613/jair.1648.

53.

Cimiano,

J.P.

McCrae and

Buitelaar, Lexicon model for ontologies: Community report, W3C Final Community Group Report, 2016, https://www.w3.org/2016/05/ontolex/.

54.

Cimiano,

J.P.

McCrae,

Rodríguez-Doncel,

Gornostay,

Gómez-Pérez,

Siemoneit and

Lagzdins, Linked terminology: Applying Linked Data principles to terminological resources, in: Electronic Lexicography in the 21st Century (eLex), 2015.

55.

Cimiano and

Völker, Text2onto: A framework for ontology learning and data-driven change discovery, in: Natural Language Processing and Information Systems,

Montoyo,

Muñoz and

Métais, eds, Springer, Berlin Heidelberg, 2005, pp. 227–238. doi:10.1007/11428817_21.

56.

Clark and

C.D.

Manning, Deep reinforcement learning for mention-ranking coreference models, in: Empirical Methods in Natural Language Processing (EMNLP),

Su,

Carreras and

Duh, eds, The Association for Computational Linguistics, 2016, pp. 2256–2262.

57.

Colace,

De Santo,

Greco,

Amato,

Moscato and

Picariello, Terminological ontology learning and population using latent Dirichlet allocation, Journal of Visual Languages & Computing25(6) (2014), 818–826. doi:10.1016/j.jvlc.2014.11.001.

58.

Colace,

De Santo,

Greco,

Moscato and

Picariello, Probabilistic approaches for sentiment analysis: Latent Dirichlet allocation for ontology building and sentiment extraction, in: Sentiment Analysis and Ontology Engineering – an Environment of Computational Intelligence,

Pedrycz and

Chen, eds, Springer, 2016, pp. 75–91. doi:10.1007/978-3-319-30319-2_4.

59.

Collins, Discriminative training methods for Hidden Markov Models: Theory and experiments with Perceptron algorithms, in: Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, 2002, pp. 1–8.

60.

Conde,

Larrañaga,

Arruarte,

J.A.

Elorriaga and

Roth, litewi: A combined term extraction and Entity Linking method for eliciting educational ontologies from textbooks, Journal of the Association for Information Science and Technology67(2) (2016), 380–399. doi:10.1002/asi.23398.

61.

Corcoglioniti,

Rospocher and

A.P.

Aprosio, Frame-based ontology population with PIKES, IEEE Trans. Knowl. Data Eng.28(12) (2016), 3261–3275. doi:10.1109/TKDE.2016.2602206.

62.

Cornolti,

Ferragina and

Ciaramita, A framework for benchmarking entity-annotation systems, in: World Wide Web Conference (WWW),

Schwabe,

V.A.F.

Almeida,

Glaser,

R.A.

Baeza-Yates and

S.B.

Moon, eds, International World Wide Web Conferences Steering Committee/ACM, 2013. doi:10.1145/2488388.2488411.

63.

Cornolti,

Ferragina,

Ciaramita,

Rüd and

Schütze, A Piggyback system for joint entity mention detection and linking in Web queries, in: World Wide Web Conference (WWW),

Bourdeau,

Hendler,

Nkambou,

Horrocks and

B.Y.

Zhao, eds, ACM, 2016, pp. 567–578. doi:10.1145/2872427.2883061.

64.

Coursey,

Mihalcea and

W.E.

Moen, Using encyclopedic knowledge for automatic topic identification, in: Conference on Computational Natural Language Learning (CoNLL),

Stevenson and

Carreras, eds, Association for Computational Linguistics, ACL, 2009, pp. 210–218.

65.

Crestan and

Pantel, Web-scale table census and classification, in: Web Search and Web Data Mining (WSDM),

King,

Nejdl and

Li, eds, ACM, 2011, pp. 545–554. doi:10.1145/1935826.1935904.

66.

Cucerzan, Large-scale named entity disambiguation based on Wikipedia data, in: Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL),

Eisner, ed., ACL, 2007, pp. 708–716.

67.

Cucerzan, Name entities made obvious: The participation in the ERD 2014 evaluation, in: Workshop on Entity Recognition & Disambiguation, ERD,

Carmel,

Chang,

Gabrilovich,

B.P.

Hsu and

Wang, eds, ACM, 2014, pp. 95–100. doi:10.1145/2633211.2634360.

68.

Cunningham, GATE, a general architecture for text engineering, Computers and the Humanities36(2) (2002), 223–254. doi:10.1023/A:1014348124664.

69.

da Silva Conrado,

Di Felippo,

T.A.S.

Pardo and

S.O.

Rezende, A survey of automatic term extraction for Brazilian Portuguese, Journal of the Brazilian Computer Society20(1) (2014), 1–28. doi:10.1186/1678-4804-20-1.

70.

Daciuk,

Mihov,

B.W.

Watson and

Watson, Incremental construction of minimal acyclic finite state automata, Computational Linguistics26(1) (2000), 3–16. doi:10.1162/089120100561601.

71.

Daiber,

Jakob,

Hokamp and

P.N.

Mendes, Improving efficiency and accuracy in multilingual entity extraction, in: International Conference on Semantic Systems (I-SEMANTICS),

Sabou,

Blomqvist,

Di Noia,

Sack and

Pellegrini, eds, 2013, pp. 121–124. doi:10.1145/2506182.2506198.

72.

Das,

Sundara and

Cyganiak, R2RML: RDB to RDF Mapping Language, W3C Recommendation, 2012, https://www.w3.org/TR/r2rml/.

73.

De Nart,

Tasso and

Degl’Innocenti, A semantic metadata generator for Web pages based on keyphrase extraction, in: International Semantic Web Conference ISWC, Posters & Demonstrations Track,

Horridge,

Rospocher and

van Ossenbruggen, eds, CEUR-WS.org, 2014, pp. 201–204.

74.

De Virgilio, RDFa based annotation of Web pages through keyphrases extraction, in: On the Move (OTM),

Meersman,

T.S.

Dillon,

Herrero,

Kumar,

Reichert,

Qing,

B.C.

Ooi,

Damiani,

D.C.

Schmidt,

White,

Hauswirth,

Hitzler and

M.K.

Mohania, eds, Springer, 2011, pp. 644–661. doi:10.1007/978-3-642-25106-1_18.

75.

Del Corro and

Gemulla, Clausie: Clause-based open information extraction, in: World Wide Web Conference (WWW),

Schwabe,

V.A.F.

Almeida,

Glaser,

R.A.

Baeza-Yates and

S.B.

Moon, eds, ACM, 2013, pp. 355–366. doi:10.1145/2488388.2488420.

76.

Delac,

Krleza,

Snajder,

B.D.

Basic and

Saric, TermeX: A tool for collocation extraction, in: Computational Linguistics and Intelligent Text Processing (CICLing),

A.F.

Gelbukh, ed., Springer, 2009, pp. 149–157. doi:10.1007/978-3-642-00382-0_12.

77.

Demartini,

D.E.

Difallah and

Cudré-Mauroux, ZenCrowd: Leveraging probabilistic reasoning and crowdsourcing techniques for large-scale Entity Linking, in: World Wide Web Conference (WWW),

Mille,

F.L.

Gandon,

Misselis,

Rabinovich and

Staab, eds, ACM, 2012, pp. 469–478. doi:10.1145/2187836.2187900.

78.

Derczynski,

Augenstein and

Bontcheva, USFD: Twitter NER with drift compensation and Linked Data, in: Proceedings of the Workshop on Noisy User-Generated Text,

Xu,

Han and

Ritter, eds, Association for Computational Linguistics, 2015. doi:10.18653/v1/W15-4306.

79.

Derczynski,

Maynard,

Rizzo,

van Erp,

Gorrell,

Troncy,

Petrak and

Bontcheva, Analysis of Named Entity Recognition and Linking for Tweets, Information Processing & Management51(2) (2015), 32–49. doi:10.1016/j.ipm.2014.10.006.

80.

T.G.

Dietterich, Ensemble methods in machine learning, in: International Workshop on Multiple Classifier Systems (MCS),

Kittler and

Roli, eds, Springer, 2000, pp. 1–15. doi:10.1007/3-540-45014-9_1.

81.

Dietze,

Maynard,

Demidova,

Risse,

Peters,

Doka and

Stavrakas, Entity extraction and consolidation for social web content preservation, in: International Workshop on Semantic Digital Archives,

Mitschick,

Loizides,

Predoiu,

Nürnberger and

Ross, eds, 2012, pp. 18–29.

82.

Dill,

Eiron,

Gibson,

Gruhl,

R.V.

Guha,

Jhingran,

Kanungo,

Rajagopalan,

Tomkins,

J.A.

Tomlin and

J.Y.

Zien, Semtag and seeker: Bootstrapping the Semantic Web via automated semantic annotation, in: World Wide Web Conference (WWW),

Hencsey,

White,

Y.R.

Chen,

Kovács and

Lawrence, eds, ACM, 2003, pp. 178–186. doi:10.1145/775152.775178.

83.

Ding,

DiFranzo,

Graves,

Michaelis,

Li,

D.L.

McGuinness and

Hendler, Data-gov wiki: Towards linking government data, in: Linked Data Meets Artificial Intelligence, AAAI, AAAI, 2010.

84.

Dojchinovski and

Kliegr, Recognizing, classifying and linking entities with Wikipedia and DBpedia, in: Workshop on Intelligent and Knowledge Oriented Technologies (WIKT), 2012, pp. 41–44.

85.

Dolby,

Fokoue,

Kalyanpur,

Schonberg and

Srinivas, Extracting enterprise vocabularies using Linked Open Data, in: International Semantic Web Conference (ISWC),

Bernstein,

D.R.

Karger,

Heath,

Feigenbaum,

Maynard,

Motta and

Thirunarayan, eds, Springer, 2009, pp. 779–794. doi:10.1007/978-3-642-04930-9_49.

86.

Dong,

Gabrilovich,

Heitz,

Horn,

Lao,

Murphy,

Strohmann,

Sun and

Zhang, Knowledge vault: A Web-scale approach to probabilistic knowledge fusion, in: International Conference on Knowledge Discovery and Data Mining (SIGKDD),

S.A.

Macskassy,

Perlich,

Leskovec,

Wang and

Ghani, eds, ACM, 2014, pp. 601–610. doi:10.1145/2623330.2623623.

87.

C.N.

dos Santos and

Guimarães, Boosting Named Entity Recognition with neural character embeddings, CoRR (2015), arXiv:1505.05008.

88.

Drozdzynski,

Krieger,

Piskorski,

Schäfer and

Xu, Shallow processing with unification and typed feature structures – foundations and applications, Künstliche Intelligenz18(1) (2004), 17.

89.

D’Souza and

Ng, Sieve-based entity linking for the biomedical domain, in: Association for Computational Linguistics: Short Papers, ACL, 2015, pp. 297–302.

90.

Dunning, Accurate methods for the statistics of surprise and coincidence, Computational Linguistics19(1) (1993), 61–74.

91.

Durrett and

Klein, A joint model for entity analysis: Coreference, typing, and linking, TACL2 (2014), 477–490.

92.

Dutta,

Meilicke and

Stuckenschmidt, Semantifying triples from open information extraction systems, in: European Starting AI Researcher Symposium (STAIRS),

Ulle and

Leite, eds, IOS Press, 2014, pp. 111–120. doi:10.3233/978-1-61499-421-3-111.

93.

Dutta,

Meilicke and

Stuckenschmidt, Enriching structured knowledge with open information, in: World Wide Web Conference (WWW),

Gangemi,

Leonardi and

Panconesi, eds, ACM, 2015, pp. 267–277. doi:10.1145/2736277.2741139.

94.

Dzbor,

Motta and

Domingue, Magpie: Experiences in supporting Semantic Web browsing, J. Web Sem.5(3) (2007), 204–222. doi:10.1016/j.websem.2007.07.001.

95.

Earley, An efficient context-free parsing algorithm, Commun. ACM13(2) (1970), 94–102. doi:10.1145/362007.362035.

96.

Eckhardt,

Hresko,

Procházka and

Smrs, Entity Linking based on the co-occurrence graph and entity probability, in: International Workshop on Entity Recognition & Disambiguation (ERD),

Carmel,

Chang,

Gabrilovich,

B.P.

Hsu and

Wang, eds, ACM, 2014, pp. 37–44. doi:10.1145/2633211.2634349.

97.

Exner and

Nugues, Entity extraction: From unstructured text to DBpedia RDF triples, in: The Web of Linked Entities Workshop (WoLE 2012), CEUR-WS, 2012, pp. 58–69.

98.

Exner and

Nugues, Refractive: An open source tool to extract knowledge from syntactic and semantic relations, in: Language Resources and Evaluation Conference (LREC),

Calzolari,

Choukri,

Declerck,

Goggi,

Grobelnik,

Maegaard,

Mariani,

Mazo,

Moreno,

Odijk and

Piperidis, eds, ELRA, 2014.

99.

Fader,

Soderland and

Etzioni, Identifying relations for open information extraction, in: Empirical Methods in Natural Language Processing (EMNLP), ACL, 2011, pp. 1535–1545.

100.

Á.

Felices-Lago and

P.U.

Gómez-Moreno, FunGramKB term extractor: A tool for building terminological ontologies from specialised corpora, in: Studies in Language Companion Series,

Huang,

Koudas,

G.J.F.

Jones,

Wu,

Collins-Thompson and

An, eds, John Benjamins Publishing Company, 2014, pp. 251–270.

101.

Ferragina and

Scaiella, Tagme: On-the-fly annotation of short text fragments (by Wikipedia entities), in: Conference on Information and Knowledge Management (CIKM),

Huang,

Koudas,

G.J.F.

Jones,

Wu,

Collins-Thompson and

An, eds, ACM, 2010, pp. 1625–1628. doi:10.1145/1871437.1871689.

102.

D.A.

Ferrucci and

Lally, UIMA: an architectural approach to unstructured information processing in the corporate research environment, Natural Language Engineering10(3–4) (2004), 327–348. doi:10.1017/S1351324904003523.

103.

C.J.

Fillmore, Frame semantics and the nature of language, Annals of the New York Academy of Sciences280(1) (1976), 20–32. doi:10.1111/j.1749-6632.1976.tb25467.x.

104.

J.R.

Finkel,

Grenager and

Manning, Incorporating non-local information into information extraction systems by Gibbs sampling, in: Annual Meeting of the Association for Computational Linguistics (ACL),

Knight,

H.T.

Ng and

Oflazer, eds, ACL, 2005, pp. 363–370.

105.

J.R.

Finkel and

C.D.

Manning, Nested Named Entity Recognition, in: Empirical Methods in Natural Language Processing (EMNLP), ACL, 2009, pp. 141–150.

106.

Fossati,

Dorigatti and

Giuliano, N-ary relation extraction for simultaneous T-Box and A-Box knowledge base augmentation, Semantic Web9(4) (2018), 413–439. doi:10.3233/SW-170269.

107.

Francis-Landau,

Durrett and

Klein, Capturing semantic similarity for Entity Linking with convolutional neural networks, CoRR (2016), arXiv:1604.00734.

108.

K.T.

Frantzi,

Ananiadou and

Mima, Automatic recognition of multi-word terms: The c-value/nc-value method, Int. J. on Digital Libraries3(2) (2000), 115–130. doi:10.1007/s007999900023.

109.

Freitas,

D.S.

Carvalho,

J.C.

Da Silva,

O’Riain and

Curry, A semantic best-effort approach for extracting structured discourse graphs from Wikipedia, in: Workshop on the Web of Linked Entities (ISWC-WLE), 2012.

110.

D.S.

Friedlander, Semantic Information Extraction, CRC Press, 2005.

111.

Gagnon,

Zouaq and

Jean-Louis, Can we use Linked Data semantic annotators for the extraction of domain-relevant expressions?, in: World Wide Web Conference (WWW),

Carr,

A.H.F.

Laender,

B.F.

Lóscio,

King,

Fontoura,

Vrandecic,

Aroyo,

Palazzo,

de Oliveira,

Lima and

Wilde, eds, ACM, 2013, pp. 1239–1246. doi:10.1145/2487788.2488157.

112.

Gangemi, A comparison of knowledge extraction tools for the Semantic Web, in: Extended Semantic Web Conference (ESWC),

Cimiano,

Ó.

Corcho,

Presutti,

Hollink and

Rudolph, eds, Springer, 2013, pp. 351–366. doi:10.1007/978-3-642-38288-8_24.

113.

Gangemi,

Presutti,

D.R.

Recupero,

A.G.

Nuzzolese,

Draicchio and

Mongiovì, Semantic Web machine reading with FRED, Semantic Web8(6) (2017), 873–893. doi:10.3233/SW-160240.

114.

Gangemi,

D.R.

Recupero,

Mongiovì,

A.G.

Nuzzolese and

Presutti, Identifying motifs for evaluating open knowledge extraction on the Web, Knowl.-Based Syst.108 (2016), 33–41. doi:10.1016/j.knosys.2016.05.023.

115.

A.L.

Gentile,

Zhang and

Ciravegna, Self training wrapper induction with Linked Data, in: Text, Speech and Dialogue (TSD), Springer, 2014, pp. 285–292. doi:10.1007/978-3-319-10816-2_35.

116.

Gerber,

Hellmann,

Bühmann,

Soru,

Usbeck and

A.-C.

Ngonga Ngomo, Real-time RDF extraction from unstructured data streams, in: International Semantic Web Conference (ISWC), Vol. 8218, Springer, 2013, pp. 135–150. doi:10.1007/978-3-642-41335-3_9.

117.

Gerber and

A.-C.

Ngonga Ngomo, Extracting multilingual natural-language patterns for RDF predicates, in: Knowledge Engineering and Knowledge Management (EKAW), Springer-Verlag, 2012, pp. 87–96. doi:10.1007/978-3-642-33876-2_10.

118.

Giannini,

Colucci,

F.M.

Donini and

Di Sciascio, A logic-based approach to named-entity disambiguation in the Web of Data, in: Advances in Artificial Intelligence,

Gavanelli,

Lamma and

Riguzzi, eds, Springer, 2015, pp. 367–380. doi:10.1007/978-3-319-24309-2_28.

119.

Gillam,

Tariq and

Ahmad, Terminology and the construction of ontology, Terminology. International Journal of Theoretical and Applied Issues in Specialized Communication11(1) (2005), 55–81.

120.

M.L.

Goldstein,

S.A.

Morris and

G.G.

Yen, Bridging the gap between data acquisition and inference ontologies – towards ontology based link discovery, SPIE5071 (2003), 117.

121.

Goodfellow,

Bengio and

Courville, Deep Learning, MIT Press, 2016, http://www.deeplearningbook.org .

122.

Grütze,

Kasneci,

Zuo and

Naumann, CohEEL: Coherent and efficient named entity linking through random walks, J. Web Sem.37–38 (2016), 75–89. doi:10.1016/j.websem.2016.03.001.

123.

J.A.

Gulla,

H.O.

Borch and

J.E.

Ingvaldsen, Unsupervised keyphrase extraction for search ontologies, in: International Conference on Applications of Natural Language to Information Systems (NLDB), Springer, 2006, pp. 25–36.

124.

Guo and

Barbosa, Robust named entity disambiguation with random walks, Semantic Web (2018), 1–21. doi:10.3233/SW-170273.

125.

Hakimov,

S.A.

Oto and

Dogdu, Named Entity Recognition and Disambiguation using Linked Data and graph-based centrality scoring, in: International Workshop on Semantic Web Information Management (SWIM),

De Virgilio,

Giunchiglia and

Tanca, eds, ACM, 2012, pp. 4:1–4:7. doi:10.1145/2237867.2237871.

126.

Hakimov,

ter Horst,

Jebbara,

Hartung and

Cimiano, Combining textual and graph-based features for named entity disambiguation using undirected probabilistic graphical models, in: Knowledge Engineering and Knowledge Management (EKAW),

Blomqvist,

Ciancarini,

Poggi and

Vitali, eds, Springer, 2016, pp. 288–302. doi:10.1007/978-3-319-49004-5_19.

127.

M.M.

Hassan,

Karray and

M.S.

Kamel, Automatic document topic identification using Wikipedia hierarchical ontology, in: Information Science, Signal Processing and Their Applications (ISSPA), IEEE, 2012, pp. 237–242. doi:10.1109/ISSPA.2012.6310552.

128.

Haugen, Abstract: The open graph protocol design decisions, in: International Semantic Web Conference (ISWC), Revised Selected Papers, Part II,

P.F.

Patel-Schneider,

Pan,

Hitzler,

Mika,

Zhang,

J.Z.

Pan,

Horrocks and

Glimm, eds, Springer, 2010, p. 338. doi:10.1007/978-3-642-17749-1_25.

129.

D.G.

Hays, Dependency theory: A formalism and some observations, Language40(4) (1964), 511–525. doi:10.2307/411934.

130.

M.A.

Hearst, Automatic acquisition of hyponyms from large text corpora, in: International Conference on Computational Linguistics (COLING), 1992, pp. 539–545.

131.

Hellmann,

Lehmann,

Auer and

Brümmer, Integrating NLP using Linked Data, in: International Semantic Web Conference (ISWC), Springer, 2013, pp. 98–113. doi:10.1007/978-3-642-41338-4_7.

132.

Hepp, GoodRelations: An ontology for describing products and services offers on the web, in: Knowledge Engineering and Knowledge Management (EKAW), Springer, 2008, pp. 329–346. doi:10.1007/978-3-540-87696-0_29.

133.

Hepple, Independence and commitment: Assumptions for rapid training and execution of rule-based POS taggers, in: Annual Meeting of the Association for Computational Linguistics (ACL), 2000.

134.

Hernández,

Hogan and

Krötzsch, Reifying RDF: What works well with Wikidata?, in: International Workshop on Scalable Semantic Web Knowledge Base Systems (SSWS),

Liebig and

Fokoue, eds, 2015, p. 32.

135.

Heuss,

Humm,

Henninger and

Rippl, A comparison of NER tools w.r.t. a domain-specific vocabulary, in: International Conference on Semantic Systems (SEMANTICS),

Sack,

Filipowska,

Lehmann and

Hellmann, eds, ACM, 2014, pp. 100–107. doi:10.1145/2660517.2660520.

136.

Hitzler,

Krötzsch,

Parsia,

P.F.

Patel-Schneider and

Rudolph, OWL 2 Web Ontology Language Primer, 2nd edn, W3C Recommendation, 2012, https://www.w3.org/TR/owl2-primer/.

137.

Hoffart,

Altun and

Weikum, Discovering emerging entities with ambiguous names, in: World Wide Web Conference (WWW),

Chung,

A.Z.

Broder,

Shim and

Suel, eds, ACM, 2014, pp. 385–396. doi:10.1145/2566486.2568003.

138.

Hoffart,

Seufert,

D.B.

Nguyen,

Theobald and

Weikum, KORE: keyphrase overlap relatedness for entity disambiguation, in: Information and Knowledge Management (CIKM),

Chen,

Lebanon,

Wang and

M.J.

Zaki, eds, ACM, 2012, pp. 545–554. doi:10.1145/2396761.2396832.

139.

Hoffart,

F.M.

Suchanek,

Berberich and

Weikum, YAGO2: A spatially and temporally enhanced knowledge base from Wikipedia, Artif. Intell.194 (2013), 28–61. doi:10.1016/j.artint.2012.06.001.

140.

Hoffart,

M.A.

Yosef,

Bordino,

Fürstenau,

Pinkal,

Spaniol,

Taneva,

Thater and

Weikum, Robust disambiguation of named entities in text, in: Empirical Methods in Natural Language Processing (EMNLP), ACL, 2011, pp. 782–792.

141.

Hoffmann,

Zhang,

Ling,

L.S.

Zettlemoyer and

D.S.

Weld, Knowledge-based weak supervision for information extraction of overlapping relations, in: Annual Meeting of the Association for Computational Linguistics (ACL),

Lin,

Matsumoto and

Mihalcea, eds, ACL, 2011, pp. 541–550.

142.

Hofmann, Unsupervised learning by probabilistic latent semantic analysis, Machine Learning42(1) (2001), 177–196. doi:10.1023/A:1007617005950.

143.

Huang,

Wang and

Y.L.

Murphey, Text categorization using topic model and ontology networks, in: International Conference on Data Mining (DMIN), 2014.

144.

Hulpuş,

Hayes,

Karnstedt and

Greene, Unsupervised graph-based topic labelling using DBpedia, in: Web Search and Web Data Mining (WSDM),

Leonardi,

Panconesi,

Ferragina and

Gionis, eds, ACM, 2013, pp. 465–474. doi:10.1145/2433396.2433454.

145.

Hulpuş,

Prangnawarat and

Hayes, Path-based semantic relatedness on Linked Data and its use to word and entity disambiguation, in: International Semantic Web Conference (ISWC), Springer, 2015, pp. 442–457. doi:10.1007/978-3-319-25007-6_26.

146.

Huynh,

Mazzocchi and

D.R.

Karger, Piggy bank: Experience the Semantic Web inside your web browser, J. Web Sem.5(1) (2007), 16–27. doi:10.1016/j.websem.2006.12.002.

147.

D.T.

Huynh,

T.H.

Cao,

P.H.T.

Pham and

T.N.

Hoang, Using hyperlink texts to improve quality of identifying document topics based on Wikipedia, in: International Conference on Knowledge and Systems Engineering (KSE), IEEE, 2009, pp. 249–254.

148.

Inyaem,

Meesad,

Haruechaiyasak and

Tran, Construction of fuzzy ontology-based terrorism event extraction, in: International Conference on Knowledge Discovery and Data Mining (WKDD), IEEE, 2010, pp. 391–394. doi:10.1109/WKDD.2010.113.

149.

Jain and

Pareek, Automatic topic(s) identification from learning material: An ontological approach, in: Computer Engineering and Applications (ICCEA), Vol. 2, IEEE, 2010, pp. 358–362.

150.

Janik and

Kochut, Wikipedia in action: Ontological knowledge in text categorization, in: International Conference on Semantic Computing (ICSC), IEEE, 2008, pp. 268–275. doi:10.1109/ICSC.2008.53.

151.

Jean-Louis,

Zouaq,

Gagnon and

Ensan, An assessment of online semantic annotators for the keyword extraction task, in: Pacific Rim International Conference on Artificial Intelligence PRICAI,

D.N.

Pham and

Park, eds, Springer, 2014, pp. 548–560. doi:10.1007/978-3-319-13560-1_44.

152.

Jha,

Röder and

A.-C.

Ngonga Ngomo, All that glitters is not gold – rule-based curation of reference datasets for Named Entity Recognition and Entity Linking, in: ESWC, Springer, 2017, pp. 305–320. doi:10.1007/978-3-319-58068-5_19.

153.

Jiang and

Tan, CRCTOL: A semantic-based domain ontology learning system, JASIST61(1) (2010), 150–168. doi:10.1002/asi.21231.

154.

Jovanovic,

Bagheri,

Cuzzola,

Gasevic,

Jeremic and

Bashash, Automated semantic tagging of textual content, IT Professional16(6) (2014), 38–46. doi:10.1109/MITP.2014.85.

155.

Kabutoya,

Sumi,

Iwata,

Uchiyama and

Uchiyama, A topic model for recommending movies via Linked Open Data, in: Web Intelligence and Intelligent Agent Technology (WI-IAT), Vol. 1, IEEE, 2012, pp. 625–630. doi:10.1109/WI-IAT.2012.23.

156.

Kamp, A theory of truth and semantic representation, in: Formal Semantics – the Essential Readings,

Portner and

B.H.

Partee, eds, Blackwell, 1981, pp. 189–222.

157.

Karlsson,

Voutilainen,

Heikkilae and

Anttila, Constraint Grammar: A Language-Independent System for Parsing Unrestricted Text, Vol. 4, Walter de Gruyter, 1995.

158.

Kemmerer,

Großmann,

Müller,

Adolphs and

Ehrig, The Neofonie NERD system at the ERD challenge 2014, in: International Workshop on Entity Recognition & Disambiguation (ERD),

Carmel,

M.-W.

Chang,

Gabrilovich,

B.-J.P.

Hsu and

Wang, eds, ACM, 2014, pp. 83–88. doi:10.1145/2633211.2634358.

159.

Khalili,

Auer and

Hladky, The RDFa content editor – from WYSIWYG to WYSIWYM, in: Computer Software and Applications Conference (COMPSAC), IEEE, 2012, pp. 531–540. doi:10.1109/COMPSAC.2012.72.

160.

H.L.

Kim,

Scerri,

J.G.

Breslin,

Decker and

Kim, The state of the art in tag ontologies: A semantic model for tagging and folksonomies, in: International Conference on Dublin Core and Metadata Applications (DC), 2008, pp. 128–137.

161.

Kim and

Rebholz-Schuhmann, Improving the extraction of complex regulatory events from scientific text by using ontology-based inference, J. Biomedical Semantics2(S-5) (2011), S3.

162.

Kim and

L.A.

Tuan, Hybrid pattern matching for complex ontology term recognition, in: Conference on Bioinformatics, Computational Biology and Biomedicine (BCB),

Ranka,

Kahveci and

Singh, eds, ACM, 2012, pp. 289–296. doi:10.1145/2382936.2382973.

163.

S.N.

Kim,

Medelyan,

M.-Y.

Kan and

Baldwin, Semeval-2010 task 5: Automatic keyphrase extraction from scientific articles, in: International Workshop on Semantic Evaluation (SemEval),

Erk and

Strapparava, eds, Association for Computational Linguistics, 2010, pp. 21–26.

164.

Kingsbury and

Palmer, From Treebank to PropBank, in: Language Resources and Evaluation Conference (LREC), ELRA, 2002.

165.

Kipper,

Korhonen,

Ryant and

Palmer, Extending VerbNet with novel verb classes, in: Language Resources and Evaluation Conference (LREC), ELRA, 2006, pp. 1027–1032.

166.

Klimek,

J.P.

McCrae,

Lehmann,

Chiarcos and

Hellmann, OnLiT: An ontology for linguistic terminology, in: International Conference on Language, Data, and Knowledge (LDK), Springer, 2017, pp. 42–57. doi:10.1007/978-3-319-59888-8_4.

167.

Krause,

Hennig,

Moro,

Weissenborn,

Xu,

Uszkoreit and

Navigli, Sar-graphs: A language resource connecting linguistic knowledge with semantic relations from knowledge graphs, J. Web Sem.37–38 (2016), 112–131. doi:10.1016/j.websem.2016.03.004.

168.

Kulkarni,

Singh,

Ramakrishnan and

Chakrabarti, Collective annotation of Wikipedia entities in Web text, in: International Conference on Knowledge Discovery and Data Mining (SIGKDD), KDD ’09,

J.F.

ElderIV.,

Fogelman-Soulié,

P.A.

Flach and

M.J.

Zaki, eds, ACM, 2009, pp. 457–466. doi:10.1145/1557019.1557073.

169.

Lacasta,

J.N.

Iso and

F.J.Z.

Soria, Terminological Ontologies – Design, Management and Practical Applications, Springer, 2010. doi:10.1007/978-1-4419-6981-1.

170.

Lauscher,

Nanni,

Ruiz Fabo and

S.P.

Ponzetto, Entities as topic labels: Combining Entity Linking and labeled LDA to improve topic interpretability and evaluability, Italian Journal of Computational Linguistics2(2) (2016), 67–88.

171.

Lehmann,

Isele,

Jakob,

Jentzsch,

Kontokostas,

P.N.

Mendes,

Hellmann,

Morsey,

van Kleef,

Auer and

Bizer, DBpedia – A large-scale, multilingual knowledge base extracted from Wikipedia, Semantic Web6(2) (2015), 167–195. doi:10.3233/SW-140134.

172.

Lehmberg,

Ritze,

Ristoski,

Meusel,

Paulheim and

Bizer, The Mannheim search join engine, J. Web Sem.35 (2015), 159–166. doi:10.1016/j.websem.2015.05.001.

173.

Lemnitzer,

Vertan,

Killing,

K.I.

Simov,

Evans,

Cristea and

Monachesi, Improving the search for learning objects with keywords and ontologies, in: European Conference on Technology Enhanced Learning (EC-TEL),

Wolpers

Duval and

Klamma, eds, Springer, Berlin, Heidelberg, 2007, pp. 202–216. doi:10.1007/978-3-540-75195-3_15.

174.

D.D.

Lewis,

Yang,

T.G.

Rose and

Li, Rcv1: A new benchmark collection for text categorization research, Journal of machine learning research5 (2004), 361–397.

175.

Li,

Yang and

Luo, Semantic video Entity Linking based on visual content and metadata, in: International Conference on Computer Vision (ICCV), IEEE, 2015, pp. 4615–4623. doi:10.1109/ICCV.2015.524.

176.

Limaye,

Sarawagi and

Chakrabarti, Annotating and searching web tables using entities, types and relationships, PVLDB3(1) (2010), 1338–1347. doi:10.14778/1920841.1921005.

177.

C.-Y.

Lin, Knowledge-based automatic topic identification, in: Annual Meeting of the Association for Computational Linguistics (ACL), Association for Computational Linguistics, 1995, pp. 308–310.

178.

Lin and

Pantel, DIRT – SBT discovery of inference rules from text, in: International Conference on Knowledge Discovery and Data Mining (SIGKDD), ACM, 2001, pp. 323–328. doi:10.1145/502512.502559.

179.

Lin,

Shen,

Liu,

Luan and

Sun, Neural relation extraction with selective attention over instances, in: Association for Computational Linguistics (ACL), Volume 1: Long Papers, ACL, 2016.

180.

Ling,

Singh and

D.S.

Weld, Design challenges for Entity Linking, TACL3 (2015), 315–328.

181.

Lipczak,

Koushkestani and

E.E.

Milios, Tulip: Lightweight entity recognition and disambiguation using Wikipedia-based topic centroids, in: International Workshop on Entity Recognition & Disambiguation (ERD),

Carmel,

Chang,

Gabrilovich,

B.P.

Hsu and

Wang, eds, 2014, pp. 31–36. doi:10.1145/2633211.2634351.

182.

Liu,

He,

Liu,

Zhou,

Liu and

Zhao, Open relation mapping based on instances and semantics expansion, in: Asia Information Retrieval Societies Conference (AIRS),

R.E.

Banchs,

Silvestri,

Liu,

Zhang,

Gao and

Lang, eds, Springer, 2013, pp. 320–331. doi:10.1007/978-3-642-45068-6_28.

183.

Lu and

Roth, Joint mention extraction and classification with mention hypergraphs, in: Empirical Methods in Natural Language Processing (EMNLP),

Màrquez,

Callison-Burch,

Su,

Pighin and

Marton, eds, ACL, 2015, pp. 857–867.

184.

Luo,

Huang,

Lin and

Nie, Joint entity recognition and disambiguation, in: Empirical Methods in Natural Language Processing (EMNLP),

Màrquez,

Callison-Burch,

Su,

Pighin and

Marton, eds, ACL, 2015, pp. 879–888.

185.

Macken,

Lefever and

Hoste, TExSIS: Bilingual terminology extraction from parallel corpora using chunk-based alignment, Terminology19(1) (2013), 1–30. doi:10.1075/term.19.1.01mac.

186.

Maedche and

Staab, Ontology Learning for the Semantic Web, IEEE Intelligent Systems16(2) (2001), 72–79. doi:10.1109/5254.920602.

187.

C.D.

Manning,

Surdeanu,

Bauer,

J.R.

Finkel,

Bethard and

McClosky, The Stanford CoreNLP natural language processing toolkit, in: Annual Meeting of the Association for Computational Linguistics (ACL), 2014, pp. 55–60.

188.

M.P.

Marcus,

Santorini and

M.A.

Marcinkiewicz, Building a large annotated corpus of English: the penn treebank, Computational Linguistics19(2) (1993), 313–330.

189.

Marujo,

Gershman,

J.G.

Carbonell,

R.E.

Frederking and

J.P.

Neto, Supervised topical key phrase extraction of news stories using crowdsourcing, light filtering and co-reference normalization, in: Language Resources and Evaluation Conference (LREC), 2012.

190.

Mausam ,

Schmitz,

Soderland,

Bart and

Etzioni, Open language learning for information extraction, in: Empirical Methods in Natural Language Processing (EMNLP) and (CoNLL),

Tsujii,

Henderson and

Pasca, eds, ACL, 2012, pp. 523–534.

191.

Maynard,

Bontcheva and

Augenstein, Natural Language Processing for the Semantic Web, Morgan & Claypool, 2016.

192.

Maynard,

Funk and

Peters, Using lexico-syntactic ontology design patterns for ontology creation and population, in: Workshop on Ontology Patterns (WOP),

Blomqvist,

Sandkuhl,

Scharffe and

Svátek, eds, CEUR-WS.org, 2009.

193.

J.D.

Mcauliffe and

D.M.

Blei, Supervised topic models, in: Advances in Neural Information Processing Systems, Curran Associates, Inc., 2008, pp. 121–128.

194.

J.F.

McCarthy and

W.G.

Lehnert, Using decision trees for coreference resolution, in: International Joint Conference on Artificial Intelligence (IJCAI), 1995, pp. 1050–1055.

195.

J.P.

McCrae,

Moran,

Hellmann and

Brümmer, Multilingual Linked Data, Semantic Web6(4) (2015), 315–317. doi:10.3233/SW-150178.

196.

Medelyan, NLP keyword extraction tutorial with RAKE and Maui, 2014, online, https://www.airpair.com/nlp/keyword-extraction-tutorial.

197.

Medelyan,

Manion,

Broekstra,

Divoli,

A.L.

Huang and

Witten, Constructing a focused taxonomy from a document collection, in: Extended Semantic Web Conference (ESWC),

Cimiano,

Ó.

Corcho,

Presutti,

Hollink and

Rudolph, eds, Springer, 2013. doi:10.1007/978-3-642-38288-8_25.

198.

Medelyan,

I.H.

Witten and

Milne, Topic indexing with Wikipedia, in: Wikipedia and Artificial Intelligence: An Evolving Synergy, 2008, p. 19.

199.

Meij,

Weerkamp and

de Rijke, Adding semantics to microblog posts, in: Web Search and Web Data Mining (WSDM),

Adar,

Teevan,

Agichtein and

Maarek, eds, ACM, 2012, pp. 563–572. doi:10.1145/2124295.2124364.

200.

P.N.

Mendes,

Jakob,

García-Silva and

Bizer, DBpedia spotlight: Shedding light on the Web of Documents, in: International Conference on Semantic Systems (I-Semantics),

Ghidini,

A.-C.

Ngonga Ngomo,

S.N.

Lindstaedt and

Pellegrini, eds, ACM, 2011, pp. 1–8. doi:10.1145/2063518.2063519.

201.

Meusel,

Bizer and

Paulheim, A Web-scale study of the adoption and evolution of the schema.org vocabulary over time, in: Web Intelligence, Mining and Semantics (WIMS), 2015, pp. 15:1–15:11. doi:10.1145/2797115.2797124.

202.

Meusel,

Petrovski and

Bizer, The WebDataCommons Microdata, RDFa and Microformat dataset series, in: International Semantic Web Conference (ISWC),

Mika,

Tudorache,

Bernstein,

Welty,

C.A.

Knoblock,

Vrandecic,

P.T.

Groth,

N.F.

Noy,

Janowicz and

C.A.

Goble, eds, Springer, 2014, pp. 277–292. doi:10.1007/978-3-319-11964-9_18.

203.

Mihalcea and

Csomai, Wikify!: Linking documents to encyclopedic knowledge, in: Information and Knowledge Management (CIKM), ACM, 2007, pp. 233–242. doi:10.1145/1321440.1321475.

204.

Mihaylov and

Palmisano, D4.5 integration of advanced modules in the annotation framework. Deliverable of the NoTube FP7 EU project (project no. 231761), 2011, http://notube3.files.wordpress.com/2012/01/notube_d4-5- integration-of-advanced-modules-in-annotation- framework-vm33.pdf.

205.

Mika, On schema.org and why it matters for the Web, IEEE Internet Computing19(4) (2015), 52–55. doi:10.1109/MIC.2015.81.

206.

Mika,

Meij and

Zaragoza, Investigating the semantic gap through query log analysis, in: International Semantic Web Conference (ISWC), Springer, 2009, pp. 441–455. doi:10.1007/978-3-642-04930-9_28.

207.

Miles and

Bechhofer, SKOS Simple Knowledge Organization System Reference, W3C Recommendation, 2009, https://www.w3.org/2004/02/skos/.

208.

G.A.

Miller, WordNet: A lexical database for English, Commun. ACM38(11) (1995), 39–41. doi:10.1145/219717.219748.

209.

Milne and

I.H.

Witten, Learning to link with Wikipedia, in: Information and Knowledge Management (CIKM), ACM, 2008, pp. 509–518. doi:10.1145/1458082.1458150.

210.

D.N.

Milne and

I.H.

Witten, An open-source toolkit for mining Wikipedia, Artif. Intell.194 (2013), 222–239. doi:10.1016/j.artint.2012.06.007.

211.

Min,

Grishman,

Wan,

Wang and

Gondek, Distant supervision for relation extraction with an incomplete knowledge base, in: North American Chapter of the (ACL),

Vanderwende,

Daumé III and

Kirchhoff, eds, ACL, 2013, pp. 777–782.

212.

Minard,

Speranza,

Urizar,

Altuna,

van Erp,

Schoen and

van Son, Meantime, the newsreader multilingual event and time corpus, in: Language Resources and Evaluation Conference (LREC),

Calzolari,

Choukri,

Declerck,

Goggi,

Grobelnik,

Maegaard,

Mariani,

Mazo,

Moreno,

Odijk and

Piperidis, eds, ELRA, 2016.

213.

Mintz,

Bills,

Snow and

Jurafsky, Distant supervision for relation extraction without labeled data, in: Annual Meeting of the Association for Computational Linguistics (ACL),

Su,

Su and

Wiebe, eds, ACL, 2009, pp. 1003–1011.

214.

T.M.

Mitchell,

W.W.

Cohen,

E.R.

HruschkaJr.,

P.P.

Talukdar,

Betteridge,

Carlson,

B.D.

Mishra,

Gardner,

Kisiel,

Krishnamurthy,

Lao,

Mazaitis,

Mohamed,

Nakashole,

E.A.

Platanios,

Ritter,

Samadi,

Ritter,

Settles,

R.C.

Wang,

D.T.

Wijaya,

Gupta,

Chen,

Saparov,

Greaves and

Welling, Never-ending learning, in: Conference on Artificial Intelligence (AAAI),

Bonet and

Koenig, eds, AAAI, 2015, pp. 2302–2310.

215.

Mori,

Matsuo,

Ishizuka and

Faltings, Keyword extraction from the Web for FOAF metadata, in: Workshop on Friend of a Friend, Social Networking and the Semantic Web, 2004.

216.

Moro,

Raganato and

Navigli, Entity Linking meets Word Sense Disambiguation: A unified approach, Transactions of the Association for Computational Linguistics2 (2014), 231–244.

217.

Moussallem,

Usbeck,

Röder and

A.-C.

Ngonga Ngomo, MAG: A multilingual, knowledge-base agnostic and deterministic Entity Linking approach, in: Knowledge Capture Conference (K-CAP),

Ó.

Corcho,

Janowicz,

Rizzo,

Tiddi and

Garijo, eds, ACM, 2017, pp. 9:1–9:8. doi:10.1145/3148011.3148024.

218.

Mulwad,

Finin and

Joshi, Semantic message passing for generating Linked Data from tables, in: International Semantic Web Conference (ISWC),

Alani,

Kagal,

Fokoue,

P.T.

Groth,

Biemann,

J.X.

Parreira,

Aroyo,

N.F.

Noy,

Welty and

Janowicz, eds, Springer, 2013, pp. 363–378. doi:10.1007/978-3-642-41335-3_23.

219.

Muñoz,

Hogan and

Mileo, Using Linked Data to mine RDF from Wikipedia’s tables, in: Web Search and Web Data Mining (WSDM),

Carterette,

Diaz,

Castillo and

Metzler, eds, ACM, 2014, pp. 533–542. doi:10.1145/2556195.2556266.

220.

Muñoz-García,

García-Silva,

Corcho,

de la Higuera-Hernández and

Navarro, Identifying topics in social media posts using DBpedia, in: Networked and Electronic Media Summit (NEM), 2011.

221.

Nadeau and

Sekine, A survey of Named Entity Recognition and Classification, Lingvisticae Investigationes30(1) (2007), 3–26. doi:10.1075/li.30.1.03nad.

222.

Nakashole,

Theobald and

Weikum, Scalable knowledge harvesting with high precision and high recall, in: Web Search and Web Data Mining (WSDM),

King,

Nejdl and

Li, eds, ACM, 2011, pp. 227–236. doi:10.1145/1935826.1935869.

223.

Nakashole,

Weikum and

F.M.

Suchanek, Discovering semantic relations from the Web and organizing them with PATTY, SIGMOD Record42(2) (2013), 29–34. doi:10.1145/2503792.2503799.

224.

Navigli, Word Sense Disambiguation: A survey, ACM Comput. Surv.41(2) (2009), 10:1–10:69. doi:10.1145/1459352.1459355.

225.

Navigli,

Velardi and

Gangemi, Ontology learning and its application to automated terminology translation, IEEE Intelligent Systems18(1) (2003), 22–31. doi:10.1109/MIS.2003.1179190.

226.

Nebhi, A rule-based relation extraction system using DBpedia and syntactic parsing, in: Conference on NLP & DBpedia (NLP-DBPEDIA),

Hellmann,

Filipowska,

Barrière,

P.N.

Mendes and

Kontokostas, eds, CEUR-WS.org, 2013, pp. 74–79.

227.

Nedellec,

Golik,

Aubin and

Bossy, Building large lexicalized ontologies from text: A use case in automatic indexing of biotechnology patents, in: Knowledge Engineering and Knowledge Management (EKAW),

Cimiano and

H.S.

Pinto, eds, Springer, 2010, pp. 514–523. doi:10.1007/978-3-642-16438-5_41.

228.

Nelson,

Wallis and

Aarts, Exploring Natural Language: Working with the British Component of the International Corpus of English, Vol. 29, John Benjamins Publishing, 2002.

229.

A.-C.

Ngonga Ngomo,

Heino,

Lyko,

Speck and

Kaltenböck, SCMS – semantifying content management systems, in: International Semantic Web Conference (ISWC),

Aroyo,

Welty,

Alani,

Taylor,

Bernstein,

Kagal,

N.F.

Noy and

Blomqvist, eds, Springer, 2011, pp. 189–204. doi:10.1007/978-3-642-25093-4_13.

230.

D.B.

Nguyen,

Hoffart,

Theobald and

Weikum, AIDA-light: High-throughput named-entity disambiguation, in: World Wide Web Conference (WWW),

Bizer,

Heath,

Auer and

Berners-Lee, eds, CEUR-WS.org, 2014.

231.

D.B.

Nguyen,

Theobald and

Weikum, J-NERD: Joint named entity recognition and disambiguation with rich linguistic features, TACL4 (2016), 215–229.

232.

T.-V.T.

Nguyen and

Moschitti, End-to-end relation extraction using distant supervision from external semantic repositories, in: Annual Meeting of the Association for Computational Linguistics (ACL): Human Language Technologies, ACL, 2011, pp. 277–282.

233.

Niu,

Zhang,

Ré and

J.W.

Shavlik, DeepDive: Web-scale knowledge-base construction using statistical learning and inference, in: International Workshop on Searching and Integrating New Web Data Sources,

Brambilla,

Ceri,

Furche and

Gottlob, eds, CEUR-WS.org, 2012, pp. 25–28.

234.

Nivre, Dependency parsing, Language and Linguistics Compass4(3) (2010), 138–152. doi:10.1111/j.1749-818X.2010.00187.x.

235.

Nivre,

de Marneffe,

Ginter,

Goldberg,

Hajic,

C.D.

Manning,

R.T.

McDonald,

Petrov,

Pyysalo,

Silveira,

Tsarfaty and

Zeman, Universal dependencies v1: A multilingual Treebank collection, in: Language Resources and Evaluation Conference (LREC),

Calzolari,

Choukri,

Declerck,

Goggi,

Grobelnik,

Maegaard,

Mariani,

Mazo,

Moreno,

Odijk and

Piperidis, eds, 2016.

236.

Novácek,

Laera,

Handschuh and

Davis, Infrastructure for dynamic knowledge integration – automated biomedical ontology extension using textual resources, Journal of Biomedical Informatics41(5) (2008), 816–828. doi:10.1016/j.jbi.2008.06.003.

237.

B.P.

Nunes,

Dietze,

M.A.

Casanova,

Kawase,

Fetahu and

Nejdl, Combining a co-occurrence-based and a semantic measure for Entity Linking, in: Extended Semantic Web Conference (ESWC),

Cimiano,

Ó.

Corcho,

Presutti,

Hollink and

Rudolph, eds, Springer, 2013, pp. 548–562. doi:10.1007/978-3-642-38288-8_37.

238.

Olieman,

Azarbonyad,

Dehghani,

Kamps and

Marx, Entity Linking by focusing DBpedia candidate entities, in: International Workshop on Entity Recognition & Disambiguation (ERD),

Carmel,

Chang,

Gabrilovich,

B.P.

Hsu and

Wang, eds, ACM, 2014, pp. 13–24. doi:10.1145/2633211.2634353.

239.

Oramas,

L.E.

Anke,

Sordo,

Saggion and

Serra, ELMD: An automatically generated Entity Linking gold standard dataset in the music domain, in: International Conference on Language Resources and Evaluation (LREC),

Calzolari,

Choukri,

Declerck,

Goggi,

Grobelnik,

Maegaard,

Mariani,

Mazo,

Moreno,

Odijk and

Piperidis, eds, ELRA, 2016.

240.

Ouksili,

Kedad and

Lopes, Theme identification in RDF graphs, in: International Conference on Model and Data Engineering (MEDI),

Y.A.

Ameur,

Bellatreche and

G.A.

Papadopoulos, ed., Springer, 2014, pp. 321–329. doi:10.1007/978-3-319-11587-0_30.

241.

Ozcan and

Y.A.

Aslangdogan, Concept based information access using ontologies and latent semantic analysis, Dept. of Computer Science and Engineering8 (2004), 2004.

242.

Pazienza,

Pennacchiotti and

Zanzotto, Terminology extraction: An analysis of linguistic and statistical approaches, Knowledge mining (2005), 255–279. doi:10.1007/3-540-32394-5_20.

243.

Piccinno and

Ferragina, From TagME to WAT: A new entity annotator, in: International Workshop on Entity Recognition & Disambiguation (ERD),

Carmel,

Chang,

Gabrilovich,

B.P.

Hsu and

Wang, eds, ACM, 2014, pp. 55–62. doi:10.1145/2633211.2634350.

244.

Pirnay-Dummer and

Walter, Bridging the world’s knowledge to individual knowledge using latent semantic analysis and Web ontologies to complement classical and new knowledge assessment technologies, Technology, Instruction, Cognition & Learning7(1) (2009).

245.

Pivk,

Cimiano,

Sure,

Gams,

Rajkovic and

Studer, Transforming arbitrary tables into logical form with TARTAR, Data Knowl. Eng.60(3) (2007), 567–595. doi:10.1016/j.datak.2006.04.002.

246.

Plu,

Rizzo and

Troncy, A hybrid approach for entity recognition and linking, in: Semantic Web Evaluation Challenges – Second SemWebEval Challenge at ESWC 2015,

Gandon,

Cabrio,

Stankovic and

Zimmermann, eds, Springer, 2015, pp. 28–39. doi:10.1007/978-3-319-25518-7_3.

247.

Plu,

Rizzo and

Troncy, Enhancing Entity Linking by combining NER models, in: Extended Semantic Web Conference (ESWC),

Sack,

Dietze,

Tordai and

Lange, eds, 2016. doi:10.1007/978-3-319-46565-4_2.

248.

Polleres,

Hogan,

Harth and

Decker, Can we ever catch up with the Web?, Semantic Web1(1–2) (2010), 45–52. doi:10.3233/SW-2010-0016.

249.

Poon and

P.M.

Domingos, Joint unsupervised coreference resolution with Markov logic, in: Empirical Methods in Natural Language Processing (EMNLP), 2008, pp. 650–659. doi:10.3115/1613715.1613796.

250.

Popov,

Kiryakov,

Ognyanoff,

Manov and

Kirilov, KIM – a semantic platform for information extraction and retrieval, Natural Language Engineering10(3–4) (2004), 375–392. doi:10.1017/S135132490400347X.

251.

Presutti,

A.G.

Nuzzolese,

Consoli,

Gangemi and

D.R.

Recupero, From hyperlinks to Semantic Web properties using open knowledge extraction, Semantic Web7(4) (2016), 351–378. doi:10.3233/SW-160221.

252.

Pudota,

Dattolo,

Baruzzo,

Ferrara and

Tasso, Automatic keyphrase extraction and ontology mining for content-based tag recommendation, Int. J. Intell. Syst.25(12) (2010), 1158–1186. doi:10.1002/int.20448.

253.

Raimond,

Ferne,

Smethurst and

Adams, The BBC world service archive prototype, J. Web Sem.27 (2014), 2–9. doi:10.1016/j.websem.2014.07.005.

254.

J.J.

Randolph, Free-marginal multirater kappa (multirater k [free]): An alternative to Fleiss’ fixed-marginal multirater kappa, Joensuu Learning and Instruction Symposium (2005).

255.

Ratinov,

Roth,

Downey and

Anderson, Local and global algorithms for disambiguation to Wikipedia, in: Association for Computational Linguistics (ACL): Human Language Technologies,

Lin,

Matsumoto and

Mihalcea, eds, ACL, 2011, pp. 1375–1384.

256.

Ratnaparkhi, Learning to parse natural language with maximum entropy models, Machine Learning34(1–3) (1999), 151–175. doi:10.1023/A:1007502103375.

257.

Riedel,

Yao and

McCallum, Modeling relations and their mentions without labeled text, in: Machine Learning and Knowledge Discovery in Databases PKDD,

J.L.

Balcázar,

Bonchi,

Gionis and

Sebag, eds, Springer, 2010, pp. 148–163. doi:10.1007/978-3-642-15939-8_10.

258.

Riedel,

Yao,

McCallum and

B.M.

Marlin, Relation extraction with matrix factorization and universal schemas, in: Association of Computational Linguistics (ACL): Human Language Technologies,

Vanderwende,

Daumé III and

Kirchhoff, eds, ACL, 2013, pp. 74–84.

259.

S.A.

Ríos,

Aguilera,

Bustos,

Omitola and

Shadbolt, Leveraging Social Network Analysis with topic models and the Semantic Web, in: Web Intelligence and Intelligent Agent Technology,

J.F.

Hübner,

Petit and

Suzuki, eds, IEEE Computer Society, 2011, pp. 339–342. doi:10.1109/WI-IAT.2011.127.

260.

A.B.

Rios-Alvarado,

López-Arévalo and

V.J.S.

Sosa, Learning concept hierarchies from textual resources for ontologies construction, Expert Systems with Applications40(15) (2013), 5907–5915. doi:10.1016/j.eswa.2013.05.005.

261.

Ristoski and

Paulheim, Semantic Web in data mining and knowledge discovery: A comprehensive survey, J. Web Sem.36 (2016), 1–22. doi:10.1016/j.websem.2016.01.001.

262.

Ritze and

Bizer, Matching web tables to DBpedia – a feature utility study, in: International Conference on Extending Database Technology (EDBT),

Markl,

Orlando,

Mitschang,

Andritsos,

Sattler and

Breß, eds, OpenProceedings.org, 2017, pp. 210–221. doi:10.5441/002/edbt.2017.20.

263.

Rizzo and

Troncy, NERD: A framework for unifying Named Entity Recognition and Disambiguation extraction tools, in: European Chapter of the Association for Computational Linguistics (ACL),

Daelemans,

Lapata and

Màrquez, eds, ACL, 2012, pp. 73–76.

264.

Rizzo and

Troncy, NERD: Evaluating Named Entity Recognition tools in the Web of Data, in: International Semantic Web Conference (ISWC), Demo Session, 2011.

265.

Rizzo,

van Erp and

Troncy, Benchmarking the extraction and disambiguation of named entities on the Semantic Web, in: Language Resources and Evaluation Conference (LREC),

Calzolari,

Choukri,

Declerck,

Loftsson,

Maegaard,

Mariani,

Moreno,

Odijk and

Piperidis, eds, 2014.

266.

Röder,

Both and

Hinneburg, Exploring the space of topic coherence measures, in: Web Search and Web Data Mining (WSDM),

Cheng,

Li,

Gabrilovich and

Tang, eds, ACM, 2015, pp. 399–408. doi:10.1145/2684822.2685324.

267.

Röder,

A.-C.

Ngonga Ngomo,

Ermilov and

Both, Detecting similar Linked Datasets using topic modelling, in: Extended Semantic Web Conference (ESWC),

Sack,

Blomqvist,

d’Aquin,

Ghidini,

S.P.

Ponzetto and

Lange, eds, Springer, 2016, pp. 3–19. doi:10.1007/978-3-319-34129-3_1.

268.

Rosales-Méndez,

Hogan and

Poblete, VoxEL: A benchmark dataset for multilingual Entity Linking, in: International Semantic Web Conference (ISWC),

Bontcheva,

Vrandečič,

Presutti,

M.C.

Suárez-Figueroa,

Celino,

Sabou,

L.-A.

Kaffee and

Simperl, eds, Springer, 2018.

269.

Rosales-Méndez,

Poblete and

Hogan, Multilingual Entity Linking: Comparing English and Spanish, in: International Workshop on Linked Data for Information Extraction (LD4IE) Co-Located with the 16th International Semantic Web Conference (ISWC),

A.L.

Gentile,

A.G.

Nuzzolese and

Zhang, eds, 2017, pp. 62–73.

270.

Rosales-Méndez,

Poblete and

Hogan, What should Entity Linking link?, in: Alberto Mendelzon International Workshop on Foundations of Data Management (AMW),

Olteanu and

Poblete, eds, 2018.

271.

Rose,

Engel,

Cramer and

Cowley, Automatic keyword extraction from individual documents, Text Mining (2010), 1–20. doi:10.1002/9780470689646.ch1.

272.

Rouces,

de Melo and

Hose, Framebase: Representing n-ary relations using semantic frames, in: Extended Semantic Web Conference (ESWC),

Gandon,

Sabou,

Sack,

d’Amato,

Cudré-Mauroux and

Zimmermann, eds, Springer, 2015, pp. 505–521. doi:10.1007/978-3-319-18818-8_31.

273.

Sánchez and

Moreno, Learning medical ontologies from the Web, Knowledge Management for Health Care Procedures (2008), 32–45. doi:10.1007/978-3-540-78624-5_3.

274.

Sarawagi, Information extraction, Found. Trends databases1(3) (2008), 261–377. doi:10.1561/1900000003.

275.

Schmachtenberg,

Bizer and

Paulheim, Adoption of the Linked Data best practices in different topical domains, in: International Semantic Web Conference (ISWC),

Mika,

Tudorache,

Bernstein,

Welty,

C.A.

Knoblock,

Vrandecic,

P.T.

Groth,

N.F.

Noy,

Janowicz and

C.A.

Goble, eds, Springer, 2014, pp. 245–260. doi:10.1007/978-3-319-11964-9_16.

276.

Schönhofen, Identifying document topics using the Wikipedia category network, Web Intelligence and Agent Systems: An International Journal7(2) (2009), 195–207. doi:10.3233/WIA-2009-0162.

277.

Shen,

Wang,

Luo and

Wang, LIEGE: Link entities in Web lists with knowledge base, in: Knowledge Discovery and Data Mining (KDD),

Yang,

Agarwal and

Pei, eds, ACM, 2012, pp. 1424–1432. doi:10.1145/2339530.2339753.

278.

Shen,

Wang,

Luo and

Wang, LINDEN: linking named entities with knowledge base via semantic knowledge, in: World Wide Web Conference (WWW),

Mille,

F.L.

Gandon,

Misselis,

Rabinovich and

Staab, eds, ACM, 2012, pp. 449–458. doi:10.1145/2187836.2187898.

279.

Sheth,

I.B.

Arpinar and

Kashyap, Relationships at the heart of Semantic Web: Modeling, discovering, and exploiting complex semantic relationships, in: Enhancing the Power of the Internet,

Nikravesh,

Azvine,

Yager and

L.A.

Zadeh, eds, Springer, 2004, pp. 63–94. doi:10.1007/978-3-540-45218-8_4.

280.

Siddiqi and

Sharan, Keyword and keyphrase extraction techniques: A literature review, International Journal of Computer Applications109(2) (2015). doi:10.5120/19161-0607.

281.

Sil and

Yates, Re-ranking for joint named-entity recognition and linking, in: Information and Knowledge Management (CIKM),

He,

Iyengar,

Nejdl,

Pei and

Rastogi, eds, ACM, 2013, pp. 2369–2374. doi:10.1145/2505515.2505601.

282.

SinghRathore and

Roy, Ontology based Web page topic identification, International Journal of Computer Applications85(6) (2014), 35–40. doi:10.5120/14849-3211.

283.

Sleeman,

Finin and

Joshi, Topic modeling for RDF graphs, in: International Workshop on Linked Data for Information Extraction (LD4IE) Co-Located with International Semantic Web Conference (ISWC),

A.L.

Gentile,

Zhang,

d’Amato and

Paulheim, eds, CEUR-WS.org, 2015.

284.

Södergren, HERD – Hajen Entity Recognition and Disambiguation, 2016.

285.

Soderland and

Mandhani, Moving from textual relations to ontologized relations, in: AAAI, 2007.

286.

W.M.

Soon,

H.T.

Ng and

C.Y.

Lim, A machine learning approach to coreference resolution of noun phrases, Computational Linguistics27(4) (2001), 521–544. doi:10.1162/089120101753342653.

287.

Specia and

Motta, Integrating folksonomies with the Semantic Web, in: European Semantic Web Conference (ESWC),

Franconi,

Kifer and

May, eds, Springer, 2007, pp. 624–639. doi:10.1007/978-3-540-72667-8_44.

288.

Speck and

A.-C.

Ngonga Ngomo, Named Entity Recognition using FOX, in: International Semantic Web Conference (ISWC), Posters & Demonstrations Track,

Horridge,

Rospocher and

van Ossenbruggen, eds, CEUR-WS.org, 2014, pp. 85–88.

289.

Speck and

A.-C.

Ngonga Ngomo, Ensemble learning for Named Entity Recognition, in: International Semantic Web Conference (ISWC),

Mika,

Tudorache,

Bernstein,

Welty,

C.A.

Knoblock,

Vrandecic,

P.T.

Groth,

N.F.

Noy,

Janowicz and

C.A.

Goble, eds, Springer, 2014, pp. 519–534. doi:10.1007/978-3-319-11964-9_33.

290.

V.C.

Storey, Understanding semantic relationships, VLDB J.2(4) (1993), 455–488. doi:10.1007/BF01263048.

291.

Sun,

Chen,

Zeng,

Lu,

Shi and

Ma, Supervised latent semantic indexing for document categorization, in: International Conference on Data Mining (ICDM), IEEE, 2004, pp. 535–538. doi:10.1109/ICDM.2004.10004.

292.

Surdeanu,

Tibshirani,

Nallapati and

C.D.

Manning, Multi-instance multi-label learning for relation extraction, in: Empirical Methods in Natural Language Processing (EMNLP), EMNLP-CoNLL ’12,

Tsujii,

Henderson and

Pasca, eds, ACL, 2012, pp. 455–465.

293.

Takamatsu,

Sato and

Nakagawa, Reducing wrong labels in distant supervision for relation extraction, in: Association for Computational Linguistics (ACL), ACL, 2012, pp. 721–729.

294.

T.P.

Tanon,

Vrandecic,

Schaffert,

Steiner and

Pintscher, From Freebase to Wikidata: The great migration, in: World Wide Web Conference (WWW),

Bourdeau,

Hendler,

Nkambou,

Horrocks and

Ben Zhao, eds, ACM, 2016, pp. 1419–1428. doi:10.1145/2872427.2874809.

295.

Thakker,

Osman and

Lakin, GATE JAPE Grammar Tutorial, Nottingham Trent University Technical Report, 2009, https://gate.ac.uk/sale/thakker-jape-tutorial/GATE.

296.

Thomas,

Starlinger,

Vowinkel,

Arzt and

Leser, GeneView: A comprehensive semantic search engine for PubMed, Nucleic acids research40(W1) (2012), W585–W591. doi:10.1093/nar/gks563.

297.

Tiun,

Abdullah and

T.E.

Kong, Automatic topic identification using ontology hierarchy, in: Computational Linguistics and Intelligent Text Processing (CICLing),

A.F.

Gelbukh, ed., Springer, 2001, pp. 444–453. doi:10.1007/3-540-44686-9_43.

298.

Todor,

Lukasiewicz,

Athan and

Paschke, Enriching topic models with DBpedia, in: On the Move to Meaningful Internet Systems,

Debruyne,

Panetto,

Meersman,

T.S.

Dillon,

Kühn,

O’Sullivan and

C.A.

Ardagna, eds, Springer, 2016, pp. 735–751. doi:10.1007/978-3-319-48472-3_46.

299.

Tristram,

Walter,

Cimiano and

Unger, Weasel: A machine learning based approach to entity linking combining different features, in: NLP&DBpedia Workshop, Co-Located with International Semantic Web Conference (ISWC),

Paulheim,

van Erp,

Filipowska,

P.N.

Mendes and

Brümmer, eds, CEUR-WS.org, 2015, pp. 25–32.

300.

Unger,

Freitas and

Cimiano, An introduction to question answering over Linked Data, in: Reasoning on the Web in the Big Data Era,

Koubarakis,

G.B.

Stamou,

Stoilos,

Horrocks,

P.G.

Kolaitis,

Lausen and

Weikum, eds, Springer, 2014, pp. 100–140. doi:10.1007/978-3-319-10587-1_2.

301.

V.S.

Uren,

Cimiano,

Iria,

Handschuh,

Vargas-Vera,

Motta and

Ciravegna, Semantic annotation for knowledge management: Requirements and a survey of the state of the art, J. Web Sem.4(1) (2006), 14–28. doi:10.1016/j.websem.2005.10.002.

302.

Usbeck,

A.-C.

Ngonga Ngomo,

Röder,

Gerber,

S.A.

Coelho,

Auer and

Both, AGDISTIS – graph-based disambiguation of named entities using Linked Data, in: International Semantic Web Conference (ISWC),

Mika,

Tudorache,

Bernstein,

Welty,

C.A.

Knoblock,

Vrandecic,

P.T.

Groth,

N.F.

Noy,

Janowicz and

C.A.

Goble, eds, Springer, 2014, pp. 457–471. doi:10.1007/978-3-319-11964-9_29.

303.

Usbeck,

Röder,

A.-C.

Ngonga Ngomo,

Baron,

Both,

Brümmer,

Ceccarelli,

Cornolti,

Cherix,

Eickmann,

Ferragina,

Lemke,

Moro,

Navigli,

Piccinno,

Rizzo,

Sack,

Speck,

Troncy,

Waitelonis and

Wesemann, GERBIL: general entity annotator benchmarking framework, in: World Wide Web Conference (WWW),

Gangemi,

Leonardi and

Panconesi, eds, ACM, 2015, pp. 1133–1143. doi:10.1145/2736277.2741626.

304.

Utt and

Padó, Ontology-based distinction between polysemy and homonymy, in: International Conference on Computational Semantics, IWCS,

Bos and

Pulman, eds, ACL, 2011.

305.

Varga,

A.E.C.

Basave,

Rowe,

Ciravegna and

He, Linked knowledge sources for topic classification of microposts: A semantic graph-based approach, Web Semantics: Science, Services and Agents on the World Wide Web26 (2014), 36–57. doi:10.1016/j.websem.2014.04.001.

306.

Velardi,

Fabriani and

Missikoff, Using text processing techniques to automatically enrich a domain ontology, in: FOIS, 2001, pp. 270–284. doi:10.1145/505168.505194.

307.

Venetis,

A.Y.

Halevy,

Madhavan,

Pasca,

Shen,

Wu,

Miao and

Wu, Recovering semantics of tables on the Web, PVLDB4(9) (2011), 528–538. doi:10.14778/2002938.2002939.

308.

J.A.L.

Ventura,

Jonquet,

Roche and

Teisseire, Biomedical term extraction: Overview and a new methodology, Inf. Retr. Journal19(1–2) (2016), 59–99. doi:10.1007/s10791-015-9262-2.

309.

Vrandecic and

Krötzsch, Wikidata: A free collaborative knowledgebase, Commun. ACM57(10) (2014), 78–85. doi:10.1145/2629489.

310.

Waitelonis and

Sack, Augmenting video search with Linked Open Data, in: International Conference on Semantic Systems (I-Semantics),

Paschke,

Weigand,

Behrendt,

Tochtermann and

Pellegrini, eds, Verlag der Technischen Universität, Graz, 2009, pp. 550–558.

311.

Wartena and

Brussee, Topic detection by clustering keywords, in: International Workshop on Database and Expert Systems Applications (DEXA), IEEE, 2008, pp. 54–58. doi:10.1109/DEXA.2008.120.

312.

Weston,

Bordes,

Yakhnenko and

Usunier, Connecting language and knowledge bases with embedding models for relation extraction, in: Empirical Methods in Natural Language Processing (EMNLP), ACL, 2013, pp. 1366–1371.

313.

D.C.

Wimalasuriya and

Dou, Ontology-based information extraction: An introduction and a survey of current approaches, J. Information Science36(3) (2010), 306–323. doi:10.1177/0165551509360123.

314.

Witte,

Khamis and

Rilling, Flexible ontology population from text: The OwlExporter, in: International Conference on Language Resources and Evaluation (LREC),

Calzolari,

Choukri,

Maegaard,

Mariani,

Odijk,

Piperidis,

Rosner and

Tapias, eds, ELRA, 2010.

315.

I.H.

Witten,

G.W.

Paynter,

Frank,

Gutwin and

C.G.

Nevill-Manning, KEA: practical automatic keyphrase extraction, in: Conference on Digital Libraries (DL), ACM, 1999, pp. 254–255. doi:10.1145/313238.313437.

316.

Wong,

Liu and

Bennamoun, Ontology learning from text: A look back and into the future, ACM Computing Surveys (CSUR)44(4) (2012), 20. doi:10.1145/2333112.2333115.

317.

Xu,

Reddy,

Feng,

Huang and

Zhao, Question answering on Freebase via relation extraction and textual evidence, in: Annual Meeting of the Association for Computational Linguistics (ACL), 2016.

318.

Yakout,

Ganjam,

Chakrabarti and

Chaudhuri, Infogather: Entity augmentation and attribute discovery by holistic matching with Web tables, in: International Conference on Management of Data (SIGMOD),

K.S.

Candan,

Chen,

R.T.

Snodgrass,

Gravano and

Fuxman, eds, ACM, 2012, pp. 97–108. doi:10.1145/2213836.2213848.

319.

Yamada and

Matsumoto, Statistical dependency analysis with support vector machines, in: International Workshop on Parsing Technologies (IWPT), Vol. 3, 2003, pp. 195–206.

320.

S.R.

Yerva,

Catasta,

Demartini and

Aberer, Entity disambiguation in Tweets leveraging user social profiles, in: Information Reuse & Integration, IRI, IEEE, 2013, pp. 120–128. doi:10.1109/IRI.2013.6642462.

321.

M.A.

Yosef,

Hoffart,

Bordino,

Spaniol and

Weikum, AIDA: An online tool for accurate disambiguation of named entities in text and tables, PVLDB4(12) (2011), 1450–1453.

322.

M.A.

Yosef,

Hoffart,

Ibrahim,

Boldyrev and

Weikum, Adapting AIDA for Tweets, in: Workshop on Making Sense of Microposts Co-Located with World Wide Web Conference (WWW),

Rowe,

Stankovic and

Dadzie, eds, CEUR-WS.org, 2014, pp. 68–69.

323.

D.H.

Younger, Recognition and parsing of context-free languages in time nˆ3, Information and Control10(2) (1967), 189–208. doi:10.1016/S0019-9958(67)80007-X.

324.

Zeng,

Liu,

Chen and

Zhao, Distant supervision for relation extraction via piecewise convolutional neural networks, in: Empirical Methods in Natural Language Processing (EMNLP),

Màrquez,

Callison-Burch,

Su,

Pighin and

Marton, eds, ACL, 2015, pp. 1753–1762.

325.

Zhang,

Yoshida and

Tang, Using ontology to improve precision of terminology extraction from documents, in: Expert Systems with Applications, Vol. 36, 2009, pp. 9333–9339. doi:10.1016/j.eswa.2008.12.034.

326.

Zhang, Named Entity Recognition: Challenges in document annotation, gazetteer construction and disambiguation, PhD thesis, The University of Sheffield, 2013.

327.

Zhang, Effective and efficient semantic table interpretation using Tableminer

^{+}

, Semantic Web8(6) (2017), 921–957. doi:10.3233/SW-160242.

328.

Zhao,

Xing,

M.A.

Kabir,

Sawada,

Li and

Lin, HDSKG: harvesting domain specific knowledge graph from content of webpages, in: IEEE International Conference on Software Analysis (SANER),

Pinzger,

Bavota and

Marcus, eds, IEEE Computer Society, 2017, pp. 56–67. doi:10.1109/SANER.2017.7884609.

329.

Zheng,

Howsmon,

Zhang,

Hahn,

D.L.

McGuinness,

J.A.

Hendler and

Ji, Entity Linking for biomedical literature, BMC Med. Inf. & Decision Making15(S-1) (2015), S4. doi:10.1186/1472-6947-15-S1-S4.

330.

Zheng,

Si,

Li,

E.Y.

Chang and

Zhu, Entity disambiguation with Freebase, in: International Conferences on Web Intelligence, WI, IEEE, 2012, pp. 82–89. doi:10.1109/WI-IAT.2012.26.

331.

Zhu,

Zhang,

Chen,

Zhang and

Zhu, Fast and accurate shift-reduce constituent parsing, in: Annual Meeting of the Association for Computational Linguistics (ACL), 2013, pp. 434–443.

332.

Zhu,

Zhu and

Wang, Improving shift-reduce constituency parsing with large-scale unlabeled data, Natural Language Engineering21(1) (2015), 113–138. doi:10.1017/S1351324913000119.

333.

Zou,

Huang,

Wang,

J.X.

Yu,

He and

Zhao, Natural language question answering over RDF: A graph data driven approach, in: International Conference on Management of Data (SIGMOD),

C.E.

Dyreson,

Li and

M.T.

Özsu, eds, ACM, 2014, pp. 313–324. doi:10.1145/2588555.2610525.

334.

Zuo,

Kasneci,

Grütze and

Naumann, BEL: Bagging for Entity Linking, in: International Conference on Computational Linguistics (COLING),

Hajic and

Tsujii, eds, ACL, 2014, pp. 2075–2086.

335.

Zwicklbauer,

Einsiedler,

Granitzer and

Seifert, Towards disambiguating Web tables, in: International Semantic Web Conference (ISWC), Posters & Demonstrations Track,

Blomqvist and

Groza, eds, CEUR-WS.org, 2013, pp. 205–208.

336.

Zwicklbauer,

Seifert and

Granitzer, DoSeR – A knowledge-base-agnostic framework for entity disambiguation using semantic embeddings, in: Extended Semantic Web Conference (ESWC),

Sack,

Blomqvist,

d’Aquin,

Ghidini,

S.P.

Ponzetto and

Lange, eds, Springer, 2016, pp. 182–198. doi:10.1007/978-3-319-34129-3_12.

Information extraction meets the Semantic Web: A survey

Abstract

Keywords

1. Introduction

2.1.1. Dictionary

12 This implementation was later integrated into GATE: https://gate.ac.uk/sale/tao/splitch13.html.

18 See Listing 8 where “Breaking” is tagged VGB (verb gerund/past participle) and “Bad” as JJ (adjective).

2.2.1. Features for disambiguation

19 More specifically, each input string is decomposed into a set of 3-character substrings, where the Jaccard coefficient (the cardinality of the intersection over union) of both sets is computed.

23 https://bioportal.bioontology.org

2.5. Evaluation

25 https://www.clips.uantwerpen.be/conll2003/ner.tgz

2.7. Open questions

3. Concept extraction & linking

38 Also known as Term Extraction [100], Term Recognition [3], Vocabulary Extraction [85], Glossary Extraction [60], etc.

3.2. Filtering

3.3. Hierarchy induction

3.4. Topic modeling

42 They use the term entity to refer to both concepts, such as person, and individuals, such as Barack Obama.

43 It is worth noting that OWL does provide means for meta-modeling (aka. punning), where concepts can be simultaneous considered as groups of individuals when reasoning at a terminological level, and as individuals when reasoning at an assertional level.

44 http://www.opencalais.com/

46 http://mlg.ucd.ie/datasets/bbc.html

3.9. Open questions

4. Relation extraction & linking

4.2. Parsing relations

58 Here we use a rather distinct representation of arguments in the relation for space/visual reasons and to follow the notation used by Boxer (which is based on variables).

61 Also known as weak supervision [141].

4.5. RDF representation

4.6. Relation mapping

4.10. Open questions

5. Semi-structured information extraction

70 Though of course, we should not underestimate the value of Wikipedia itself as a raw source for IE tasks.

6. Discussion

Footnotes

Acknowledgements

Primer: Traditional information extraction

References

¹²
This implementation was later integrated into GATE: https://gate.ac.uk/sale/tao/splitch13.html.

¹⁸
See Listing 8 where “Breaking” is tagged VGB (verb gerund/past participle) and “Bad” as JJ (adjective).

¹⁹
More specifically, each input string is decomposed into a set of 3-character substrings, where the Jaccard coefficient (the cardinality of the intersection over union) of both sets is computed.

²³
https://bioportal.bioontology.org

²⁵
https://www.clips.uantwerpen.be/conll2003/ner.tgz

³⁸
Also known as Term Extraction [100], Term Recognition [3], Vocabulary Extraction [85], Glossary Extraction [60], etc.

⁴²
They use the term entity to refer to both concepts, such as person, and individuals, such as Barack Obama.

⁴³
It is worth noting that OWL does provide means for meta-modeling (aka. punning), where concepts can be simultaneous considered as groups of individuals when reasoning at a terminological level, and as individuals when reasoning at an assertional level.

⁴⁴
http://www.opencalais.com/

⁴⁶
http://mlg.ucd.ie/datasets/bbc.html

⁵⁸
Here we use a rather distinct representation of arguments in the relation for space/visual reasons and to follow the notation used by Boxer (which is based on variables).

⁶¹
Also known as weak supervision [141].

⁷⁰
Though of course, we should not underestimate the value of Wikipedia itself as a raw source for IE tasks.