JRC-Names: Multilingual entity name variants and titles as Linked Data

Abstract

Since 2004 the European Commission’s Joint Research Centre (JRC) has been analysing the online version of printed media in over twenty languages and has automatically recognised and compiled large amounts of named entities (persons and organisations) and their many name variants. The collected variants not only include standard spellings in various countries, languages and scripts, but also frequently found spelling mistakes or lesser used name forms, all occurring in real-life text (e.g. Benjamin/Binyamin/Bibi/Benyamín/Biniamin/Беньямин/بنيامين Netanyahu/Netanjahu/Nétanyahou/Netahny/Нетаньяху/نتنياهو). This entity name variant data, known as JRC-Names, has been available for public download since 2011. In this article, we report on our efforts to render JRC-Names as Linked Data (LD), using the lexicon model for ontologies lemon. Besides adhering to Semantic Web standards, this new release goes beyond the initial one in that it includes titles found next to the names, as well as date ranges when the titles and the name variants were found. It also establishes links towards existing datasets, such as DBpedia and Talk-Of-Europe. As multilingual linguistic linked dataset, JRC-Names can help bridge the gap between structured data and natural languages, thus supporting large-scale data integration, e.g. cross-lingual mapping, and web-based content processing, e.g. entity linking. JRC-Names is publicly available through the dataset catalogue of the European Union’s Open Data Portal.

Keywords

Multilingual Semantic Web linguistic Linked Data lemon named entity name variants

1. Introduction

Enhanced by Semantic Web technologies, the Linked Data publishing paradigm has become increasingly attractive in the recent years [4,31], giving rise to an ever-growing Web of Data.1

¹
As evidenced by the Linked Open Data (LOD) cloud: http://lod-cloud.net/state/.

The availability of such machine-readable, formally defined and interlinked data that can be used by computational agents bears the potential of a better and knowledgeable use of information and forms the basis of the Semantic Web vision [6]. Yet, even if the ‘global giant graph’ is under way, several challenges still need to be addressed before attaining web-scale data integration and full access to knowledge.

A crucial point relates to the natural language interfacing and processing capabilities of the Semantic Web (SW). Indeed, if the Semantic Web is inherently language-independent [29], the question arises of how to mediate between, on the one hand, language-agnostic data representations and, on the other, language-based information needs and content. For that to happen, it is crucial to enrich structured data with linguistic information in several languages, and to enhance the Semantic Web infrastructure with language processing applications [10]. Overcoming the gap between the Web of Data and natural languages presents challenges and opportunities for both Semantic Web and Natural Language Processing (NLP), which stand here in a mutually beneficial relationship.

With regards to the Semantic Web, such developments are key in several respects. Multilingual linguistic information can first support data integration. Given the growing trend towards the publication of non-English data sources and the risk of ‘monolingual islands’ of data that do not interoperate [28], cross-lingual mappings between datasets are necessary. In this context, the lexicalisation of data on a multilingual basis can be of great help [60]. Linguistic knowledge can also ease data access. Particularly, it can support the development of ontology-based Question-Answering systems in order to allow users to interact with data using their own language(s) [37,62]. Finally, even if data can be interlinked and accessed in several languages, the vast majority of content (i.e. the Web of Documents) remains unstructured. In order to facilitate information discovery and to further develop the scope of structured data, content needs to be marked-up with semantic metadata. This relies again on the availability of web-based linguistic information and technologies.

With respect to Natural Language Processing, adopting Linked Data principles for the distribution of linguistic resources can bring many benefits, including: resource interoperability, both at a structural and conceptual level; resource integration (via interlinking); and resource maintenance (via a rich ecosystem of technologies allowing, among other things, continuous updating) [15]. Based on such insights, members of the NLP and SW communities – in particular the Open Linguistics Working Group and the W3C Ontology-Lexica Community Group2

http://linguistics.okfn.org and www.w3.org/community/ontolex/.

– joined efforts for the definition of best practice [28] and the design of principled models for the representation of linguistic information [40,46]. This laid the foundation for the development of a Linguistic Linked Open Data cloud (LLOD) and provided a real impetus for the publication and the use of linguistic data collections on the Web.3

http://linguistic-lod.org/.

Apart from an interoperable set of linguistic resources, NLP can additionally benefit from the plethora of semantic resources and knowledge bases (KB) available on the Semantic Web, e.g., as Linked Data. Finally, a web-based integration of NLP tools is foreseeable in the medium term. Some steps have already been taken in this direction with the definition of the NLP Interchange Format (NIF) [32], and further progress is being achieved through several international initiatives.4

⁴

Such as the LIDER project (lider-project.eu) and the BPMLOD and LD4LT W3C community groups (w3.org/community/bpmlod|ld4lt).

The task of Entity Linking (EL) is particularly representative of the symbiotic relationship between SW and NLP. It illustrates the evolution of information extraction from a document to a semantic-centric viewpoint [43,50] and is at the core of many knowledge extraction tools for the Semantic Web [18,26]. This task requires to align textual mentions of entities with a unique identifier in a knowledge base, typically Wikipedia or DBpedia [36]. Like in traditional named entity recognition, entities of interest are usually of type person, organisation and geo-political, although they can be extended to others. Many EL approaches have been developed [11,19,24,48], all of which acknowledge the lexical gap between KBs and textual content with, especially, the problem of entity surface form variation. Indeed, alternative spellings, abbreviations, aliases or other types of lexical variation make entity mention spotting and/or candidate selection difficult. When provided with extra surface forms, system performances increase, particularly with noisy texts [12] or specific domains [67]. There is thus a need for lexical information regarding entity names, especially across languages.

In this paper we present the release of a multilingual named entity resource for person and organisation names, namely JRC-Names, as Linked Data. The resource is freely available and comprises hundreds of thousands of entity names and their multilingual variants in over twenty languages, including across scripts. This is a follow-up of a first release [56], from which it differs in that (1) it is rendered as Linked Data using the Lexicon Model for Ontologies, and (2) it contains much more information, such as titles of persons and date ranges when title and name variants were found. Besides increasing the discoverability and reusability of the resource, the Linked Data release of JRC-Names can help better address the challenges of data integration and multilingual access, as well as support the SW to embrace the web of unstructured documents, e.g. through entity linking.

The remainder of the paper is organised as follows. In Section 2 we introduce the JRC-Names resource; we briefly explain how it was produced (2.1), account for the quality of the resource (2.2) and specify what is included in the dataset (2.3). Next, we describe its conversion to Linked Data (Section 3) and present its interconnections with other datasets (Section 4). We then give accessibility details (Section 5) and summarise known and potential usages (Section 6); finally, after the discussion of related work (Section 7), we conclude and consider future work (Section 8).

2. JRC-Names

2.1. Resource creation: Multilingual NER from the news

JRC-Names is a by-product of the Europe Media Monitor (EMM) family of news analysis applications, which gathers and analyses up to 220,000 news articles per day fully automatically in about 70 different languages from up to 7,000 news sites (status January 2015; [54]). Once gathered, news texts enter a pipeline of different modules which cluster related news, link news clusters over time and across languages, and – for currently twenty-one languages – recognise direct speech quotations and perform named entity recognition (NER) and classification for the entity types person and organisation. Location names are also recognised, through a lookup procedure, and disambiguated via document-based heuristics.

NER is performed using a number of manually curated language-independent rules that make use of language-specific lists of titles and other words/phrases that are typically found next to names. As regards person names, these pattern words can be titles (president), professions or occupations (tennis player, playboy), references to countries, regions, ethnic or religious groups (French, Bavarian, Berber, Muslim), age expressions (57-year-old), verbal phrases (deceased) and more. Such phrases, which we generally refer to as trigger words because they include far more than only titles, can be further modified (former) or occur in combination (57-year-old former British Prime Minister). Trigger word lists are produced in a combination of machine learning and manual collection from online sources. Those found historically next to each name are stored in order to build up a frequency-ranked repository of common titles (and more) for each entity. Organisation name recognition is performed in a similar manner, i.e. it makes use of lists of typical organisation name parts (organisation, club, international, bank, etc.). However, it is relatively weakly developed in EMM and, due to a coarse entity type categorisation, other entity types are included such as Belfast Agreement, Nobel Prize, Red Mosque or World War I. We refer the reader to [53] for further details about the NER system.

Besides NER applied to multilingual news, JRC-Names is also the result of a name variant matching process. The NER tool identifies over 500 new name forms per day and, for each of them, the system shall determine whether it refers to a new entity or whether it is a spelling variant of an existing entity name. To this end, a language-independent name matching algorithm is applied, which computes a similarity measure (edit distance) between different name representations. These are obtained after several transformation steps including transliteration, normalisation and vowel removal to create consonant signatures. A newly identified name is merged with an existing one if their overall similarity is above an empirically defined threshold, and kept as separate entity otherwise. More advanced approaches for name similarity across scripts have been explored in [49].

It is important to clarify the concept of language with respect to names and their variants. We avoid talking about certain name variants as being in a certain language. Instead, we prefer to consider that a certain name variant is more frequently found in texts written in a certain language. The same variant may also be found in other languages, but probably with different distributions. For instance, Michail Gorbatschow is the most frequent spelling used in German news when referring to the former Soviet leader Михаил Горбачев, while Mikhaïl Gorbatchev is more frequent in French and this variant is also found in Portuguese texts. This relative frequency information is useful if the purpose is to generate an easy-to-read text in another language (e.g. during Machine Translation).

Finally, let us consider the question of morphological inflection. As other lexical units, proper names are morphologically inflected in many languages. Inflection mechanisms are numerous and heterogeneous, and they can be very difficult to handle when dealing with many languages. Some of the inflected forms found for the surname of the current US president are Obamával (Hungarian), Obamę (Polish) and Obamas (German). In order to avoid the storage of all inflected forms in the database (inefficient and untidy) while keeping the possibility to capture at least a large part of their occurrences in texts, EMM pre-generates the most common inflections for a subset of known name variants or it uses suffix replacement rules during the NER process. This mechanism allows to recognise a majority of name inflections in text and to return the base form for that name. Hence, morphological inflections of entity names are not meant to be part of JRC-Names. However, several of them have erroneously been missed as morphological variants and they have been categorised as variants of known names. This is rather an aesthetic issue because, from a practical point of view, their presence improves the lookup procedure of names in text.

Since 2004, the software has identified about 1.75 million different person and about 10,000 organisation names. In addition to these ‘canonical’ name forms, it contains about 390,000 additional lexical variants. The database grows by about 700 name forms (new names or variants of known names) per week.

2.2. Resource quality

The JRC’s software recognises entities in annotated gold standard NER corpora with an average Precision of 92.13% and a Recall of 50.33% for the nine languages De, En, Es, Hu, It, Nl, Pt, Ro and Tr. Precision is highest for English (96.83%) and lowest for Portuguese (83.41%). Recall is highest for Hungarian (73.89%) and lowest for Turkish (31.70%). The evaluation values of the real-life NER system are actually better than that because of the specific settings of JR’s system, which are geared towards (a) recognising each name at least once in a whole cluster of related news and (b) grounding each name to a real-life entity.

The result of EMM’s automatic NER and variant merging process is subject to a (light-weight) human moderation process. Manual intervention is carried out daily (an average of maximally one hour), focusing on the most frequently mentioned names and on regular mistakes that affect large numbers of entities. The human moderator also has the possibility to mine – assisted by an automatic tool – name variants from cross-lingual Wikipedia links and to download entity images. This semi-automatic Wikipedia mining increases the number of languages for name variants beyond the ones covered by the NER system. Although extremely valuable, the manual verification mends only a small part of the data and JRC-Names remains the product of an automated process and, as a consequence, contains noise. The main types of errors consist of non-entities (e.g. Red Piano or French Doctor), wrong name extents (e.g. Even Obama) and wrong entity type (e.g. Merlin Biosciences as a person). Additionally, it is possible that different entities have been merged into one and, conversely, that homonyms have the same identifier, as no disambiguation mechanism is in place. In order to keep most mistakes out of the JRC-Names distribution and also to stick to the more useful entities, only those entities whose frequencies are above a threshold are included in JRC-Names, as we shall see in the next section.

2.3. Content of the linked dataset

A first version of JRC-Names has been released in 2011 in the form of a tab-separated text file, accompanied by a Java library for fast lookup. The named entity resource file corresponds to a subset of EMM’s database, and it has since been available on JRC’s website5

⁵
https://ec.europa.eu/jrc/en/language-technologies/jrc-names.

where a daily update ensures the inclusion also of recent names. This initial version was subject of a coarse-grained transformation to RDF during the MLODE 2012 workshop,6

⁶

Multilingual Linked Open Data for Enterprises: http://sabre2012.infai.org/mlode.

where participants collaboratively worked on bootstrapping the LLOD. The present Linked Data version of JRC-Names takes a leap forward from there in that it (1) encodes the data using a lexical data model, namely lemon, and (2) contains further types of data. The dataset is composed of the following:

Person and organisation entity names. Those entities must have been found in at least five different news clusters (i.e. all mentions in all clustered articles of the same day count only as one).7

⁷

As mentioned in Section 2.1, EMM groups related news articles into ‘news clusters’ and deals with each cluster as a meta-document. Frequency counts of named entities are relative to these clusters, and not to single news articles.

Name variants. They must satisfy the threshold of having been found in at least 2 different news clusters.

Trigger words. They correspond to titles and function names that have been found in news articles next to the person mentions (cf. Section 2.1). Trigger words are included if they were found in at least 5 different news clusters.

Time stamps. Each name variant or title is accompanied by two time stamps: the first insertion date into the database (when EMM first found this title), and the last update date. This information is useful to detect changing titles, e.g. when a person is mentioned with different positions.

Frequency information. Each name variant has a news cluster frequency count.

Prior probabilities. Name variants have monolingual and multilingual prior probabilities, which reflect how likely an entity is mentioned with a specific variant in a certain language, or across all languages.8

⁸

The prior probability of a specific variant of an entity is calculated by dividing the frequency count of this variant by the sum of the frequency counts of other variants of the same entity in the same or across languages.

For multilingual name variants harvested from Wikipedia, there is neither frequency nor time stamp information.

3. Multilingual entity names as Linked Data

The resource consists of lexical knowledge, i.e. name variants in multiple languages, about individuals, i.e. person and organisation entities. Lemon and other linguistic vocabularies (Section 3.2) were used to render JRC-Names as Linked Data (Sections 3.3 and 3.4).

3.1. The lemon model

lemon is a model to represent linguistic information relative to ontologies in RDF. More specifically, it allows to specify the meaning of lexical units as well as to describe their constructions with respect to the vocabulary of an ontology. In line with the principle of semantics by reference [9,39], lemon maintains a clean separation between the lexical layer, which deals with the morphological and syntactic description of lexical entries (words or phrases), and the ontological layer, responsible for describing the meaning (or resolving the reference) of the lexical entries. The model builds on previous work for representing lexica and combines the strengths of LexInfo [16] and of the Linguistic Information Repository [45], both based on the Lexical Mark-up Framework [25]. The core of the lemon model9

⁹
http://lemon-model.net/lemon#.

consists of the following elements:

Lexicon, which collects lexical entries and is marked with a language,

Lexical entry, which comprises all syntactic forms of an entry,

Lexical form, which represents the surface realization of a lexical entry, usually in the form of a written representation,

Lexical sense, which represents the usage of a lexical entry as a reference to an ontological entity.

The lexical sense acts, among others, as a ‘glue’10

¹⁰

The expression is from [17].

between a lexical entry and an ontological entity and, as such, corresponds to the reification of the meaning of an entry [17]. lemon is linguistically agnostic and allows to use any vocabulary of linguistic categories. The model has already been used to represent various existing lexica [7,22,23,41,42,63,66] and proposals have been made for its extension [13,30,35]. Meeting the challenge of representing lexica and connecting them to ontologies is the current focus of the W3C OntoLex Community Group,11

¹¹

http://www.w3.org/community/ontolex.

which is actively working towards lemon’s final specification.

3.2. Other vocabularies

Apart from lemon, which enables the representation of most JRC-Names data, other controlled vocabularies are used: LexInfo and OLiA, which provide linguistic categories and mapping between linguistic schemes, are used to specify linguistic categories and relation properties of name variants [14,16]; lexvo, which provides global IDs for language-related objects, is used to encode language information [20]; and the DBpedia ontology, which organises Wikipedia concepts, is used to encode entity types. As regards meta-data information, the VoID [1] and the DCTerms vocabularies are used. Finally, when no existing vocabulary could answer our needs, we defined our classes and properties in a dedicated vocabulary.12

¹²
URLs of all vocabularies are mentioned in Fig. 1.

3.3. Representing entities and their multilingual name variants

At the ontological level, JRC-Names entities are encoded as dbo:Person or dbo:Organisation. Each entity has a language-independent ‘base name’, i.e. the variant that was chosen to use for display purposes inside EMM. The choice was made according to the name being either the most frequently found variant in the news (across languages), or the variant found on Wikipedia, or a frequent Latin script version of a name originally written in another script. This base name is therefore not marked with a language (although it is typically a name form that is frequently found in English text) and is encoded as the skos:prefLabel of the RDF entity.

At the lexical level, entity name variants are encoded as lemon:LexicalEntry, the language of which is specified through lemon and lexvo language properties (ISO-639-1 and 3). These lexical entries are also defined as olia:NamedEntity and get further characterised with the lexinfo properNoun part-of-speech.

JRC-Names exhibits a relatively high degree of lexical variation. There are multiple scripts (e.g. Latin vs. Cyrillic Barack Obama – Барак Обама), omission or addition of name parts (Barack Hussein Obama Jr.), inflected forms (Barack Obamát), typos (Barrac Obama), inversion of name parts (Obama Barack) and various other forms (e.g. Barack O’Bama). Because the collection of variants is based on string similarity, formally very different units such as diachronic variants or aliases (Eric Blair, alias George Orwell) do not exist in the resource (or if so, they were manually entered). Variant types, however, are not specified in JRC-Names. As a consequence, even if lemon offers the possibility to represent term variation at the level of surface form, word or sense [46,47], name variants are all lemon:LexicalEntry (i.e. words), although some could be conceived as different lemon:Forms of a variant. Accordingly, name variants of the same language (and of the same entity) are related through lemon:lexicalVariant relations.

Figure 1.

Graphical illustration of JRC-Names data representation, with the example of the entity Jean-Claude Juncker.

The path from name variants to their referent is set via lemon:LexicalSense. As reification of the relation between a word and a concept (here an entity), a lexical sense can support the expression of information which is neither of lexical nor of ontological nature. JRC-Names associates contextual information to entity name variants, that is to say their news cluster frequency and the dates of their first insertion and last update in the frequency and the dates of their first insertion and last update in the database. Based on news cluster frequencies, we additionally compute monolingual and multilingual prior probabilities. This information is rendered as properties of name variant lexical senses. Such properties are circumstantial and do not qualify the linguistic usage but the incidence of the association of a given variant with a specific entity (how many times this name appears with this referent, when was the first and last time of this occurrence). This is the reason why we did not use the lemon:context property, which concentrates on pragmatics or discourse properties such as register or temporal and geographical usage constraints. With regards to proper names, such a context could for example specify the time span usage of Byzantium vs. Constantinople vs. Istanbul, or the register difference between Michael Schumacher and Schumy.

Lexical senses additionally allow the expression of translation relations between name variants in different languages referring to the same entity. Translation relations fall indeed within the domain of lexical sense, as they shall be stated between disambiguated names (the English lexical entry London will translate into the French Londres when referring to the city, into London when referring to the writer). These relations are represented through lexinfo:translation object properties, as there was no need to use a more principled way to do it [30].

3.4. Representing titles

Besides name variants in multiple languages, the dataset also contains person entity ‘titles’. As detailed in Section 2.1, titles correspond to the trigger words that helped recognise entities in texts and they consist of a heterogeneous set of nominal phrases referring to the function or the social status of a person. Titles are lexically defined as lexical entries and as olia:TitleNoun, a morphosyntactic category describing appropriately those items. They are marked with language, but their part-of-speech remain unspecified. Title lexical units refer through lexical senses to the dbo:PersonFunction class, in a kind of loose lexicalisation of this abstract concept.

Similarly as for name variants, frequency and time-stamp information are available. However, since these elements regard the relation between a title and a person entity and not the one between a title and its concept (dbo:PersonFunction), they cannot be stated on titles’ lexical senses. In other words, what is qualified here is not the linguistic relation between a word and its concept, but the factual one of a person entity having, or occurring with, its title(s). In order to correctly encode this information as well as to capture the person/title relation, we introduced a jrc-model:Occurrence class. It represents a specific occurrence of a title lexical sense and establishes the relation with a person entity via the jrc-model:hasTitle property. As expected, instances of jrc-model:Occurrence additionally holds the frequency and time properties relative to a given person/title association.

Let us mention that in a more rigorous setting the occurrence of a title lexical sense (an instantiation of jrc-model:Occurrence) should point not to the person entity (dbo:Person) but to one of its name variants with which the title originally occurred. This information is however not available in the original database, where title expressions are directly associated with person entities.

A graphical representation of JRC-Names entity and lexical knowledge is given in Fig. 1, with the example of the current President of the European Commission Jean-Claude Juncker. As it is not possible to represent all information, only a few items of each type of information are depicted.

4. Interlinking

JRC-Names introduces links towards two specialised datasets, New York Times and Talk of Europe, and a generic one, DBpedia [36]. The New York Times (NYT) initiated some years ago the Linked Data publication13

¹³
http://data.nytimes.com/.

of its news index, or subject headings, which includes data about people and organisations (among others). As of Talk of Europe, this project curates Linked Open Data about the European Parliament; the published dataset contains all plenary debates over a fifteen-year period (1999–2014), and biographical information about the members of parliament (MEP) [64]. Interlinks of type owl:sameAs are set from JRC persons towards person entities of both datasets, based on a label strict matching of non-ambiguous entities. As indicated in Table 1, 2701 links are established towards NYT, 928 towards MEP.

DBpedia contains a great number of person entities with many properties in various languages. As briefly mentioned in the introduction, a well-known issue with knowledge bases is entity disambiguation. Although this was not the primary goal of the present work, we developed a light-weight strategy in order to link JRC-Names entities with their correct counterpart in DBpedia. Given a JRC source entity and its variants in all languages, the algorithm first looks for an exact match between the variants and the English rdfs:label of non-ambiguous person and organisation DBpedia entities. Next, if no match is found, ambiguous DBpedia candidates are selected (based on the variant surface forms) and if only one of these candidates is of the same type as the JRC source entity one, then the resources are interlinked. Finally, when there is more than one possible candidate (i.e. DBpedia entities having the same type and label than the JRC one), the set of English titles of the JRC entity is considered against a selection of English properties of DBpedia candidates (dbo:office, purl:description and db-prop:title), looking again for an exact match. Overall 95,437 links were created (cf. Table 1), 64,002 thanks to the first alternative, 31,340 thanks to the second and 95 to the third. We manually evaluated the correctness of 100 randomly selected links and obtained a Precision of 91%. Errors are mainly due to EMM mixing different persons, resulting into ambiguous entities difficult to link. The linking strategy could be improved in several ways, e.g. by exploiting multilingual features and making a joint use of the different DBpedia chapters.

Some interlinks are set at vocabulary level [34]. JRC’s classes and properties being quite specific, only a few links could be set, mostly on NYT’s vocabulary, with loose relationships (rdfs:seeAlso) from jrc-model:clusterFreq, jrc-model:insertionDate and jrc-model:lastUpdate towards New York Times associated_article_count, first_use and latest_use properties respectively. Finally, let us mention that backward links towards the MLODE dataset are set, based on JRC entity IDs.

Table 1

Statistical profile of JRC-Names RDF dataset

Data
# Lexicons (total)	170
# Lexicons (with freq. metadata)	21
# Lexical Entries	1,781,901
# Lexical Senses	1,781,901
# Person entities	331,242
# Organisation entities	7,391
Internal connectivity
# Lexical variants	2,412,394
# Translation relations	32,564,928
External connectivity
# MEP (Talk of Europe)	928
# New York Times	2,706
# DBpedia	95,437
Grand Total	72,586,712

5. Dataset features and Web access

The RDF version of JRC-Names features an overall number of 72.5 million triples. Table 1 gives further details on the statistical profile of the dataset. The majority of entities are persons, with 331,242 resources of this type against 7,391 of type organisation. Those entities are lexicalised through 1.7 million lexical entries, gathered into a total of 171 language-specific lexicons. It is worthwhile here recalling that NER is performed for 21 languages, and that data for other languages is added through Wikipedia mining. Next, there are about 2.4 million monolingual lexical variant relations, and 32 million translation relations. Finally, external connectivity is reasonably good, with a third of the entities being connected to either DBpedia, New York Times, or Talk of Europe.

Resource metadata are expressed using the VOID vocabulary; provided descriptions include general, access and structural metadata.

The JRC-Names linked dataset is served on the web via the EU Open Data Portal with: an RDF dump file,14

¹⁴
https://data.europa.eu/euodp/en/data/dataset/jrc-names.

a public SPARQL endpoint15

¹⁵

http://data.europa.eu/euodp/en/linked-data.

and dereferenceable URIs.16

¹⁶

https://data.europa.eu/euodp/resource/jrc-names/.

Additionally, the resource is registered on the datahub plateform.17

¹⁷

https://datahub.io/dataset/jrc-names-ec.

Occasional updates of the LOD version of JRC-Names are foreseen to maintain appropriate synchronisation with the database.

6. Known and potential uses

JRC-Names has been used for a whole range of tasks. The major usage probably is the improvement of the recall of searches in databases (including audio-visual) and text collections (including the Internet) [2,57] by expanding the initial user query by all name variants. Alternatively, name mentions in the search space can be normalised by replacing variants with a standard form. Search expansion is particularly important across scripts as even approximate matching techniques will not find foreign script variants of the searched name. Hands-on users of JRC-Names have either replaced the whole entity name by the set of its variants ‘George Bush’ (‘George Busch’, ‘George Buhs’, ‘Corc Uolker Buş’), or they have split all entities in JRC-Names to produce lists of variants for each name part, e.g. ‘Georgius’, ‘Georges’, ‘Georg’, ‘Džordž’, etc. for the English standard spelling of ‘George’. By doing this, the knowledge contained in the resource can be applied to any names and not only to media VIPs. Another usage of JRC-Names relates to Machine Translation systems, which typically have problems translating proper names [5]. This challenge can be overcome by identifying and removing names before the translation process and by then reinserting the target language equivalent [61]. Also, lists of names in two different scripts are often used to learn transliteration rules, e.g. [49]. Collections of names and their variants have been used to train and/or improve Named Entity Recognition tools [8,21,65] or to disambiguate name mentions [2], but also, more generally, to develop Language Technology tools for lesser-resourced languages [58,69]. The development of higher-level Language Technology tools has benefited from JRC-Names, such as co-reference resolution [55] and cross-lingual linking of related documents in different languages [52]. Furthermore, JRC-Names has been used in higher-level sociological or political studies such as tracking researchers’ mobility on the web [27] or pre-processing text for a subsequent political science study [3]. In principle, JRC-Names can also be useful as a component in Language Technology tools for opinion mining, summarisation, topic detection and tracking, and more.

The LOD version of JRC-Names contains more information and links to other LOD resources. This not only widens the application areas, but most of all it opens the way to a fully-automatic usage of the data. First, the machine-readable version of JRC-Names can be queried by agents18

¹⁸
Examples of queries are available at: http://data.europa.eu/euodp/en/linked-data.

and the retrieved information can easily be integrated into NLP web services. Second, due to the list of spelling variants for each name, the LD resource allows establishing richer links between unstructured natural language texts and structured information (for e.g. entity linking), what is more at a multilingual level. Furthermore, the LD resource can support cross-lingual access to information with e.g. the automatic retrieval of entity information spread over several monolingual resources, as well as cross-lingual mapping between datasets, including across scripts. Finally, interlinks towards other resources connect JRC-Names to the Web of Data, enabling further data enhancement at both content and linguistic levels: while interlinks towards New York Times and Talk of Europe datasets can support political studies with questions such as “How and when members of parliament or politicians where mentioned in news articles”, links towards the DBpedia nucleus provide additional lexicalisations of DBpedia person entities and have the potential to facilitate integration with other named entity resources.

7. Related work

This section summarises previous efforts to compile multilingual lexical information about names, and considers named entity-related data on the LLOD.

Named entities, or proper names when limited to the core categories of person, location and organisation, represent an open word class which evolves endlessly. Dedicated resources or gazetteers are therefore not easy to acquire and require constant updates. In this context, the collaboratively built, semi-structured and multilingual Wikipedia resource appeared as a great relief, and several named entity dictionaries were built out of it [57,59,68]. Prolexbase [38], a manually produced multilingual ontology of proper names built up over many years, recently adopted a semi-automatic enrichment strategy based on Wikipedia [51]. All of these resources are the result of exploiting Wikipedia and, with the exception of [59] which makes use of LMF, they are not interoperable.

Many linguistic resources have been exposed as Linked Data recently. As for entities, they appear mainly in encyclopedic dictionaries and knowledge bases, such as BabelNet [23], DBpedia [36] and YAGO [33], but some are present in lexical resources. In the latter case, resources such as WordNet RDF [42] or lemonUBY [22] do include entity names, but in a rather limited number and with little information about lexical variation. In the former, all entities derive from Wikipedia and are primarily the focus of encyclopedic descriptions. At lexical level, Wikipedia is strong at providing cross-lingual and cross-script variants, but it contains only few spelling variants within the same language and it does not contain information on morphological variants. In contrast, JRC-Names is mostly built up by recognising name variants in real-life multilingual texts. A dedicated resource has been compiled as part of DBpedia Spotlight [44], which consists of entity lexicalisations collected over the graph of labels, redirects and disambiguations of the KB. Anew, the range of name variants is bounded to Wikipedia data, while JRC-Names provides name occurrences of real-life texts. Overall, the picture that emerges is one of complementarity, where various datasets could provide different types of information about entities.

8. Conclusion

We have presented the new release of the JRC-Names resource as Linked Data using lemon, a model for representing ontology lexica. This work is the continuation of previous efforts and is in line with the general effort of the European Commission to support multilingualism and language diversity. Compared with the initial release of JRC-Names in 2011, the current one is available as Linked Data and provides more information, namely person titles, occurrence time-stamps and frequency information. With name variants extracted from multilingual news, this resource complements those based on Wikipedia and contributes to the ongoing developments within the SW and NLP communities to support data access in several languages.

Future work could be manifold. At data level, it would be useful to further specify the variant types, to carry out a lemon-based publication of morphological generation rules, and to clean erroneously conflated entities (e.g. using titles). At web level, interlinking with other datasets (lexical, encyclopedic or factual) could be expanded, as well as intralinking among titles.

Footnotes

Acknowledgements

The Europe Media Monitor EMM is a multiannual group effort involving many tasks and we would thus like to thank all past and present EMM team members for their help and dedication. We are also grateful to Valentina Fratto and her team from the EU Publication Office for the fruitful collaboration. Finally, we would also like to thank the members of the LIDER project, whose activities emphasise the need for and greatly support the development of linguistic Linked Data.

References

Alexander,

Cyganiak,

Hausenblas and

Zhao, Describing Linked Datasets with the VoID Vocabulary, 2011. URL http://www.w3.org/TR/void/. W3C Interest Group Note.

Aliprandi,

Tomas and

Sérgio, Language processing and linguistic data in the CAPER project, in: Workshop on Language Resources for Public Security Applications, Held at the 8th International Conference on Language Resource and Evaluation Conference, Istanbul, Turkey, May 2012, p. 23.

Andreas,

Gregor and

Gerhard, Leipzig corpus miner – A text mining infrastructure for qualitative data analysis, in: Terminology and Knowledge Engineering 2014, Berlin, Germany, June 2014, p. 10.

Auer,

Lehmann and

A.N.

Ngomo, Introduction to Linked Data and its lifecycle on the Web, in: Proc. of the 9th International Summer School 2013, Mannheim, Germany, Springer, Berlin, Heidelberg, 2013, pp. 1–90. doi:10.1007/978-3-642-39784-4_1. ISBN 978-3-642-39784-4.

Babych and

Hartley, Improving machine translation quality with automatic named entity recognition, in: EAMT’03: Proc. of the 7th International EAMT Workshop on MT and Other Language Technology Tools, Improving MT Through Other Language Technology Tools: Resources and Tools for Building MT, Stroudsburg, PA, USA, Association for Computational Linguistics, April 2003, p. 1.

Bizer,

Heath and

Berners-Lee, Linked Data – The story so far, International Journal on Semantic Web and Information Systems (IJSWIS) 5(3) (2009), 1–22.

L.L.

Borin,

Dannélls,

Forsberg and

J.P.

McCrae, Representing Swedish Lexical Resources in RDF with lemon, in: Proc. of the ISWC 2014 Posters & Demonstrations Track – A Track Within the 13th International Semantic Web Conference, Aachen, Germany, CEUR-WS.org 2014, pp. 329–332.

Buchholz and

van den Bosch, Integrating seed names and ngrams for a name entity list classifier, in: Proc. of the 2nd International Conference on Language Resources and Evaluation, Athens, Greece, May, June, European Language Resources Association, 2000.

Buitelaar, Ontology-based semantic lexicons: Mapping between terms and object descriptions, in: Ontology and the Lexicon, 2010, 212–223.

10.

Buitelaar,

K.S.

Choi,

Cimiano and

Hovy, The multilingual Semantic Web (Dagstuhl seminar 12362), Dagstuhl Reports 2(9) (2013), 15–94, ISSN 2192-5283, http://drops.dagstuhl.de/opus/volltexte/2013/3788. doi:10.4230/DagRep.2.9.15.

11.

Charton,

Gagnon and

Ozell, Automatic Semantic Web annotation of named entities, in: Proc. of Advances in Artificial Intelligence: 24th Canadian Conference on Artificial Intelligence (Canadian AI 2011), St. John’s, Canada, May 25–27, 2011, Springer, Berlin, Heidelberg, 2011, pp. 74–85, doi:10.1007/978-3-642-21043-3_10. ISBN 978-3-642-21043-3.

12.

Charton,

M.J.

Meurs,

Jean-Louis and

Gagnon, Improving entity linking using surface form refinement, in: Proc. of the 9th International Conference on Language Resources and Evaluation,

Calzolari,

Choukri,

Declerck,

Loftsson,

Maegaard,

Mariani,

Moreno,

Odijk and

Piperidis, eds, European Language Resources Association (ELRA), Reykjavik, Iceland, 2014. ISBN 978-2-9517408-8-4.

13.

Chavula and

C.M.

Keet, Is lemon sufficient for building multilingual ontologies for bantu languages? in: Proc. of the 11th OWL: Experiences and Directions Workshop, CEUR-WS, Vol. 1265, Oct. 2014, pp. 61–72.

14.

Chiarcos, Ontologies of linguistic annotation: Survey and perspectives, in: Proc. of the 8th International Conference on Language Resources and Evaluation,

Calzolari,

Choukri,

Declerck,

Loftsson,

Maegaard,

Mariani,

Moreno,

Odijk and

Piperidis, eds, European Language Resources Association (ELRA), Istanbul, Turkey, May 2012. ISBN 978-2-9517408-7-7.

15.

Chiarcos,

J.P.

McCrae,

Cimiano and

Fellbaum, Towards open data for linguistics: Linguistic linked data, in: New Trends of Research in Ontologies and Lexical Resources: Ideas, Projects, Systems, Springer, Berlin, Heidelberg, 2013, pp. 7–25. doi:10.1007/978-3-642-31782-8_2. ISBN 978-3-642-31782-8.

16.

Cimiano,

Buitelaar,

J.P.

McCrae and

M.S.

Sintek, LexInfo: A declarative model for the lexicon-ontology interface, Web Semantics: Science, Services and Agents on the World Wide Web 9(1) (March 2011), 29–51, ISSN 1570-8268. doi:10.1016/j.websem.2010.11.001.

17.

Cimiano,

J.P.

McCrae,

Buitelaar and

Montiel-Ponsoda, On the role of senses in the ontology-lexicon, in: New Trends of Research in Ontologies and Lexical Resources: Ideas, Projects, Systems, Springer, Berlin, Heidelberg, 2013, pp. 43–62. doi:10.1007/978-3-642-31782-8_4. ISBN 978-3-642-31782-8.

18.

Cornolti,

Ferragina and

Ciaramita, A framework for benchmarking entity-annotation systems, in: Proc. of the 22nd International Conference on World Wide Web, International World Wide Web Conferences Steering Committee, ACM, New York, NY, USA, 2013, pp. 249–260, doi:10.1145/2488388.2488411. ISBN 978-1-4503-2035-1.

19.

Daiber,

Jakob,

Hokamp and

P.N.

Mendes, Improving efficiency and accuracy in multilingual entity extraction, in: Proc. of the 9th International Conference on Semantic Systems, I-SEMANTICS’13, New York, NY, USA, ACM, 2013, pp. 121–124. ISBN 978-1-4503-1972-0. doi:10.1145/2506182.2506198.

20.

de Melo, Lexvo.org: Language-related information for the linguistic Linked Data cloud, Semantic Web Journal 6(4) (2013), 393–400. doi:10.3233/SW-150171.

21.

Derczynski,

Maynard,

Aswani and

Bontcheva, Microblog-genre noise and impact on Semantic Annotation Accuracy, in: Proc. of the 24th ACM Conference on Hypertext and Social Media, New York, NY, USA, ACM, 2013, pp. 21–30. ISBN 978-1-4503-1967-6. doi:10.1145/2481492.2481495.

22.

Eckle-Kohler,

J.P.

McCrae and

Chiarcos, LemonUby-a large, interlinked, syntactically-rich resource for ontologies, Semantic Web Journal, Special Issue on Multilingual Linked Open Data 6(4) (2014), 371–378. doi:10.3233/SW-140159.

23.

Ehrmann,

Cecconi,

Vannella,

J.P.

McCrae,

Cimiano and

Navigli, Representing multilingual data as Linked Data: The case of BabelNet 2.0, in: Proc. of the 9th International Conference on Language Resources and Evaluation,

Calzolari,

Choukri,

Declerck,

Loftsson,

Maegaard,

Mariani,

Moreno,

Odijk and

Piperidis, eds, European Language Resources Association (ELRA), Reykjavik, Iceland, May 2014. ISBN 978-2-9517408-8-4.

24.

Exner and

Nugues, Entity extraction: From unstructured text to DBpedia RDF triples, in: Proc. of the Web of Linked Entities Workshop in Conjunction with the 11th International Semantic Web Conference (ISWC 2012), CEUR, 2012, pp. 58–69.

25.

Francopoulo,

George,

Calzolari,

Monachini,

Bel,

Pet,

Soria et al., Lexical Markup Framework (LMF), in: Proc. of the 5th International Conference on Language Resources and Evaluation, Genoa, Italy, 2006, pp. 233–236.

26.

Gangemi, A comparison of knowledge extraction tools for the Semantic Web, in: The Semantic Web: Semantics and Big Data, Proc. of the 10th International Conference (ESWC 2013),

Cimiano,

Corcho,

Presutti,

Hollink and

Rudolph, eds, Springer, Montpellier, France, 2013. ISBN 978-3-642-38287-1.

27.

García-Flores,

Zweigenbaum,

Yue and

Turner, Tracking Researcher mobility on the Web using snippet semantic analysis, in: Proc. of Advances in Natural Language Processing: 8th International Conference on NLP (JapTAL 2012), Kanazawa, Japan, October 22–24, 2012, Springer, Berlin, Heidelberg, 2012, pp. 180–191. doi:10.1007/978-3-642-33983-7_18. ISBN 978-3-642-33983-7.

28.

Gómez-Pérez,

Vila-Suero,

Montiel-Ponsoda,

Gracia and

Aguado de Cea, Guidelines for multilingual Linked Data, in: Proc. of the 3rd International Conference on Web Intelligence, Mining and Semantics, New York, NY, USA, ACM, 2013, ISBN 978-1-4503-1850-1.

29.

Gracia,

Montiel-Ponsoda,

Cimiano,

Goméz-Pérez,

Buitelaar and

J.P.

McCrae, Challenges for the multilingual Web of Data, Web Semantics: Science, Services and Agents on the World Wide Web 11 (2011), 63–71, ISSN 1570-8268. http://www.sciencedirect.com/science/article/pii/S1570826811000783.

30.

Gracia,

Montiel-Ponsoda,

Vila-Suero and

Aguado-De-Cea, Enabling language resources to expose translations as Linked Data on the Web, in: Proc. of the 9th International Conference on Language Resources and Evaluation,

Calzolari,

Choukri,

Declerck,

Loftsson,

Maegaard,

Mariani,

Moreno,

Odijk and

Piperidis, eds, European Language Resources Association (ELRA), Reykjavik, Iceland, May 2014. ISBN 978-2-9517408-8-4.

31.

Heath and

Bizer, Linked data: Evolving the Web into a global data space, Synthesis Lectures on the Semantic Web: Theory and Technology 1(1) (2011), 1–136. doi:10.2200/S00334ED1V01Y201102WBE001.

32.

Hellmann,

Lehmann,

Auer and

Brümmer, Integrating NLP using Linked Data, in: Proc. of the Semantic Web – ISWC 2013: 12th International Semantic Web Conference, Part II, Sydney, NSW, Australia, October 21–25, 2013, Springer, Berlin, Heidelberg, 2013, pp. 98–113. doi:10.1007/978-3-642-41338-4_7. ISBN 978-3-642-41338-4.

33.

Hoffart,

F.M.

Suchanek,

Berberich and

Weikum, YAGO2: A spatially and temporally enhanced knowledge base from Wikipedia, Artificial Intelligence 194 (2013), 28–61, ISSN 0004-3702.

34.

Janowicz,

Hitzler,

Adams,

Kolas and

VardemanII, Five stars of Linked Data vocabulary use, Semantic Web 5(3) (2014), 173–176. doi:10.3233/SW-140135.

35.

Khan,

Frontini,

Del Gratta,

Monachini and

Quochi, Generative lexicon theory and linguistic Linked Open Data, in: Proc. of the 6th International Conference on Generative Approaches to the Lexicon, 2013, pp. 62–69.

36.

Lehmann,

Isele,

Jakob,

Jentzsch,

Kontokostas,

P.N.

Mendes,

Hellmann,

Morsey,

van Kleef,

Auer and

Bizer, DBpedia – A large-scale, multilingual knowledge base extracted from Wikipedia, Semantic Web Journal 6(2) (2013), 167–195. doi:10.3233/SW-140134.

37.

Lopez,

Uren,

Sabou and

Motta, Is question answering fit for the Semantic Web?: A survey, Semantic Web 2(2) (2011), 125–155. doi:10.3233/SW-2011-0041.

38.

Maurel, Prolexbase: A multilingual relational lexical database of proper names, in: Proc. of the 6th International Conference on Language Resources and Evaluation,

Calzolari,

Choukri,

Maegaard,

Mariani,

Odijk,

Piperidis and

Tapias, eds, European Language Resources Association (ELRA), Marrakech, Morocco, May 2008, ISBN 2-9517408-4-0. http://www.lrec-conf.org/proceedings/lrec2008/.

39.

J.P.

McCrae,

Spohr and

Cimiano, Linking lexical resources and ontologies on the Semantic Web with lemon, in: Proc. of the Semantic Web: Research and Applications: 8th Extended Semantic Web Conference, Part I (ESWC 2011), Heraklion, Crete, Greece, May 29–June 2, 2011, Springer, Berlin, Heidelberg, 2011, pp. 245–259. doi:10.1007/978-3-642-21034-1_17. ISBN 978-3-642-21034-1.

40.

J.P.

McCrae,

Aguado de Cea,

Buitelaar,

Cimiano,

Declerck,

Gómez Pérez,

Gracia,

Hollink,

Montiel-Ponsoda,

Spohr and

Wunner, Interchanging lexical resources on the Semantic Web, Language Resources and Evaluation 46(4) (2012), 701–719, ISSN 1574–0218. doi:10.1007/s10579-012-9182-3.

41.

J.P.

McCrae,

Cimiano and

Montiel-Ponsoda, Integrating WordNet and Wiktionary with lemon, in: Linked Data in Linguistics: Representing and Connecting Language Data and Language Metadata, Springer, Berlin, Heidelberg, 2012, pp. 25–34. doi:10.1007/978-3-642-28249-2_3. ISBN 978-3-642-28249-2.

42.

J.P.

McCrae,

Fellbaum and

Cimiano, Publishing and Linking WordNet using lemon and RDF, in: Proc. of the 3rd Workshop on Linked Data in Linguistics, 2014.

43.

McNamee and

T.H.

Dang, Overview of the TAC 2009 knowledge base population track, in: Text Analysis Conference (TAC), Vol. 17, 2009, pp. 111–113.

44.

P.N.

Mendes,

Jakob,

García-Silva and

Bizer, DBpedia spotlight: Shedding light on the web of documents, in: Proc. of the 7th International Conference on Semantic Systems, New York, NY, USA, ACM, 2011, pp. 1–8, ISBN 978-1-4503-0621-8. doi:10.1145/2063518.2063519.

45.

Montiel-Ponsoda,

Aguado de Cea and

Gómez Pérez, Modelling multilinguality in ontologies, in: Proc. of the 22nd International Conference on Computational Linguistics, 2008, pp. 67–70. ISBN 978-1-905593-44-6.

46.

Montiel-Ponsoda,

Gracia del Río,

Aguado de Cea and

Gómez-Pérez, Representing translations on the Semantic Web, in: Proc. of the 2nd International Workshop on the Multilingual Semantic Web, 2011, pp. 30–42.

47.

Montiel-Ponsoda,

J.P.

McCrae,

Aguado de Cea and

Garcia, Multilingual variation in the context of Linked Data, in: Proc. of the 10th International Conference on Terminology and Artificial Intelligence, 2013, pp. 19–26.

48.

Pereira,

Aggarwal and

Buitelaar, AELA: An Adaptive Entity Linking Approach, in: Proc. of the 22nd International Conference on World Wide Web Companion,

Schwabe,

Almeida,

Glaser,

Baeza-Yates and

Moon, eds, 2013, pp. 87–88. ISBN 978-1-4503-2035-1.

49.

Pouliquen, Similarity of names across scripts: Edit distance using learned Costs of N-Grams, in: Advances in Natural Language Processing,

Nordström and

Ranta, eds, Lecture Notes in Computer Science, Vol. 5221, Springer, Berlin, Heidelberg, 2008, pp. 405–416.

50.

Rao,

McNamee and

Dredze, Entity Linking: Finding extracted entities in a knowledge base, in: Multi-Source, Multilingual Information Extraction and Summarization, Springer, Berlin, Heidelberg, 2013, pp. 93–115, doi:10.1007/978-3-642-28569-1_5. ISBN 978-3-642-28569-1.

51.

Savary,

Manicki and

Baron, Populating a multilingual ontology of proper names from open sources, Journal of Language Modelling 1(2) (2013), 189–225.

52.

Steinberger, Multilingual and cross-lingual news analysis in the Europe Media Monitor (EMM), in: Multidisciplinary Information Retrieval, 6th Information Retrieval Facility Conference (IRFC’2013),

Lupu,

Kanoulas and

Loizides, eds, Springer Lecture Notes in Computer Science, Vol. 8201, Springer, 2013, pp. 1–4.

53.

Steinberger and

Pouliquen, Cross-lingual named entity recognition, Lingvisticae Investigationes 30(1) (2007), 135–162. doi:10.1075/li.30.1.09ste.

54.

Steinberger,

Pouliquen and

van der Goot, An introduction to the Europe Media Monitor family of applications, in: Proc. of the SIGIR 2009 Workshop (SIGIR-CLIR’2009), Boston, USA, July 2009, pp. 1–8.

55.

Steinberger,

Belyaeva,

Crawley,

Della-Rocca,

Ebrahim,

Ehrmann,

Kabadjov,

Steinberger and

Van der Goot, Highly multilingual coreference resolution exploiting a mature entity repository, in: Proc. of the International Conference Recent Advances in Natural Language Processing 2011, Hissar, Bulgaria, 2011, pp. 254–260.

56.

Steinberger,

Pouliquen,

Kabadjov and

van der Goot, JRC-Names: A freely available, highly multilingual named entity resource, in: Proc. of the 8th International Conference Recent Advances in Natural Language Processing (RANLP’2011), Hissar, Bulgaria, September 2011, pp. 104–110.

57.

Stern and

Sagot, Resources for named entity recognition and resolution in News Wires, in: Workshop on Resource and Evaluation for Entity Resolution and Entity Managment, Collocated with the 7th International Conference on Language Resources and Evaluation,

Calzolari,

Choukri,

Maegaard,

Mariani,

Odijk,

Piperidis,

Rosner and

Tapias, eds, European Language Resources Association (ELRA), Valletta, Malta, 2010.

58.

Täckström, Predicting linguistic structure with incomplete and cross-lingual supervision, PhD thesis, Uppsala University, 2013.

59.

Toral,

Ferrández,

Monachini and

Muñoz, Web 2.0, Language resources and standards to automatically build a multilingual Named Entity lexicon, Language Resources and Evaluation 46(3) (2012), 383–419, ISSN 1574–0218. doi:10.1007/s10579-011-9148-x.

60.

Trojahn,

Fu,

Zamazal and

Ritze, State-of-the-Art in multilingual and cross-lingual ontology matching, in: Towards the Multilingual Semantic Web,

Buitelaar and

Cimiano, eds, Springer, Berlin, Heidelberg, 2014, pp. 119–135, ISBN 9783662435847. 3662435845.

61.

Turchi,

Atkinson,

Wilcox,

Crawley,

Bucci,

Steinberger and

van der Goot, ONTS: “OPTIMA” News Translation System, in: Proc. of the 13th Conference of the European Chapter of the Association for Computational Linguistics, Avignon, France, 2012, pp. 25–30.

62.

Unger and

Cimiano, Pythia: Compositional meaning construction for ontology-based question answering on the Semantic Web, in: Proc. of the 16th International Conference on Applications of Natural Language to Information Systems, Springer, 2011, pp. 153–160.

63.

Unger,

J.P.

McCrae,

Walter,

Winter and

Cimiano, A lemon lexicon for DBpedia, in: Proc. of 1st International Workshop on NLP and DBpedia, Sydney, Australia,

Hellmann,

Filipowska,

Barriere,

Mendes and

Kontokostas, eds, 2013.

64.

van Aggelen,

Hollink,

Kemman,

Kleppe and

Beunders, The debates of the European Parliament as Linked Open Data, Semantic Web Journal (2016).

65.

van den Bosch and

Bogers, Memory-based named entity recognition in Tweets, in: Proc. of Making Sense of Microposts (MSM2013) Concept Extraction Challenge, Rio de Janeiro, Brazil, 2013, pp. 40–43.

66.

Villegas and

Bel, PAROLE/SIMPLE ‘Lemon’ ontology and lexicons, Semantic Web Journal 6(4) (2013), 363–369. doi:10.3233/SW-140148.

67.

Weichselbraun,

Streiff and

Scharl, Linked enterprise data for fine grained named entity linking and Web Intelligence, in: Proc. of the 4th International Conference on Web Intelligence, Mining and Semantics (WIMS14), 2014, pp. 13:1–13:11. doi:10.1145/2611040.2611052. ISBN 978-1-4503-2538-7.

68.

Wentland,

Knopp,

Silberer and

Hartung, Building a multilingual lexical resource for named entity disambiguation, translation and transliteration, in: Proc. of the 6th International Conference on Language Resources and Evaluation, Marrakech, Morocco, May 2008,

Calzolari,

Choukri,

Maegaard,

Mariani,

Odijk,

Piperidis and

Tapias, eds, European Language Resources Association (ELRA), 2008, ISBN 2-9517408-4-0. http://www.lrec-conf.org/proceedings/lrec2008/.

69.

Zaghouani, Critical survey of the freely available Arabic corpora, in: Proc. of the 9th International Conference on Language Resources and Evaluation, Workshop on Free/Open-Source Arabic Corpora and Corpora Processing Tools Workshop Programme,

Calzolari,

Choukri,

Declerck,

Loftsson,

Maegaard,

Mariani,

Moreno,

Odijk and

Piperidis, eds, European Language Resources Association (ELRA), p. 1 2014. ISBN 978-2-9517408- 8-4.

JRC-Names: Multilingual entity name variants and titles as Linked Data

Abstract

Keywords

1. Introduction

1 As evidenced by the Linked Open Data (LOD) cloud: http://lod-cloud.net/state/.

2.1. Resource creation: Multilingual NER from the news

2.2. Resource quality

2.3. Content of the linked dataset

5 https://ec.europa.eu/jrc/en/language-technologies/jrc-names.

3.1. The lemon model

9 http://lemon-model.net/lemon#.

12 URLs of all vocabularies are mentioned in Fig. 1.

4. Interlinking

13 http://data.nytimes.com/.

14 https://data.europa.eu/euodp/en/data/dataset/jrc-names.

18 Examples of queries are available at: http://data.europa.eu/euodp/en/linked-data.

8. Conclusion

Footnotes

Acknowledgements

References

¹
As evidenced by the Linked Open Data (LOD) cloud: http://lod-cloud.net/state/.

⁵
https://ec.europa.eu/jrc/en/language-technologies/jrc-names.

⁹
http://lemon-model.net/lemon#.

¹²
URLs of all vocabularies are mentioned in Fig. 1.

¹³
http://data.nytimes.com/.

¹⁴
https://data.europa.eu/euodp/en/data/dataset/jrc-names.

¹⁸
Examples of queries are available at: http://data.europa.eu/euodp/en/linked-data.