Abstract
Lexvo.org brings information about languages, words, and other linguistic entities to the Web of Linked Data. It defines URIs for terms, languages, scripts, and characters, which are not only highly interconnected but also linked to a variety of resources on the Web. Additionally, new datasets are being published to contribute to the emerging Linked Data Cloud of Language-Related information.
Keywords
Introduction
Lexvo.org is a service that publishes information about numerous aspects of human language online in both human-readable and machine-readable form, contributing to the Web of Linked Data and the Semantic Web. Language is the basis of communication and the key to the tremendous body of written knowledge produced by humanity. Due to the ubiquity of textual data, the value of lexical and other linguistic information is increasingly being recognized in several communities, including the Database, Semantic Web, and Digital Library communities. At the same time, the value of making linguistic data interoperable has been receiving an increased amount of attention in linguistics and lexicography. Among others, the Open Linguistics Working Group [3] of the Open Knowledge Foundation has brought researchers working in areas together and has begun to proselytize and educate by organizing workshops and meetings. These developments are leading to the emergence of a significant new part of the Linked Data cloud focusing on linguistic data. This article describes Lexvo.org1
In particular, the 2015-03-01 version of the dataset.
One main focus of Lexvo.org is to provide descriptions of human languages. The term languages, here, is meant to be inclusive, encompassing specific language variants (such as dialects), and larger groups of language variants (e.g. macrolanguages and language families), given that the distinctions between such categories are largely conventional.
Language identification
In numerous application settings, there is a need to reference a given human language. For example, one may wish to express that a book is written in a specific language or that a person prefers that the user interface be set to a particular language. One of the central motivations for the Web of Linked Data is the idea of liberating data from traditional data silos by relying on shared global identifiers rather than database-dependent strings of characters. Thus, instead of establishing a
The ubiquitous two-letter ISO 639-1 codes for languages (
To address this situation, Lexvo.org, since 2008, has defined URIs of the form http://www.lexvo.org/id/iso639-3/eng for all of the 7 000 languages covered by the ISO 639-3 standard. While the Library of Congress has published ISO 639-2 as a controlled vocabulary based on SKOS [23], there is no good Linked Data alternative to Lexvo.org for ISO 639-3. Lexvo.org’s language identifiers are used by the British Library,2
Obviously, even ISO 639-3 cannot be complete in the sense of covering every possible dialect. However, the standard has well-defined procedures for adding new identifiers and is regularly updated. It thus serves as a good practical solution for most language identification needs. The Glottolog project [25] serves significantly more fine-grained identifiers for language definitions proposed by individual linguists.
Lexvo.org delivers extensive descriptions of each language, extracted from sources such as Wikipedia and the Unicode CLDR. In the Lexvo.org data, these are often expressed using properties and classes from the Lexvo Ontology, a custom ontology focusing on general semantic classes and properties.3
In order to facilitate linking to Lexvo.org, the site provides mapping tables for MARC 21/USMARC language codes and also defines an alternative set of IDs based on the commonly used 2-letter ISO 639-1 language codes.
Lexvo.org’s language identifiers are connected to DBpedia, YAGO, and other existing sites. Additonally, for around 400 languages, the service now provides links to text samples (specifically, the UN Declaration of Human Rights).
Language families and language collections are described using URIs based on ISO 639-5, e.g. http://www.lexvo.org/id/iso639-5/sit for the Sino-Tibetan languages. Some of the identifiers refer not to language families per se, but to other types of collections, e.g. the set of sign languages.
Lexvo.org draws information about language collections, including relationships between them, from Wikipedia, WordNet [12], and the ISO standard. This leads to an extensive language hierarchy that is fully integrated into a general-purpose WordNet-based word sense hierarchy (see Section 3.2). From Mandarin Chinese, for instance, one can easily navigate to general Chinese, the Sinitic languages, and the Sino-Tibetan languages. While there is still considerable research and debate on certain putative language family relationships, Lexvo.org relies on a methodology that favours well-established relationships and to some extent eschews any forced distinctions between dialects, languages, and language families. Instead, it draws on official standards, WordNet, and Wikipedia to form a hierarchy of language systems of varying granularity, as described elsewhere in more detail [10].
Scripts and characters
On Lexvo.org, languages are connected to the writing systems commonly used for them, such as for example Cyrillic, Indian Devanagari, or the Korean Hangul system. The identifiers for these writing systems are based on the ISO 15924 standard. By extracting Unicode Property Values from the Unicode specification, Lexvo.org also connects writing systems with the specific characters that are part of them.
URIs of the form http://www.lexvo.org/id/char/5A34 are provided for each of the several thousand abstract characters defined by the international Unicode standard. The final part here,
Phones
Lexvo.org was recently extended to include phonetic information. For a given phone, it provides different representations (IPA, X-SAMPA, Arpabet, etc.) and properties (e.g. labiodental, plosive).4
The phoible.org project provides a more extensive description of the phonetic properties of different languages.
Identifiers for terms
The second major focus of Lexvo.org is to identify and describe terms (or words) and their properties. In the RDF standard that much of the Semantic Web and Linked Data cloud is based on, string literals cannot serve as subjects of a statement. The same applies to string literals with language tags, and thus it is non-trivial to describe terms in RDF. Some ontologies defined OWL classes that represent words or other terms in a language. However, data publishers still needed to create the URIs for individual terms on an ad hoc basis. For instance, the W3C draft RDF representation of WordNet [27] defined URIs for the words covered by the WordNet lexical database [12].
In order to provide data publishers with a simple way of identifying any word using an URI, Lexvo.org proposed a standard, uniform scheme for referring to any term in a given language. Data publishers and developers can obtain and work with such URIs by using a simple Java API (see Section 5) or by following the specifications described below.
Formal semantics
Formally, different levels of abstraction could be chosen to refer to terms. Lexvo.org’s term URIs are intended to serve as interchange URIs that can easily be created from any word-segmented natural language text. When a term is encountered in a text document, a computer system initially only sees the surface form. Typically, one wishes to look up terms in a given knowledge source (e.g. in a thesaurus or a dictionary) without already knowing what meanings they have.
Lexvo.org’s notion of term is thus based exclusively on the surface form, at an abstraction level that does not reflect any semantic or historic differences between words sharing the same form. This notion of terms is hence at a higher level of abstraction than the standard linguistic notion of words, which typically considers the animal noun “bear” different from the verb “bear”, and typically distinguishes “bass” as in music from “bass” as a fish. Lexvo.org’s notion of terms explicitly avoids distinguishing the meanings of polysemous or homonymic words within a specific language, because in many settings all one has is the surface form without any knowledge of which specific homonymic variant one is dealing with.
Instead, Lexvo.org makes such distinctions only at the word sense level, described later on in Section 3.2. Lexvo.org thereby avoids the often rather subjective decisions about whether two word senses have a distinct enough history to count as separate words or not. However, data publishers are free to use Lexvo.org in conjunction with word entities that group together different word senses.
In contrast, Lexvo.org does, nonetheless, consider the language of a term relevant to its identity. Thus, the Spanish term “con”, which means “with”, is treated as distinct from the French term “con”, which means “idiot”. This level of abstraction allows us to model relationships between terms in different languages using simple RDF triples. If one instead used URIs based on pure string literals without language information, some rather cumbersome form of reification would be required to properly reference the languages of those terms.
Different word forms are treated as distinct terms. Here, however, there are a few minor subtleties of term identity regarding their string encodings. For multilingual applications, the ISO 10646/Unicode standards offer an appropriate set of characters for encoding strings. Since Unicode allows encoding a character such as “à” in either a composed or in a decomposed form, NFC normalization [4] is applied to avoid duplicate entities. Formally, given a term t in a language L, the URI is constructed as follows:
The term t is encoded using Unicode, and the NFC normalization procedure [4] is applied to ensure a unique representation. Conventional un-normalized Unicode allows encoding a character such as “à” in either a composed or in a decomposed form. Normalization ensures that there is only one unique form. The resulting Unicode code point string is encoded in UTF-8 to obtain a sequence of bytes. This byte sequence is converted to an ASCII path segment by applying percent-encoding as per the RFC 3986 standard. Essentially, this means that certain characters need to be escaped: Such unacceptable characters as well as the “%” character are encoded as triplets of the form The base address
Fortunately, Lexvo.org’s Java API (Section 5) hides most of these details from data publishers, instead providing a very simple interface to obtain a term URI given a string and its language.
Term descriptions and links
Capturing links to terms is particularly significant in light of the important role of natural language for the Semantic Web. In general, a non-information resource URI string itself does not convey reliable information about its intended meaning, because a URI (including class or property names) can be chosen quite arbitrarily. Oftentimes the meaning is specified using natural language definitions or characteristic labels. From a semantic perspective, however, RDFS
In order to make the meaning of URIs more explicit, Lexvo.org proposes linking URIs to term URIs of one or more natural languages using a lexicalization property, whenever appropriate. Such a property captures the semantic relationship between a concept and its natural language lexicalizations or between an arbitrary entity and natural language terms that refer to it. For instance, a URI of the form
Unlike SKOS [23], which aims at expressing descriptions of individual vocabularies and their properties, Lexvo.org describes generic properties of words and other expressions in a language as a whole. For instance, Lexvo.org’s properties do not make any normative claims about which labels are to be preferred, but merely describes the fact that a certain term can be used in some specific sense. The focus on describing language per se also means that the Lexvo Ontology does not try to describe specific vocabularies, lexicons, or lexical entries in them. Given that Lexvo.org predates the lemon model [21] and has different goals, it relies on a somewhat different conceptual framework. Fortunately, the two models can be made interoperable to a considerable extent. While both SKOS and lemon focus on providing a rich vocabulary that can be used to express terminological or lexical knowledge, Lexvo.org’s primary focus, in any case, has always been to actually provide certain kinds of language-related knowledge as well as reusable identifiers for the involved entities.
Much of the multilingual information about words that Lexvo.org supplies is procured from Wiktionary, a well-known effort to collaboratively create dictionaries on the Web. Lexvo.org currently provides links from terms to human-readable web pages in the English, Catalan, French, German, Greek, Portuguese, Spanish, and Swedish Wiktionary versions. Lexvo.org also extracts translations as well as part-of-speech tag information, explaining whether a word functions as a noun or an adjective, for instance. Lexvo.org also publishes etymological relationships between words as triples (see Section 4.2). Lexvo.org was the first web site to publish Linked Data based on Wiktionary on the Web back in 2008. More recently, other sites have emerged that aim at establishing a more direct mapping of Wiktionary’s data to RDF [16].
Lexvo.org’s term entities are also linked to corresponding concepts in external resources, such as the GEneral Multilingual Environmental Thesaurus (GEMET), the United Nations FAO AGROVOC thesaurus, the US National Agricultural Library Thesaurus, EuroVoc, and the RAMEAU subject headings. Links to upper ontologies such as OpenCyc are provided as well.
Word senses
Ideally, there would be a universal registry of word meanings that could serve as a hub in the Linked Data world that anyone could link to. Unfortunately, there are several challenges: (1) There is no obvious universal inventory of word senses. Even authoritative dictionaries differ significantly in the senses they enumerate for a given word [13]. In sources like Wiktionary, sense definitions can be quite ephemeral and vary over time as site visitors make changes to the pages. (2) Even if we had an adequate registry of senses, most existing linguistic resources do not make sense distinctions, so we cannot easily link them to the appropriate senses, as automatic disambiguation is known to be rather error-prone. (3) Even when a resource does distinguish senses, these are unlikely to be compatible with the chosen inventory. Empirical studies show that senses in different datasets often do not align in a clean way [7]. Some even go as far as arguing that word senses are not necessarily a useful notion at all [17]. For these reasons, Lexvo.org mainly focuses on term-based URIs that do not distinguish word senses.
The service does, however, also include word sense-specific URIs based on Princeton’s WordNet lexical database [12]. WordNet is the most widely used sense inventory in natural language processing and thus the closest we have to a universal word sense inventory. While designed for English, WordNet’s senses have also been used for many other languages, especially in UWN, the Universal WordNet project [9]. WordNet’s identifiers have been linked to YAGO, SUMO, OpenCyc, VerbNet, and numerous other datasets (some of which are discussed later on). Lexvo.org was the first service to put WordNet 3.0 online as Linked Data. It links English terms to their respective WordNet synsets and provides links between synsets.
Towards a Linguistic Linked Data cloud
Lexvo.org first went online in 2008, serving information from lexical resources like Wiktionary as well as language information data. In 2010, Bernard Vatant decided to deprecate the lingvoj.org service, which had been publishing language identifiers based on Wikipedia, instead redirecting users to Lexvo.org, which provides richer descriptions of over an order of magnitude more languages. Due to these developments, a number of data publishers have recognized the value of Lexvo.org’s language descriptions.
Recently, many third parties have created linguistic datasets, leading to the emergence of a cloud of Linguistic Linked Data [3]. In order to strengthen and accelerate these efforts, Lexvo.org has begun publishing several separate new datasets on its download page.5
Roget’s Thesaurus is the most well-known English thesaurus, but the standard distribution comes in a text format that is hard to parse. Lexvo.org hosts an RDF version of the American 1911 edition of Roget’s Thesaurus [2,20]. The textual data is parsed using a rule-based top-down approach [8], which also needs to handle various formatting problems (including errors) in the original data. The RDF conversion includes both the hierarchy of topics as well as the terms associated with individual headword-level topics.
The WordNet Evocation dataset [1] provides data about associations between word senses, e.g. between senses of “car” and “road”. We obtain WordNet URIs for the respective senses, and, following the original paper, determine the median score of each evocation relationship. In order to express these scores, RDF Reification is used.
WordNet Domains delivers thematic domain markers such as
Cross-linguistic data
Etymological WordNet [6] contributes links between words that are not semantic but etymological or derivational in nature. For instance, the English word “salary” ultimately goes back to the Latin word “sal” (salt). The RDF conversion straightforwardly maps the original relationships between words extracted from Wiktionary to triples that use Lexvo.org term URIs and predicates from the Lexvo Ontology.
Semantic frames and roles
Lexvo.org maintains RDF datasets for FrameNet [14], the PropBank lexicon [26] (not the corpus), and NomBank [22], all resources that model the phrase- and sentence-level semantic frames and roles that words express. The RDF conversions are currently quite incomplete, as they only contain a subset of information that can easily be expressed using the Lexvo Ontology and other well-known RDF predicates. However, they will be extended over time, as lemon and other models finalize their representations of such data [21].
Lexvo.org also publishes an RDF version of VerbNet [18], a resource providing a Levin-style [19] classification of verbs based on their syntactic properties. The RDF conversion includes word senses, class memberships, and class hierarchy relationships, as well as mappings to WordNet.
Sentiment analysis data
Lexvo.org hosts a Linked Data version of the MPQA Subjectivity Lexicon [28], which supplies subjectivity and sentiment polarity labels.
Additionally, the AFINN dataset has been converted [24], offering more fine-grained numeric sentiment valency scores. The Lexvo Ontology assumes scores in the range
Speech-related data
An RDF version of the CMU Pronunciation Dictionary6
Processing framework
Lexvo.org relies on an automatic data processing architecture that allows for simple updates whenever any of the original data sources are updated. Lexvo.org is updated regularly to ensure that its data is always fairly up-to-date. For deprecated ISO 639-3 language codes, the system points to relevant alternative language identifiers.
The information that Lexvo.org serves is processed in a fairly sophisticated workflow system. In a first step, data is extracted from numerous sources, including official registries, Wikipedia, and Wiktionary.7
Full list at
Lexvo.org is backed by a custom Linked Data server infrastructure that makes its URIs dereferenceable online and part of the Linked Data cloud. This infrastructure also provides machine-readable dumps and mapping tables (see http://www.lexvo.org/linkeddata/resources.html for more information). The RDF dump is available under a Creative Commons CC-BY-SA license.8
Details at
Data publishers typically use Lexvo.org as a provider of URIs for language-related entities. The aforementioned Java API provides an easy way of constructing Lexvo.org-based URIs. One can provide an ISO language code as input and obtain the corresponding Lexvo.org language URI. One can also easily construct term URIs simply by supplying a string and a language. The API automatically carries out the steps detailed earlier in Section 3. Further information about this API is available online.9
Data publishers may also want to consult the Lexvo Ontology,10
Finally, data consumers use Lexvo.org to retrieve the various kinds of information elaborated earlier in Sections 2, 3, and 4. For example, they can retrieve senses and translations of a term or the geographical locations associated with a language.
In summary, Lexvo.org is a valuable service that defines standard identifiers for languages and language families, words and word senses, scripts, characters, and other language-related entities. These are being used by a multitude of data publishers in several different communities. Additionally, Lexvo.org publishes a broad spectrum of language-related information about these entities that is extensively used by numerous third-party data consumers. Lexvo.org provides online access as well as machine-readable dumps to ensure widespread availability of this information. This ecosystem of data constitutes a useful basis for applications in linguistics, natural language processing, and other areas that benefit from the considerably interlinked and interoperable nature of the resources. We believe that this provides tremendous incentives for third parties to contribute to the growing Linguistic Linked Data cloud.
