The world is losing some of its 7,000 languages. Hypothesizing that language attrition might subside if all languages were intertranslatable, the PanLex project supports panlingual lexical translation by integrating all known lexical translations. Semantic Web technologies can flexibly represent and reason with the content of its database and interlink it with linguistic and other resources and annotations. Conversely, PanLex, with its collection of translation links between more than a billion pairs of lexemes from more than 9,000 language varieties, can improve the coverage of the Linguistic Web of Data. We detail how we transformed the content of the PanLex database to RDF, established conformance with the lemon and GOLD data models, interlinked it with Lexvo and DBpedia, and published it as Linked Data and via SPARQL.
but language attrition has extinguished or threatened from 10% to over 75% of all languages in the last 60 years in various regions [10]. This attrition arguably imperils human biological knowledge and species diversity [9]. Hypothetically, panlingual intertranslatability would make all languages more useful and incentivize their preservation and revitalization.
The PanLex project is making all languages’ lexicons intertranslatable. In some contexts (e.g., profiles, catalogs, tags, search, and web navigation), lexical translation can be most of the translation load. PanLex systematically integrates lexical translations, found in diverse sources, into a database for research, applications, and public use. The content can be interpreted as a graph linking lexemes in “is-a-translation-of” relations, and permitting automated inference to additional, unattested relations.
The Semantic Web initiative has led to the development of standards and technologies supporting a machine-readable and -i nterpretable Linked Data network, known as the Web of Data.2
http://www.w3.org/standards/semanticweb/
From these efforts the Linked Open Data (LOD) cloud3
http://lod-cloud.net/
emerged. A growing community is leveraging Semantic Web technologies for linguistic knowledge, building a Linguistic LOD (LLOD) cloud.
Here we describe how we connected PanLex to this Linked Data network. In Section 2 we introduce the dataset, present a PanLex RDF vocabulary, and explain how we transformed the one into the other and established conformance with additional data models. Section 3 shows how we linked to other datasets of the LLOD cloud, and Section 4 is about the publication of the dataset. Usage scenarios are given in Section 5, and related work is discussed in Section 6. Finally, Section 7 concludes this paper.
Triplification of the raw data
In this section, we analyze the PanLex dataset, introduce our URI and vocabulary design, which resemble PanLex’s conceptual model, summarize how we classified PanLex’s instance data with additional data models and explain our transformation of the data to RDF.
Analysis of the original dataset
The PanLex database is created by editors who consult information sources,4
http://panlex.org/tech/plrefs.shtml
such as mono- and multilingual dictionaries, glossaries, standards, and thesauri. The data include single- and multi-word expressions, corresponding meanings assigned to them, and related information. PanLex data constitute editors’ interpretations of sources’ assertions that two or more expressions share a meaning.5
The most important entities and relations of PanLex’s conceptual model are depicted in Fig. 1.
The source entity is the authority to which an editor attributes assertions about lexical translations.
Expressions are lexical entities, each uniquely identified with a text, i.e. a string of (Unicode) characters, and a variety of a language. Expressions resemble lemmas or dictionary-entry headwords, but differ from them in at least two ways. (1) Homographs, such as the verb “hide” (conceal) and the noun “hide” (animal skin) in English, are treated as a single expression in PanLex. (2) Multiword expressions, such as “fall in love”, traditionally found in an entry headed by one of their words, such as “fall” or “love”, are treated as independent expressions in PanLex.
Languages in PanLex are identified using ISO 639-36
Language varieties are collections of expressions. Each has a unique identifier: a language code and a distinguishing integer. For example, six dialects of Ahtna are identified as “aht-000” through “aht-005”. These labels are, themselves, treated as a (controlled) language variety, whose expressions (i.e. the labels) are translated into natural languages and other controlled languages (such as the IETF standard BCP 47).9
http://tools.ietf.org/html/bcp47
Meanings are entities assigned to expressions, thereby identifying expressions as translations or synonyms. For example, a source’s translation of the German expression “klingen” into English as “ring, sound, seem” can be interpreted as (1) the assignment of a single meaning to all four expressions or (2) the assignment of two or three meanings to “klingen” and of one of those to each of the English expressions. Meanings are source-specific. The identification and consolidation of equivalent meanings of distinct sources is a research topic, not a database feature. Meanings can have properties of three types. (1) Definitions are descriptions of a meaning, consisting of text strings annotated as being in particular language varieties. (2) Domains are expressions (e.g., “medicine”) that characterize a meaning, but do not express it. (3) Meaning identifiers are strings acting as references to identifiers in a source.
Denotations are assignments of meanings to expressions. A denotation may have one or more word classes (a closed set based on OLIF, the Open Lexicon Interchange Format) and/or metadata (arbitrary strings paired as keys and values).
Users may define sources, attribute data to them, and define language varieties.
Among the properties of sources are licenses. Some of the license categories are public domain, request (author invites inquiries), GNU Free Documentation License (FDL), and PanLex Use Permission (specific permission for use in PanLex). The distribution of licenses is shown in Table 1.
Number of instances of main entities in the PanLex database
Entity
Instances
Denotations
54,278,860
Meanings
20,773,371
Expressions
19,790,453
Definitions
2,747,892
Language Varieties identified
9,310
Language Varieties with data
9,239
Languages
7,843
Sources being consulted
4,190
Sources already consulted
1,453
Users
23
The PanLex vocabulary
The entities and relations described above are the base for the PanLex RDF vocabulary. In general, all PanLex RDF resources reside in the namespace http://ld.panlex.org/plx/, abbreviated with plx. An example of the resulting ontology is depicted in Fig. 2 and summarized as follows. Unless otherwise noted, the URIs of instances of PanLex classes follow the pattern plx:{className}/{id}, where {className} is spelled in lower camel case and the {id} is the primary key of the corresponding database table.
Expressions are modeled as instances of the class plx:Expression. Their original and degraded textual representations become the values of the properties rdfs:label and plx:degradedText, respectively. Their corresponding language variety is stated using plx:languageVariety.
For language and language varieties the classes plx:Language and plx:LanguageVariety are introduced. ISO 639-1 and ISO 639-3 codes become instances of the classes plx:Iso639-1Code and plx:Iso639-3Code.
The RDF analog of the PanLex meaning is the plx:Meaning. Entities of this class may have an identifier assigned with the plx:identifier property pointing to an xsd:string literal. Meanings may also have definitions, entities of the plx:Definition class, giving a textual representation (rdfs:label) in a certain language variety (plx:languageVariety).
Meanings and expressions are linked via denotations. These are entities of the plx:Denotation class pointing to meanings and expressions via the properties plx:denotationMeaning and plx:denotationExpression. Denotations may also have a word class assigned to them. This can be achieved with the denotation’s plx:wordClass property pointing to a plx:WordClass entity.
All sources share the plx:Source class. The characteristics of a source are described using mainly triples with literal objects. These are for example dc:title to assign the title of a source, dc:creator to give an xsd:string containing the author’s name. At present, we support the different license categories recognized in the database by creating resources of the plx:License class.
Example of the PanLex RDF vocabulary showing one meaning of the expression ‘between’ and the corresponding source and definition.
Classes and properties used in the PanLex RDF vocabulary. Note that all rdf:type properties are omitted for brevity
The PanLex vocabulary is based on PanLex’s conceptual schema and enables all of PanLex’s data to be directly exposed as RDF. Additionally, we also re-use existing vocuabularies, namely the Lexicon Model for Ontologies (lemon) [7] as well as the General Ontology for Linguistic Description (GOLD) [4]. Since these models differ from the PanLex one, we follow an incremental approach of aligning the PanLex data with them. Table 4 shows PanLex classes with their current counterparts in lemon and GOLD respectively. The parts implemented in our RDF conversion are displayed in Fig. 3.
Parts of the GOLD (left) and lemon model (right) re-used in PanLex (URI prefixes are omitted for brevity).
Classes considered to be similar across the re-used vocabulary
Panlex
lemon
GOLD
plx:Denotation
–
gold:LinguisticSign
plx:Meaning
lemon:LexicalSense
gold:SemanticUnit
plx:Expression
lemon:LexicalEntry
gold:FormUnit
RDF transformation workflow
Since new sources are added to the PanLex database on almost a daily basis and because of its current size (~18 GB), the recurrent conversion of the database to capture changes in it is impractical. As the PanLex data already reside in a relational database, the use of a virtual RDB2RDF10
http://www.w3.org/2001/sw/wiki/RDB2RDF
mapping solution is a natural choice. The Sparqlify system11
https://github.com/AKSW/Sparqlify
offers, besides an efficient query rewriting engine, also a very easy-to-use mapping language, called Sparqlification Mapping Language (SML). Essentially, these mappings consist of three clauses: The From clause specifies the logical SQL table (i.e. table, view, or query) to be used in the SML view. The With clause binds a set of SPARQL variables to expressions that yield RDF terms from relational columns. Finally, the Construct clause holds a set of triple patterns. Figure 4 shows an example of an SML view for the languages in PanLex: From each row of the table i1 three resources are created based on the iso3 column and bound to the variable names ?lang, ?iso3 and ?lexvo3. Resources for ?lang become typed as a Language in the PanLex and the schema.org namespace. This view-based approach makes it easy to perform future revisions of RDF mapping, such as adding support for new vocabularies.
An excerpt of an SML view definition for PanLex’s languages. This example also demonstrates how “is-a” relations to schema.org and links to Lexvo are established.
Linking
The SML view in (Fig. 4) establishes the interlinking of languages in PanLex with Lexvo [3]. Here we outline the interlinking with DBpedia [5], where we were interested in creating valid and dereferenceable links. Therefore, we iterated the titles datasets,12
http://wiki.dbpedia.org/Downloads38
which map (non-localized) DBpedia URIs to their page titles in the respective language. For each language version we normalized the labels by applying Unicode NFKD13
http://unicode.org/reports/tr15/
normalization and removal of punctuation characters. Each DBpedia resource was then mapped to the PanLex expression that was equal to the resource’s normalized label in the respective language. Table 5 summarizes the number of links obtained.
In total, about 2.5 million links were obtained for approx. 20 million expressions. This relatively low coverage can be attributed to frequently appearing multi-word expressions that do not match the DBpedia titles well, and the fact that in this work we yet only considered DBpedia datasets for mainstream languages, whereas PanLex focuses on low-density ones.
Number of DBpedia links per language
Language
Links
Language
Links
English
1,415,241
Catalan
27,779
German
224,146
Korean
24,912
French
187,364
Turkish
22,258
Italian
147,485
Bulgarian
19,431
Spanish
117,056
Hungarian
18,203
Portuguese
112,266
Slovene
11,981
Polish
110,974
Greek
1,112
Russian
68,040
Czech
28,767
Total
2,537,015
Publishing
With our RDF conversion work, we complement existing APIs14
ran by Sparqlify and Virtuoso. An overview is shown in Fig. 5. The SPARQL browser SNORQL18
https://github.com/kurtjx/SNORQL
can be accessed by replacing sparql with snorql in the respective links. Our SML views and the interlinking code are hosted on GitHub.19
https://github.com/AKSW/PanLex-2-RDF
The created linksets are hosted in the PanLex database and are published together with the other data using Sparqlify. Finally, we offer downloads tagged with timestamps of their creation.20
http://ld.panlex.org/downloads/releases/
PanLex architecture.
Dataset benefits and usage scenarios
There are general benefits of using Semantic Web technologies, such as the potential for simplified data integration due to RDF and vocabulary reuse, the possibility of enriching data based on interlinking, drawing advantage from reasoning and the exploration of the data through the use of generic Semantic Web tools. Moreover, some applications, like the TeraDict translation lookup service,21
http://panlex.org/teradict/?lg=eng
can now be realized using SPARQL queries and so easily integrated in other applications. Due to space considerations, we refer the reader to the PanLex Linked Data landing page,22
http://ld.panlex.org
where a collection of SPARQL queries is maintained. Also, since PanLex covers a niche of providing linguistic data for non-mainstream languages, investigation of its fitness for use in cross language information retrieval, as well as annotation projects, like DBpedia Spotlight [8] seems worthwhile.
Related work
PanLex is a project whose editors integrate information discovered from many lexical resources. The extraction of information from linguistic sources, and techniques for automatically inferring translations, are relevant work discussed in [6]. An important related initiative is the Global Wordnet Association (GWA),23
http://www.globalwordnet.org/
which offers a platform for sharing wordnets and, among other goals, uniformly representing wordnets of different languages and establishing a universal index of meaning. Wordnets are usually focused on the definition of synsets and relations between them in a single language; GWA is helping to transform these single-language synonym silos into a virtual multilingual translation resource. PanLex is approximating that, in a different way, by integrating data from numerous wordnets along with translingual sources into a single graph. In the Semantic Web context, several standard or quasi-standard vocabularies and ontologies have been developed with the rise of the Linguistic LOD movement. Examples include the Ontologies of Linguistic Annotation (OLiA) [1] for modeling lexicon and machine-readable dictionaries, POWLA for modeling linguistic corpora [2] and the Natural Language Processing Interchange Format (NIF).24
http://nlp2rdf.org/nif-1-0
Conclusions and future work
In this dataset description we detailed the PanLex database and its conversion to RDF. Based on our URI and vocabulary design, we created appropriate view definitions for the Sparqlify system, which carries out the actual RDF transformation. Furthermore, we interlinked the languages in PanLex with Lexvo, and created about 2.5 million links to DBpedia for expressions in 16 languages. With the integration of lemon and GOLD we also support data access via external linguistic ontologies.
We intend to address some limitations in the future: The relations among PanLex’s information sources, if treated as distinct datasets, could be modeled with the VoID vocabulary.25
http://rdfs.org/ns/void
The source entity should be refactored to reference users and information sources as distinct entities. Metadata attached to PanLex denotations are currently limited to arbitrary pairs of strings, but this sacrifices discovery possibilities when the metadata describe facts that can again be expressed with PanLex expressions. Finally, new collaborations between PanLex and related fields (e.g. as language identification, language geolocation, lemmatization, transliteration, localization, etc.) are promising areas for development.
References
1.
C.Chiarcos, Grounding an ontology of linguistic annotationsin the data category registry, in: Language Resource and Language Technology Standards – State of the Art, Emerging Needs, and Future Developments, Valetta, Malta, 2010, pp. 37–40.
2.
C.Chiarcos, POWLA: Modeling linguistic corpora inOWL/DL, in: The Semantic Web: Research and Applications – Proc. of 9th Extended Semantic Web Conference, ESWC 2012, Heraklion, Crete, Greece, May 27–31, 2012, E.Simperl, P.Cimiano, A.Polleres, Ó.Corcho and V.Presutti, eds, Lecture Notes in Computer Science, Vol. 7295, Springer, 2012, pp. 225–239.
3.
G.deMelo and G.Weikum, Language as a foundation of theSemantic Web, in: Proc. of the Poster and Demonstration Session at the 7th International Semantic Web Conference (ISWC 2008), C.Bizer and A.Joshi, eds, CEUR WS, Vol. 401, Karlsruhe,Germany, 2008, CEUR.
4.
S.Farrar and D.T.Langendoen,
A linguistic ontology for the semantic web, GLOT International7(3) (2003), 97–100.
5.
J.Lehmann, R.Isele, M.Jakob, A.Jentzsch, D.Kontokostas, P.N.Mendes, S.Hellmann, M.Morsey, P.vanKleef, S.Auer and C.Bizer, DBpedia – A large-scale, multilingual knowledge base extracted from wikipedia, Semantic Web (2014).
6.
Mausam, S.Soderland, O.Etzioni, D.S.Weld, K.Reiter, M.Skinner, M.Sammer and J.Bilmes,
Panlingual lexical translation via probabilistic inference, Artificial Intelligence174(9–10) (2010), 619–637.
7.
J.McCrae, D.Spohr and P.Cimiano, Linking lexical resources and ontologies on the semantic web with lemon, in: The Semantic Web: Research and Applications – Proc. of 9th Extended Semantic Web Conference, ESWC 2012, Heraklion, Crete, Greece, May 27–31, 2012, G.Antoniou, M.Grobelnik, E.P.B.Simperl, B.Parsia, D.Plexousakis, P.D.Leenheer and J.Z.Pan, eds, LNCS, Vol. 6643, Springer, 2011, pp. 245–259.
8.
P.N.Mendes, M.Jakob, A.Garcia-Silva and C.Bizer, DBpedia Spotlight: Shedding light on the web of documents, in: Proc. of the 7th International Conference on Semantic Systems (I-Semantics), C.Ghidini, A.-C.N.Ngomo, S.N.Lindstaedt and T.Pellegrini, eds, ACM International Conference Proceeding Series, ACM, 2011, pp. 1–8.
9.
D.Nettleet al., Vanishing Voices: The Extinction of the World’s Languages: The Extinction of the World’s Languages, Oxford University Press, 2000.
10.
G.F.Simons and M.P.Lewis, The world’s languages in crisis: A 20-year update, in: Responses to Language Endangerment: In Honor of Mickey Noonan, Studies in Language Companion Series, John Benjamins, 2013, pp. 3–20.