Countering language attrition with PanLex and the Web of Data

Abstract

The world is losing some of its 7,000 languages. Hypothesizing that language attrition might subside if all languages were intertranslatable, the PanLex project supports panlingual lexical translation by integrating all known lexical translations. Semantic Web technologies can flexibly represent and reason with the content of its database and interlink it with linguistic and other resources and annotations. Conversely, PanLex, with its collection of translation links between more than a billion pairs of lexemes from more than 9,000 language varieties, can improve the coverage of the Linguistic Web of Data. We detail how we transformed the content of the PanLex database to RDF, established conformance with the lemon and GOLD data models, interlinked it with Lexvo and DBpedia, and published it as Linked Data and via SPARQL.

Keywords

Multilingual Linked Open Data LLOD cloud PanLex lexical resource RDF RDB2RDF SPARQL Sparqlify

1 Introduction

There are about 7,000 living languages,1

¹
http://www-01.sil.org/iso639-3/iso-639-3.tab

but language attrition has extinguished or threatened from 10% to over 75% of all languages in the last 60 years in various regions [10]. This attrition arguably imperils human biological knowledge and species diversity [9]. Hypothetically, panlingual intertranslatability would make all languages more useful and incentivize their preservation and revitalization.

The PanLex project is making all languages’ lexicons intertranslatable. In some contexts (e.g., profiles, catalogs, tags, search, and web navigation), lexical translation can be most of the translation load. PanLex systematically integrates lexical translations, found in diverse sources, into a database for research, applications, and public use. The content can be interpreted as a graph linking lexemes in “is-a-translation-of” relations, and permitting automated inference to additional, unattested relations.

The Semantic Web initiative has led to the development of standards and technologies supporting a machine-readable and -i nterpretable Linked Data network, known as the Web of Data.2

http://www.w3.org/standards/semanticweb/

From these efforts the Linked Open Data (LOD) cloud3

http://lod-cloud.net/

emerged. A growing community is leveraging Semantic Web technologies for linguistic knowledge, building a Linguistic LOD (LLOD) cloud.

Here we describe how we connected PanLex to this Linked Data network. In Section 2 we introduce the dataset, present a PanLex RDF vocabulary, and explain how we transformed the one into the other and established conformance with additional data models. Section 3 shows how we linked to other datasets of the LLOD cloud, and Section 4 is about the publication of the dataset. Usage scenarios are given in Section 5, and related work is discussed in Section 6. Finally, Section 7 concludes this paper.

2 Triplification of the raw data

In this section, we analyze the PanLex dataset, introduce our URI and vocabulary design, which resemble PanLex’s conceptual model, summarize how we classified PanLex’s instance data with additional data models and explain our transformation of the data to RDF.

2.1 Analysis of the original dataset

The PanLex database is created by editors who consult information sources,4

⁴
http://panlex.org/tech/plrefs.shtml

such as mono- and multilingual dictionaries, glossaries, standards, and thesauri. The data include single- and multi-word expressions, corresponding meanings assigned to them, and related information. PanLex data constitute editors’ interpretations of sources’ assertions that two or more expressions share a meaning.5

⁵

http://panlex.org/tech/doc/design/panlex-db-design.pdf

The most important entities and relations of PanLex’s conceptual model are depicted in Fig. 1.

The source entity is the authority to which an editor attributes assertions about lexical translations.

Expressions are lexical entities, each uniquely identified with a text, i.e. a string of (Unicode) characters, and a variety of a language. Expressions resemble lemmas or dictionary-entry headwords, but differ from them in at least two ways. (1) Homographs, such as the verb “hide” (conceal) and the noun “hide” (animal skin) in English, are treated as a single expression in PanLex. (2) Multiword expressions, such as “fall in love”, traditionally found in an entry headed by one of their words, such as “fall” or “love”, are treated as independent expressions in PanLex.

Languages in PanLex are identified using ISO 639-36

⁶

http://www.sil.org/iso639-3/codes.asp

individual and macrolanguage codes, ISO 639-27

⁷

http://www.loc.gov/standards/iso639-2/

collective codes, and ISO 639-58

⁸

http://www.loc.gov/standards/iso639-5/

codes.

Language varieties are collections of expressions. Each has a unique identifier: a language code and a distinguishing integer. For example, six dialects of Ahtna are identified as “aht-000” through “aht-005”. These labels are, themselves, treated as a (controlled) language variety, whose expressions (i.e. the labels) are translated into natural languages and other controlled languages (such as the IETF standard BCP 47).9

⁹

http://tools.ietf.org/html/bcp47

Meanings are entities assigned to expressions, thereby identifying expressions as translations or synonyms. For example, a source’s translation of the German expression “klingen” into English as “ring, sound, seem” can be interpreted as (1) the assignment of a single meaning to all four expressions or (2) the assignment of two or three meanings to “klingen” and of one of those to each of the English expressions. Meanings are source-specific. The identification and consolidation of equivalent meanings of distinct sources is a research topic, not a database feature. Meanings can have properties of three types. (1) Definitions are descriptions of a meaning, consisting of text strings annotated as being in particular language varieties. (2) Domains are expressions (e.g., “medicine”) that characterize a meaning, but do not express it. (3) Meaning identifiers are strings acting as references to identifiers in a source.

Denotations are assignments of meanings to expressions. A denotation may have one or more word classes (a closed set based on OLIF, the Open Lexicon Interchange Format) and/or metadata (arbitrary strings paired as keys and values).

Users may define sources, attribute data to them, and define language varieties.

Among the properties of sources are licenses. Some of the license categories are public domain, request (author invites inquiries), GNU Free Documentation License (FDL), and PanLex Use Permission (specific permission for use in PanLex). The distribution of licenses is shown in Table 1.

Table 1

Number of sources using a certain license

License	Count	License	Count	License	Count
copyright	1343	LGPL	9	PD	149
CC	387	MIT	32	other	106
FDL	24	PanLex	7	unknown	1958
GPL	172	request	5

Fig. 1.

The PanLex database schema.

Table 2 gives entity counts as of January 2014.

Table 2

Number of instances of main entities in the PanLex database

Entity	Instances
Denotations	54,278,860
Meanings	20,773,371
Expressions	19,790,453
Definitions	2,747,892
Language Varieties identified	9,310
Language Varieties with data	9,239
Languages	7,843
Sources being consulted	4,190
Sources already consulted	1,453
Users	23

2.2 The PanLex vocabulary

The entities and relations described above are the base for the PanLex RDF vocabulary. In general, all PanLex RDF resources reside in the namespace http://ld.panlex.org/plx/, abbreviated with plx. An example of the resulting ontology is depicted in Fig. 2 and summarized as follows. Unless otherwise noted, the URIs of instances of PanLex classes follow the pattern plx:{className}/{id}, where {className} is spelled in lower camel case and the {id} is the primary key of the corresponding database table.

Expressions are modeled as instances of the class plx:Expression. Their original and degraded textual representations become the values of the properties rdfs:label and plx:degradedText, respectively. Their corresponding language variety is stated using plx:languageVariety.

For language and language varieties the classes plx:Language and plx:LanguageVariety are introduced. ISO 639-1 and ISO 639-3 codes become instances of the classes plx:Iso639-1Code and plx:Iso639-3Code.

The RDF analog of the PanLex meaning is the plx:Meaning. Entities of this class may have an identifier assigned with the plx:identifier property pointing to an xsd:string literal. Meanings may also have definitions, entities of the plx:Definition class, giving a textual representation (rdfs:label) in a certain language variety (plx:languageVariety).

Meanings and expressions are linked via denotations. These are entities of the plx:Denotation class pointing to meanings and expressions via the properties plx:denotationMeaning and plx:denotationExpression. Denotations may also have a word class assigned to them. This can be achieved with the denotation’s plx:wordClass property pointing to a plx:WordClass entity.

All sources share the plx:Source class. The characteristics of a source are described using mainly triples with literal objects. These are for example dc:title to assign the title of a source, dc:creator to give an xsd:string containing the author’s name. At present, we support the different license categories recognized in the database by creating resources of the plx:License class.

Fig. 2.

Example of the PanLex RDF vocabulary showing one meaning of the expression ‘between’ and the corresponding source and definition.

Table 3

Classes and properties used in the PanLex RDF vocabulary. Note that all rdf:type properties are omitted for brevity

Class	Properties
plx:Source	plx:registrationDate, rdfs:label, dc:title, dc:creator, plx:license, dc:date, plx:quality, foaf:homepage, dc:publisher, dbpedia-owl:isbn
plx:Language	plx:iso639-3Code, plx:iso639-1Code
plx:LanguageVariety	plx:languageVarietyOf, rdfs:label
plx:Iso639-1Code
plx:Iso639-3Code
plx:Expression	plx:languageVariety, plx:degradedText, rdfs:label
plx:Meaning	plx:approver, plx:identifier, plx:meaningDefinition
plx:Definition	plx:languageVariety, rdfs:label
plx:Denotation	plx:denotationMeaning, plx:denotationExpression, plx:wordClass
plx:WordClass	rdfs:label
plx:License	rdfs:label

2.3 Vocabulary reuse

The PanLex vocabulary is based on PanLex’s conceptual schema and enables all of PanLex’s data to be directly exposed as RDF. Additionally, we also re-use existing vocuabularies, namely the Lexicon Model for Ontologies (lemon) [7] as well as the General Ontology for Linguistic Description (GOLD) [4]. Since these models differ from the PanLex one, we follow an incremental approach of aligning the PanLex data with them. Table 4 shows PanLex classes with their current counterparts in lemon and GOLD respectively. The parts implemented in our RDF conversion are displayed in Fig. 3.

Fig. 3.

Parts of the GOLD (left) and lemon model (right) re-used in PanLex (URI prefixes are omitted for brevity).

Table 4

Classes considered to be similar across the re-used vocabulary

Panlex	lemon	GOLD
plx:Denotation	–	gold:LinguisticSign
plx:Meaning	lemon:LexicalSense	gold:SemanticUnit
plx:Expression	lemon:LexicalEntry	gold:FormUnit

2.4 RDF transformation workflow

Since new sources are added to the PanLex database on almost a daily basis and because of its current size (~18 GB), the recurrent conversion of the database to capture changes in it is impractical. As the PanLex data already reside in a relational database, the use of a virtual RDB2RDF10

¹⁰
http://www.w3.org/2001/sw/wiki/RDB2RDF

mapping solution is a natural choice. The Sparqlify system11

¹¹

https://github.com/AKSW/Sparqlify

offers, besides an efficient query rewriting engine, also a very easy-to-use mapping language, called Sparqlification Mapping Language (SML). Essentially, these mappings consist of three clauses: The From clause specifies the logical SQL table (i.e. table, view, or query) to be used in the SML view. The With clause binds a set of SPARQL variables to expressions that yield RDF terms from relational columns. Finally, the Construct clause holds a set of triple patterns. Figure 4 shows an example of an SML view for the languages in PanLex: From each row of the table i1 three resources are created based on the iso3 column and bound to the variable names ?lang, ?iso3 and ?lexvo3. Resources for ?lang become typed as a Language in the PanLex and the schema.org namespace. This view-based approach makes it easy to perform future revisions of RDF mapping, such as adding support for new vocabularies.

Fig. 4.

An excerpt of an SML view definition for PanLex’s languages. This example also demonstrates how “is-a” relations to schema.org and links to Lexvo are established.

3 Linking

The SML view in (Fig. 4) establishes the interlinking of languages in PanLex with Lexvo [3]. Here we outline the interlinking with DBpedia [5], where we were interested in creating valid and dereferenceable links. Therefore, we iterated the titles datasets,12

¹²
http://wiki.dbpedia.org/Downloads38

which map (non-localized) DBpedia URIs to their page titles in the respective language. For each language version we normalized the labels by applying Unicode NFKD13

¹³

http://unicode.org/reports/tr15/

normalization and removal of punctuation characters. Each DBpedia resource was then mapped to the PanLex expression that was equal to the resource’s normalized label in the respective language. Table 5 summarizes the number of links obtained.

In total, about 2.5 million links were obtained for approx. 20 million expressions. This relatively low coverage can be attributed to frequently appearing multi-word expressions that do not match the DBpedia titles well, and the fact that in this work we yet only considered DBpedia datasets for mainstream languages, whereas PanLex focuses on low-density ones.

Table 5

Number of DBpedia links per language

Language	Links	Language	Links
English	1,415,241	Catalan	27,779
German	224,146	Korean	24,912
French	187,364	Turkish	22,258
Italian	147,485	Bulgarian	19,431
Spanish	117,056	Hungarian	18,203
Portuguese	112,266	Slovene	11,981
Polish	110,974	Greek	1,112
Russian	68,040
Czech	28,767	Total	2,537,015

4 Publishing

With our RDF conversion work, we complement existing APIs14

¹⁴
http://panlex.org/try/

with Linked Data, powered by Pubby,15

¹⁵

http://wifo5-03.informatik.uni-mannheim.de/pubby/

and two SPARQL endpoints,16

¹⁶

http://ld.panlex.org/vsparql

^{,}

¹⁷

http://ld.panlex.org/sparql

ran by Sparqlify and Virtuoso. An overview is shown in Fig. 5. The SPARQL browser SNORQL18

¹⁸

https://github.com/kurtjx/SNORQL

can be accessed by replacing sparql with snorql in the respective links. Our SML views and the interlinking code are hosted on GitHub.19

¹⁹

https://github.com/AKSW/PanLex-2-RDF

The created linksets are hosted in the PanLex database and are published together with the other data using Sparqlify. Finally, we offer downloads tagged with timestamps of their creation.20

²⁰

http://ld.panlex.org/downloads/releases/

Fig. 5.

PanLex architecture.

5 Dataset benefits and usage scenarios

There are general benefits of using Semantic Web technologies, such as the potential for simplified data integration due to RDF and vocabulary reuse, the possibility of enriching data based on interlinking, drawing advantage from reasoning and the exploration of the data through the use of generic Semantic Web tools. Moreover, some applications, like the TeraDict translation lookup service,21

²¹
http://panlex.org/teradict/?lg=eng

can now be realized using SPARQL queries and so easily integrated in other applications. Due to space considerations, we refer the reader to the PanLex Linked Data landing page,22

²²

http://ld.panlex.org

where a collection of SPARQL queries is maintained. Also, since PanLex covers a niche of providing linguistic data for non-mainstream languages, investigation of its fitness for use in cross language information retrieval, as well as annotation projects, like DBpedia Spotlight [8] seems worthwhile.

6 Related work

PanLex is a project whose editors integrate information discovered from many lexical resources. The extraction of information from linguistic sources, and techniques for automatically inferring translations, are relevant work discussed in [6]. An important related initiative is the Global Wordnet Association (GWA),23

²³
http://www.globalwordnet.org/

which offers a platform for sharing wordnets and, among other goals, uniformly representing wordnets of different languages and establishing a universal index of meaning. Wordnets are usually focused on the definition of synsets and relations between them in a single language; GWA is helping to transform these single-language synonym silos into a virtual multilingual translation resource. PanLex is approximating that, in a different way, by integrating data from numerous wordnets along with translingual sources into a single graph. In the Semantic Web context, several standard or quasi-standard vocabularies and ontologies have been developed with the rise of the Linguistic LOD movement. Examples include the Ontologies of Linguistic Annotation (OLiA) [1] for modeling lexicon and machine-readable dictionaries, POWLA for modeling linguistic corpora [2] and the Natural Language Processing Interchange Format (NIF).24

²⁴

http://nlp2rdf.org/nif-1-0

7 Conclusions and future work

In this dataset description we detailed the PanLex database and its conversion to RDF. Based on our URI and vocabulary design, we created appropriate view definitions for the Sparqlify system, which carries out the actual RDF transformation. Furthermore, we interlinked the languages in PanLex with Lexvo, and created about 2.5 million links to DBpedia for expressions in 16 languages. With the integration of lemon and GOLD we also support data access via external linguistic ontologies.

We intend to address some limitations in the future: The relations among PanLex’s information sources, if treated as distinct datasets, could be modeled with the VoID vocabulary.25

²⁵

http://rdfs.org/ns/void

The source entity should be refactored to reference users and information sources as distinct entities. Metadata attached to PanLex denotations are currently limited to arbitrary pairs of strings, but this sacrifices discovery possibilities when the metadata describe facts that can again be expressed with PanLex expressions. Finally, new collaborations between PanLex and related fields (e.g. as language identification, language geolocation, lemmatization, transliteration, localization, etc.) are promising areas for development.

References

[1]

Chiarcos, Grounding an ontology of linguistic annotationsin the data category registry, in: Language Resource and Language Technology Standards – State of the Art, Emerging Needs, and Future Developments, Valetta, Malta, 2010, pp. 37–40.

[2]

Chiarcos, POWLA: Modeling linguistic corpora inOWL/DL, in: The Semantic Web: Research and Applications – Proc. of 9th Extended Semantic Web Conference, ESWC 2012, Heraklion, Crete, Greece, May 27–31, 2012,

Simperl,

Cimiano,

Polleres,

Ó.

Corcho and

Presutti, eds, Lecture Notes in Computer Science, Vol. 7295, Springer, 2012, pp. 225–239.

[3]

deMelo and

Weikum, Language as a foundation of theSemantic Web, in: Proc. of the Poster and Demonstration Session at the 7th International Semantic Web Conference (ISWC 2008),

Bizer and

Joshi, eds, CEUR WS, Vol. 401, Karlsruhe,Germany, 2008, CEUR .

[4]

Farrar and

D.T.

Langendoen, A linguistic ontology for the semantic web, GLOT International 7(3) (2003), 97–100.

[5]

Lehmann,

Isele,

Jakob,

Jentzsch,

Kontokostas,

P.N.

Mendes,

Hellmann,

Morsey,

vanKleef,

Auer and

Bizer, DBpedia – A large-scale, multilingual knowledge base extracted from wikipedia, Semantic Web (2014).

[6] Mausam ,

Soderland,

Etzioni,

D.S.

Weld,

Reiter,

Skinner,

Sammer and

Bilmes, Panlingual lexical translation via probabilistic inference, Artificial Intelligence 174(9–10) (2010), 619–637.

[7]

McCrae,

Spohr and

Cimiano, Linking lexical resources and ontologies on the semantic web with lemon, in: The Semantic Web: Research and Applications – Proc. of 9th Extended Semantic Web Conference, ESWC 2012, Heraklion, Crete, Greece, May 27–31, 2012,

Antoniou,

Grobelnik,

E.P.B.

Simperl,

Parsia,

Plexousakis,

P.D.

Leenheer and

J.Z.

Pan, eds, LNCS, Vol. 6643, Springer, 2011, pp. 245–259.

[8]

P.N.

Mendes,

Jakob,

Garcia-Silva and

Bizer, DBpedia Spotlight: Shedding light on the web of documents, in: Proc. of the 7th International Conference on Semantic Systems (I-Semantics),

Ghidini,

A.-C.N.

Ngomo,

S.N.

Lindstaedt and

Pellegrini, eds, ACM International Conference Proceeding Series, ACM, 2011, pp. 1–8.

[9]

Nettle

et al., Vanishing Voices: The Extinction of the World’s Languages: The Extinction of the World’s Languages, Oxford University Press, 2000.

10.

[10]

G.F.

Simons and

M.P.

Lewis, The world’s languages in crisis: A 20-year update, in: Responses to Language Endangerment: In Honor of Mickey Noonan, Studies in Language Companion Series, John Benjamins, 2013, pp. 3–20.

Countering language attrition with PanLex and the Web of Data

Abstract

Keywords

1 Introduction

1 http://www-01.sil.org/iso639-3/iso-639-3.tab

2.1 Analysis of the original dataset

4 http://panlex.org/tech/plrefs.shtml

10 http://www.w3.org/2001/sw/wiki/RDB2RDF

12 http://wiki.dbpedia.org/Downloads38

14 http://panlex.org/try/

21 http://panlex.org/teradict/?lg=eng

23 http://www.globalwordnet.org/