Abstract
Bilingual electronic dictionaries contain collections of lexical entries in two languages, with explicitly declared translation relations between such entries. Nevertheless, they are typically developed in isolation, in their own formats and accessible through proprietary APIs. In this paper we propose the use of Semantic Web techniques to make translations available on the Web to be consumed by other semantic enabled resources in a direct manner, based on standard languages and query means. In particular, we describe the conversion of the Apertium family of bilingual dictionaries and lexicons into RDF (Resource Description Framework) and how their data have been made accessible on the Web as linked data. As a result, all the converted dictionaries (many of them covering under-resourced languages) are connected among them and can be easily traversed from one to another to obtain, for instance, translations between language pairs not originally connected in any of the original dictionaries.
Introduction
The publication of bilingual and multilingual language resources as linked data (LD) on the Web can largely benefit the creation of the critical mass of cross-lingual connections required by the vision of the multilingual Web of Data [9]. The benefits of sharing linguistic information on the Web of Data have been recently recognized by the language resources community, which has shown increasing interest in publishing their linguistic data and metadata as LD on the Web [2]. As a result of interlinking multilingual and open language resources, the Linguistic Linked Open Data (LLOD) cloud is emerging,1 An updated picture of the current LLOD cloud can be found at
In this article we will focus on the case of electronic bilingual dictionaries as a particular type of language resources. Bilingual dictionaries are specialized dictionaries that describe translations of words or phrases from one language to another. They can be unidirectional or bidirectional, allowing translation, in the latter case, to and from both languages. A bilingual dictionary can contain additional information such as part of speech, gender, declension model and other grammatical properties.
Electronic bilingual dictionaries have been typically developed in isolation, in their own formats and accessible through proprietary APIs. We propose the use of Semantic Web techniques to make translations available on the Web to be consumed by other semantic enabled resources in a direct manner, based on standard languages and query means. The result of a principled conversion into RDF is that the LD dictionaries are connected among them [7] and can be easily traversed from one to another to obtain, for instance, translations between language pairs not originally connected in any of the original dictionaries. Other potential uses of bilingual dictionaries in LD are enhancement of LD-based machine translation [10] and crosslingual information access over LD [9].
In particular, we have converted the Apertium family of bilingual dictionaries [4] into RDF, making their data interconnected and accessible on the Web as LD. Thus, the main contributions of this work are:
We propose a method for converting bilingual dictionaries into RDF and publishing them as LD on the Web, that we have particularised in the Apertium case.
As a result, we contribute to the cloud of LLOD with 22 new linguistic datasets, many of them covering
We also analyse the resulting Apertium RDF graph and exemplify how to traverse it to obtain direct and indirect translations.
We propose two different algorithms based on the RDF graph structure to compute the confidence degree of indirect translations.
The remainder of this paper is organised as follows. In Section 2 we briefly describe the Apertium source data. Section 3 discusses how lexica and language translations can be represented as LD. Section 4 describes the method we have followed to convert the Apertium dictionaries into RDF and to publish them as LD on the Web. In Section 5 we exemplify how to traverse the Apertium RDF graph to obtain translations, and how to compute a confidence degree for indirect translations. Section 6 analyses related work and, finally, conclusions can be found in Section 7.
Apertium is a free/open-source machine translation platform [4], initially aimed at related-language pairs which currently includes up to 40 language pairs.2
The translation engine consists of series of assembled modules which communicate using text streams. One of the modules is the lexical transfer module which reads lexical forms of the source language and delivers the corresponding target language lexical forms. The module uses a bilingual dictionary which contains an equivalent for each source lexical form. Apertium dictionaries are designed so that they can be compiled into letter transducers [6] which are able to process input strings and produce output strings. Accordingly, dictionaries are made of entries consisting of string pairs that correspond to the inputs and outputs of the transducer. The Apertium dictionaries are described in XML. Notice that the Apertium initiative benefits
During the METANET4U Project,3
A complete list of available lexicons can be found at
Thus, following the LMF model, for each bilingual Apertium lexicon a new
As it is shown in the above example, every resulting
For representing lexical content in RDF, we adopt the LExicon Model for ONtologies (
See figure and a short description at
An entry in the lexicon (word, multiword expression or even affix) that is assumed to represent a single lexical unit with common properties (e.g., PoS) across all its forms and meanings.
A form represents a particular version of a lexical entry, for example a plural or some other inflected form.
The sense refers to the usage of a lexical entry with a specific meaning and can also be considered as a reification of the relation between a lexical entry and the ontological entity that characterizes its meaning in a certain context.
The reference is an entity in the ontology that defines the formal semantics associated with the lexical entry.
The As for instance the ones contained at
Additionally other broadly used vocabularies such as Dublin Core9

Representation of the translation of “bench” from English to Spanish.
In Fig. 1 we illustrate a translation that is represented using
In principle, due to the limited lexical information contained in the source data, we are not using the chosen representation mechanism at its full potential. However, we considered important the use of the proposed modelling mechanism for two reasons: 1) enabling interoperability with other lexical datasets already represented in
The models presented in this section (
Some guidelines have been proposed to produce and publish high quality multilingual LD on the Web [17]. As a further step, the W3C Best Practices for Multilingual Linked Open Data (BPMLOD) community group13
http://www.w3.org/2015/09/bpmlod-reports/bilingual-dictionaries/
Every Apertium bilingual dictionary, which came originally in a single LMF file, was converted into three different objects in RDF, namely: source lexicon, target lexicon, and translation set. This division fits naturally in the
URIs design
Among the different patterns and recommendations for defining URIs we follow the one proposed by the ISA Action15
http://ec.europa.eu/isa/actions/01-trusted-information-exchange/1-1action_en.htm
In order to construct the URIs of lexical entries, senses, and other lexical elements, we have preserved the identifiers of the original LMF data whenever possible, propagating them into the RDF representation. Some minor changes have been introduced, though. For instance, in the original LMF data the identifier of the lexical entries ended with the particle “-l” or “-r” depending on their role as “source” or “target” in the translation (see Section 2). In our case, directionality is not preserved at the level of lexicon but in the
This activity deals with the transformation into RDF of the selected data sources using the chosen representation scheme and modelling patterns. There are a number of tools that can be used to assist the developer in this task, depending on the format of the data source. In our case, Open Refine16
As a result of the transformation, three RDF files were generated, one per component (lexicons and translation set). The code in Fig. 2 contains the RDF of the single translation illustrated in Fig. 1.

Example of the RDF (in turtle syntax) generated for the translation represented in Fig. 1.
All our scripts for the Apertium RDF generation from LMF have been recorded, stored, and made available online to enable their later analysis and reuse.17
The Apertium RDF data have been linked to two external datasets: LexInfo and BabelNet.18
Once the RDF data was generated, they were loaded in a Virtuoso19
A list of the Apertium RDF dictionaries available in Datahub can be found at http://datahub.io/dataset?q=apertium+rdf&organization=oeg-upm As it is described at
Finally, in order to improve the visibility and human access to the Apertium RDF dictionaries, a web portal was developed with pointers to the published individual dictionaries and with some query facilities.24
In summary, we have transformed all the Apertium dictionaries with available versions in LMF, resulting in a total of 22 Apertium RDF bilingual dictionaries. In Table 1 we show the list of datasets along with their size in terms of number of triples and number of translations. As a result of the generation and publication process, the 22 Apertium RDF dictionaries were included in the LLOD cloud.25 A picture of the LLOD cloud as of May 2015 can be found at
Apertium RDF datasets with their size in number of triples and translations. The language codes in the table follow the ISO-639 standard
As a result of the generation of the Apertium dictionaries as LD, a large unified graph of linked lexical entries, senses and translations was created and made accessible on the Web. The URIs of all these elements can be seen as the nodes of such a network. Every URI is dereferenceable, meaning that when it is accessed a response is obtained and their attributes and links to other elements get in RDF. In this section we explore, by means of examples, how to get direct and indirect translations from the graph. We also describe some methods to calculate the confidence degree of the indirect ones.
Exploring the graph
As a result of publishing the bilingual dictionaries in a unified graph by following consistent naming rules, a
We omit the query here, for the sake of space, but it can be found at http://files.figshare.com/2201195/ApertiumRDF_ExampleQuery4.txt
Direct translations of “bank” from English to Spanish along with their part of speech (POS)
In addition to obtaining explicitly declared translations (as in the above query), it is possible to infer indirect translations by traversing the graph through pivot lexical entries. For instance, a direct translation cannot be obtained (with a query similar to the previous one) between “bank” and the term in Portuguese, because there exists no English–Portuguese (or vice versa) Apertium bilingual dictionary yet. However, the graph can be traversed to reach indirect translations from English to Portuguese through an intermediate language (e.g., Spanish). This is illustrated in Fig. 3, which shows an oversimplified fragment of the graph that results from publishing both the EN-ES and ES-PT dictionaries as LD.

Simplified representation of the path between some lexical entries in EN and PT (disconnected in the original Apertium dictionaries).
A relatively simple SPARQL query28
http://files.figshare.com/2133242/ApertiumRDF_ExampleQuery2.txt
When using one pivot language (or more) to construct a bilingual dictionary, it is necessary to discriminate inappropriate equivalences between words caused by ambiguities in the pivot language, as Fig. 4 illustrates. In fact, when using EN as intermediate language between ES and CA, some wrong translations can be inferred, as for instance "banco"@es → "riba"@ca, in addition to the correct ones.
Indirect translations of “bank” into Portuguese along with the pivot Spanish translations
Indirect translations of “bank” into Portuguese along with the pivot Spanish translations

Example of translation candidates between ES and CA (EN as pivot language) in Apertium RDF. We do not represent here the translation sets for simplicity. The dashed connectors show the direct ES-CA translations.
A method to identify such incorrect translations was proposed by Tanaka and Umemura [15] when constructing bilingual dictionaries intermediated by a third language. The method, called
Following the example in Fig. 4, let us suppose that we want to obtain translations for “banco” from Spanish into Catalan using English as pivot language. The application of the OTIC method results in the following scores for the candidate translations: score("banc"@ca) = 1.0 and score("riba"@ca) = 0.66. It can be seen that the correct translation ("banco"@es → "banc"@ca) is higher ranked in this example. A service to compute the OTIC values for indirect translations is available at the Apertium RDF user interface.29
As an alternative to the OTIC method, we have proposed another method based on the
More details can be found at [18], where some experimental results are shown. For instance, all the possible English to Spanish translations that can be obtained in Apertium RDF via indirect paths were evaluated with this method. Comparing the results with the original direct translations available in the Apertium EN-ES dictionary, the obtained precision was 99% and the recall was 55%. The same experiment using OTIC (with Catalan as pivot) led to a 77% precision and 48% recall.30 With threshold = 0.5 in both experiments.
There have been other remarkable efforts to convert and expose multilingual linguistic data as LD on the Web. For instance DBnary [14] extracts multilingual lexical data from Wiktionary data and provides it to the community as linked open data. It covers 21 languages currently and uses
In this paper we have also discussed how new translations can be inferred between initially disconnected languages by traversing the Apertium RDF graph. Notice that the original Apertium framework includes the apertium-dixtools31
There have been previous efforts in combining existent bilingual dictionaries to create new bilingual [15] or multilingual [11] ones. However, differently to these approaches, Apertium RDF has been developed by applying Semantic Web techniques, and its lexical information is available on the Web to be consumed by humans or by other semantic enabled resources in a direct manner, based on standard languages and query means. Further, the Apertium RDF is now part of the much larger LLOD cloud, thus enabling easier combination with data from other LD sources.
In this paper we have described the transformation of 22 Apertium bilingual dictionaries into RDF and their publication as LD on the Web. The proposed methodology is general enough to be applied to other bilingual dictionaries. We have also discussed how to compute the confidence degree of indirect translations in the Apertium RDF graph. In our view, the publication of Apertium RDF contributes to the critical mass of cross-lingual connections required by the multilingual Web of Data to be truly useful and operational.
Despite its novelty, Apertium RDF is attracting the attention of third parties for their reuse. For instance the BabelNet team is currently exploring the potential improvement of their information in cases in which BabelNet does not provide translations that however can be found in Apertium, which is particularly interesting for certain minority languages. Further, the original Apertium initiative is focusing on the Apertium RDF graph as a way to enhance their own data creation processes. For instance, a topic for
As future work, we plan to enrich the Apertium RDF graph with new datasets as soon as new LMF versions of the existent (and future) Apertium dictionaries will appear. Also, an in-depth analysis of the coverage and quality of the translations among all the possible language pairs in the graph will deserve a separate study.
Footnotes
Acknowledgements
This work has been supported by the FP7 European project LIDER (610782), by the Spanish Ministry of Economy and Competitiveness through the “Juan de la Cierva” program and the 4V project (TIN2013-46238-C4-2-R), and by IULA-UPF-CC-CLARIN.
