Abstract
Limited accessibility to language resources and technologies represents a challenge for the analysis, preservation, and documentation of natural languages other than English. Linguistic Linked (Open) Data (LLOD) holds the promise to ease the creation, linking, and reuse of multilingual linguistic data across distributed and heterogeneous resources. However, individual language resources and technologies accommodate or target different linguistic description levels, e.g., morphology, syntax, phonology, and pragmatics. In this comprehensive survey, the state-of-the-art of multilinguality and LLOD is being represented with a particular focus on linguistic description levels, identifying open challenges and gaps as well as proposing an ideal ecosystem for multilingual LLOD across description levels. This survey seeks to contribute an introductory text for newcomers to the field of multilingual LLOD, uncover gaps and challenges to be tackled by the LLOD community in reference to linguistic description levels, and present a solid basis for a future best practice of multilingual LLOD across description levels.
Introduction
Human languages are incredibly diverse, influencing the way communities interact with one another, with their own national institutions, and within the global economy. Many globally scattered groups and organizations capture data for a specific or several natural language(s) in the form of digital language resources. Such resources allow to document and preserve the language use and development and are, thus, important cultural assets [56]. Especially under-resourced languages benefit from consolidation of existing data and facilitated interoperability with other existing resources. However, barriers that exist for the interoperability between language resources, e.g. legal, economic, information, technical, and methodological challenges [28], render their interchange difficult. To address these challenges and promote linguistic diversity, it is crucial to consolidate existing language data and develop technologies that facilitate the integration of information from various multilingual resources.
High-quality digital language data and resources are vital to a variety of research areas, such as linguistics, the study of low-resource languages, and digital humanities. Such data are equally important for a number of downstream applications from Natural Language Processing (NLP) to learning structured knowledge from text. The creation, linking, and reuse of multilingual linguistic data is complex due to differences in theoretical underpinnings, representation formats, and annotation and metadata coverage. In particular, differences in linguistic description levels need to be considered, such as the morphological, syntactic, lexical, and other (see Section 4). This consideration requires a technology that is sufficiently generic to be applied to all levels of linguistic description and capable of integrating information from different data providers, e.g., from national research infrastructures used for hosting their respective language resources.
With this objective in mind, Chiarcos et al. [43] introduced the notion of Linguistic Linked (Open) Data (LLOD)1 “Open” is in brackets since proprietary data can also be published as linked data. We use LLOD to refer to the technology and the use of open, community-maintained vocabularies, regardless of the licensing and availability of the resources this is applied to.
This article is a comprehensive survey of the state-of-the-art in multilinguality and LLOD with a particular focus on support for different linguistic description levels in order to identify open challenges and gaps. Overall, Bosque-Gil et al. [18] have recently argue that LLD has certainly made headway, but there are still challenges to respond to. More specifically, Bosque-Gil et al. [23] and more recently Khan et al. [135] present surveys on modeling linguistic data as LLOD, where the former identify phonetics and phonology as well as dialogue structures as still under-represented. In this more comprehensive and recent survey we can confirm these findings and additionally identify pragmatics as a level with rather low coverage to date. Bosque-Gil et al. [18] also discuss some of the challenges based on the studies presented in the special issue dedicated to LLOD, and, although some coincide with ours, our analysis is more thorough and comprehensive. To the best of our knowledge, this is the first systematic survey of existing research and practices of linguistic description levels in multilingual LLOD resources. Building on the Preferred Reporting Items for Systematic reviews and Meta-Analyses (PRISMA) [192] method to conduct and report systematic reviews and a team of 16 experts in linguistics and LLOD, this article aims to:
provide guidance for researchers and practitioners on available approaches for supporting specific linguistic description levels in the LLOD;
identify open challenges and gaps in the support of linguistic description levels across multilingual LLOD resources; and to
present a solid basis for a future best practice on how to represent, model, and link different linguistic description levels across multilingual LLOD resources.
The article is structured as follows: Section 2 introduces the preliminaries of multilinguality and LLOD. Section 3 then describes the methodology and statistical results of the conducted survey. Sections 4 and 5 detail the findings from our survey, where the former focuses on models and types of linguistic description levels covered, while the latter concerns types of language resources with their linguistic description levels and their use. Section 6 unites challenges that were identified based on this survey with challenges that derive from the experience of the group of experts authoring this article. Finally, prior to concluding remarks, Section 7 proposes an ideal ecosystem for multilingual LLOD, addressing general challenges that need to be addressed by the (L)LOD community as well as particular challenges that pertain to multilinguality and LLOD.
Specialised terminology from linguistics is used throughout this article. For further information about the terms used, the reader is referred to The Summer Institute of Linguistics (SIL) Glossary of Linguistics Terms2
The two concepts of linking and multilinguality are of fundamental importance because they relate strongly to the distribution of data according to FAIR4 FAIR data principles are intended for improving Findability, Accessibility, Interoperability and Reusability [237].
In the context of web technologies, the most widely adopted solution to the issue of how to perform this linking is the application of the Resource Description Framework (RDF) [66] and Linked Data [12]. Cimiano et al. [57] present the semantics of the RDF model, which was created in late 1990s, to represent linked data and knowledge in a machine-readable manner, and its most common formats for serialisation, N-Triples, Turtle, XML and JSON-LD, which enable publishing RDF data on the Web. The authors also give an overview of the Web Ontology Language (OWL) and SPARQL, the standard language for querying RDF data. With the development of commonly used vocabularies for language resources, especially for the lexical domain (OntoLex-Lemon [61,160]), the so called LLOD cloud has been developed [43,58] as an aggregator of language resources available as LOD. Subsequently, great potential has been recognised in the use of this technology to establish interoperability between existing resources, especially in applications that have previously been tackled by means of graph technologies or feature structures, such as lexical data or linguistic annotation [44,157,166]. Also, the Simple Knowledge Organisation System (SKOS) standard for representing structured controlled vocabulary is widely used for the representation of multilingual LLOD [56,74] and SKOS-XL5
Multilinguality has always been a central aspect of LLOD development. Initially, most LOD resources adopted
With the increasing number of available multilingual language resources as LLOD, the question of adequate support not only for multiple languages but different description levels in individual resources becomes more and more pressing. Several approaches exist for tracking information about the same item across different data sources exploiting links, such as
As a result of these trends, we find ourselves today in a situation where the semantic layer is no longer the only bridge between languages. Linking language data across languages is, in principle, also possible via the linguistic layer either statically, through pre-computed cross-lingual links, or dynamically, by computing such links on the fly. Furthermore, because the computation of such cross-lingual links can exploit a wide range of linguistic resources available in the cloud, they can be sensitive to linguistic and cultural context and can exhibit a degree of finesse and nuance not realisable from a purely semantic perspective.
The full potential of this approach is yet to be fully determined, which is why we feel it is opportune to carry out a systematic survey which has to take into account the complex interplay of progress between (i) the different levels of linguistic description that make up the layer of linguistic information present in the LLOD, (ii) the representations and models that are used to express these different levels, and (iii) the use cases in which these have been realised.
The notion of multilinguality is pervasive, and its meaning is generally taken for granted. However, close examination of the way the concept is used reveals a variety of accepted meanings. The things that are frequently cited as being “multilingual” fall broadly into three categories: (i) language resources, (ii) tools and services, and (iii) knowledge-based structures, i.e., ontologies, knowledge graphs, taxonomies and databases. A related notion of multilinguality that is claimed for many linguistic or lexical approaches is
This definition derives from the ELRA Language Resource Association to be found at
Entities in the LLOD have the essential character of being
Multilingual resources
A resource is monolingual if its contents are linguistically relevant to one language. Thus, a corpus of Italian text or an Italian wordlist is monolingual because it contains words which belong to the Italian language. It follows that a resource is multilingual, if it relates to two or more languages. A prototypical example would be a code-switching corpus, e.g. Li et al. [153] whose words derive from both English and Mandarin. A resource can also be multilingual if it is composed of several monolingual subparts belonging to different languages. This is consistent with Schmidt and Wörner [211], for whom a multilingual resource is “any systematic collection of empirical language data enabling linguists to carry out analyses of multilingual individuals, multilingual societies or multilingual communication”.
The LLOD cloud is inherently multilingual due to its inclusion of corpora and resources containing data in various languages. A separate and important issue is how that information is actually represented. Ultimately, it must bottom out in the association of an entity with a universally accepted language identifier. A recent in-depth study, as reported by Spahiu et al. [216], has provided valuable insights into the current state of multilinguality within LLOD datasets.7 This study only considered LLOD datasets that were available as dumps.
Multilingual services and tools
A service or tool is characterised by three things: input, outputs and behaviours. A service or tool will be deemed monolingual if it operates over inputs and outputs that (like monolingual corpora) are both associated with the same unique natural language. Expanding this to the multilingual case, there are several possibilities: (i) input and output are in different languages (e.g a translation service); (ii) same service can be applied to input/output in same language but for different languages (e.g. EN–EN and FR–FR summarisation); (iii) various combinations of (i) and (ii). It is also possible to envisage NLP services where either input or output is not in natural language as such but in some other form, such as a parse tree or an abstract meaning representation. The linguality of such structures are discussed in the next section.
Multilingual knowledge structure
Examples of knowledge structures are ontologies, taxonomies, etc. Items in this class have several distinguishing characteristics. First, they can be represented directly using LLOD machinery (e.g. using RDF, shared vocabulary, naming with URIs, links to other resources). Second, they are primarily
Multilinguality as language independence
LLOD embodies language independence in three ways: (i) its design principles are language-independent, (ii) it encourages
In summary, the design of LLOD supports language independence by offering principles for achieving a useful compromise between linguistic felicity and interoperability across languages. This is achieved by linking through appropriately extended shared vocabularies.
Before proceeding to a systematic review of approaches to create, represent, and reuse multilingual language data building on LLOD principles, we first introduce our methodological approach.
This section gives a detailed description of the methodology we applied to our systematic literature review, based on the well established PRISMA method [192], and provides details on the obtained results of the systematic review that serve as a basis for the comprehensive analysis in the following sections.
Methodology
The objective of this systematic review is to provide a synthesis on the state of knowledge (Sections 4 and 5) and suggestions for priorities of future research (Section 6 and 7). The PRISMA method has specifically been designed to provide detailed reporting guidelines for such reviews to ensure a comparable and comprehensive result. This method generally consists of three stages:
Identification Screening Inclusion
Identification
In order to optimise our search in publication databases, a set of keywords was jointly defined by a group of, in total, 16 experts who are the authors of this article. Each keyword represented a composition of multilingual, multilinguality, multilingualism or cross-linguistic, cross-lingual and prototypical search terms for LOD, e.g. RDF, linked data, web or simply “multilingual data”. In addition, we explicitly included linguistic description levels in the keywords, i.e., pragmatics, syntax, semantics, lexical, discourse analysis, phonology, phonetics, and morphology. In total, 41 individual, e.g. [“multilingual LLOD”], and compositions of keywords, e.g. [“multilingual data” AND “representation”], were jointly identified as relevant. The keywords were collected in a document and discussed in several meetings as well as initially submitted to one search platform to test their potential return, i.e., if there was no result the keyword was excluded from further steps. In a second step, the keywords were rated on a scale from 1 to 10 by 6 experts, where 1 signified not relevant and 10 denoted highly relevant for this search. We calculated an average for each keyword/keyword combination from these scores to obtain a final relevance score.8 The list of keywords and average expert ratings are available at
These keywords represented a starting point for an extensive search on several publication platforms, which the same group of experts jointly identified as important to this task. The following search platforms for scientific publications were utilised in the proposed approach:
Scopus
Web of Science
DBLP
Google Scholar
The time period was set from 2009 until 2021 for this search, which focuses our survey on more recent works, and an additional search was performed to include papers published until 2023 after the first submission. We additionally assumed that important publications before 2009 would be included in review papers that fall within the time period we selected. To reduce the number of resulting publications to a manageable number of papers to be read by the 16 experts of this research endeavour, each paper was ranked by times of occurrences across platforms and keyword ranking building on the expert scores introduced above. The final score for each paper was calculated by taking the score for each search keyword the paper resulted from and multiplying it with the times of occurrences across platforms, finally summing the individual multiplied keyword scores. For instance, Paper No. 1 was found with the keyword [“multilingual LLOD”] with an expert score of 9.17 three times across platforms resulting in a score of 27.51. The same paper also resulted from the keyword [“multilingual information”] with an average expert score of 4.17 one time, which makes the total score for this paper 31.68 in the final ranking. This approach clearly favours papers resulting from several keywords that were ranked with a high expert score.
The extensive search was supplemented with snowballing, i.e., exploration for more recent publications citing central works we identified within our result corpus and frequently cited older references that recur. In parallel, a reference repository of publications that this group of experts considered central to this topic was compiled. This reference repository serves as a gold standard to validate our semi-automated keyword-based search strategy. We have evaluated to which degree the result corpus of the latter contains publications from the reference repository.
The top-rated papers from the Identification step were manually annotated each by two experts. A crucial and central qualifying question for the screening process was which linguistic description levels are addressed/described in each publication. Furthermore, the criteria for this Screening step were the relevance of the publication to the topic of multilingual linguistic linked data and its thematic categorisation by representation, approach or standardisation. If one or two annotators marked a paper as “unsure”, i.e., not clearly central to this survey but probably to be considered, a third expert decided on the publication’s relevance.
To distribute the final set that resulted from this initial screening among experts, we performed an annotation process with pre-defined categories based on their title, abstract and keywords. Only if the categorisation based on these three components of publications was not possible, the full text had to be consulted at this stage. The categories for this final step were divided into generic and specific annotation tags represented in Table 1, where the specific tag of linguistic description level had to be assigned to all publications.
Tags for expert annotation of result set
Tags for expert annotation of result set
For generic tags, the category was only assigned if relevant for a given publication. For specific tags, each of the three categories and a respective value exemplified in Table 1 was assigned. This annotation with generic and specific tags provided the basis for clustering the result set, assigning each cluster a specific label. The clusters served the purpose to decide on the relevance of an individual publication by comparison to other publications on the same topic, perform targeted snowballing and ensure that experts can search for more recent publications on the specific topic, mitigating the risk to miss important contributions. Furthermore, it facilitated the distribution of the workload among the experts.
To decide on the eligibility of publications, each cluster was assigned to one, two or three of the experts of this work, depending on the size of the cluster. A cluster in our case is a grouping of papers based on their identical or similar tags. Very large clusters would be assigned to three experts, very small clusters to only one expert. Some clusters that contained a considerable number of papers on a specific subtopic, e.g. OntoLex-Lemon, were further subdivided. Table 2 shows the types of labels and number of clusters, the number of papers contained in each cluster and the number of experts that worked on each cluster. As you can see in Table 2, some of the 16 experts were assigned to more than one cluster.
Types and numbers of clusters with number of publications per cluster and experts
This section describes our methods for identifying the final subset of publications to be included in this review. The first and foremost criteria for inclusion were that publications are:
directly related to multilingual linked data
published in English
peer-reviewed (guaranteed by the publication venue)
The explicit decision which publications to report was taken by the experts of the individual clusters, where specific papers would be discussed with other experts if the decision was not clear. Snowballing, that is, checking citations in our result set on important works, and complementing the result set with additional more recent publications, further increased the number of publications considered for this survey.
Inclusion was designed as a two-step process. In the first step, experts assigned to a specific topic, i.e., a cluster in our case, prepared a written summary of topic-specific publications, dividing the contents into the topics that now represent Sections 4 to 5 of this article for uniformity. In the second step, the individual sections of each cluster summary was synthesised into the sections of this article.

PRISMA 2020 flow diagram; represents expert involvement in the step.
The total number of papers for each stage of the survey methodology is represented in Fig. 1. In the Identification stage, we identified 41 keywords that were ranked by 6 experts according to their relevance. The Spearman correlation for this ranking step was 0.632 across all six expert rankings, thus providing a strong correlation. The keyword scores provided the basis for ranking the papers, adding up scores of a paper depending on the keyword that it was returned for. In total from 41 keywords a list of 25,074 papers were returned.
Given the number of people involved and the time available to annotate papers, we had to limit the result set to annotate. To this end, after removing duplicates, the result set was ranked by keyword-based score and the top-ranked publications were inspected to determine a cutoff score. This cutoff turned out to be a score of 37, after which publications started to get less relevant to our topic, limiting the result set to be screened to 210 publications. For comparison, the top-ranked publication obtained a ranking score of 155.19. Manually screening and annotating this reduced result set further decreased the number to 110 publications after the screening phase (see Section 3.1.2), removing not directly relevant or duplicate publications. This manual annotation first involved assessing whether a paper is relevant (1), not relevant (0) or the annotator was unsure about its relevance (2). The inter-rater reliability score for this rating resulted in a moderate kappa value of 0.495, mostly due to the fact that many times one rater was sure about relevance, while the second annotator was unsure, providing a 2. In cases were a score of 2 was assigned, a third annotator would determine whether to include the publication or not. This detailed screening stage led to the exclusion of 14 more papers, 4 of which were superseded by newer publications by the same authors, 6 were closely related to other use cases, e.g., on BabelNet or OntoLex-Lemon, and 4 were finally deemed not closely related to linguistic description levels.
The size of the clusters varied between 4 and 25 publications, the smallest was related to the tag LLOD infrastructure, the largest to the specific representation format and standard OntoLex and its predecessor Lemon [160] as represented in Table 2. Summaries of these clusters were prepared by experts and structured by the topics and sections in this article. Not all of these topics would be covered by each of the clusters, e.g. the topic of morphology did not explicitly address other linguistic description levels. The final distribution of papers by year of publication is shown in Fig. 2, which clearly shows this has been a topic of continued interest over the past decade. A lower number of publications on the final year considered for this survey can be expected due to the submission time of the first version of this article.

Distribution of included papers by year.
In terms of gold standard comparison, from the 10 papers manually selected as highly relevant by experts, only 6 were included in our final result set. This confirms our intuition that this method should be extended by performing snowballing and further investigation on the individual linguistic description levels, which we performed when deemed necessary. The final number of papers included in this survey comprises 227 publications. We kept references to individual book chapters of a monograph, if these were part of our result set and referenced them accordingly in this work.
All publications surveyed and added by means of snowballing and exploring more recent publications are finally discussed in the following Sections 4 and 5. First, we present approaches specific to individual linguistic description levels. Second, resources, their uses and representation models are discussed. In Section 6 and 7, we draw concluding challenges from the survey analysis as well as our own professional experiences and discuss a potential ideal ecosystem for LLOD with respect to multilingual data and linguistic description levels.
A summary per linguistic description level with the respective models that are language agnostic and representative resource along with their language. If the resource is available in multiple languages, indication of some is provided
In this section, we discuss the results of our literature analysis with respect to representation models along different linguistic description levels, mentioning also some examples of language resources where such techniques were applied. An overview of the models and indicative resource per linguistic level can be found in Table 3. Subsequently, in Section 5, we review the types of language resources and their use in more detail. The considered linguistic description levels are the following:
Lexical Semantics
Syntax and Morphology
Pragmatics
Lexicography
Phonetics and Phonology
Translation and Terminology
Etymology and Diachronicity
One recurring and predominant model for representing linguistic information as linked data at different linguistic description levels is OntoLex-Lemon. Thus, several of the approaches covered in this section represent extensions of OntoLex-Lemon (see [59,164] for an overview on such extensions). It also occupies a central role as representation mechanism in the integration of resources and services into complex language technology-processing pipelines [161]. Nevertheless, the objective of this section is to provide a general overview of approaches to describe different linguistic description levels within the context of multilingual linked data. This overview serves the purpose to see which levels have been well covered in the literature and which ones might require more attention as well as to identify open challenges.
It should be noted that the majority of reviewed papers do not refer to specific linguistic descriptive levels, but rather have generic references to “linguistic data”, “lexical data”, “language annotations”, “annotated corpora”, etc. Such generic references typically include several linguistic description levels that deal with written language, e.g. morphology, syntax, (lexical) semantics, etc. Bosque-Gil et al. [23] explicitly touch upon representation of specific linguistic levels, i.e., phonetics and phonology, morphology, syntax, semantics, semiotics, discourse, and specific branches of linguistics, i.e., historical linguistics, lexicography, typology and cross-linguistic studies, terminology. Bosque-Gil et al. [23] observe that “phonetics and phonology remain two areas with relatively low coverage in the LLOD cloud” as well as dialogue structure. Our more comprehensive and more recent survey can confirm this finding based on the coverage of description levels and number of papers in the result set on these description levels. Additionally, we identified a low coverage for pragmatics. While we touch upon modeling of linguistic data and different linguistic description levels in this and the following section, please consult Khan et al. [135] for a very comprehensive survey on the current state-of-the-art on modelling LLOD.
Lexical semantics
Lexical semantics is the study of word meaning. Within the context of this article, we are interested in how word meaning in all its facets can be represented in LLOD. Several models to represent lexical data on the web have been defined, as depicted in Table 3. These models made it possible to link the semantic information described in existing ontologies with the linguistic information necessary to link ontological concepts with their mentions in natural language data.
From these models, the OntoLex-Lemon predominantly surfaced in our result set (see Table 2), also in its preceding version Lexicon Model for Ontologies (
In the core model of OntoLex-Lemon, headwords are represented as lexical entries (
The original lemon model [160] advanced in the context of the W3C OntoLex community group,10
A model for describing lexical semantics preceding and extended by OntoLex-Lemon is SKOS [173]. It is an RDF vocabulary designed to represent concept schemes and provide lexical information for thesauri and other types of controlled vocabularies. Lexical meaning is represented as
One alternative approach to represent lexical semantics in our result set is Framester [99], a data hub focused on broadening the FrameNet coverage of linguistic information and formal homogeneous linking of lexical and factual resources. Building on Fillmore’s frame semantics [92] and Linguistic Linked Data principles, it acts as a hub between FrameNet, WordNet, VerbNet, BabelNet, DBpedia, DOLCE-Zero, and many other resources. It provides a two-layered (intentional-extensional) semantics for frames, semantic roles, semantic types, selectional restrictions, and other elements of lexical resources in OWL2. Any word or multiword can then evoke a frame, which can be a FrameNet frame or any other type of frame, such as a WordNet synset frame. While this approach allows for easy access via a SPARQL endpoint and a different representation model for lexical semantics, multilinguality is not explicitly considered and only covered in as far as the interlinked resources are multilingual.
From the perspective of linked data, all approaches to represent lexical semantic information agree that the conceptual or meaning level should be kept separate from the string or word level. This is important since additional information might only apply to one of these levels, e.g. part-of-speech relates rather to the lexical representation than to the meaning of a word. A separation of meaning and form is particularly important for representing multilingual information, such as equivalent words or multiwords across languages that represent the same meaning but require different metadata descriptions.
Syntax guides the composition of words and morphemes into larger units of phrases and sentences. Morphology studies the composition of words, where inflectional morphology is concerned with affixes that carry grammatical meaning to fit words within specific grammatical contexts, and derivational morphology relates to the formation of new words with changes to part-of-speeches and lexical meaning. One common way to represent syntactic and morphological information in relation to textual data and corpora is by means of annotation metadata. A very comprehensive ontology to formalise linguistic information in a machine-readable ontology for 75 language varieties is provided by the Ontologies of Linguistic Annotation (OLiA) [54], which covers morphology, morphosyntax, phrase structure syntax, and dependency syntax. Recently, OLiA has been utilised in Annohub [2], a method to harvest existing annotation schemes to provide an RDF-based platform for linguistic research.
In OntoLex-Lemon, the syntactic behaviour of headwords in the lexicon, i.e., lexical entries, can be described by means of syntactic frames and the number and type of arguments a lexical entry requires [59]. For instance, verbs that follow a transitive frame require a syntactic subject and a direct object. Morphemes can be represented as different forms of a lexical entry, e.g. singular and plural forms. A very specific scenario for re-using OntoLex-Lemon to model morphological and syntactic information is provided by Loughnane et al. [155], who target to represent annotations generated from language-learning content. As examples, the authors model a Spanish conjugation and an English syntax exercise as LLD.
One phenomenon at the syntax-semantics interface that we decided to include in this section for the purpose of this overview is that of
Morphology still remains an under-explored aspect of LLOD. With the systematic review, we identified papers that address morphology in lexical resources [139,140,203], corpora [45,131,214] and in grammars [188] and as general modelling challenges [139,142].
In all of these areas, a number of more recent publications have appeared, which we added after the systematic review. OntoLex-Lemon extensions for morphology initially focused on inflectional morphology and composition with limited support for derivational morphology. The Multilingual Morpheme Ontology (MMoOn) [138,139] has been designed in a bottom-up approach to provide an exhaustive vocabulary for morphological inventories, partly inspired by current standards, tools and resources as applied in language documentation and linguistic typology. Its feature inventory incorporates a large number of terminological resources that are of considerable size in their own right (ISOcat, OLiA, LexInfo), which is why it has grown into a relatively large vocabulary. MMoOn [142] focuses on decomposition of entries and related word forms as well as morphological patterns that are used to form lexical entries and word forms. To this end, an extension of OntoLex-Lemon by 13 classes and 11 properties has been proposed (Version 4.17 at the moment of writing), the most central ones being
Several additional features that should be addressed in future are discussed, such as ordering morphs, which is not strongly supported by the current RDF format. Preliminary work in this sense is reported in Declerck et al. [77], which shows how the lexical representation and linking features of OntoLex-Lemon can be used to model morphological and ordering restrictions over the components of Multiword Expressions (MWEs), illustrated by examples from OdeNet, a German resource for lexical semantics. Because of the complexity of the vocabulary, it is lacking wide application, but it has been driving the development of the OntoLex-Morph module [142]. While OntoLex-Morph does not provide the level of detail of MMoOn, it defines elementary and reusable data structures for representing morphology as LLOD, and MMoOn is expected to serve as an inventory of morphological features in this context. A desideratum in this regard is the wider application of the emerging OntoLex-Morph specifications to broad-scale morphological resources such as the UniMorph12
Pragmatics studies the contribution of context to meaning and utilization of language in social interactions as well as the relationship between interacting interlocutors. To represent pragmatic information as LLOD, Pareja-Lora [194] extends the OntoLingAnnot annotation framework for morphological, syntactic, semantic, and discourse phenomena by an ontological conceptualization of pragmatics. To this end, pragmatic units are introduced to annotate text and dialogues in a way that they can interact with the other linguistic description levels, since every linguistic unit can have a pragmatic projection. For instance,
In terms of discourse annotation, Chiarcos [39] proposes an extension of Ontologies of Linguistic Annotation (OLiA) [36] with a conceptualization of discourse features as found in major annotated corpora, e.g. Penn Discourse Treebank. To this end, the model introduces the classes
Another line of research that broadly falls in the scope of pragmatics is the computational modelling of rhetorics, style and genre information by means of OWL ontologies [11,26,175,176,189]. At the moment, however, these are primarily conducted in the context of literary studies and less frequently applied to develop multilingual applications and thus beyond the scope of this article.
In terms of real-world applications, chatbots operating on knowledge graphs and other structured data have been described, as well as human language interfaces to ontologies or the use of ontology lexicalization techniques (e.g. [68,120]. LINGVO [133], for instance, addresses the challenge of ranking knowledge graphs by their degree of multilinguality. While these technologies can benefit from and partially build on lexical data linked across multiple languages and thus have a multilingual dimension, the dimension and processing of discourse information is under-represented in this line of research. A notable exception is the development and practical application of an OWL/DL ontology of discourse relations in the context of an NLG system by Bärenfänger et al. [8]. This general line of research from work on ontology-based parsing for symbolic natural language generation and deep syntactic parsing was proposed around the time [235,236], and is continued with limited intensity to this day [62,122,123,175,234]. Overall, however, the area generally suffers from a lack of publicly available data sources compliant with the LLOD format. Instead, discourse-related data continues to be published in resource-, domain- or community-specific formats (e.g. [191]).
In an effort to address this issue, Chiarcos and Ionov [46] propose the formalization of discourse markers, such as
Lexicography
From a practical perspective, lexicography refers to the compilation, writing, and editing of dictionaries and other types of lexical resources. From a theoretical perspective, it relates to the study of lexeme features, such as syntagmatic and paradigmatic behaviour. A lexeme is coarsely defined as a set of inflected variants of a word.
Within the last years, a growing trend to publish lexical resources, including dictionaries, as linked data on the web could be observed. Bosque-Gil et al. [20] discuss the benefits of representing a lexicon as linked data, both from the macro-structure (internal and external reusability of the elements in the lexicon, independence on the order of appearance of lexical entries and senses in cross-references, compatible onomasiological and semasiological views, etc); and the micro-structure (every lexicon element, i.e., lexical entry, sense, written form, etc. is a node in the graph, thus being a potential entry point in a LD dictionary). These and other advantages illustrate the difference between traditional electronic dictionaries, compiled with only the human as target, and creating them for both humans and computers, as it is the case of linked data dictionaries. Some early works that used linked data to represent dictionary data comprise monolingual [141], bilingual [117], and multilingual [22] dictionaries, as well as diachronic [137], dialectal [79], and etymological ones [1].
Based on the experience of the above referred works, Bosque-Gil et al. [21] identify a number of issues when converting information in a dictionary to OntoLex-Lemon, e.g. headwords may have different part-of-speeches. Also establishing translation relations between usage examples of words turned out challenging. The authors go on to propose a Lexicography Module to extend OntoLex-Lemon to resolve these issues. The specification of such a new module, called
There has been a close collaboration between the recently finished projects Prêt-à-LLOD15
Etymological information that provides details on word origins and histories is frequently a part of dictionaries. Thus, transforming dictionaries and lexical resources including etymological and diachronic information to LLD requires a means of adequately representing such information. Since OntoLex-Lemon is the predominant model for representing lexical information, Khan [136] proposed an OntoLex-Lemon Etymological Extension (lemonETY) by linking etymological elements to
In addition to word histories, it is important to enable a representation of historic languages and near-extinct languages with digital language equality and preservation of cultures in mind. Bellandi et al. [10] discuss how to represent a multilingual and multi-alphabetical Old Occitan medico-botanical lexicon in lemon and discuss an extension to multilingual settings, e.g. by extending
To truly assist in an inclusive approach to digital preservation of culture and cultural heritage, linguistic linked data should be able to accommodate all types of linguistic representation, i.e., written, spoken, and signed. Sign languages have received very little attention in LLOD, with very few exceptions, e.g. Gennari et al. [101]. In this case, the topic goes beyond etymology and diachronicity, since the representation of sign languages as such already represents a blind spot. From a more etymological perspective, representing ancient signs, such as cuneiform signs, as LLOD should be considered. Homburg [128] proposes an extension of OntoLex-Lemon with paleocodes to this end, which requires an SVG representation among others.
Phonetics and phonology
Phonetics studies the production and perception of speech sounds or equivalent representations, e.g. signs in sign language. Phonology investigates how speech sounds, or equivalent representations, form patterns in a specific language or across languages.
The Phonetics Information Base and Lexicon (PHOIBLE) [181,182] represents a phonological typology that ports disparate segment inventory databases to linked data to make them linguistically and computationally interoperable. Additionally, knowledge about distinctive features is added. Thus, PHOIBLE provides a research platform for segment and distinctive features across languages. A simple RDF model was created to link segments and languages, features and segments, and provide metadata for segment inventories.
Translation and terminology
Translation refers to the explicit representation of equivalent words, terms or longer sequences across languages that derive from a translation process. In contrast, terminology describes the generally multilingual representation of equivalent domain-specific single- or multi-word terms across languages. Terminologies can represent translated terms or terms derived from parallel or comparable corpora.
Vila-Suero et al. [231] follow a similar path of addressing multilingual LD as Labra et al. [144] and identify three levels of multilinguality in a resource: the resource itself might be multilingual, the vocabulary to describe the resource might be mono- or multilingual, and a target dataset for enriching and linking might be mono- or multilingual. A use case on geo.linkeddata.es from the Spanish National Institute of Geography with metadata in several local languages is presented. While equally considering different aspects where multilingualism plays a role as in Labra et al. [144], the analysis is split into the method proposed by Villazón-Terrazas et al. [232] for publishing LD: specification, modeling, generation, linking, publication, and exploitation.
Gracia et al. [116] propose an extension of lemon that builds on early work from Montiel-Ponsoda et al. [180] and introduces relations specific to modeling translations as linked data, such as
The DBnary dataset [213] draws on Wiktionary and provides vartrans relations for the subset of translations where source and target languages have their own lexicon, but introduced its own
León-Araúz and Faber [146] analyse the dynamic nature of terms and concepts from a pragmatic perspective and which challenges this raises for multilingual and cross-lingual settings. In terms of modelling, they utilise translation equivalents and context elements of OntoLex-Lemon. The main contribution is a detailed discussion of term variants from orthographic to diatopic and multi-dimensional facets of concepts as well as a detailed classification of terminological gaps and translation relations required to handle these gaps. Such relations are canonical translations, generic-specific translations, extensional translations, communicative translations, etc.
Early approaches to porting terminological information to linked data include Federmann et al. [90], where the authors present a new approach on the automated acquisition of multilingual terms for labels of ontologies in the financial domain from web stock exchange websites. This approach uses direct localisation/translation by searching candidate terms in various semi-structured multilingual web sources and repositories. Rule-based machine translation methods are used to extract terminology and work with under-resourced data extracted from multilingual websites. The final goal of this approach is to integrate the extracted terminology into Monnet [6] and TrendMiner [143] by transforming HTML into an XML-encoded multilingual terminology database or into the OntoLex-Lemon format. Multilingual terminologies available as LLOD, described in Lewis [151], are among others IATE, EuroVov, TAUS, etc. More recently, Gracia [110] describes Terminesp,18
Terme-à-LLOD [80] is a method of porting TermBase eXchange (TBX) resources, specifically as a use case IATE,19
While focused on the interdisciplinary exchange of theoretical and empirical findings on language acquisition research, Pareja-Lora et al. [16] address the need to integrate such data not only across disciplines but also across languages. Thus, they identify the necessity to describe and integrate language resources across different linguistic description levels, e.g. phonological information, morphological markings, syntactic differences, to perform cross-linguistic research. Cross-linguistic studies on language acquisition seek to identify commonalities and differences in developmental patterns across languages. The complexity of the data utilised for studying goes beyond linguistic description levels and extends to methodological and research design information, information about provenance (meta-data), and multimedia representations of data (e.g. speech coding). All of these different dimensions should be captured and assimilated in order to allow for a cross-resource analyses of research findings and data.
Two initiatives that have focused on representing language resources from different linguistic description levels, even though not directly related to LLOD but rather in the offline category of the language resource classification proposed by Lezcano et al. [152], are GrAF [130] and TEI [63]. Their LLOD counterparts are OntoLex-Lemon, Onto Media [132], MTE OLIA [42], ISOcat,20 ISOcat as such has been discontinued as an online inventory and has been succeeded by DatCatInfo, a repository of data categories, available at
In the section, we discuss LLOD resources and their use as multilingual and semantically interconnected linguistic data environment, which is useful in a number of tasks and application domains. For instance, LLOD resources have been applied in a range of Natural Language Processing (NLP) tasks, such as evaluation of Framester on frame disambiguation and detection [100], AMUSE for semantic parsing in questions answering [120], use of Wiktionary for a shared task on morpheme segmentation [9] as well es entity linking [178], utilization of Apertium in a task on translation inference across dictionaries [113], and cross-lingual information retrieval and linking [205]. A detailed overview of how (multilingual) knowledge graphs have been relevant for and used in NLP tasks is provided by Schneider et al. [212], ranging from entity alignment to text summarization. LLOD resources have also been beneficial to many application domains, such as cultural heritage [35,105], healthcare and medicine [111], administration and law [83], e-governance [159,221], media and journalism [219], language learning and education [155], cross-cultural business and commerce [90,229], disaster response and humanitarian aid [34], ecology and environment [4], and digital librarianship [69].
Over time, LLOD resources have become available in all shapes and sizes and have been classified into different schemes. For instance, language resources can be monolingual or multilingual and relate to different domains or be domain-agnostic. To provide a structured overview of resources and their different uses, we rely on the typology of language resources in the LLOD cloud21
When it comes to using these resources, in this article we distinguish between linguistic data usage and LLOD use. Linguistic data usage refers to the scenario where data contained in an LLOD resource are re-used for some specific purpose, without benefiting from the fact that these data have been modelled as linked data, e.g. collecting strings from an LLOD lexicon. LLOD use refers to cases that truly benefit from the LLOD representation of language data and the full potential of Semantic Web technologies. Our focus in this article is on the LLOD use rather than linguistic data usage.
In response to this, and specifically addressing the modelling of morphologically annotated corpora, Chiarcos and Ionov [45] introduced Ligt, an RDF vocabulary in accordance with classical interlinear glossed text (IGT). Based on established tools and formats such as FLEx and Toolbox [47], this is a minimal data model that allows encoding morphological segmentation, annotation and hierarchical structuring on all levels of morphology. Because Ligt is a relatively novel contribution, it is not widely used yet, and it is primarily to be seen as a first step towards developing common specifications that address aspects of morphology in lexical resources and corpora (i.e., a synchronisation with OntoLex-Morph) on the one hand, and linguistic annotation in general (i.e., an extension or revision of Web Annotation or NIF to support morphological annotation) on the other hand.
One more recent example of converting annotations and primary data to the LLOD cloud is the conversion of the Tartar National Corpus “Tugan Tel” [185], making it possible to interlink the corpus with available Tatar linguistic resources, e.g. TatWordNet. In fact, a LLOD version of corpus data in general has the added benefit of providing interoperability with linguistic resources, be it corpora or other types [37]. One example from our result set is the semantic annotation project Open Access Database ‘Adjective–Adverb Interfaces’ in Romance, which links different heterogeneous multilingual corpora annotated morpho-syntactically and semantically in TEI/XML enriched with RDF [198]. One work addressing corpus annotations in regards to discourse markers is Purificação et al. [215], who provide data in Bulgarian, Lithuanian, German, European Portuguese, Hebrew, Romanian, Polish, Macedonian, and English as a pivot.
POWLA [38] is a general formalism for interoperable representation of linguistic annotations through OWL/DL. In contrast to previous techniques in this area, POWLA is not restricted to a particular set of annotation layers; rather, it is meant to accommodate any kind of text-oriented annotation. Benefits of this type of representation are widely discussed, even for under-resourced languages (e.g. [200] for South African parallel corpora in our result set). Practical resources and applications in our result set are scarce and corpora are yet under-represented in the LLOD cloud in general. In particular, multilingual corpus annotations and interlinking multilingual corpus data is yet an underexplored area of research and practice.
Other examples of development and use of LD-based dictionaries can be found in the K Dictionaries [24] and the Linking Latin project (LiLa) [158] initiatives, both of them early adopters of the
Language resources that provide elementary aspects of morphological information are manifold, as these aspects are already part of the OntoLex specification, but these primarily focus on morphosyntax and inflection. Racioppa and Declerck [203] show that LLOD technology allows to seamlessly merge traditional lexical resources, such as multilingual WordNet(s), with independently developed computational morphologies for various languages, so that lexical entries can provide both sense information (from WordNet) and inflectional information (from language-specific morphologies). But, as specifications for the encoding of deeper morphological information in lexical resources are only emerging, only a limited set of lexical resources with rich morphological features are currently in existence, and these serve mainly as demonstrators of the respective vocabularies. As such, Klimek et al. [140] demonstrated the applicability of the Multilingual Morpheme Ontology (MMoON) to encode morphology information for Hebrew.
The original Princeton Wordnet [91,174] has frequently acted as a hub connecting other wordnets in other languages. However, such linking has not relied on stable identifiers and led to broken references and other technical problems when new versions of WordNet appeared. To solve this and to increase interoperability, efforts were made to convert Princeton WordNet into linked data [162,228]. Further, linked data principles have been applied in the development of the Global WordNet Grid (GWG) [60].
Other than that, there are IndoWordNet and EuroWordNet, which contain 76 individual wordnets in 47 languages.23 Even more wordnets are handled by the Global WordNet Association (globalwordnet.org).
Gillis-Webber [102] contributes to the important area of under-resourced languages by converting the English–Xhosa Dictionary for Nurses to RDF. This is particularly interesting, since it considers the representation of Click languages, requiring characters not typically included in a Roman alphabet. Taking a dynamic perspective on language data, particular emphasis is put on management of provenance and its related linked data generation.
An in-depth overview of the DBpedia knowledge base project is presented in Lehmann et al. [14,145]. DBpedia is a major interlinking LOD hub that extracts knowledge from more than 111 different language editions of Wikipedia. This knowledge base serves many purposes, and there are various applications and tools built around or applied to it. The DBpedia project consists of several important components, i.e., the knowledge extraction framework, DBpedia ontology, and DBpedia Live. The knowledge extraction framework applies various extractors for translating sections of Wikipedia pages to RDF statements. The extraction is based on the community-curated DBpedia ontology, consisting of more than 320 classes. DBpedia Live provides live synchronization with Wikipedia with only small delays of at most a few minutes. In Hellmann et al. [125] the authors present a declarative approach implemented in a comprehensive open-source framework based on DBpedia to extract lexical-semantic resources from Wiktionary.27
Steinberger et al. [220] present an overview of large-scale multilingual parallel language resources made publicly available by the European Commission (EC) and different European Union (EU) organisations with the aim to clarify what the similarities and differences between the various resources are and what they can be used for. The work focuses on 7 full-text corpora resources that cover all 24 official EU languages as well as a variety of non-EU languages: JRC-Acquis [223], DGT-Acquis and Digital Corpus of the European Parliament (DCEP) [119], the translation memories DGT-TM [222], ECDC-TM and EAC-TM, and the document collection accompanying the multi-label categorisation software JRC EuroVoc Indexer (JEX) [221]. These resources are made publicly and freely available online through the Europe Media Monitor (EMM) [219] family of applications developed by the Joint Research Centre (JRC) – EC’s in-house science service.
One resource in the category of knowledge bases is the Semantic Quran [214], a multilingual RDF representation of translations of the Quran. Building on an ontology specifically designed for this resource, the dataset encompasses 43 languages including some of the most under-represented in the LLOD cloud, such as Arabic, Amharic and Amazigh. The format is compatible with the NIF format and eases application scenarios, such as data retrieval for training NLP tools or linguistic research including morpho-syntactic aspects due to explicit representation of morpho-syntactic information.
Another endeavour to link a knowledge base with the Linked Data cloud is described in the project of integrating EcoLexicon, which is a multilingual (Spanish, English, German, Modern Greek, Russian, French and Dutch) terminological knowledge base, into DBpedia and GeoNames. The project is based on ‘linking legacy systems (RDB stored information) with an ontological system’ [5]. Also Web technologies are applied in Digital Humanities including their application in APIs, NoSQL databases, and database integration as well as terminology management. Linked Open Data is increasingly applied in digital humanities for LOD resources (prosopographical databases, gazetteers, citation services) and in other projects and applications. The vocabularies created by the linked data movement are broadly adopted in digital humanities and used for terminology integration over the distributed data collections, for example, SKOS, CIDOC-DRM and CTS. The metadata vocabulary in the GLAM provides data on galleries, libraries, archives and museums; there is also Linked Geo Data. A project of collecting, digitising and tagging Geolinguistic data of Cimbrian dialect varieties also adopted the LOD approach to make the dataset interoperable and available to other researchers and projects [82].
From the administrative and legal domain, a major LLOD resource is the multilingual EuroVoc vocabulary from the European Commission published in SKOS [83]. A more comprehensive initiative to port to and interlink legal language resources in the LLOD cloud was proposed by Martín-Chozas et al. [159]. Their approach includes the porting of existing resources, such as German Labour Law Thesaurus and JuriVoc, to RDF as well as the creation of new resources drawing from automated term extraction and existing legal language corpora. Moreover, LOD has become relevant for accessibility and transparency of government data publication worldwide. Researchers of the World Wide Web Consortium [64] have designed best management practices for publication and interlinking high-quality government data via RDF and SPARQL. It also should be stressed that the popular TEI data model used in digital humanities can be made compatible with RDF. From a different angle, Gromann [118] presents a vision of joining Neural Language Models (NLM) and LLOD towards a multilingual, transcultural, and multimodal information access. Different linguistic description levels are not considered explicitly, however, methods and application scenarios for all three dimensions are provided. In terms of the multilingual aspect, such a work proposes uniting different application scenarios of Neural Machine Translation (NMT) and LLOD, e.g. translating LLD contents, learning structured knowledge with NMT, or building reasoning on NMT, and NLM-based ontology alignment.
From a different perspective, in Lesnikova et al. [150] a method is proposed that employs the use of Machine Translation techniques (e.g., Bing Translator28
Another extensively used catalogue of linguistic categories is LexInfo [55]. It is primarily targeted to be used in combination with Ontolex-Lemon, but can be used for any other purpose that requires stable, well defined, and de-referenceable URIs to represent grammar categories. LexInfo has been implemented as an OWL ontology, and allows associating linguistic information to elements in an ontology with respect to a great variety of levels of linguistic description and expressivity.
One more project converted the semantic resource Thompson Motif index (TMI) of folk-literature into LLOD based on porting lexical resources provided in Wiktionary to a standardised representation, with the aim to support ‘semi-automatic translation of TMI’ and ‘the automatic detection and semantic annotation of motifs in literary work, across genres and languages’ [177]. The multilingual value of this project is reflected in an attempt to enrich TMI, which contains labels in English only, by labels in other languages, namely, German and Hungarian.
An additional model in our result set of publications is the Model for Language Annotation (MoLA) [106]. MoLA provides an RDF vocabulary for language annotation that permits the definition of custom language tags and their association with a time period and region. Furthermore, our result set contained the Cross-Linguistic Data Formats (CLDF) building on the CLLD project [95] that represents data types for language typologies. An example of a typological database modelled with CLDF is the representation of languages or rather languoids inspired from Glottolog, which models parameters that can be compared across languages, values of these parameters, and source referring to the primary source of data collection [97]. It further specifies the CLDF modules, e.g. wordlists, parallel texts, etc., and CLDF components, e.g. cognates, functional equivalents, etc. This format has been applied to various resources, including a database of cross-linguistic co-lexifications in more than 3,000 language varieties with the objective to analyse cross-linguistic polysemies [206] and the phylogenetic methods to analyse the ancestry of Sino-Tibetan [207].
The last available version of BabelNet as LLOD is 3.6, released on February 2016. Later updates of BabelNet (the last one is v5, at the time of writing this), do not contain updates of the linked data version.
Another resource that is not yet classified is the publication of Joint Research Centre (JRC)-Names resource as linked data using OntoLex to address the problem of identifying name variants of entities found in news media worldwide, within and across many languages [87]. The JRC-Names data originate from real-life multilingual texts, containing useful, complementary name variants.
Despite its rising popularity and recognition of its usefulness by different disciplines, the LLOD Infrastructure has some new [48,76] and old [114] challenges to overcome. As a result of our systematic study, and also based on our own experience, we analyse in this section a number of such challenges to be addressed in order to bring LLOD to its full potential for representing and linking multilingual language data across linguistic levels. Notice, though, that some of such challenges are common to LD in general (e.g. sustainability), however, we do not want to miss the opportunity to refer to them here because they are also crucial for the LLOD community. Other issues related to language resources or linguistic data in general but not so much specific to LD or LLOD (e.g. legal issues, ownership, data protection [157]) are out of the scope of this section.
Entry barriers to the technology
One of the central challenges revolves around enabling researchers and practitioners, who may not be familiar with the LLOD framework, to utilize it effectively. As with any emerging technology, LD presents a steep learning curve, requiring proficiency in RDF, OWL, SPARQL, and specific models such as OntoLex-Lemon. Furthermore, new adopters will need certain technical support to set up the appropriate infrastructure, which may vary depending on their needs, from simple storage of RDF dumps to fully-fledged triple stores with de-referenceable mechanisms.
Another challenge results from the amount of language resources that are available, which increases the complexity of issues related to interoperability. In fact, once a resource in the LLOD cloud is discovered, its access and exploitation are not always straightforward. Additionally, the presence of abandoned resources and broken links in the LLOD cloud might be a discouraging experience for newcomers.
To address these challenges, it is not only imperative to develop tools and standards and conduct research, but also to invest in education by means of training schools and courses. These educational activities are critical for the continued growth and advancement of the LLOD infrastructure and the expanding LLOD community. In that respect, ongoing research projects and networks, and the activities of several WC3 community groups, are progressing in that direction. For instance, NexusLinguarum36
However, there is still a need for user-friendly visual interfaces and working environments for working with LLOD (frameworks such as VocBench [224] are a step in the right direction), as well as tools and infrastructures for an easier deployment of (linguistic) semantic data on the Web. Previous efforts like the
Researchers and practitioners that specialise on specific linguistic description levels and actively generate linguistic resources covering one or more linguistic description levels are not necessarily LLOD-savvy. Lowering the LLOD entry barrier is in the interest of the LLOD community as well as of these researchers and practitioners. For the former, it is important to increase the coverage especially of yet under-represented linguistic description levels, such as phonetics and phonology, pragmatics, dialogue, sign languages, and diatopic representations. For the latter, it is of interest to maximise the re-use and interoperability of their often manually curated resources. Finally, addressing these challenges will contribute to lowering the entry barriers for both the LLOD community and researchers and practitioners specializing in specific linguistic description levels.
Ensuring the sustainable hosting of RDF data exposed as linked data on the web is another critical challenge, not limited to LLOD but common to LOD in general. This challenge involves balancing the efforts between data providers, data consumers, data hosts, language resource providers, technology developers, and linked data application developers. As it has been recently reported in several fora39
Data consumers may want content negotiation mechanisms and server side infrastructure (triple store + SPARQL endpoints). This can be a burden on the host/provider.
Alternatively, the burden can be put on data consumers, if they need to download and locally process RDF data dumps.
Focusing on the federation and queryability of linked data resources, a scenario that is ideal from the perspective of the user would be if the host can expose the data via a SPARQL endpoint – which can be directly queried by a client without setting up local infrastructure. On the other hand, real-world infrastructures currently allow only to deposit data
Linked data Fragments40
SPARQLer41
RDF-HDT is a community standard for binary compressed RDF data that can be directly queried by means of SPARQL [204]. HDT requires to download external data, but does not require to set up a local SPARQL end point.
More powerful support and infrastructures are, however, still needed. Something analogous to www.wordpress.org for websites, but for small linked data providers. Some steps in this direction are Databus,43
To lower the entry barrier to the LLOD cloud, a representation mechanism for linguistic data is crucial. While most linguistic description levels are well-represented in the current landscape, some areas, such as phonetics and phonology, pragmatics, dialogue, sign languages, and diatopic representations, lack comprehensive LLOD models. These gaps present challenges not only for the LLOD community but also for researchers and practitioners specializing in these areas. For the latter group, maximizing the reusability and interoperability of their manually curated linguistic resources is essential.
One level that encompasses more facets in linguistic research than LLOD representations currently provide is phonetics and phonology. PHOIBLE 2.048
Another important aspect of representing linguistic data as linked data is the ease to move across and between distinct description levels. Fortunately, interoperability is one of the key assets of the LLOD concept. One predominant approach of the LLOD community that becomes evident in this survey is the extension of existing representation models with dedicated modules for specific levels. For instance, numerous extensions to OntoLex-Lemon and OLiA provide a communal base representation to which to link specific information, e.g. phonetic features and morpho-syntactic annotations across languages. Models with different theoretical underpinnings can equally and jointly be explored by means of their linked representation in the LLOD cloud. However, this brings us back to the ease of access to LLOD resources, which is a requirement to be attractive to a wide audience. Only then is it feasible to explore cross-disciplinary linguistic research in multiple natural languages.
When it comes to specific language resources, especially corpora, formalisms such as POWLA have been proposed a decade ago, but still very few primary corpus data or corpus metadata have been published in the LLOD cloud. This raises the question whether there is a need to extol the virtues of querying, consistency controlling, and linking such data, also to other types of resources and across languages, more explicitly or whether the entry barriers to the LLOD cloud and/or representation models is too high for providers of such data. Within the COST Action NexusLinguarum49
To conclude, lowering the entry barrier to LLOD is in the interest of both the LLOD community and these domain-specific researchers and practitioners. Expanding coverage, especially for under-represented linguistic description levels, is vital.
Metadata provides a challenge for a broad audience involved in linguistic research, language resource creation and curation, phonology, translation, and related fields, all of whom can benefit from improved metadata standards and linked data solutions. One remarkable issue when publishing LRs on the Web is that their metadata is scattered across the different language repositories, which makes it problematic to ensure effective search procedures across the repositories. Furthermore, there are different standards adopted for different repositories, which makes data accessibility and linking problematic. There are also difficulties in harmonising metadata from different repositories in order to provide a single point of access to search for relevant language resources across repositories.
Actually, linked data provides suitable mechanisms to solve such issues. In this regard, we advocate for an increased use of agreed vocabularies for LRs metadata description, such as the Meta-Share OWL ontology [169]. An example of the use of the Meta-Share ontology can be found in the aforementioned LingHub service. Other types of metadata that might be of interest for the LLOD cloud is the Information Coding Classification (ICC) [70], or the licensing information in machine-understandable ways [230]. In order to overcome existing inconsistencies between different language resources, [81] propose a promising methodology for fixing and enriching metadata for LOD Cloud and Annohub repositories.
Besides metadata for the description of language resources, metadata for the development of particular use cases in linguistics also poses interesting challenges. For instance, as reported by Blume et al. [15] the use of LOD for research on multilingualism, particularly on language acquisition, requires a set of very different metadata to characterise multilingual speakers that currently are not present in the LLOD cloud, to account for psychological and sociological factors, competence being evaluated, language speaker’s acquisition history, among many other features. In fact, means to represent information on discourse structures and discourse relations in a multilingual setting and pragmatics in general is currently poorly represented in LLOD, as are phonetics and phonology. One especially challenging aspect within the context of LLOD is that all these metadata need to be linked to the participant in a specific study rather than to a language resource or a data repository. Thereby, LLOD could support the development of meta-analysis studies, e.g. to analyse the development of a specific grammatical element across studies. Furthermore, as studies on translation inferences in general and in relation to pragmatics have shown, the potential to query data inventories in a structured manner with a specific research question in mind across languages, potentially even from a diachronic perspective, open up entirely new research avenues for different linguistic branches. For phonology, for instance, such interlinking holds the potential to analyse speech patterns across a large number of languages and representation modes.
Cross-lingual linking
Cross-lingual linking enhances the efficiency and effectiveness of multilingual data integration and knowledge sharing. Thus, it is beneficial for Natural Language Processing (NLP) and Semantic Web researchers, cross-cultural studies, ontology development, benchmark creation, language resource provision, and language technology development, among others.
Interlinking multilingual resources is not straightforward since when entities are described in different natural languages, string similarity measures cannot be applied directly. This task poses several challenges [149]: (1) the structure of graphs can be different and the structure-based techniques will not be of much help; and (2) even if the structures are similar to one another, the properties themselves and their values are expressed in different natural languages. In this regard, even though an NLP approach is adopted, the performance of the method may depend on the amount of text and discriminative power of labels [147,148].
From the perspective of conceptualisation, other issues arise in the linking task [115]: a) conceptualisation mismatches due to language and cultural discrepancies; b) conceptualisation mismatches due to the perspectives from which the same domain is approached; or even c) different levels of granularity in the conceptualisation. Despite the recent advancements in the field, all the referred issues remain valid and give room for further research.
Another remarkable challenge is the need of benchmarks to support the evaluation of methods and algorithms on cross-lingual linking, in a Semantic Web context. Current efforts in that direction are the Multifarm [172] track, which is part of the periodic Ontology Alignment Evaluation Initiative (OAEI),50
See latest campaign description at
The main challenges that under-resourced languages face can be grouped in two [25]: technological barriers (e.g., lack of the large amounts of data needed to support current deep learning approaches) and cultural and socio-economic barriers (e.g., the low number of language resources hinders cultural heritage maintenance). There is a good number of ongoing efforts and initiatives aimed at the promotion of languages that are often under-resourced (see [25]). However,
There are some remaining open issues in the application of LD to under-resourced languages, though, like the necessity of modelling languages that are very rich morphologically and the still low adoption of LLOD at the morphological level. A second remarkable issue, as pointed out by Gillis-Webber and Tittle [104], is the current limitation of language tags when dealing with very specific language variants or dialects. The latter is, however, not an LLOD-specific issue, but something broader that involved internationalisation of the Web at a larger scale. Nevertheless, potential solutions to that issue might come in linked data native ways following the example of lines of works such as Lexvo.org [73], a database that brings information about languages, words, characters, and other human language-related entities in a linked data format.
Another category of under-resourced languages that is important to consider is that of Sign Languages. Since Sign Languages require a multimodal representation, they provide a particularly interesting challenge for representation models. Since Sign Languages are not organised the same way as spoken languages, representing Sign Languages might require additional elements of current formats for spoken and written languages. Furthermore, existing resources, e.g. the German Sign Language (DGS) corpus [201] and Sign Language of the Netherlands (NGT) [65], and their different transcription systems, e.g. HamNoSys [121], Signing Gesture Markup Language (SiGML) [238] and SignWriting [225], are incomplete. While they cover movements of hands and body in images for a sign, information on mouthing or mouth movements are missing among other types of information. Even if this information was available for many signs, there are only few fully annotated corpora of a decent size. Within European projects, such as Intelligent Automatic Sign Language Translation (EASIER) ,52
Both Dutch as used in the Netherlands (NGT) and Dutch as used in Belgium (VGT) The spoken language is largely the same, the signed languages are really different languages.
Multilinguality plays a crucial role in enhancing access to linguistic data across various languages, making it a valuable source for linguists, entities dedicated to language preservation and revitalization, multilingual communication organizations, language resource curators, and Semantic Web researchers. The Semantic Web in general, and linked data in particular, has been repeatedly identified as a core technology to overcome language barriers on the Web [114,218], since it has mechanisms to represent, traverse, and integrate, data in different languages, mediated by a common ontological layer. However, the main question is whether LLOD has really helped in making the Semantic Web more multilingual. Studies indicate that the number of language tags used in the Semantic Web increased, but the dominance of English never stopped [81,109].
In terms of comparison of the LLOD cloud and the broader LOD one, one wonders if LLOD is more “multilingual” than the general LOD. The current availability of linguistic data in the LLOD in terms of languages needs a more systematic exploration. There is also a need to focus on the coverage and details on the granularity of available data (lexical entries / links to other languages through translation of common referents / availability of data from the different linguistic description levels / etc.). An “observatory” would be needed to measure the quality and evolution of linguistic data along such dimensions.
Towards an ideal ecosystem for LLOD
In a previous analysis, one decade ago, Gracia et al. [114] studied the challenges posed by the so-called Multilingual Web of Data and proposed a roadmap towards its full realisation. In a first stage, they proposed the development of new (lightweight) representation models along with simple techniques for ontology localisation, cross-lingual querying and linking. The idea was to ensure early adoption of LLOD and provide the required incentives for the development of more complex infrastructures in future stages. In a second stage, semantic search engines might index multilingual lexical information available on the Web and support answering ad hoc queries in any language. More complex models and services would be developed in this second stage, supporting cross-lingual natural language processing applications requiring deeper multilingual lexical knowledge. Finally, the third stage would be more user-centered, with people more motivated to provide multilingual lexical information. An ecosystem of services would be available for cross-language querying, on-demand translation, cross-lingual mappings, etc. Search engines might be able to process natural language questions in any language and adapt their result presentation to conventions of the linguistic and cultural community to which the user belongs.
As our literature analysis attests, there has been substantial progress in the field over the last ten years. However, this progress did not always move in the direction predicted in the mentioned roadmap. Some goals have been accomplished, to judge from the emergence of new models (e.g., lexicog [21]) and updated versions of other well-established ones (e.g., Lemon [164]), as well as the (still moderate) progress in cross-lingual link inference (e.g., TIAD campaign [112]). However, the roadmap envisioned a more central role for the final Web user, more aware of the incentives and rewards that publishing linguistic information as LD should bring. We are still far from that. Recent progress has been achieved mainly in academic contexts, for specialised studies with specialised linguistic data. This is not bad in itself, of course, and there are very successful stories in the application of LLOD for linguistic research (e.g., the LiLa57
In the rest of this section, we propose a new roadmap with the next steps that the community might take to address the challenges reported in Section 6, in order to attain an ecosystem of truly interoperable linguistic data on the Web, multilingual in nature, across different linguistic levels. These steps are not intended to be sequential and can overlap.
Step I. More robust and sustainable open infrastructures should be in place, to support small and medium scale data providers who cannot afford their own hosting infrastructure. Since the technology is already in place, this is a matter of promoting its adoption and carrying out new national and international LD projects with a clear focus on infrastructure development. In parallel, more educational efforts are needed to make the advantages of LLOD visible to a new generation of researchers and practitioners. While this step is a general LOD issue, it is of crucial importance to achieve a highly Multilingual LLOD cloud as this necessarily requires publishing many datasets of varying size and language coverage from many data publishers who cannot afford their on-premise infrastructure. Step II. New models, along with new systems for RDF generation and linking, will be developed to cover linguistic description levels currently under-represented in the LLOD cloud. This will enable truly cross-disciplinary linguistic research in multiple natural languages, at Web scale. Step III. Development of an “observatory” to measure the quality and evolution of linguistic data on the Web along several dimensions (language, linguistic level, usage, etc.). Stable metadata models and repositories will be in place, with the ultimate aim of not only discovering relevant language resources, but really accessing to their data and enabling their direct re-use and inter-operation. Metadata models are of tremendous importance in Semantic Web and LOD in general. Their usage are, however, mainly disregarded in the NLP community.58 Indeed Ducel et al. [85] recently showed that around 32% of ACL research papers do not mention the language that is studied while they should have. Step IV. Massive population of the LLOD cloud with the maximum possible number of languages (thousands better than hundreds) and resources. That will create a critical mass of data to be eventually exploited by final language applications. This should cut the vicious circle resulting in lack of data caused by lack of exploitation opportunities and vice-versa. Step V. Development of a fully fledged family of services for easy upload and integration of multilingual linguistic data on the Web, language independent access and querying of linguistic data, and seamless integration of such a data with NLP services and tools. That will include also user interfaces for browsing/editing linked data.
This systematic survey on the status of multilinguality and LLOD that is built on the PRISMA method aims to provide an overview of available representation models, resources, and approaches for and across different linguistic description levels, pointing out existing challenges and gaps. It contributes (i) a guide on the state-of-the-art for researchers and practitioners interested in exposing their linguistic data as LLOD with a focus on available approaches for specific linguistic description levels. Furthermore, it (ii) identifies open challenges and gaps in the support of specific linguistic description levels across multilingual LLOD resources. For the LLOD community, this survey presents a report on where to direct future joint efforts towards multilinguality and LLOD. Among the identified description levels, phonetics, phonology, pragmatics and discourse structures have turned out to be least explored, and correspondingly wanting in representation means. From a resource perspective, available formalisms have not necessarily resulted in a wide publication of linguistic data, e.g. corpora and typological databases are quite under-represented in the LLOD cloud. Finally, (iii) we present a solid basis for future best practices on how to represent, model, and link different linguistic description levels in a truly multilingual LLOD cloud. To this end, this article proposes an ideal ecosystem, that is, a step by step roadmap to linguistically-rich multilingual LLOD, which addresses general LLOD challenges as much as LLOD challenges particular to multilinguality and LLOD.
Results of this article indicate that most individual description levels are well represented and that for most types of language resources examples exist, however, they also suggest that the key asset of the LLOD representation of interoperability should be more extensively explored for
One of the first and foremost challenges has been and still is
In terms of
Lastly, we have envisaged an ideal ecosystem for LLOD in the form of an open, multilingual and semantically interconnected linguistic data environment that facilitates access and interoperability, offering features that are universal, transdisciplinary, transnational, and translingual.
Footnotes
Acknowledgements
This article is based upon work from COST Action NexusLinguarum – European network for Web-centered linguistic data science (CA18209), supported by COST (European Cooperation in Science and Technology). It has been also partially supported by the Spanish project PID2020-113903RB-I00 (AEI/FEDER, UE), by DGA/FEDER, and by the
