Abstract
Since the inception of the Open Linguistics Working Group in 2010, there have been numerous efforts in transforming language resources into Linked Data. The research field of Linguistic Linked Data (LLD) has gained in importance, visibility and impact, with the Linguistic Linked Open Data (LLOD) cloud gathering nowadays over 200 resources. With this increasing growth, new challenges have emerged concerning particular domain and task applications, quality dimensions, and linguistic features to take into account. This special issue aims to review and summarize the progress and status of LLD research in recent years, as well as to offer an understanding of the challenges ahead of the field for the years to come. The papers in this issue indicate that there are still aspects to address for a wider community adoption of LLD, as well as a lack of resources for specific tasks and (interdisciplinary) domains. Likewise, the integration of LLD resources into Natural Language Processing (NLP) architectures and the search for long-term infrastructure solutions to host LLD resources continue to be essential points to which to attend in the foreseeable future of the research line.
Keywords
Introduction
Linguistic Linked Data (LLD) refers to the application of linked data principles to the representation and publication of linguistic resources, including among them lexica, dictionaries, corpora, terminologies, metadata repositories and linguistic ontologies. As is the case with data from other domains, the linked data paradigm in this setting allows to share language resources in a uniform and interoperable way that facilitates and enhances their discovery, integration and reuse. One of the communities acting as main driver in this endeavour has been the Open Linguistics Working Group (OWLG) [3,8] gathered in the context of the Open Knowledge Foundation (OKFN).1
The work in this community, together with other European projects and initiatives (e.g. LIDER2
) in the past decade has led to the creation of the Linguistic Linked Open Data (LLOD) cloud, an LOD subset comprising of linguistic resources [2,8].3 In 2016, the LLOD cloud had already experienced a growth by a factor of 4 since its first instantiation [8], and was growing at 19.3% per year in the period 2018-2020 [3]. Nowadays, it amounts to more than 200 resources (on the publication of these LLOD cloud resources and their metadata, see di Buono et al., this volume).This continuous growth is to a large extent due to the efforts coming from projects and initiatives in the fields of computational linguistics, computer science, information technology, lexicography, and applied linguistics. H2020 projects such as Lynx,4
In its initial stage, the linguistic linked data line of research motivated the development and/or conversion of multiple LLD datasets and vocabularies to account for the representation needs of different types of resources. Models such as the Lexicon Model for Ontologies (lemon) [7] (and its successor OntoLex [10]), the NLP Interchange Format (NIF) [6], the Meta-Share Ontology [11] or the Ontologies for Linguistic Annotation (OLiA) [4] emerged. Likewise, datasets commonly used as a linking backbone were represented as linked data or newly developed (WordNet [9,13], BabelNet [12]), and linguistic linked data categories such as LexInfo [5] started to draw the attention of the community and slowly became a de facto standard to encode morphosyntactic categories. As of today, there are multiple vocabularies (see Fahad Khan et al. in this volume) and resources providing numerous options to link to relevant and supplementary data enriching the content of a given language resource.
With the rapid growth and the increasing interest in the use of linked data for NLP in this decade, new challenges have emerged concerning particular use cases, domain and task applications, quality dimensions, and linguistic features for which to account. Some of these aspects refer to the application of linked data principles in general and are not particularly tied to the linguistics domain, whereas others are concerned with the nature of the linguistic content of the resource.
Since the inception of the Open Linguistics Working Group in 2010, there have been many efforts in transforming language resources into Linked Data. The research field of Linguistic Linked Data (LLD) has gained in importance, visibility and impact. The editors of this special issue have thus decided to publish a special issue on latest advancements in the LLD with two main goals:
reviewing and summarizing the progress and status of LLD research in recent years
developing an understanding of the challenges ahead of the field for the years to come
The call for papers for the special issue was distributed over relevant mailing lists. As a response to our call, we received 13 relevant submissions, of which after a thorough reviewing process involving three reviewers for each paper at least, and at least 2 rounds of improvements, we finally decided to accept eight papers.
This issue includes papers on a range of topics such as models for linguistic data representation, metadata and its quality assessment, diachronicity, typology, and, closer to the NLP field, entity linking, terminology extraction, bilingual dictionary generation, and the detection of discourse relations.
The paper by Kahn et al. provides an overview of the current state of ontologies and vocabularies used to describe linguistic resources as linked data. The main emphasis lies on understanding how these models can support the FAIR publication of language resources. The authors provide an overview over the main vocabularies for describing corpora (NIF, Open Annotation), lexica and dictionaries (OntoLex Lemon, SKOS), terminologies (MMOn ontology and PHOIBLE) as well as linguistic resource metadata (METASHARE Ontology, ISOcat, GOLD) in addition to vocabularies for describing typological datasets. The authors include in particular a summary of the main recent developments of a selection of these models. Further, the paper provides an overview of projects that are currently contributing to the growth and development of standards for the LLOD. As main challenges for the future, the authors emphasize the need for reliable infrastructure that supports the longer term hosting and accessibility of resources. Second, the authors highlight that as the complexity of the models developed for the description of language resources as LLD increases, one challenge is to support the consistent and correct use of the vocabularies by end users (engineers, linguists, etc). For this, ontology design patterns and templates that describe best (modelling) practices would be an important asset.
The paper by Pia di Buono is concerned with analyzing the current state and adoption of standards for the description of metadata. They analyse the LOD Cloud and Annohub9
In their contribution “Glottocodes: Identifiers Linking Families, Languages and Dialects to Comprehensive Reference Information”, Hammarstrom and Forkel present Glottocodes, an identification system for languoids (languages, dialects and language families) in Glottolog. Glottolog gathers language data and supporting bibliographic references, grouping data into specific levels (e.g. dialect, language, subfamily, etc). This categorisation, however, might change throughout time, given the controversies on language vs. dialect status or as more knowledge on a particular languoid is recorded. The authors propose Glottocodes as a system of persistent identifiers for languoids which improves on the ISO 639-3 language identifiers. The resource has been conceived to support machine readability and stays independent of the level of linguistic abstraction (idiolect, dialect, language, family, etc.) and its potential changes. As such, one of the key challenges addressed by their proposal is the dynamic nature of languoids and hence the need for technological solutions to consider the continuous changes and updates. The authors also address the topic of an optimal infrastructure to facilitate data curation and the involvement of the wider community. In the case of Glottolog, the authors turned to
Armaselu et al. present a thorough survey addressing jointly the detection of semantic change in multilingual diachronic corpora in NLP and its representation as LLOD. The study focuses on the generation of diachronic ontologies, and aims to provide a first step towards bridging the gap between NLP and LLOD from the perspective of humanities research. To do so, the authors propose a workflow for this interdisciplinary study, revisiting the works on semantic change from different theoretical frameworks, its analysis and representation in the Semantic Web context, its detection with different NLP approaches and tools, and lastly the generation of diachronic linked data resources and their subsequent publication. A major challenge faced by the authors concerns the interdisciplinary nature of the study itself, involving lines of work with different maturity levels. In relation to this, the need for a framework to foster collaboration, exchanges and communication across the various research lines is highlighted. The limitations in representing temporal and dynamic information as LLOD and the absence of guidelines to generate diachronic ontologies represent significant barriers for the adoption of LLOD. Finally, and common to other lines of work involving the humanities community and LLOD researchers, the need for methods to facilitate the publication and maintenance of linked data for non-Semantic Web experts is identified by the authors.
Özel et al. address the scarcity of multilingual resources for discourse analysis. A major contribution of their work are the two methods for discourse relation linking in the TED Multilingual Discourse Treebank (TED-MDB), one based on word alignment, and a second one on cross-lingual sentence embeddings. The TED-MDB is annotated with discourse relations between sentences in six different languages. However, the annotations on the different languages were performed independently from each other, which hinders the cross-lingual analysis of discourse connectives and motivates the authors’ relation linking task. Their results show that the cross-lingual sentence embeddings approach outperforms the word alignment one, which is negatively affected by incorrectly derived sentence alignments. Thanks to the extracted relations, the authors gain new insights into discourse structures present in the TED-MDB, and generate bilingual discourse connectivity lexica relevant for machine translation, discourse studies and language teaching. The discourse relation linking task is still a challenging problem to tackle due to argument spans of relations varying across languages or multiple relations overlapping. In a broader sense, the scarcity of multilingual discourse resources (not necessarily linked data-based) remains low, although this work makes a step forward to increase their availability.
In their “Survey on English Entity Linking on Wikidata”, Cedric Moeller, Jens Lehmann and Ricardo Usbeck present the results from a survey on Entity Linking datasets and approaches in the context of Wikidata. The authors argue that the vast majority of Entity Linking approaches consider specific properties such as labels and descriptions, however the contextual information, the structure and links of Wikidata, is rarely exploited. Furthermore, the survey reveals that most of the Entity Linking datasets are created as “mapped version” of already existing datasets, i.e. not exclusively focused on Wikidata, and the time-variance and multilingualism aspects are still poorly represented.
Over the last decade, large number of dictionaries have been converted, linked and published as part of the LLOD cloud. The availability of linked dictionaries provides new opportunities for their exploitation. In the paper “Bilingual dictionary generation and enrichment via graph exploration”, Shashwat Goel, Jorge Gracia and Mikel L. Forcada propose a novel method that exploits the graph structure of existing bilingual linked dictionaries and infers new bilingual entries. The method has been applied and validated on the Apertium knowledge graph which produced new bilingual dictionaries with 70% the size of the source Apertium dictionaries at a precision of 85%.
Multilingual terminologies play an important role in many language technology solutions. Their creation typically requires significant amount of human effort and due to their availability in different formats their reuse is limited. In “TermitUp: Generation and Enrichment of Linked Terminologies”, Patricia Martín-Chozas, Karen Vázquez-Flores, Pablo Calleja, Elena Montiel-Ponsoda and Víctor Rodríguez-Doncel present TermitUp, a service for automated extraction of domain-specific terminologies and their enrichment with data from the Linguistic LOD cloud. The created terminologies are validated, linked with other terminological resources and published as part of the Linguistic LOD cloud.
The papers in the special issue clearly convey that the field of linguistic linked data (LLD) has certainly progressed and matured over the years. The field has seen a convergence in terms of ontologies / models used for the description of data and metadata and best practices have been identified. Yet, the special issue also shows that there are important challenges to address in the future.
Adoption It is still difficult to discover and reuse LLD. As identified by Maria Pia di Buono et al. interoperable metadata standards are not yet fully available and used. This is clearly an obstacle for the use of LLD datasets which represents a significant barrier to the practical adoption. The paper by Armaselu et al. points out an important challenge regarding adoption, too, emerging from the difficulties of advancing in interdisciplinary approaches with lines of varying level of maturity, and the need to set up frameworks to facilitate exchange and collaboration across fields. In this regard, bringing LLD generation, publication and maintenance closer to non-experts in Semantic Web remains an important line of research and dissemination.
Representation needs Although the availability of LLD-related vocabularies and their coverage has significantly increased in the past years, further work and best practices are needed to address the representation needs of linguistic data relevant for areas under-represented in the LLOD cloud (e.g. as opposed to synchronic lexical semantics). This is the case of diachronic information, as pointed out by Armaselu et al. In relation to language as the object of study as well, the persistent identifiers proposed by Hammarstrom and Forkel (Glottocodes) to remain agnostic to changes in the description of languages, language families, dialects, etc. in Glottolog serve as a reminder of the need of solutions that adapt to potential updates as more information on a given language is obtained, which is particularly relevant for languages under-resourced as of today.
Scarcity of resources Closely tied to the previous point, the low number of cross-lingual discourse resources available pointed out by Özel et al. indicates that further work on that line would also lead to a greater balance in the availability of potential LLD resources and models covering different linguistic levels. This scarcity also holds for the availability of datasets for particular NLP tasks such as Entity Linking. As identified by Moeller et al. there is a lack of Entity Linking evaluation datasets which consider multilingualism and time-variance.
Integration into NLP architectures A further challenge is to facilitate the integration of linguistic linked dataset into NLP architectures and systems so that systems can be easily ported to work on a new dataset. Again this requires the consistent use of vocabularies and standards to describe the language resources at the content level, so that resources become directly pluggable into workflow.
Infrastructure The need for a reliable infrastructure that supports the longer term hosting and accessibility of resources has been called to attention by Kahn et al. as well as by other works in the recent literature [1]. A challenge for the future is to identify requirements regarding the availability of LLD resources and develop (business) models to cover the costs incurred by the services required to maintain availability. A further challenge lies in creating incentives to foster involvement of the whole community in the curation and enrichment of data.
However, solutions that guarantee the availability of LLD resources in long-term are still to be thoroughly discussed and addressed as a key aspect for the future development of the LLOD cloud and the line of research as a whole. The involvement of the wider community in data curation in feedback or request gathering through a suitable infrastructure is also a significant point to bear in mind, with suggested solutions as the one presented by Hammarstrom and Forkel.
The Linguistic LOD cloud has been under development for over a decade now and we already see some works which exploit its potential, such as the work on bilingual dictionary creation by Shashwat Goel et al., and the domain specific terminology extraction by Patricia Martin-Chozas et al. However, the potential of the LLOD cloud is enormous and we expect to see many more works in the near future which will exploit the datasets available as part of the LLOD cloud.
