Abstract
Since the inception of the Open Linguistics Working Group in 2010, there have been numerous efforts in transforming language resources into Linked Data. The research field of Linguistic Linked Data (LLD) has gained in importance, visibility and impact, with the Linguistic Linked Open Data (LLOD) cloud gathering nowadays over 200 resources. With this increasing growth, new challenges have emerged concerning particular domain and task applications, quality dimensions, and linguistic features to take into account. This special issue aims to review and summarize the progress and status of LLD research in recent years, as well as to offer an understanding of the challenges ahead of the field for the years to come. The papers in this issue indicate that there are still aspects to address for a wider community adoption of LLD, as well as a lack of resources for specific tasks and (interdisciplinary) domains. Likewise, the integration of LLD resources into Natural Language Processing (NLP) architectures and the search for long-term infrastructure solutions to host LLD resources continue to be essential points to which to attend in the foreseeable future of the research line.
Keywords
Introduction
Linguistic Linked Data (LLD) refers to the application of linked data principles to the representation and publication of linguistic resources, including among them lexica, dictionaries, corpora, terminologies, metadata repositories and linguistic ontologies. As is the case with data from other domains, the linked data paradigm in this setting allows to share language resources in a uniform and interoperable way that facilitates and enhances their discovery, integration and reuse. One of the communities acting as main driver in this endeavour has been the
The work in this community, together with other European projects and initiatives (e.g. LIDER2
This continuous growth is to a large extent due to the efforts coming from projects and initiatives in the fields of computational linguistics, computer science, information technology, lexicography, and applied linguistics. H2020 projects such as Lynx,4
In its initial stage, the linguistic linked data line of research motivated the development and/or conversion of multiple LLD datasets and vocabularies to account for the representation needs of different types of resources. Models such as the Lexicon Model for Ontologies (
With the rapid growth and the increasing interest in the use of linked data for NLP in this decade, new challenges have emerged concerning particular use cases, domain and task applications, quality dimensions, and linguistic features for which to account. Some of these aspects refer to the application of linked data principles in general and are not particularly tied to the linguistics domain, whereas others are concerned with the nature of the linguistic content of the resource.
Since the inception of the Open Linguistics Working Group in 2010, there have been many efforts in transforming language resources into Linked Data. The research field of Linguistic Linked Data (LLD) has gained in importance, visibility and impact. The editors of this special issue have thus decided to publish a special issue on latest advancements in the LLD with two main goals:
reviewing and summarizing the progress and status of LLD research in recent years
developing an understanding of the challenges ahead of the field for the years to come
The call for papers for the special issue was distributed over relevant mailing lists. As a response to our call, we received 13 relevant submissions, of which after a thorough reviewing process involving three reviewers for each paper at least, and at least 2 rounds of improvements, we finally decided to accept eight papers.
This issue includes papers on a range of topics such as models for linguistic data representation, metadata and its quality assessment, diachronicity, typology, and, closer to the NLP field, entity linking, terminology extraction, bilingual dictionary generation, and the detection of discourse relations.
The paper by Kahn et al. provides an overview of the current state of ontologies and vocabularies used to describe linguistic resources as linked data. The main emphasis lies on understanding how these models can support the FAIR publication of language resources. The authors provide an overview over the main vocabularies for describing corpora (NIF, Open Annotation), lexica and dictionaries (OntoLex Lemon, SKOS), terminologies (MMOn ontology and PHOIBLE) as well as linguistic resource metadata (METASHARE Ontology, ISOcat, GOLD) in addition to vocabularies for describing typological datasets. The authors include in particular a summary of the main recent developments of a selection of these models. Further, the paper provides an overview of projects that are currently contributing to the growth and development of standards for the LLOD. As main challenges for the future, the authors emphasize the need for reliable infrastructure that supports the longer term hosting and accessibility of resources. Second, the authors highlight that as the complexity of the models developed for the description of language resources as LLD increases, one challenge is to support the consistent and correct use of the vocabularies by end users (engineers, linguists, etc). For this, ontology design patterns and templates that describe best (modelling) practices would be an important asset.
The paper by Pia di Buono is concerned with analyzing the current state and adoption of standards for the description of metadata. They analyse the LOD Cloud and Annohub9
In their contribution “Glottocodes: Identifiers Linking Families, Languages and Dialects to Comprehensive Reference Information”, Hammarstrom and Forkel present Glottocodes, an identification system for
Armaselu et al. present a thorough survey addressing jointly the detection of semantic change in multilingual diachronic corpora in NLP and its representation as LLOD. The study focuses on the generation of diachronic ontologies, and aims to provide a first step towards bridging the gap between NLP and LLOD from the perspective of humanities research. To do so, the authors propose a workflow for this interdisciplinary study, revisiting the works on semantic change from different theoretical frameworks, its analysis and representation in the Semantic Web context, its detection with different NLP approaches and tools, and lastly the generation of diachronic linked data resources and their subsequent publication. A major challenge faced by the authors concerns the interdisciplinary nature of the study itself, involving lines of work with different maturity levels. In relation to this, the need for a framework to foster collaboration, exchanges and communication across the various research lines is highlighted. The limitations in representing temporal and dynamic information as LLOD and the absence of guidelines to generate diachronic ontologies represent significant barriers for the adoption of LLOD. Finally, and common to other lines of work involving the humanities community and LLOD researchers, the need for methods to facilitate the publication and maintenance of linked data for non-Semantic Web experts is identified by the authors.
Özel et al. address the scarcity of multilingual resources for discourse analysis. A major contribution of their work are the two methods for discourse relation linking in the TED Multilingual Discourse Treebank (TED-MDB), one based on word alignment, and a second one on cross-lingual sentence embeddings. The TED-MDB is annotated with discourse relations between sentences in six different languages. However, the annotations on the different languages were performed independently from each other, which hinders the cross-lingual analysis of discourse connectives and motivates the authors’ relation linking task. Their results show that the cross-lingual sentence embeddings approach outperforms the word alignment one, which is negatively affected by incorrectly derived sentence alignments. Thanks to the extracted relations, the authors gain new insights into discourse structures present in the TED-MDB, and generate bilingual discourse connectivity lexica relevant for machine translation, discourse studies and language teaching. The discourse relation linking task is still a challenging problem to tackle due to argument spans of relations varying across languages or multiple relations overlapping. In a broader sense, the scarcity of multilingual discourse resources (not necessarily linked data-based) remains low, although this work makes a step forward to increase their availability.
In their “Survey on English Entity Linking on Wikidata”, Cedric Moeller, Jens Lehmann and Ricardo Usbeck present the results from a survey on Entity Linking datasets and approaches in the context of Wikidata. The authors argue that the vast majority of Entity Linking approaches consider specific properties such as labels and descriptions, however the contextual information, the structure and links of Wikidata, is rarely exploited. Furthermore, the survey reveals that most of the Entity Linking datasets are created as “mapped version” of already existing datasets, i.e. not exclusively focused on Wikidata, and the time-variance and multilingualism aspects are still poorly represented.
Over the last decade, large number of dictionaries have been converted, linked and published as part of the LLOD cloud. The availability of linked dictionaries provides new opportunities for their exploitation. In the paper “Bilingual dictionary generation and enrichment via graph exploration”, Shashwat Goel, Jorge Gracia and Mikel L. Forcada propose a novel method that exploits the graph structure of existing bilingual linked dictionaries and infers new bilingual entries. The method has been applied and validated on the Apertium knowledge graph which produced new bilingual dictionaries with 70% the size of the source Apertium dictionaries at a precision of 85%.
Multilingual terminologies play an important role in many language technology solutions. Their creation typically requires significant amount of human effort and due to their availability in different formats their reuse is limited. In “TermitUp: Generation and Enrichment of Linked Terminologies”, Patricia Martín-Chozas, Karen Vázquez-Flores, Pablo Calleja, Elena Montiel-Ponsoda and Víctor Rodríguez-Doncel present TermitUp, a service for automated extraction of domain-specific terminologies and their enrichment with data from the Linguistic LOD cloud. The created terminologies are validated, linked with other terminological resources and published as part of the Linguistic LOD cloud.
The papers in the special issue clearly convey that the field of linguistic linked data (LLD) has certainly progressed and matured over the years. The field has seen a convergence in terms of ontologies / models used for the description of data and metadata and best practices have been identified. Yet, the special issue also shows that there are important challenges to address in the future.
However, solutions that guarantee the availability of LLD resources in long-term are still to be thoroughly discussed and addressed as a key aspect for the future development of the LLOD cloud and the line of research as a whole. The involvement of the wider community in data curation in feedback or request gathering through a suitable infrastructure is also a significant point to bear in mind, with suggested solutions as the one presented by Hammarstrom and Forkel.
The Linguistic LOD cloud has been under development for over a decade now and we already see some works which exploit its potential, such as the work on bilingual dictionary creation by Shashwat Goel et al., and the domain specific terminology extraction by Patricia Martin-Chozas et al. However, the potential of the LLOD cloud is enormous and we expect to see many more works in the near future which will exploit the datasets available as part of the LLOD cloud.
