Abstract
This paper describes the Ontologies of Linguistic Annotation (OLiA) as one of the data sets currently available as part of Linguistic Linked Open Data (LLOD) cloud. Within the LLOD cloud, the OLiA ontologies serve as a reference hub for annotation terminology for linguistic phenomena on a great band-width of languages, they have been used to facilitate interoperability and information integration of linguistic annotations in corpora, NLP pipelines, and lexical-semantic resources and mediate their linking with multiple community-maintained terminology repositories.
Keywords
Background
The heterogeneity of linguistic annotations has been recognized as a key problem limiting the interoperability and reusability of NLP tools and linguistic data collections. Several repositories of linguistic annotation terminology have been developed to facilitate annotation interoperability by means of a joint level of representation, or an ‘interlingua’, the most prominent probably being the General Ontology of Linguistic Description [12, GOLD] and the ISO TC37/SC4 Data Category Registry [20, ISOcat].
Still, these repositories are developed by different communities, and are thus not always compatible with each other, neither with respect to definitions nor technologies (e.g., there is no commonly agreed formalism to link linguistic annotations to terminology repositories).
The Ontologies of Linguistic Annotation (OLiA) have been developed to facilitate the development of applications that take benefit of a well-defined terminological backbone even before the GOLD and ISOcat repositories have converged into a generally accepted repository of reference terminology: They introduce an intermediate level of representation between ISOcat, GOLD and other repositories of linguistic reference terminology and are interconnected with these resources, and they provide not only means to formalize reference categories, but also annotation schemes, and the way that these are linked with reference categories.
Architecture
The
The OLiA ontologies were developed as part of an infrastructure for the sustainable maintenance of linguistic resources [28], and their primary fields of application include the formalization of annotation schemes and concept-based querying over heterogeneously annotated corpora [10,24].
In the OLiA architecture, four different types of ontologies are distinguished (cf. Fig. 1 for an example):
The Multiple For every Annotation Model, a Existing terminology repositories can be integrated as
The OLiA Reference Model specifies classes for linguistic categories (e.g.,
Conceptually, Annotation Models differ from the Reference Model in that they include not only concepts and properties, but also individuals: Individuals represent concrete tags, while classes represent abstract concepts similar to those of the Reference Model. Figure 1 gives an example for the individual
Data set description
The OLiA ontologies are available from
The OLiA ontologies cover different grammatical phenomena, including inflectional morphology, word classes, phrase and edge labels of different syntax annotations, as well as prototypes for discourse annotations (coreference, discourse relations, discourse structure and information structure). Annotations for lexical semantics are only covered to the extent that they are found in syntactic and morphosyntactic annotation schemes. Other aspects of lexical semantics are beyond the scope of OLiA: Existing reference resources for lexical semantics available in RDF include WordNet, VerbNet and FrameNet, their linking with OLiA is recommended as part of the lexicon model
Since their first presentation [3], the OLiA ontologies have been continuously extended. At the time of writing, the OLiA Reference Model distinguishes 280

Interpreting annotations in terms of the OLiA Reference Model.
As for morphological, morphosyntactic and syntactic annotations, the OLiA ontologies include 32 Annotation Models for about 70 different languages, including several multi-lingual annotation schemes, e.g., EAGLES [3] for 11 Western European languages, and MULTEXT/East [8] for 15 (mostly) Eastern European languages. As for non-(Indo-)European languages, the OLiA ontologies include morphosyntactic annotation schemes for languages of the Indian subcontinent, for Arabic, Basque, Chinese, Estonian, Finnish, Hausa, Hungarian and Turkish, as well as multi-lingual schemes applied to languages of Africa, the Americas, the Pacific and Australia. The OLiA ontologies also cover historical varieties, including Old High German, Old Norse and Old Tibetan. Additionally, 7 Annotation Models for different resources with discourse annotations have been developed.
External reference models currently linked to the OLiA Reference Model include GOLD [3], the OntoTag ontologies [2], an ontological remodeling of ISOcat [4], and the Typological Database System (TDS) ontologies [26]. From these, only the TDS ontologies are currently available under an open (CC-BY) license,1
Actually, the developers are sympathetic to the idea of releasing this data under an open license, Helen Aristar-Dry (for GOLD) and Menzo Windhouwer (for ISOcat), pers. communication, June 2012.
In this context, the function of the OLiA Reference Model is not to provide a novel and independent view on linguistic terminology, but rather to serve as a stable intermediate representation between (ontological models of) annotation schemes and these terminology repositories. This allows any concept that can be expressed in terms of the OLiA Reference Model also to be interpreted in the context of ISOcat, GOLD, OntoTag or TDS. OLiA serves to aggregate annotation terminology as found in linguistic resources and provides a middle ground between these and the External Reference Models linked to it. We would like to emphasize that OLiA is not meant as a substitute for any of these repositories, but rather, that it serves to facilitate their further harmonization and interoperability, as they are maintained by different communities and remain for the foreseeable future in a continuous state of enrichment and specialization. Initial efforts towards their gradual convergence include the support of linking mechanisms to external knowledge bases in GOLD and ISOcat. Within a GOLD context, for example, OLiA may be referred to as a Community-of-Practice Extension for the NLP community. From the perspective of ISOcat, it may be seen as an ontological view on annotation terminology among the otherwise unstructured data categories. Along with ontologies for other ISOcat profiles, e.g., metadata [35], OLiA may provide a seed for populating RELcat [29], an ongoing effort to provide structured views on ISOcat data.
As compared to a direct linking between annotation models and these terminology repositories, the modular structure limits the number of linkings that need to be defined (if a new Annotation Model is linked to the Reference Model, it inherits its linking with ISOcat, GOLD, OntoTag and TDS), and also, it provides stability (GOLD and ISOcat are developed in community processes with occasional revisions), a clear and non-redundant taxonomical organization (similar to GOLD, TDS and OntoTag, but very different from the semi-structured ISOcat) and establishes interoperability between GOLD and ISOcat (that – despite ongoing harmonization efforts [19] – are maintained by different communities and developed independently). Using the OLiA Reference Model, it is thus possible to develop applications that are interoperable in terms of GOLD
Initially, the OLiA ontologies have been intended to serve a
In earlier
In a similar vein, OLiA can be employed in
Figure 1 illustrates how annotations can be mapped onto Reference Model concepts for the German phrase
These ontology-based descriptions are comparable across different corpora and/or NLP tools, across different languages, and even across different types of language resources: Recently, the OLiA ontologies have also been applied to represent grammatical specifications of machine-readable dictionaries, that are thus interoperable with OLiA-linked corpora [11,21]. Moreover, through the linking with External Reference Models, OLiA-linked resources are also interoperable with resources directly grounded in GOLD, ISOcat, etc.
Using Semantic Web formalisms to represent corpora and annotations also provides us with the possibility to develop novel,
We see possible applications of this technology in situations where multiple, domain-specific NLP tools are available. In a monolingual setting, this may be the case where rule-based morphologies [34] or parsers [32] are to be combined with robust statistical part-of-speech taggers, whose coarse-grained tagsets cannot be trivially mapped onto the detailed annotations provided by deep, rule-based systems. Here, OLiA representations leverage tools with different granularity. Currently, we experiment with multilingual annotation projection, where annotations are projected from
This paper summarized the development of the OLiA ontologies since 2006, their current status, and a number of applications that have been developed on this basis.
The fundamental idea of the OLiA architecture is that annotation schemes are linked to community-maintained terminology repositories through an intermediate ‘Reference Model’, thereby minimizing the number of mappings necessary to establish interoperability of one annotation scheme with multiple terminology repositories. Further, annotation schemes and their linking to the Reference Model are formalized as separate OWL2/DL ontologies, so that interpretation-independent conceptualization (annotation documentation) and its interpretation in terms of the Reference Model (linking) are properly distinguished.
The OLiA ontologies differ from related approaches in that they take a focus on modeling annotation schemes and their linking with reference categories rather than merely providing reference categories. The differentiation of Annotation Models, the OLiA Reference Model and External Reference Models (community-maintained terminology repositories) represents increasing levels of abstraction, and, possibly, loss of information. However, no information about the original annotation is lost, and tools may chose the appropriate level of abstraction. Unlike a direct mapping approach as apparently favored by GOLD and ISOcat, OLiA allows to recover information about sources of mismatches between Reference Model concepts and Annotation Model concepts, and its declarative linking supports inspection and refinement using standard RDF/OWL tools.
The relationship between annotations and reference concept is not only represented in a transparent way, but also, conceptual
Moreover, negation (¬) is available in OWL2/DL. This is of particular importance for the linking between External Reference Models and the OLiA Reference Model. For example, an
The physical separation of Linking Models from Annotation Models and Reference Model introduces a clear distinction between externally provided information and the ontology engineer’s interpretation. Annotation Models formalize annotation documentation, and the Reference Model is based on a generalization of a broad band-width of resources. However, there may be different terminological traditions involved, so that apparently similar concepts found in Reference Model and Annotation Model are in fact unrelated. If nevertheless an incorrect identification takes place, the linking can be inspected by existing ontology browsers, and corrected independently from the interpretation-invariant Annotation Model and Reference Model. Furthermore, multiple
In ISOcat, the problem of conflicting interpretations of data categories is currently Providing a top-down perspective does not automatically disclose such inconsistencies, but the resulting dialog between tagset provider and ontology developer may facilitate their detection, as in the example given above.
In comparison to GOLD, OLiA is more focused on NLP and corpus interoperability, whereas GOLD originates from the language documentation community. Therefore, a number of data categories commonly assumed in NLP were not originally represented in GOLD. For example,
Conceptually, the OLiA ontologies are closer related to the OntoTag ontologies [1], that were also applied to develop NLP applications on the basis of ontological representations of linguistic annotations [23]. One important difference is that the OntoTag ontologies are considering only Iberian Romance languages (in particular Spanish), that they are partially designed with a top-down perspective (whereas the development of the OLiA Reference Model is guided by the annotation schemes it is applied to) and are thus richer in consistency constraints (that are, however, often language-specific), and that the OntoTag ontologies are not publicly available at the moment. Within the OLiA architecture, the morphosyntactic layer of the OntoTag ontologies is integrated as an External Reference Model [2].
The OLiA ontologies may play an important role in NLP, corpus and annotation interoperability in that they relate these activities to initiatives in different linguistic communities to establish reference repositories for linguistic annotation terminology, e.g., recent developments towards the creation of a Linguistic Linked Open Data (LLOD) cloud.8
Acknowledgments
We would like to thank Menzo Windhouwer, Steve Cassidy and four anonymous reviewers for valuable feedback and comments to this paper and its immediate predecessor [7]; beyond this, we thank OLiA users, contributors and funders. OLiA has originally been developed at the Collaborative Research Center (SFB) 441 “Linguistic Data Structures” (University of Tübingen, Germany) in the context of the project “Sustainability of Linguistic Resources” in cooperation with SFB 632 “Information Structure” (University of Potsdam, Humboldt-University Berlin, Germany) and SFB 538 “Multilingualism” (University of Hamburg, Germany) from 2006 to 2008. From 2007 to 2011, is has been maintained and further developed at SFB 632 in the context of the project “Linguistic Data Base”. In 2012, the first author continued his research on OLiA in the context of PostDoc fellowship at the Information Sciences Institute of the University of Southern California funded by the German Academic Exchange Service (DAAD). The work of the second author was conducted in the context of the LOEWE cluster “Digital Humanities” at the Goethe-University Frankfurt (2011–2014).
In parts, this data set description is based on [7], shortened, updated and thoroughly revised.
