Abstract
Representing provenance information for data is of crucial importance for data reuse. This is in particular the case for language resources such as annotated corpora. NIF has been proposed as an RDF vocabulary to support the representation of text data together with annotations. However, NIF suffers from severe shortcomings with respect to its ability to represent provenance information. As a remedy to this, we present MOND, a new glue ontology that implements an interface between NIF and the PROV-O ontology to support the inclusion of provenance information into NIF annotated datasets. We first present an approach that reifies annotations and allows the attachment of any provenance metadata to annotations at arbitrary granularity. We show that this approach has an important drawback as it roughly doubles the size of the data. Building on this observation, we design the MOND glue ontology that implements a modular approach in which annotation metadata is not attached to single annotations but to modules that represent collections of annotations of the same type and origin. This yields a moderate increase in data size, while maintaining all the benefits of the first approach. We validate our approach on three use cases that represent prototypical needs in corpus work.
Get full access to this article
View all access options for this article.
