On the origin of annotations: A module-based approach to representing annotations in the Natural Language Processing Interchange Format (NIF)

Abstract

Representing provenance information for data is of crucial importance for data reuse. This is in particular the case for language resources such as annotated corpora. NIF has been proposed as an RDF vocabulary to support the representation of text data together with annotations. However, NIF suffers from severe shortcomings with respect to its ability to represent provenance information. As a remedy to this, we present MOND, a new glue ontology that implements an interface between NIF and the PROV-O ontology to support the inclusion of provenance information into NIF annotated datasets. We first present an approach that reifies annotations and allows the attachment of any provenance metadata to annotations at arbitrary granularity. We show that this approach has an important drawback as it roughly doubles the size of the data. Building on this observation, we design the MOND glue ontology that implements a modular approach in which annotation metadata is not attached to single annotations but to modules that represent collections of annotations of the same type and origin. This yields a moderate increase in data size, while maintaining all the benefits of the first approach. We validate our approach on three use cases that represent prototypical needs in corpus work.

Get full access to this article

View all access options for this article.

References

Baker, M., et al. (1993). Corpus linguistics and translation studies: Implications and applications. In

Baker ,

Francis and

Tognini-Bonelli (Eds.), Text and Technology (pp. 233–250). Amsterdam: John Benjamins. doi:10.1075/z.64.15bak.

Biber, D. & Conrad, S. (2009). Register, Genre, and Style. Cambridge University Press.

Bird, S. & Liberman, M. (2001). A formal framework for linguistic annotation. Speech Communication, 33(1), 23–60. doi:10.1016/S0167-6393(00)00068-6.

Borgman, C.L. (2012). The conundrum of sharing research data. JASIST, 63(6), 1059–1078. doi:10.1002/asi.22634.

Carroll, J.J., Bizer, C., Hayes, P. & Stickler, P. (2005). Named graphs, provenance and trust. In Proceedings of the 14th International Conference on World Wide Web (pp. 613–622). ACM. doi:10.1145/1060745.1060835.

Chiarcos, C., Dipper, S., Götze, M., Leser, U., Lüdeling, A., Ritz, J. & Stede, M. (2008). A flexible framework for integrating annotations from different tools and tagsets. Traitement Automatique Des Langues, 49, 217–246.

Chiarcos, C., McCrae, J., Cimiano, P. & Fellbaum, C. (2013). Towards open data for linguistics: Linguistic linked data. In

Oltramari,

Vossen ,

Qin and

Hovy (Eds.), New Trends of Research in Ontologies and Lexical Resources. Lecture Notes in Computer Science (chapter 2, pp. 7–25). Berlin/Heidelberg: Springer.

Gorp, P.V. & Mazanek, S. (2011). SHARE: A web portal for creating and sharing executable research papers. Procedia Computer Science, 4, 589–597. doi:10.1016/j.procs.2011.04.062.

Granger, S. (2002). A bird’s-eye view of learner corpus research. In

Granger ,

Hung and

Petch-Tyson (Eds.), Computer Learner Corpora, Second Language Acquisition, and Foreign Language Teaching (pp. 3–33). Philadelphia: John Benjamins. doi:10.1075/lllt.6.04gra.

10.

Groth, P., Gil, Y., Cheney, J. & Miles, S. (2012). Requirements for provenance on the web. International Journal of Digital Curation, 7(1), 39–56. doi:10.2218/ijdc.v7i1.213.

11.

Heath, T. & Bizer, C. (2011). Linked Data – Evolving the Web Into Global Data Space. Synthesis Lectures on the Semantic Web: Theory and Technology (Vol. 5). San Rafael, CA: Morgan & Claypool.

12.

Hellmann, S., Lehmann, J., Auer, S. & Brümmer, M. (2013). Integrating NLP using linked data. In The Semantic Web. ISWC 2013. LNCS (Vol. 8219, pp. 98–113). doi:10.1007/978-3-642-41338-4_7.

13.

Hinrichs, E.W., Hinrichs, M. & Zastrow, T. (2010). WebLicht: Web-based LRT services for German. In Proceedings of the ACL 2010 System Demonstrations (pp. 25–29).

14.

Lehmann, C. (2005). Data in linguistics. The Linguistic Review, 21(3–4), 175–210.

15.

Lier, F., Wrede, S., Siepmann, F., Lütkebohle, I., Paul-Stueve, T. & Wachsmuth, S. (2012). Facilitating research cooperation through linking and sharing of heterogenous research artefacts. In Proceedings of the 8th International Conference on Semantic Systems – I-SEMANTICS ’12 (pp. 157–164). New York, New York, USA: ACM Press. doi:10.1145/2362499.2362521.

16.

McEnery, T. & Hardie, A. (2012). Corpus Linguistics: Method, Theory and Practice. Cambridge (UK): Cambridge University Press.

17.

Nguyen, V., Bodenreider, O. & Sheth, A. (2014). Don’t like RDF reification? Making statements about statements using singleton property. In Proceedings of the 23rd International Conference on World-Wide Web (pp. 759–770).

18.

Rizzo, G. & Troncy, R. (2012). NERD: A framework for unifying named entity recognition and disambiguation extraction tools. In Proceedings of the Demonstrations at the 13th Conference of the European Chapter of the Association for Computational Linguistics (pp. 73–76).

19.

Schmid, H. (1994). Probabilistic part-of-speech tagging using decision trees. In Proceedings of the International Conference on New Methods in Language Processing (Vol. 12, pp. 44–49).

20.

Thomas, D. & Hunt, A. (1999). The Pragmatic Programmer: From Journeyman to Master. Boston: Addison-Wesley Professional.

21.

Watkins, E. & Nicole, D. (2006). Named graphs as a mechanism for reasoning about provenance. In

Zhou,

Li,

Shen,

Kitsuregawa and

Zhang (Eds.), Frontiers of WWW Research and Development – APWeb 2006. Lecture Notes in Computer Science (Vol. 3841, pp. 943–948). Berlin Heidelberg: Springer. doi:10.1007/11610113_99.