Abstract
The Darwin Core vocabulary is widely used to transmit biodiversity data in the form of simple text files. In order to support expression of biodiversity data in the Resource Description Framework (RDF), a guide was created as a non-normative addition to the Darwin Core standard. This paper describes the major issues that were addressed in the creation of the guide, particularly problems related to adapting terms designed to have literal values for use with IRI references. By making it possible to express millions of existing records as RDF, the guide is an important step towards enabling the biodiversity informatics community to participate in broader Linked Data and Semantic Web efforts.
Introduction
Terms that are commonly used as predicates in Resource Description Framework1
Because so many data are already described using Darwin Core terms, there has been significant interest in adapting the DwC terms to describe biodiversity resources in RDF. Since the DwC terms are designated as IRIs, and because the normative term definitions are expressed in RDF/XML, it would seem trivial to use Darwin Core property terms as RDF predicates. However, the results of experimentation reported on the TDWG mailing list8
In 2011, an RDF/OWL Task Group was chartered by TDWG. In 2012 a team of writers began work on a Darwin Core RDF Guide to address the identified issues by providing a set of best practices and by creating some new Darwin Core terms intended specifically for use in RDF. The Guide [2] was completed in 2013 and reviewed by the Task Group, which recommended it for adoption. When adopted by TDWG in 2015, the RDF Guide became a non-normative part of the Darwin Core standard and joined existing guides that describe how to use Darwin Core terms in simple text files and XML.
Adapting existing metadata vocabularies and datasets for use in the Semantic Web is a current challenge [1,4,6]. This paper describes how the Task Group adapted a vocabulary that was not designed specifically for use in RDF so that its terms could be used as RDF predicates in a consistent manner. In Section 2 of this paper, we describe each of the major issues (Box 1) and how they were resolved. Section 3 describes future challenges and prospects for integrating Darwin Core-described data into the broader Semantic Web.
In this paper, IRIs are sometimes abbreviated as QNames using standard namespace prefixes, e.g.
Explaining the rationale to new users of RDF
The TDWG constituency consists primarily of biologists and data managers. Relatively few members of the organization are familiar with RDF, Linked Data, and the Semantic Web. Therefore, an important component of the RDF Guide is an explanation of important ways in which RDF differs from more traditional data transfer systems with which data managers may be familiar. The introduction of the Guide (Guide Section 1) highlights several important features of RDF that data managers need to consider when adapting their data for output as RDF. It discusses the importance of IRIs as resource identifiers (Guide Section 1.3.2) and references the best-practices specified in the TDWG GUID Applicability Statement standard9
Each of these issues mentioned in the introduction of the Guide were identified as important points of confusion in threads on the TDWG email list prior to the formation of the Task Group. Although brief, the summary of these issues in the introduction provides links to more extensive reference information for implementers who are not familiar with those issues.

Example Darwin Core records. Actual records would have additional fields.

Because the Darwin Core vocabulary was designed primarily to facilitate the transfer of text-based records from relatively flat database tables, definitions and comments for terms in the general namespace
Figure 2 shows an attempt to represent these data as RDF using only
The conflicting demands of flat, string-based tables and normalized, graph-based RDF creates a problem when terms that were originally designed for use with text strings are co-opted for use with non-literal objects in RDF. This is a long-standing problem11

Use of Darwin Core geographic “convenience” properties and their corresponding dwciri: analog.
The Darwin Core RDF Guide (Guide Section 2.5) adopts the dual-term approach by creating a new Darwin Core namespace,
The Guide allows legacy string name data in the form of concatenated lists to continue to be exposed in RDF as a literal object of a
An advantage of the dual-term approach is that it allows large databases consisting of legacy string name data to be exposed immediately as RDF using
Darwin Core contains several collections of hierarchical terms designed to provide a set of text-based property/value pairs that will unambiguously specify a resource. For example, the terms
In the location term set, no single term value is sufficient to unambiguously place the location in its lowest level political subdivision, because there may be several low level political subdivisions having the same name that are contained within different upper level political subdivisions. Thus, each location record must provide values for the entire set of terms. In the context of a flat database structure, it is convenient to expose the full set of property/value pairs for a location since that would allow a user to query for locations in the database by specifying the particular values of interest for certain properties in the set (hence the name “convenience terms” for properties that are included in such sets to make searching convenient).
It would be possible to define
Following this approach would alleviate the need for data providers to update their database each time there is a change in the upper levels of the hierarchy (change in spelling, reassignment of lower level resources to different upper level resources, reorganization of upper levels, etc.).
For each of the kinds of convenience terms in Darwin Core, a term has been defined in the
The general Darwin Core vocabulary includes a number of terms whose local name ends in “ID” (e.g.,
The first problem stems from a property assigned to all ID terms in their normative RDF. Each ID term is declared to be
A second problem is that a pre-existing understanding between the data provider and consumer is required to know which of several ID term fields that might be present in the record represents the identifier for the record (i.e., provides the identifier for the subject resource) and which ID term fields represent identifiers of linked resources (i.e., of object resources). For example, In Fig. 1.A it is not possible to know whether
Because of these problems associated with the use of ID terms in RDF, the Darwin Core RDF Guide states that ID terms should not be used as predicates in RDF triples. Instead, RDF best practices should be followed for specification of identifiers (discussed in Section 2.3.1 of this paper) and for the assignment of type (discussed in Section 2.3.2 of this paper). The linking function of ID terms must be served by object properties not defined by Darwin Core as discussed in Section 2.4.2 of this paper.
Associating an identifier with a subject resource
In non-RDF uses, Darwin Core is not strict about the identifiers that are used as values of its properties. Although globally unique identifiers are recommended, identifiers specific to a data set are allowed. There is also no requirement that globally unique identifiers be IRIs. Thus, Section 2.2 the RDF Guide provides some guidelines for translating the various kinds of ID term values into RDF.
If the subject resource identifier is an IRI, that IRI is simply asserted as the subject of triples describing the subject resource. If the subject resource identifier is a non-IRI string, the string is presented as the literal value of a
In the past, TDWG has recommended the use of Life Science Identifiers (LSIDs) [9]. As IRIs, LSIDs may be the subjects of RDF triples. However, Recommendation 30 of the TDWG LSID Applicability Statement standard requires that “The description of all objects identified by an LSID must contain an owl:sameAs, owl:equivalentProperty or owl:equivalentClass statement expressing the equivalence between the object identifier in its standard form and its proxy version” [11]. The Darwin Core RDF Guide extends this recommendation to any non-HTTP IRI (i.e., including other varieties of URNs such as ARK, UUID, ISBN, etc.) by specifying that if possible, the subject resource should be identified by an HTTP-proxy version of the non-HTTP IRI, and that the non-HTTP IRI be the object of an
Specifying the types of resources in a database record using the general Darwin Core vocabulary is complex and involves using the terms
Darwin Core also imports terms from Dublin Core that have range or domain declarations. The Guide draws attention to the fact that use of those terms also entails type relationships that may not be explicitly declared.
Linking to related resources
Because RDF is a graph-based model, one of its primary concerns is linking non-literal, IRI-identified or anonymous nodes using object properties. Although there are a variety of ways that Darwin Core properties in the
The first category corresponds to relationships defined by literal value terms in the general Darwin Core namespace
Association terms
In the second category, generic relationships were indicated by existing literal value Darwin Core terms. The general Darwin Core vocabulary included a number of “association terms” (terms whose local names begin with “associated”, e.g.
Linking instances of the Darwin Core classes
The other major category of Darwin Core terms whose purpose is to establish links to other non-literal resources is the category of ID terms (discussed in Section 2.3 of this paper). Since the ID terms cannot be used as RDF predicates, it seems as though it would have been a relatively simple task for the Guide to mint a set of object properties that could have been be used instead. However, minting such terms was hampered by the lack of a standard biodiversity domain model.
Example 4 in the Appendix is based on the data in Fig. 1 and illustrates the difficulty. Although the two tables in Fig. 1 imply the existence of instances of two classes (Occurrence and Location), the tables could actually be considered to contain information about instances of five classes:

Parts A through C of Example 4 show how the data in the tables can be serialized as RDF under several non-Darwin Core models.21 The models are compared at
Appendix Example 5 shows a SPARQL query designed to find occurrences recorded in Departamento de Puno, Peru by querying for its GeoNames IRI. Because the object properties used in the query are the Darwin-SW properties that link the Occurrence, Event, and Location classes, the query would be successful in finding the desired occurrences in any data serialized as in Example 4 part C. However, it would not find those same Occurrences if the data were serialized using the classes and object properties included in the TDWG Ontology (part A) or TaxonConcept ontology (part B).
It would be possible to merge graphs from providers that used different models and object properties, then to adjust by creating complex queries. However, standardization and consistent use of object properties among providers would make data integration and querying much simpler. Creating a uniform set of object properties to link Darwin Core classes is contingent on the development of a consensus model for the biodiversity informatics domain and that was an effort beyond the scope of the Darwin Core RDF Guide. Work in this area is being actively pursued in the context of the Biological Collections Ontology (BCO) [13].
Because of its wide acceptance, the Darwin Core vocabulary is an obvious source of predicates for description of biodiversity resources in RDF. However, because the original Darwin Core vocabulary was a general-purpose vocabulary intended to enable data transmission based primarily on simple tables of literal values, some of its properties could not be used as predicates in unmodified form, while the values of other properties did not unambiguously specify the real-world entities to which they refer. The Darwin Core RDF Guide defines new
Comparison of Figs 2 and 4 illustrates the advantages provided by using new
The lack of object properties to link instances of the main Darwin Core classes remains a major obstacle to effective expression of biodiversity data as RDF. However, the ability to consistently express many other relationships as RDF using Darwin Core properties will facilitate the development and testing of a consensus domain model for the biodiversity informatics community. This will in turn enable the community to add those missing object properties to the standard.27 Darwin Core is a continuously evolving standard. As such this paper reflects the state of Darwin Core at the time of the ratification of the RDF Guide. To examine the current status of the standard, visit the Darwin Core documentation at
Footnotes
Acknowledgements
The authors would like to thank the members of the TDWG RDF/OWL Task Group for feedback on the draft Darwin Core RDF Guide. Anonymous reviewers of a previous draft of this paper provided very helpful suggestions for its improvement.
Steve Baskauf’s participation in the Semantics of Biodiversity Symposium at TDWG 2013 was supported by the Research Coordination Network for the Genomic Standards Consortium (RCN4GSC, NSF DBI-0840989) and the Scientific Observations Network (SONet, NSF #0753144, OCI-Interop).
