Abstract
Encyclopedia of Life (EOL) has developed TraitBank (
Introduction
While human knowledge of life on Earth is vast, there is no easy way to query all the information accumulated in hundreds of years of biodiversity research and documentation. Even simple questions like “which plants have yellow flowers?” or “what do sharks eat?” are impossible to answer with confidence.
Biologists have captured and managed information about morphology, behavior, life history, and ecological interactions in many different ways. Most of this information survives in the form of free text or data tables in published papers, if it survives at all [20]. Lately communities have started to annotate those papers [3], extract information from text [28,40], and build special-purpose databases of trait data, for example, TRY1
This lack of data standards impedes progress in the ecological, conservation, and phylogenetic research communities, who need effective ways to quickly discover and consume data in the coming era of data-intensive science [e.g., [17]–[19]]. For example, marine environmental modelers need high-quality inputs about large numbers of species in order to understand current and historical distributions of species; how these distributions are impacted by environmental changes such as climate change, overharvesting, or invasive species; how biological communities function to provide ecosystem services; and what could happen to these services under future scenarios that change the composition of these communities. Such large-scale data have also been identified by DIVERSITAS6
This paper describes TraitBank®, a system designed by the Encyclopedia of Life (EOL) to acquire, organize and serve biodiversity attribute data on a global scale across the entire tree of life8
TraitBank mobilizes data from diverse sources including biodiversity databases (e.g., Global Biodiversity Information Facility (GBIF),9
In addition to traditional “trait data” like
TraitBank leverages EOL’s existing network of content partners and Content Creation Community [39] and employs the EOL relational database frameworks (providing advanced taxonomic names resolution) in combination with existing data standards and domain ontologies. Rather than developing a comprehensive semantic framework for the integration of trait data, TraitBank simply links data records to relevant ontologies and controlled vocabularies. These links improve the discoverability and queriability of the data and provide interoperability with other semantic resources, but more principled inference is left to end users. This lightweight semantic approach allows for the efficient management of a large and diverse data store and ensures scalability as the system grows.
TraitBank is designed for use by a wide audience including biodiversity researchers, information and data scientists, but also teachers, students, and the public. It provides both human and machine accessible query interfaces, and trait data are displayed on EOL taxon pages making them readily accessible to the EOL user base of about 6 million unique users per year.14 Data from 1 October 2013 to 30 September 2014.

Data model and architecture for TraitBank/EOL. Elements are from Darwin Core except for the following extensions developed by EOL: Media (with Audubon Core), References (with BIBO), Associations (under development), and Agents. Only the most important properties are indicated. TraitBank elements may hold only pointers to elements managed in the EOL relational database management system (RDBMS), like taxon names and references.
To represent trait data, TraitBank uses and extends TDWG Darwin Core [43] (Fig. 1), the most widely used standard for exchange of biodiversity data. Darwin Core Archives are already the preferred method for sharing media, references, and taxonomic data with EOL. Other prominent initiatives like GBIF, OBIS, and the Atlas of Living Australia (ALA)15
Each TraitBank record is associated with an
The Darwin Core
Interactions among species (e.g., predator-prey relationships) are handled using a new
As with other content on EOL, provenance of TraitBank data is handled using rich attribution metadata via fields from Dublin Core18
Taxonomic names reconciliation is at the heart of any effort to integrate biodiversity information [33]. Since there is no comprehensive consensus classification for organism names, EOL maps each data record to names in multiple taxonomic hierarchies from several scientific providers. Synonyms, misspellings, ranks, and parent taxa are taken into account during the reconciliation process. Rather than attempt to fully capture these complex interactions semantically [15], TraitBank reflects data structures already developed to represent the multiple classifications managed in the EOL relational database [32].
Scientific names in TraitBank are designated with the Darwin Core property
Implementation
To ensure that TraitBank would meet the needs of the scientific community and to build a stakeholder base ready to use it, EOL convened workshops and an advisory panel early in the development process. Scientists who attended workshops sponsored by EOL’s Biodiversity Synthesis Center at the Field Museum over a period of four years provided high-level community requirements. A workshop in Washington, DC in September 2012 brought together more than twenty experts from biology and computer science, including semantics, to focus on the questions that could be addressed with a comprehensive, integrated trait repository and associated software and infrastructure requirements. Teleconferences with an 11-person panel of scientists and technologists drawn from the above workshops informed iterative design and development. Following the first production release of TraitBank in January 2014, further refinements to the technology were implemented on an as-needed basis, and the focus of the development team shifted to increasing the amount of content aggregated into TraitBank.
The initial data sets targeted for ingestion into TraitBank were chosen to quickly achieve broad taxonomic coverage for a number of commonly studied ecological and life history traits. In addition to iconic data sets like PanTHERIA [22], IUCN Redlist,20
TraitBank contents as of 27 January 2015 as retrieved from
Most TraitBank data are imported from other databases via PHP connectors or uploaded directly via Darwin Core Archive files.21
If a data set introduces new concepts (attributes, values or metadata) to TraitBank, the new terms and their definitions must be added to the TraitBank URI registry before the data can be harvested [11]. Each attribute is mapped to broad subject categories (Distribution, Physical Description, Ecology, Life History and Behavior, Evolution and Systematics, Physiology and Cell Biology, Molecular Biology and Genetics, Conservation, Relevance to Humans and Ecosystems, Notes, Names and Taxonomy, Database and Repository Coverage), and basic semantic relationships are entered into the system (see below). Attributes are also ranked based on their putative audience appeal, so that attributes of greater interest to EOL audiences can be displayed more prominently in the EOL interface (see below).
Some of the frequently referenced ontologies in TraitBank
Some of the frequently referenced ontologies in TraitBank
If a provider supplies semantic annotations with their data, these mappings are preserved in TraitBank. However, only three TraitBank data partners, Environments-EOL [28], Global Biotic Interaction [36], and Polytraits [14] fall into this category. Most of the resources we aggregate are not “born semantic,” i.e., the data come to us with labels, some metadata, and sometimes an associated article explaining the rationale and methods of the study. In these cases, EOL staff analyze the meaning of each attribute and select formally-defined semantic terms to represent them. Terms from ontologies under active development by engaged communities are preferred. These include Open Biological and Biomedical Ontologies (OBO) Foundry ontologies such as Molecular Function (GO),23
Not all concepts encountered in TraitBank data sets can be matched to terms in current ontologies or controlled vocabularies. Especially in the life history and ecology domains ontology coverage is still sparse. EOL staff therefore regularly propose new terms for adoption into ontologies like PATO and CHEBI, and we are involved in efforts to extend the Relations Ontology (RO),31
Many traits are highly complex and require referencing of more than one class, potentially from multiple ontologies. Some new terms are therefore created through Term Genie,34
The goal is for new TraitBank terms to become part of the most relevant ontologies so that they can be managed by domain experts and readily discovered by users and semantic web developers. Since adding new terms to ontologies can often take a considerable amount of time, EOL creates provisional URIs while term requests are under review.
TraitBank terms, their definitions, and URIs are listed in the TraitBank Data Glossary37
For terms imported from ontologies and controlled vocabularies, the Data Glossary entry can serve as a backup when the original resource is moved or temporarily unavailable. If the definition of a term changes in the source ontology, the Data Glossary entry also serves as a record of the definition implied in the TraitBank annotation. Links to individual glossary entries can be generated based on URIs (e.g., the OBA URI for cell shape is
Because of the complexity of semantic reasoning and the challenges of reasoning across highly heterogeneous or web-scale data sets [34,41] the availability of semantic reasoning capabilities was limited in the first release of TraitBank, with the goal to add additional reasoning later as the system matures and as demand requires. However, conversion relationships of units (e.g., from
Data quality
The quality of the data represented in TraitBank is highly variable. Early in the planning process, we made the decision to not only aggregate tightly curated data but to also recruit data in need of review (e.g., data from citizen science and text mining projects) and data of questionable provenance (e.g., summary statistics without original sources). Such provisional data can make important contributions to the biodiversity knowledge base in cases where no data from scientific studies are available, where such data cannot be shared and reused freely, or where the expert curated data are of limited scope. Feedback from stakeholders has since confirmed that, at least for some applications, provisional data are better than no data at all.
Data quality concerns may also extend to the accuracy of the semantic annotations in TraitBank. Most of these links are created by trained biologists, but not necessarily by domain experts. Also, when data sources provide only vague descriptions of attributes, values, and metadata there will be some conjecture involved in the selection of the appropriate semantic context.
Finally, taxonomic name reconciliation relies on algorithms that may yield suboptimal results if there are unresolved homonyms, unrecognized synonym relationships, contradictory taxonomic data from different providers or undocumented lexical variants of taxon or author names. As a result data records may sometimes not be associated with the most appropriate EOL taxon page.
TraitBank users in need of high quality data are advised to thoroughly check data sources, semantic annotations, and taxon mappings before employing the data in scientific analyses. The metadata needed to perform these assessments are provided alongside TraitBank records in all data delivery interfaces (see below).
Data search, download, and API
TraitBank data can be queried and downloaded through the EOL data search interface38
The EOL data search (Fig. 2) supports queries based on individual attributes. A generic search returns all TraitBank records for a given attribute like
Search results can be explored in the EOL interface, or they can be downloaded as a CSV (comma-separated values) file. The CSV format is easily parsed and can be imported into common spreadsheet applications for manual or semi-automatic processing. The downloaded file features comprehensive information about each data record. It includes the unique EOL identifier for the associated taxon along with its scientific name and a common name if available. Each data row specifies the attribute label (e.g.,
To support data-driven web-applications, a JSON-LD application programming interface (API),40

The EOL data search interface for TraitBank, accessible at
TraitBank data are also displayed prominently on EOL taxon pages where they enrich the experience of millions of visitors each year. On many pages, these data fill important gaps by providing information that is not yet available in narrative form. Ubiquitous links to term definitions and data searches also encourage users to explore biodiversity data and give students and teachers easy access to sample data sets for instruction and projects.
The Overview tab, which is the information center of each EOL taxon page, features a sample of relevant data records. By default, these records are selected automatically based on global, dynamic attribute rankings. The principal criterion for these rankings is the relative level of interest expected in a general audience. For example, attributes like

Part of a data tab of an EOL taxon page. Wood density is expanded to show rich metadata. Users can select info buttons (? icons) to access definitions of terms, URIs, and links to the glossary and data search interface.
A comprehensive presentation of TraitBank data is provided in the Data tab of EOL taxon pages. The default view of this tab shows a simple list of attribute labels, values, and data providers, ordered by subject (Distribution, Physical Description, Ecology, etc.). A dynamic user interface (Fig. 3) gives access to the metadata for each record as well as URIs and definitions for attributes and categorical data values. Access to curation and commenting tools (see below), the data glossary, and data search interface are also provided.
Most TraitBank data are at the level of species or subspecies. For select physical, ecological, and life history attributes, the EOL Data tabs for higher taxa (genera, families, etc.) also feature summaries of the data represented among the taxonomic children of the group. Maximum and minimum values are displayed along with record and taxa counts and a quick link to a data search that yields relevant records.
Any registered EOL member can review TraitBank content and report problems by adding comments to individual data records. EOL Curators – individuals with validated professional credentials – have the power to remove incorrect or suspect TraitBank records from public view. Flagged records remain visible to other curators and can be restored if flagged in error. Currently, TraitBank data providers do not receive notifications of comments and curator actions, but this feature will soon be available on an opt-in basis. This will allow data providers to benefit from the quality control activities of the EOL community.
EOL curators also participate in the selection of data for the Overview tabs of individual taxon pages. This activity is particularly important to ensure that the most interesting and informative records are highlighted for taxa of interest to a wide audience.
Architecture and technology
TraitBank is built on the RDF triple store integrated into the open source edition of the OpenLink Virtuoso Universal Server.42
All code is available under an MIT open source license and is published to the EOL project on GitHub.45
The amount of available biodiversity information has transcended our ability to process and analyze it. TraitBank addresses this impediment with an efficient, pragmatic approach to trait data integration that bridges taxon-specific and technology-specific systems. By organizing distributed knowledge from diverse sources into a lightweight, scalable framework, we facilitate its retrieval and reuse for a variety of applications, ranging from large-scale synthetic analyses of biodiversity to linked data products like the Knowledge Graph46
TraitBank was released in January 2014 after private (September 2013) and public (October 2013) beta test releases, with each test followed by a survey. Informal demonstrations to communities at several conferences have also been used to gather feedback. Some of the most valuable insights about the needs of TraitBank users were gained during the EOL-NESCent-BHL research sprint [30]. This event, scheduled only a week after TraitBank’s public launch, brought together a diverse group of biologists and informaticians to tackle large-scale ecological and evolutionary questions with the aid of resources provided by EOL and the Biodiversity Heritage Library (BHL).47
Based on user feedback and observations of user behavior, new features were added to TraitBank (e.g., JSON-LD access on a taxon by taxon basis) and the data search and download functions have been revised. In addition, new data sets were imported to TraitBank in response to specific user requests.
Several improvements suggested by users are still in the planning stages. These include support for more complex data queries, with multiple facets across traits, metadata, values, and taxa, improved presentation of results including visualizations, an R-interface for access to TraitBank data, and better performance of searches filtered by taxonomic group. Also, TraitBank’s geographic keyword vocabulary is not yet standardized. Most locations are currently stored as text strings, preventing reasoning on geographic distribution data. These records need to be mapped to gazetteers like GAZ,48
TraitBank fosters semantic interoperability both within and across domains by using URIs from ontologies that are also used in other systems. As the use of semantic technologies is already prevalent in genomics, morphology, ecology, and developmental biology communities, it makes sense to link newly exposed and annotated biodiversity trait information to these efforts. On the other hand, where existing ontologies do not yet capture knowledge adequately (e.g., missing terms, missing relations, missing definitions, complex taxonomic and nomenclatural semantics), our approach still allows progress in knowledge management and sharing in the most practical sense, even if not all elements of the system are interoperable.
Recent efforts to automate the description and measurement of organisms [3,6,23] accelerate the pace of data generation. While semantic annotation and open access publishing are likely to become an integral part of modern scientific workflows, standardization across data sets and domains remains in its infancy [12]. We expect that the semantic annotation of TraitBank resources will long remain a work in progress. The rapid growth and diversification of the corpus of data frequently requires the exploration of new subject areas. Even the annotation of existing data sets is often an iterative process as best practices develop in response to evolving needs for integration, new ontology resources, and feedback from domain and knowledge representation experts.
Impact on semantic community, data providers and research community
TraitBank is a starting point for the untangling of the vast riches compiled through centuries of biodiversity exploration. It will take time for it to mature into a comprehensive, consistent knowledge management platform that can supply highly curated, analysis-ready data products. Based on our experience so far, domain ontologies will have to become much more detailed if they are to be applied to the backlog of biodiversity data. Achieving the desired level of complexity without sacrificing interoperability will be an ongoing challenge. Because of its broad scope, TraitBank is in an ideal position to provide the stewards of many relevant domain ontologies with use cases that can help to optimize the development of their resources. We also anticipate that the prominent use of semantics in TraitBank will result in increased usage of ontologies in research applications.
TraitBank complements taxon or subject-specific trait databases by filling gaps (both in taxonomic and attribute space), by recruiting new types of data (e.g., from text-mining, citizen-science, and specimen data digitization efforts) and by integrating knowledge across the tree of life and multiple scientific domains. To promote progress in the aggregation of comprehensive data sets of particular interest to scientists and the public, EOL has funded projects like GloBi (Global Biotic Interactions) [36] and Environments-EOL [28]. For these communities and other ongoing projects like Polytraits and OBIS, TraitBank provides a live platform for distribution and re-use that exposes their data to broader audiences and promotes significant community curation. For legacy data providers, such as the authors of literature-derived data sets, TraitBank improves discoverability of data that otherwise would not be exposed to the Linked Open Data (LOD) community [5]. Once provisioned to TraitBank, data can be discovered and re-used for a wide range of use cases, from simple fact-finding to “big data” modeling studies. Through its association with the Encyclopedia of Life web site, TraitBank also brings awareness of data science and interoperability efforts to novel audiences. Some of these new data users may themselves become data providers, e.g., through participation in citizen science50
With TraitBank only a year old, it is somewhat premature to assess its impact on scientific research. The TraitBank data search interface has so far been accessed over 5,000 times, and more than 1,500 data packages have been downloaded. Also, papers citing TraitBank as a data source are starting to appear in the literature (e.g., [1,4,8,28,36,37,44]). Future development efforts will focus on improving TraitBank’s utility for research by improving the search interface, exposing the data in more advanced machine-readable formats, employing standardized data quality descriptors, replacing provisional EOL terms with community-managed terms, and exploring the best use of reasoning within the EOL-TraitBank framework.
Footnotes
Acknowledgements
Support for TraitBank was provided by the Alfred P. Sloan Foundation, the Smithsonian Institution, the Marine Biological Laboratory, and the John D. and Catherine T. MacArthur Foundation. The production hardware infrastructure for the EOL website was supported by the Harvard Faculty of Arts and Sciences (FAS) Sciences Division Research Computing Group and the Smithsonian Institution. The TraitBank development team wishes to specifically thank Dr. Jesse Ausubel for his support and for his commitment to the entire Encyclopedia of Life initiative.
