Abstract
The increasing and unprecedented publication rate in the biomedical field is a major bottleneck for knowledge discovery in the Life Sciences. The manual curation of facts from published scientific papers is slow and inefficient, and therefore new approaches are needed that can enable the automatic, scalable and reliable extraction of assertions. While the publication of scientific assertions and datasets on the Semantic Web is gaining traction, it also creates new challenges such as the proper representation of provenance and versioning. Here, we address these issues and describe our efforts to represent the DisGeNET database of human gene-disease associations as permanent, immutable, and provenance rich digital objects called nanopublications. Our nanopublications are the first instance of a Linked Data model that ensures stable interlinking of the assertion and its metadata by Trusty URIs. As DisGeNET integrates manually curated as well as text-mined data of different origins, the semantic description of the evidence for each assertion is important to provide trust and allow evidence-based hypothesis generation. Here, we describe our steps to ensure high quality and demonstrate the utility of linking our data to other datasets on the emerging Semantic Web.
Introduction
To obtain a deeper understanding of the molecular mechanisms of diseases and to support drug development and healthcare research, biomedical researchers need to explore the current knowledge on the complex relationships between genes, proteins, gene variants, pathways, drugs, phenotypes and environmental factors. Scientific knowledge is mainly communicated and gathered as scholarly publications. To ease the access and exploitation of this knowledge, one of the main strategies is to manually extract and curate biomedical statements from the literature and structure them in databases. Due to the increasing size of literature repositories and the number of different dispersed and isolated databases, there are many efforts devoted to efficiently extract and provide the most up-to-date data in a way that can be integrated with existing datasets to facilitate knowledge discovery. These efforts include a) text mining approaches aimed at extracting relationships between biomedical entities from the literature [21], b) community-driven publication approaches based on wiki systems [10,11,14], c) the publication of existing databases to the Linked Open Data cloud (LOD)1
DisGeNET is a discovery platform developed in the Integrative Biomedical Informatics (IBI) group designed to enable research on the genetic basis of the pathophysiology of diseases. The platform offers one of the most comprehensive collections of knowledge on human gene-disease associations (GDAs) integrating over 380,000 associations between more than 16,000 genes and 13,000 diseases covering all disease areas (statistics from DisGeNET v2.1). These GDAs are collected from seven different public databases, which include human and animal model expert-curated databases. DisGeNET also includes GDAs extracted from MEDLINE by the BeFree [4] NLP-based approach. All these data are integrated, harmonized and made accessible for exploration and analysis through a Web interface, a Cytoscape plugin [1], and as an RDF linked dataset [19] with an open license.
The Semantic Web enables data integration and interoperability, but new challenges emerge when the data are used for the identification and evaluation of scientific hypotheses. These challenges include the tracking of provenance to understand the basis of an assertion and its relation to existing evidence, and the creation of unambiguous references to immutable scientific assertions. To overcome these issues we publish our DisGeNET GDAs as a new linked RDF dataset using the emerging nanopublication approach3
The linked dataset presented here is the first database published as nanopublications using Trusty URIs. Trusty URIs use cryptographic hash values to generate unique and stable identifiers based on their content. This makes digital artifacts identified using Trusty URIs permanent, immutable, and verifiable. Converting a dataset into nanopublications with Trusty URIs enables users and software agents to trace and interpret how a statement was produced, ensuring reproducibility, reliability, and enhancing citation. Another advantage is that both the scientific assertions and their metadata are interoperable, shareable, reusable, and supports discoverability by provenance-aware applications.
In this paper, we present the DisGeNET Nanopublications linked dataset as an alternative way to disseminate the information contained in DisGeNET. The conversion into nanopublications presented in this paper aims to extend and complement the capabilities of the existing DisGeNET dataset in RDF. The goal is to foster the publication and discoverability of these assertions, to support the automated aggregation of their evidence levels, and to support the generation of evidence-based new hypotheses in the biomedical field.
The DisGeNET evidence classification and its coverage in the nanopublication dataset
In order to cover different aspects of gene-disease relations, DisGeNET GDA content is extracted from various types of sources ranging from structured databases on human and animal models to unstructured scientific literature. DisGeNET provides evidence-based classifications of the data according to the level and type of curation in the original databases that enable users to rapidly assess the quality of the specific GDA. The DisGeNET evidence classes are: “CURATED” for human GDAs that are reviewed by experts, “PREDICTED” for human GDAs inferred from the GDA of an animal model that was reviewed by an expert, and “LITERATURE” for human GDAs that were automatically extracted from the literature by text mining methods (see DisGeNET coverage in Table 1). It is important to point out that DisGeNET not only aggregates GDA statements from different sources, but integrates them in a uniform way accompanied with contextual annotation. Specifically, the provenance and evidence are well described and fine-grained for each statement in order to keep track of them after integration. DisGeNET GDA content is represented according to various structured data model conventions: as a relational database, as an RDF linked dataset, and now as a nanopublication linked dataset. In the following sections, we first describe the RDF linked dataset version of DisGeNET data, followed by a description of the DisGeNET nanopublication set and methods to querying them, concluding with possible applications and related work.
RDF dataset description
There are three main components in the RDF dataset: GDA content, provenance description of the RDF dataset, and linksets to other linked datasets. Each of these components is described separately in the following sections.
GDA content
The RDF representation identifies genes by their NCBI Gene ID and diseases by their UMLS Concept Unique Identifier (CUI), and captures the biological type of the association. The gene and the disease additionally have different annotated attributes (see the schema at
A full provenance description of the RDF linked dataset is provided using the Vocabulary of Interlinked Datasets (VoID),6
DisGeNET data is linked to LOD in order to both enrich GDAs with annotations from external Semantic Web resources and expand the GDA content in the Semantic Web in a metadata-aware manner. 4,962,315 links to LOD through projects such as Bio2RDF and Linked Life Data, exist in the current version. All entities are linked using the same SKOS7

DisGeNET nanopublication schema. See an example in RDF/TriG notation at our Web site
Vocabularies prefixes, namespaces and topics used in DisGeNET nanopublications
As conforming to the nanopublication standard, our DisGeNET nanopublications consist of four named graphs: head, assertion, provenance and publication information (Fig. 1). The head graph defines the structure of the nanopublication by linking to the other nanopublication graph URIs. The assertion graph contains the description for a specific single GDA assertion. The provenance graph includes provenance, evidence and attribution statements that were directly mapped from the VoID description of the RDF dataset. Finally, the publication information graph includes all the metadata information regarding the nanopublication itself. We also include in this graph a description of the general topic of the nanopublication to enhance discoverability. The general topic of our nanopublications is ‘Gene-Disease Association’, and each nanopublication is annotated using the Dublin Core vocabulary (DC)9
For modeling our nanopublications, we needed to determine what information to include, how to formally represent it, and which ontologies to use to best represent the semantics. To represent the GDAs in the assertion part of the nanopublications, we used the same triples and ontologies already present in the RDF version of our data. That is, we use SIO to encode both the type of association and to relate the disease and the gene associated, and NCIt to encode the gene and disease biomedical entity types.
One important step in the nanopublication modeling was to find appropriate vocabularies for the description of the provenance and metadata in the nanopublication graphs. To represent provenance information we mainly used the PROV Ontology (PROV-O),10
A summary of the Linked Data vocabularies used in DisGeNET nanopublications is shown in Table 2. We published our dereferenceable nanopublications on the Web with a human-readable list of the vocabularies. The SIO and ECO ontology concepts are deployed in our triple store to be available both as machine-readable explicitly at axiom level to optimize the GDAs searches in our SPARQL endpoint, and to be human-readable in our Linked Data Faceted Browser.
Here we present the first version of the nanopublication model for DisGeNET data, which is based on the official nanopublications guidelines17
The
We present here the first release of DisGeNET published as nanopublications, which corresponds to version v2.1.0.0. The dataset consists of 940,034 nanopublications, representing the same number of scientific assertions for 381,056 different GDAs with their detailed provenance, levels of evidence and publication information descriptions. In total this represents 3,760,136 annotated RDF nanopublication graphs. Specifically, the dataset is composed of 31,961,156 quads, i.e. RDF triples with their graph (or “context”) added as the fourth member in the tuple (Subject, Predicate, Object, Context), everything being serialized using the TriG syntax.20

An example of SPARQL query.
The generation of our nanopublications started from the relational database whose data are used to produce the RDF linked dataset. This RDF dataset is generated by in-house scripts that prepare data for the D2RQ platform,21
The nanopublication dataset will be updated accordingly in conjunction with its parent relational and RDF versions. Two major updates per year are envisioned for the relational database and consequently the RDF Linked Data distribution, therefore these updates may well also affect the nanopublication content. In addition, the maintenance of the nanopublication dataset may require additional new versions. In each major revision of DisGeNET we include more data sources into the database to increase the coverage on the current knowledge in GDAs, and new annotations, adding more value to the data. For example, in the last update we included new and popular expert-curated datasets: RGD, CTD mouse, CTD rat and our literature-mined BeFree dataset, as well as new annotations describing the level of evidence of each data source. These new evidence codes are highlighted in Table 1. The versioning for nanopublications consists of keeping track of the provenance of both the RDF and the relational version of DisGeNET data from which the RDF is derived. Thus, the nanopublication version information is a composite of: the version of the relational database (v2.1) plus the version of the RDF dataset (v2.1.0) plus the version of the nanopublication (v0). Finally, in recognition of the interest in the nanopublishing of DisGeNET GDAs, we note that from the day we made it available in the download section of our Web site (October 13rd, 2014) until January 26th, 2015, the nanopublication dataset has been downloaded 52 times while the RDF has been downloaded 43 times.
With the aim to show the questions that can be answered by our nanopublication implementation, we use the following question as an example:
Applications
We aim to incorporate the DisGeNET GDA collection in knowledge discovery projects such as the Open PHACTS Discovery platform [12]. The Open Pharmacological Concepts Triple Store project (Open PHACTS) has developed a powerful cloud-based platform for open access data following a Semantic Web approach that allows scientists to draw on diverse databases to answer many questions relating to drug discovery. The new version of the platform will integrate and provide access to additional datasets such as WikiPathways [14], neXtProt, which was also recently converted into nanopublications, and DisGeNET.
Related work
A variety of datasets represented as nanopublications have recently been published. Beck et al. [2] provided a nanopublication dataset on comprehensive genome-wide association studies (GWAS) to organize and annotate the complex spectrum of observed human GWAS phenotypes for reuse and interchange, to assist with cross-species genotype and phenotype comparisons, and to integrate GWAS data into the Linked Data Web. This work underlined the importance of including appropriate provenance and context information to avoid confusion to data consumers since their GWAS nanopublications are simply items of data not yet validated, i.e. not established facts. Chichester et al. [6] explored the use of neXtProt nanopublications to obtain new insights based on restricted levels of evidence related to sequence variation, expression, and regulation of human proteins important for precision medicine. Mina et al. [17] used the nanopublication model to expose different assertions generated by a Taverna workflow analysis applied to the investigation of the relation between Huntington’s Disease genes and epigenetic regulation. They showed the potential of the model to provide metadata from a computational analysis, which will enable reproducibility and increase trust in the assertions. They also showed that the nanopublication model enables the connection of this information to the Research Object model [3]. In Sernadela et al. [23], the authors explored the nanopublication integration of large collections of annotated associations between drugs and their adverse events extracted by data mining techniques and applied to pharmacovigilance, and present three interoperable data exchange interfaces. Recently, two novel examples of using nanopublications to track and aggregate evidence for future applications on knowledge discovery and evidence-based decision making processes have appeared. The Repurposing Drugs with Semantics (ReDrugS) framework [16], based on a systems biology approach, represents biological and chemical entity interactions contained in databases as nanopublications including descriptions of the experimental methods used to derive the assertions. By creating consensus assertions, they assign a combined probability of truth inferred from those experimental methods. They showcased their approach by searching and discovering new drug-gene associations for drug repurposing based on statistical aggregation of confidence. Another relevant pioneering work [22] proposes to track the provenance information in diagnostic databases and diagnostic processes by a nanopublication approach, with the goal to enable accurate and evidence-based clinical decision making.
Summary
We have created a nanopublication-based linked dataset that provides 940,034 nanopublications on scientific statements of human GDAs. These GDAs identified by Trusty URIs are machine-interpretable, immutable, permanent, and verifiable, which promotes data citations and stable references. Each GDA statement has its provenance description providing attribution, creation time, and further context of its creation to confer trust. Each GDA is classified as “CURATED”, “PREDICTED”, or “LITERATURE” to categorize the evidence of the statement based on the type of assertion and curation made in the original databases. We have enriched the provenance annotation by stating the type of curation of the assertion, and classified the nanopublications by their level of evidence. DisGeNET nanopublications include metadata annotations about the general topic of the nanopublications, i.e. ‘Gene-Disease Association’, semantically described by SIO to ease their discoverability in the Semantic Web. With an illustrative use case we show how our nanopublications can be used to explore GDAs and how they can be integrated with relationships published in other LOD sources, which to permit data integration across domains.
The publication of our DisGeNET nanopublications on the Web of Data will enable a large-scale interconnection of statements about genes and diseases and will allow users to explore them based on evidence. This is essential for knowledge discovery, and our approach can help to get a better picture of the molecular basis of pathological conditions.
Footnotes
Acknowledgements
We thank Dr. Mark Thompson (Leiden University Medical Center) and Dr. Jesse Van Dam (Wageningen UR) for sharing their expertise in the modeling of nanopublications. We also thank the reviewers for their helpful comments. The research leading to these results has received support from the Instituto de Salud Carlos III-Fondo Europeo de Desarollo Regional (PI13/00082 and CP10/00524). Also, from the Innovative Medicines Initiative Joint Undertaking under grants agreements n° 115002 [eTOX] and n° 115191 [Open PHACTS], resources of which are composed of financial contribution from the European Union’s Seventh Framework Programme (FP7/2007-2013) and EFPIA companies’ in kind contribution. Laura I. Furlong received support from Instituto de Salud Carlos III Fondo Europeo de Desarollo Regional (CP10/00524). The Research Programme on Biomedical Informatics (GRIB) is a node of the Spanish National Institute of Bioinformatics (INB).
