A systematic analysis of term reuse and term overlap across biomedical ontologies

Abstract

Reusing ontologies and their terms is a principle and best practice that most ontology development methodologies strongly encourage. Reuse comes with the promise to support the semantic interoperability and to reduce engineering costs. In this paper, we present a descriptive study of the current extent of term reuse and overlap among biomedical ontologies. We use the corpus of biomedical ontologies stored in the BioPortal repository, and analyze different types of reuse and overlap constructs. While we find an approximate term overlap between 25–31%, the term reuse is only <9%, with most ontologies reusing fewer than 5% of their terms from a small set of popular ontologies. Clustering analysis shows that the terms reused by a common set of ontologies have >90% semantic similarity, hinting that ontology developers tend to reuse terms that are sibling or parent–child nodes. We validate this finding by analyzing the logs generated from a Protégé plugin that enables developers to reuse terms from BioPortal. We find most reuse constructs were 2-level subtrees on the higher levels of the class hierarchy. We developed a Web application that visualizes reuse dependencies and overlap among ontologies, and that proposes similar terms from BioPortal for a term of interest. We also identified a set of error patterns that indicate that ontology developers did intend to reuse terms from other ontologies, but that they were using different and sometimes incorrect representations. Our results stipulate the need for semi-automated tools that augment term reuse in the ontology engineering process through personalized recommendations.

Keywords

Descriptive study ontologies biomedical domain term reuse term overlap composite mappings visualization

1. Reuse in biomedical ontologies

The biomedical research community has been one of the earliest adopters of ontologies to tackle the challenges of efficient knowledge organization, optimized information retrieval and effective annotation of datasets. Researchers have used ontologies for various purposes such as knowledge management, semantic search, data annotation, data integration, exchange, decision support and reasoning [5,32]. For example, i) the National Cancer Institute Thesaurus (NCIT) has been used as a reference terminology for cancer data [35], ii) the Gene Ontology (GO) has been ubiquitously used for enrichment analysis on gene sets obtained from microarray experiments [3], and iii) the Systematized Nomenclature of Medicine-Clinical Terms (SNOMED CT) has been used for the electronic exchange of clinical health information [37].

Over the years, ontology development has become a reuse-centric process [34,38]. All methodologies strongly encourage reuse while building new ontologies, be it at the level of an ontology, or at the level of individual terms [2,7]. In the literature, we may find two areas that benefit from reuse: i) ontology engineering, in which experts can reuse already existing ontology structures, and thus reduce the engineering costs; and ii) ontology application, in which reuse supports the semantic interoperability among different datasets and applications. For example, the 11th revision of the International Classification of Diseases (ICD-11) reuses terms from SNOMED CT to support its use in electronic health records [31,41]; while federated search engines benefit from reuse by being able to query multiple, heterogeneous knowledge sources without the need for extensive ontology alignment [19].

Several large, collaborative efforts are trying to streamline the development of interoperable, logically well-formed and accurate biomedical ontologies. They deal with ontological term overlap and reuse in different ways. For example, one of the key aims of the Open Biological and Biomedical Ontologies (OBO) Foundry [36] is to create a set of orthogonal ontologies by: i) defining each term in exactly one ontology, and referring it in other ontologies using its Internationalised Resource Identifier (IRI), or ii) using the xref mechanism to create references between similar terms in different ontologies [28]. Another prominent example is the Unified Medical Language System–UMLS [4], which uses the notion of a Concept Unique Identifier (CUI) to map terms with similar meaning in different terminologies a posteriori. Figure 1 shows examples for the different types of reuse (IRI, CUI and xref) employed by various ontology development projects.

Fig. 1.

Types of Reuse: (a) CUI reuse: Diabetes Mellitus terms in SNOMED CT and ICD-9CM are mapped to the same CUI, (b) IRI reuse: RNA Binding defined in the GO ontology is reused in GEXO ontology using the same IRI; xref reuse: the latter term is reused in the GRO Ontology via a xref annotation.

For the purpose of this work, we define a term to be a class in an ontology. A term usually has a preferred label, other labels, synonyms, and other properties. We define as term reuse the situation in which the same term is present in two or more ontologies either by the direct use of the same IRI, or via explicit references (xref) and mappings (CUI). We further classify the reuse: (1) reuse of an ontology, through the means of the import mechanism available in OWL [43], meaning that the entire source ontology is imported into the target ontology; and (2) reuse of terms from one source ontology into another. In many cases, experts reuse not only one term from one ontology, but rather subsets of terms from multiple ontologies (e.g., subtrees). We define as term overlap the situation in which two terms are similar, when compared using their labels or synonyms. If we subtract from the set of all overlap terms the reused ones (term overlap–term reuse), we will get a set of terms that could have been reused potentially, but have not been in practice. We call this set the overlap–reuse gap. Ideally, we should try to minimize this gap.

For this research, we use the entire set of biomedical ontologies stored in BioPortal [45], an open content repository of biomedical ontologies and terminologies. The key contributions of this research can be described as follows:

We provide a systematic study of the current state of reuse and overlap across biomedical ontologies.

We propose and implement a new approach to determine term overlap across ontologies using composite mappings.

We develop a clustering method to help identify patterns of reuse using semantic similarity among ontology terms, and validate the results using the BioPortal Import Plugin logs.

We implement a Web application that can search for similar and reused terms in Bioportal ontologies, and that can visualize reuse dependencies and overlap among ontologies.

We discuss the state and challenges of reuse in biomedical ontologies.

All results of this paper, as well as all developed visualization tools, are available online at: http://onto-apps.stanford.edu.

The paper is structured as follows: Section 2 describes the related work to this research. Section 3 presents the methods that we used for our descriptive study. Section 4 details the results of applying the research methods, and then we discuss our findings in Section 5.

2. Related work

2.1. Benefits and challenges of reuse

Ontology reuse is recommended in the methodologies and guidelines outlined by several engineering groups as a means to develop modular, interoperable, accurate and cost-effective ontologies [10,26,38]. Bontas et al. [6] provide several real-world use cases for the benefits of ontology reuse in biomedicine and eRecruitment. By empirically analyzing methodologies, methods and tools currently used, Simperl et al. [34] identify the research and development challenges for ontological knowledge reuse to become a feasible alternative to other ontology-development strategies. In essence, reuse can be increased through the development of pragmatic methods and semi-automated tools that optimally exploit human and computational intelligence for reusing ontologies through a context- and task-sensitive approach [34]. Ontology modularization techniques (i.e., extracting parts of an ontology using some structural or logical properties) are also an important factor in supporting reuse. Researchers have undertaken comprehensive studies of existing modularization techniques [11,29].

2.2. Tools to support reuse

There are only a few tools that support term reuse in biomedical ontologies. OntoFox [46] is a Web-based application that allows users to retrieve terms, selected properties, and annotations from the source ontologies, using MIREOT principles [9]. The BioPortal Import Plugin [23,24] is an extension of the Protégé ontology editor [27] that allows the importation of terms, their properties and class subtrees from BioPortal ontologies. The MIREOT Protégé Plugin [15] and DOG4DAG [44] are also Protégé plugins that provide term importations from external ontologies. ProtégéLov [12] allows reuse of terms from the Linked Open Vocabularies repository [42] using owl:equivalentClass and rdf:subClassOf axioms. All these tools require the users to have prior knowledge of the ontologies where their desired term of interest exists.

2.3. Previous analyses of reuse and overlap

Matentzoglu et al. [22] provide a method to analyze the overlap between automatically-downloaded OWL ontologies from the Web. Ontologies with 90% overlap or containment relations were considered similar. Poveda et al. [30] analyzed the landscape of reuse in the ontologies referenced in Linked Open Data (LOD). The results indicate that over 40% of the terms are reused from other vocabularies, 67% of which are reused by imports, and the rest by referencing the term IRI.

In 2010, a systematic analysis of the member and candidate ontologies in the OBO Foundry indicated that the OBO Foundry had made significant progress over a period of two years towards the goal of orthogonality [14]. However, term overlap – percentage of similar terms between the OBO Foundry ontologies, also increased [14].

Five years later, we conducted a study [20] to investigate the level of reuse across all the biomedical ontologies stored in BioPortal [45]. Both these studies carried out simple lexical comparisons of the term labels to determine term overlap. Even though effective, this naive method tends to leave out terms that represent the same concept but have lexically-different term labels (e.g., “Cardiac Muscle” and “Myocardium”). For the three types of reuse observed in the biomedical domain (IRI, xref and CUI, Fig. 1), we estimated term reuse using a simple metric (Eq. (1)). $\begin{matrix} (1) & Reuse = \frac{unique reused terms}{total terms} \end{matrix}$

We found term reuse to be 3.1%, 3.9% and 4.1% for the three reuse types respectively, whereas, we found a term overlap of 14.4%. We also found that most ontologies reuse less than $5 %$ of their terms. These terms are reused from a small set of popular ontologies only. We presented some use cases, in which the developers reused terms with different, and often, incorrect representations.

In this paper, we will extend this research by providing a new approach to determine term overlap, a better metric to estimate term overlap and reuse, and a deeper understanding of how ontology developers reuse terms.

3. Methods

Fig. 2.

Worflow of all the steps required to estimate the average term reuse and overlap statistics across the BioPortal Ontologies, as well as clustering and BioPortal Import Plugin Log analysis to detect any reuse patterns. The steps of the workflow are: (1) Ontology Pre-processing, (2) Term Reuse, (3) Term Overlap, (4) Clustering, and (5) Log Analysis.

For our descriptive study, we employed several methods that aim to: (i) estimate the level of term reuse and term overlap across biomedical ontologies, (ii) extract reuse patterns from BioPortal ontologies, and (iii) extract reuse patterns from time-stamped BioPortal Import Plugin logs. These methods are inspired from text mining, graph theory and unsupervised learning. We make the results available through interactive visualizations and a search application (http://onto-apps.stanford.edu). Figure 2 describes the workflow of our methodology and the methods used stepwise. The structure of this section follows the numbered steps of the workflow.

3.1. Datasets

We used two datasets for our study: (i) a dump of BioPortal ontologies to analyze term reuse (Step 2) and overlap (Step 3), as well as to perform the clustering (Step 4); and (ii) the logs of the BioPortal Import Plugin to analyze the patterns of reuse in user ontologies (Step 5).

3.1.1. BioPortal ontologies

We obtained a triplestore dump of the BioPortal ontologies in N-triples format that contained 509 distinct ontologies as of January 1, 2015. This dump did not contain some ontologies that were deprecated or merged with existing ontologies, or added to BioPortal after January 1, 2015. After removing ontological views (i.e. $O_{1} \subseteq O_{2}$ ), we were left with 377 distinct biomedical ontologies (Fig. 2, Step 1). These ontologies include 8 OBO Foundry member ontologies (GO, CHEBI, PATO, OBI, ZFA, XAO, PR and PO), 105 OBO Foundry candidate ontologies (e.g., OGMS, HP) and 31 UMLS Terminologies (e.g., SNOMED CT, ICD-9).

3.1.2. BioPortal Import Plugin logs

Listing 1.

An anonymized excerpt of the BioPortal Import Plugin Logs

The BioPortal Import Plugin, an extension to the Protege ontology editor, allows users to import terms and sub-trees from BioPortal ontologies into their own ontology [23,24]. The plugin invokes the BioPortal REST API to search the BioPortal ontologies, and also to import terms.

We obtained the logs of REST calls that the plugin made to BioPortal. The logs are time and IP-stamped, and span the period from 26th September, 2011–14th May, 2013 (∼20 months). Listing 1 shows an excerpt of these logs.

Even though we did not have access to the user ontologies into which these imports were performed, these logs were an important source of information of terms that were reused together in user ontologies. We used these logs to identify patterns of reuse (Fig. 2, Step 5).

3.2. Identifying term reuse

Fig. 3.

Cartoon representations of the (a) $Reuse$ , (b) $Overlap : L E R O G$ and (c) $Overlap - Reuse : L E G - {Reuse}$ modules. In (a) Terms A and E are defined in two ontologies using same IRI. The green, dotted arrow in $Reuse$ module is a xref mapping from $E \to A$ , whereas the green, bidirectional arrow means the terms G and H are mapped to same CUI. In (b) and (c) the two disjoint components $T_{1}$ and $T_{2}$ are composed of ${A, B, C, D, E}$ and ${F, G, H}$ terms respectively. The darkened path $C \to A \to D$ represents a sample composite mapping, formed by different edge types.

For the purpose of this work, we define as term reuse the situation in which the same term is present in two or more ontologies, either by the direct use of the same IRI, via explicit xref references, or via CUI mappings.

To identify term reuse (Fig. 2, Step 2), we used the BioPortal corpus (Section 3.1.1), and defined three reuse constructs:

IRI – two terms share the same IRI,

xref – two terms are linked through the xref annotation [28], and

CUI – two terms are mapped to the same UMLS CUI.

We iterated over all the axioms in each of the 377 BioPortal ontologies to extract class term IRIs, their labels, synonyms, xref links and UMLS CUI mappings, when available. From the 5,718,275 class terms, we used the three constructs (same IRI, xref annotation, and CUI mapping) to extract the set of terms that satisfy any of the three reuse criteria (Fig. 1). For the first two reuse types (IRI and xref),1

UMLS CUI reuse was excluded, as we could not identify the source ontology for a CUI.

we identified the source ontology for each term using a heuristic approach described previously [20]. For each ontology, we calculated:

The percentage of terms reused using the first two constructs from other ontologies (IRI and xref),

The total number of ontologies reused from,

The percentage of terms reused by other ontologies,

The total number of other ontologies reusing terms,

CUI-mapped terms among other ontologies,

Reuse among all distinct pairs of ontologies.

Using these metrics, we determined those ontologies that reused the maximum number terms from other ontologies, and also those ontologies whose terms were reused the most.

Table 1

An example of a composite mapping. A column represents the term shown in the header. The content of a column contains different labels (preferred labels, synonyms, etc.) associated to the term. An arrow indicates that a label of a term is mapped to the label of another term. This example shows how we can map Term A defined in $O_{1}$ to Term C defined in $O_{3}$ using a composite mapping

Term A ( $O_{1}$ )	Term B ( $O_{2}$ )	Term C ( $O_{3}$ )
Heart Muscle →	Muscle of Heart	Myocardium
	Cardiac Muscle →	Cardiac Muscle

We generated a graph $G$ , where the terms identified through IRIs represent nodes (Fig. 3a). The number of ontologies in which the term is reused is represented as an attribute of each node. An xref annotation is shown as a unidirectional arrow, whereas all terms mapped to the same CUI are interlinked with each other using bidirectional arrows. A component of a graph is a subgraph in which any two nodes (terms) are connected to each other by paths, and the subgraph is connected to no additional node in the main graph. Due to the nature of these ontological terms (generally distinct for a given ontology), we produced a graph composed of different, disjoint components (e.g., $T_{1}$ , $T_{2}$ and $T_{3}$ are different components in Fig. 3a). This graph can be divided based on the type of the edges, and thus yields three modules corresponding to our three reuse constructs:

IRI reuse module

– the graph module containing only IRI edges (an undirected edge links two terms with same IRI),

xref reuse module

– the graph module containing only xref edges (a directed edge links the source term and the referenced term via xref), and

CUI reuse module

– the graph module containing only CUI edges (an edge links two terms that are mapped to the same CUI).

Table 2

Sources for labels and synonyms to generate composite mappings

Set	Source
$L$	skos:prefLabel, rdfs:label, dc:title
$S_{E}$	OBO:hasExactSynonym, skos:altLabel
$S_{R}$	OBO:hasRelatedSynonym, OBO:IAO_0000118
$S_{O}$	OBO:hasNarrowSynonym, OBO:hasBroadSynonym,
	under IAO:000015, rdfs:comment, skos:definition

For each reuse module, we calculated the term reuse across all biomedical ontologies using the equation given below, where N represents the total number of terms extracted (5,718,275), $M_{r}$ is a reuse module, composed of k components ${T_{0}, T_{1}, \dots, T_{k}}$ . Each component $T_{j}$ is formed from $n_{j}$ terms, i.e. ${t_{0 j}, t_{1 j}, \dots, t_{n j}} \in T_{j}$ . The number of terms in a component $T_{j}$ must follow $1 < n_{j} < N$ (i.e., components with a single term are not allowed). All terms in one component are reused forms for the same term. We calculate term reuse for each of the three different reuse modules: $\begin{matrix} (2) & Reuse = \frac{\sum_{j | T_{j} \in M_{r}} n_{j} - k}{N} \end{matrix}$

The above equation serves as a better metric to estimate term reuse as compared to the previous metric (Eq. (1)). The equation calculates the percentage of terms in BioPortal that are not unique, but are reused, unlike the previous metric, which did not include the count of reused versions of a term.

3.3. Detecting term overlap through composite mappings

For the purpose of this work, we define term overlap as the situation in which two terms are similar, when compared using their labels or synonyms. To detect term overlap (Fig. 2, Step 3), we use the BioPortal corpus (described in Section 3.1.1).

In our initial approach [20], we normalized the term labels by converting them to lowercase and then removing all non-alphanumeric characters. We performed naïve string matching to determine the potential term overlap. However, we realized that the terms with labels such as “Cardiac Muscle”, “Heart Muscle”, “Muscle of Heart” and “Myocardium” would be treated as separate terms in this approach, when these terms are the same and should be treated as term overlap.

To overcome this limitation, we considered using composite mappings in the current approach. Given a mapping from $A \to B$ and from $B \to C$ , where terms $A \in O_{1}$ , $B \in O_{2}$ and $C \in O_{3}$ and $O_{1}, O_{2}, O_{3}$ are different ontologies, a mapping from $A \to C$ is called a composite mapping [39]. This approach, which leverages transitivity of terms, has been used in the past to match unstructured vocabularies using a background ontology, where $O_{2}$ is a background ontology [1]. An example of such a composite mapping is shown in Table 1.

We extended this notion to generate graphs of such composite mappings ( $M$ ) between different terms across all BioPortal ontologies, without predefining any particular ontology as a background ontology. We extracted preferred labels ( $L$ ), exact synonyms ( $S_{E}$ ), related synonyms ( $S_{R}$ ), and other synonyms ( $S_{O}$ ) from the sources listed in Table 2.

We normalized the labels and synonyms, by first removing a set of 126 common English stop words (e.g. “of”), and then converting them to count vectors. We calculated cosine similarities between each pair of these string vectors and established a mapping, if the similarity was >95%. Due to the size and relative reduced importance of $S_{O}$ , we also considered bi-gram phrases of words in similarity calculations.

We generated 5 different overlap modules from different combinations of composite mappings:

$L G : {\forall m \in L L}$

$L E G : {\forall m \in L L \cup L S_{E} \cup S_{E} S_{E}}$

$L E R G : {\forall m \in L L \cup L S_{E} \cup L S_{R} \cup S_{E} S_{E} \cup S_{E} S_{R} \cup S_{R} S_{R}}$

$L E R O G : {\forall m \in L L \cup L S_{E} \cup L S_{R} \cup L S_{O}}$

$X G : {\forall m \in M}$

The $L G$ overlap module contains only the mappings performed using the properties from the $L$ set defined in Table 2 (that is, skos:prefLabel, rdfs:label, dc:title). The $L E G$ overlap module includes besides the label–label mappings, also the label–exact synonym and exact synonym–exact synonym mappings.

The final $X G$ overlap module contains all the composite mappings in $M$ . We removed the edges that were present in the three reuse modules from $L E G$ (i.e., overlapping terms that were already reused), to find the overlap–reuse gap. This new module is called $L E G - {Reuse}$ , where ${Reuse} = {IRI} \cup {xref} \cup {CUI}$ . The $L E R O G$ overlap module and $L E G - {Reuse}$ module are shown in Fig. 3b and c.

In the next step, we identified those terms that had the same source ontology and identifier, but a different IRI representation, and no explicit mappings (e.g., OBO:owlapi/fma#FMA_31396 was used instead of OBO:FMA_31396). Such situations show that ontology developers intended to reuse a term, but they used different, and sometimes incorrect term representations. These situations do not represent actual reuse, and we marked such cases as intent for reuse (Section 5). We removed any interconnecting edges between terms that show an intent for reuse in $L E G - {Reuse}$ to generate the final module $L E G - {Reuse, Intent}$ .

We calculated term overlap for each overlap module using the metric described in Eq. (2), where all nodes (terms $t_{i j}$ ) in $T_{j}$ (connected component of composite mappings) of the overlap module $M_{o}$ can be considered singular (Fig. 3).

For each the five overlap modules, we conducted an empirical analysis on the composition of the term labels of 100 randomly selected components to determine the threshold of the maximum distance (mapping hops) between two leaf nodes, for which any component $T_{j}$ can be considered to be ‘pure’ (i.e., contains terms that can still be considered similar). We identified the maximum distance (i.e., mapping hops) for which the components are still ‘pure’ to lie between [8,10], depending on the overlap module.

We called the components that have mappings exceeding the maximum distance Hybrid Components. These components are “hybrid” because they contain terms that are likely not similar to each other, usually because of a faulty mapping. In essence, the hybrid components can also be broken down into smaller components that are joined by one incorrect edge caused by a faulty mapping. Term nodes in these smaller components may be similar to each other. In the example from Table 3, term $t_{3}$ has a faulty synonym Intercalated disk that links two smaller, relevant components $T_{1 a}$ and $T_{1 b}$ creating a hybrid component $T_{1}$ .

Table 3
An example of a hybrid component $T_{1}$ , composed of terms ${t_{i} | i = 1, 2, \dots, 7}$ . $T_{1}$ can be broken into two smaller, relevant components $T_{1 a}$ and $T_{1 b}$ that are connected by an incorrect mapping caused due to a synonym of term $t_{3}$

Component ( $T_{1 a}$ ) Component ( $T_{1 b}$ )

$t_{1}$ Myocardium $t_{4}$ Intercalated Disk

$t_{2}$ Cardiac Muscle $t_{5}$ Intercalated-Disc

$t_{3}$ Heart Muscle $t_{6}$ Discus Intercalatus

$t_{3}$ (Intercalated disc) → $t_{7}$ Intercalated Disc

We calculated another term overlap estimate, which we called Non-hybrid Term Overlap, by excluding hybrid components from consideration in our metric. By excluding hybrid components altogether from this estimate, we set a lower bound on our estimated term overlap.

3.4. Clustering to detect patterns of reuse

One goal of this work is to investigate whether the reuse within biomedical ontologies occur in certain patterns that can be identified algorithmically. To this end, in Step 4 of our workflow (Fig. 2), we used a two-phase clustering approach on the IRI module that we defined in Section 3.1.1. As a reminder, the IRI reuse module contains only IRI edges that link terms that share the same IRI.

We excluded the $CUI$ and xref reuse modules from this analysis, as $CUI$ mappings and xref annotations are generally established a posteriori in the engineering process.

Using the terms in the $IRI$ reuse module, we generated a term–ontology matrix. The rows contain the terms that have been reused at least once (i.e., the term appears in at least 2 ontologies with the same IRI), and the columns contain the ontology in which the term appears. Whether a term exists in an ontology or not was indicated as 1 or 0 respectively, resulting in a very large, sparse, binary matrix.

As our term–ontology matrix X is categorical (n terms, m ontologies), we used a k-modes algorithm [18] over 100 simulations with different k to partition the terms into large, disjoint clusters (k). The initial step is similar to the k-means algorithm, where k unique terms are selected as cluster centroids $Z = {Z_{1}, Z_{2}, \dots, Z_{k}}$ . k-modes algorithm assigns a term $X_{i}$ to a cluster whose centroid $Z_{l}$ has the minimum distance $d (X_{i}, Z_{l})$ to it. $δ (x_{j}, z_{j})$ checks if the term and the cluster centroid are present/absent together for one ontology $O_{j}$ ( $∴ δ (x_{j}, z_{j}) = 0$ ). After each term is assigned to a cluster, new centroids are generated for each cluster based on the modes of values for each ontology $O_{j}$ (i.e. if more terms in a cluster $Z_{l}$ are present in $O_{j}$ then $z_{l, j} = 1$ ). Until the cluster centroids are stable, we iterated over these steps. Over 100 simulations, the value of k is chosen with a desirable measure of cluster compactness (minimum spread of each cluster) and separation (maximum distance between cluster centroids). $\begin{array}{l} (3) & δ (x_{j}, z_{j}) & = \{\begin{matrix} 0 & if (x_{j} = z_{j}) \\ 1 & if (x_{j} \neq z_{j}) \end{matrix} \\ (4) & d (X_{i}, Z_{l}) & = \sum_{j = 1}^{m} δ (x_{i, j}, z_{l, j}) \end{array}$

For each pair of terms in each cluster, we computed a similarity score as follows: $\begin{array}{l} Sim (A, B) & = ω_{1} (\frac{| O_{A} \cap O_{B} |^{2}}{| O_{A} \cup O_{B} |}) \\ (5) & + ω_{2} (\frac{| S P_{A} \cap S P_{B} |}{| S P_{A} \cup S P_{B} |}) \end{array}$

In the equation above, $O_{A} \cap O_{B}$ indicates the set of common ontologies between terms A and B. $S P_{A} = {x | x \supseteq A}$ , and $S P_{A} \cap S P_{B}$ indicates the set of common super terms of A and B.

As can be seen, the similarity measure is a weighted distribution of common ontologies and Jaccard semantic similarity. $ω_{1} > ω_{2}$ , as we want to discern how ontology developers reused terms based on the set of ontologies in which these terms co-occur. We consider the proportion of shared terms, to reduce the impact of owl:Thing and other upper-level ontology terms which would be reused in many ontologies.

We generated a term–term affinity matrix A, where $A_{i j} ⩾ 0$ represents the similarity between the terms i and j. We used Spectral Clustering [25] over this matrix to further partition each large cluster. This method uses the largest eigenvectors of the similarity matrix to perform dimensionality reduction before using k-means clustering in the fewer dimensions. We performed 100 simulations with different values of $ω_{1}$ and $ω_{2}$ to isolate sub-clusters that are composed of terms from one source ontology only. Based on the current state of tools that support reuse, as well as the mental processing of the ontology developers, terms or groups of terms reused together in one session originate from the same source ontology.

3.5. Analyzing BioPortal Import Plugin logs

In step 5 of our workflow (Fig. 2), we analyzed the logs generated by the BioPortal Import Plugin (see Section 3.1.2). We used this analysis for two purposes: (1) to gain knowledge on other reuse patterns that occur in user ontologies, and (2) to validate whether the insights generated from our clustering analysis are accurate.

The entries in the BioPortal logs are generated as the user does certain operations in the user interface of the plugin. For example, if the user searches for a term in a BioPortal ontology using the plugin, the log will record a line corresponding to the search REST call made to BioPortal (see Listing 1). An import operation in the plugin would trigger other REST calls.

Fig. 4.

Histogram depicting the number of ontologies that reuse a given percentage (%) of terms from other ontologies in their current versions by the same IRI or xref annotation. Most ontologies reuse fewer than 5% of their terms.

As we do not have access to the user ontologies into which the BioPortal terms have been imported, the only sources we have are the time- and IP-stamped BioPortal call logs. Therefore, we had to reverse-engineer these logs to find out the actions that the users have taken in the user interface, and to identify which BioPortal terms are being reused (i.e., imported) together.

We documented the algorithm we used to reverse-engineer the logs in the additional online materials (http://onto-apps.stanford.edu).

As a result of running the reverse-engineering algorithm on BioPortal logs, we obtained term sets that have been reused (i.e., imported) together in user ontologies. Then, we mapped the extracted terms to existing terms in the current version of the source BioPortal ontology to find the overall depth of tree imports and the location of these terms and subtrees. We used this information as an additional source of reuse patterns, and also to validate the hypotheses made from clustering analysis (Section 3.4).

4. Results

We now present the results of each of the methods that compose our workflow (Fig. 2), described previously in Section 3.

4.1. Reuse

Previously, we found that most ontologies reuse less than 5% of the total terms in their current versions, using either the same IRI or through xref annotations [20]. Out of 377 BioPortal ontologies, 156 did not reuse any term using the IRI construct, and 315 did not reuse through xref. Moreover, ontologies reused terms from a small set of popular ontologies only. More than 250 ontologies have no terms reused. Figure 4 shows histograms of the percentage of terms that are reused by other ontologies. We also observed that there are 20 ontologies that exhibit reuse between 95% to 100% of their total terms. These ontologies are developed by reusing combinations of multiple ontologies (e.g., CCONT reuses terms from EFO, NCBITAXON, ORDO, and 19 other ontologies).

Using our $CUI$ construct, we found: i) popular UMLS terminologies such as ICD10CM (ICD10 – Clinical Modification), LOINC (Logical Observation Identifiers Names and Codes), HL7 (Health Level Seven Reference Implementation Model, Version 3) and MESH (Medical Subject Headings) to be composed primarily of unshared, unique terms, ii) procedural terminologies such as HCPCS (Healthcare Common Procedure Coding System), CPT (Current Procedural Terminology) and ICD10PCS (ICD10 – Procedure Coding System) have very few terms mapped to the same CUI, and iii) Several new terms were introduced in ICD10CM during the migration from ICD9CM, potentially impacting reuse [20].

Fig. 5.

Top 16 ontologies whose terms are reused the most through $IRI$ and xref constructs. Number of ontologies reusing (#) and percentage (%) of terms reused with respect to the terms in their current version.

Table 4

Term overlap (actual and hybrid-adjusted) estimated for different overlap modules composed of different mappings

Row #	Overlap Module	Terms #	Components #	Term Overlap (TO)	Hybrid Components # (Terms #)	Non-hybrid TO
1	$L G$	2,230,636	781,007	25.39%	10 (1,119)	25.37%
2	$L E G$	2,485,478	759,571	30.18%	1,187 (279,635)	25.31%
3	$L E R G$	2,565,928	755,816	31.65%	725 (361,120)	25.35%
4	$L E R O G$	2,475,905	744,314	30.28%	868 (289,090)	25.24%
5	$X G$	2,620,032	746,993	32.75%	270 (431,831)	25.21%
6	$L E G - {Reuse}$	1,789,407	553,114	21.62%	182 (195,139)	18.21%
7	$L E G - {Reuse, Intent}$	1,232,149	284,499	16.57%	178 (192,475)	13.21%

The 16 ontologies whose terms are reused the most from the first 2 constructs (IRI and xref) are shown in Fig. 5. The plot indicates the number of ontologies (#) that reuse terms from a given ontology as dots, and the percentage of terms (%) that are reused with respect to the number of terms in their current version as bars. For example, 95.2% of the total terms in the current version of GO are reused using the same IRI by 74 ontologies. Also, 3.7% of the total GO terms are xref-linked in 37 ontologies.

It is easily noticeable that most of these are popular or upper-level ontologies, some of which have more than 100% of their terms reused (e.g., we found 101 different versions of Basic Formal Ontology – BFO IRIs, whereas the current version only has 39 terms). As we have discussed [20], this anomaly is due to the fact that ontology developers tend to reuse terms with different versions, notations, or namespaces, that are sometimes incorrect and have no explicit mappings to the original term. We do not consider this case as reuse, but rather an intent for reuse, and we discuss it in Section 5.

Using the updated metric described in Section 3.2, we found term reuse to be 6.63% for the $IRI$ reuse module, 5.98% for the xref reuse module, and 8.39% for the $CUI$ reuse module.

4.2. Overlap

4.2.1. Term overlap

In our previous work [20], we determined term overlap using a naive approach. We found a total of 2,023,854 terms sharing 752,177 unique labels across the BioPortal ontologies. Using the new metrics described in Section 3.3, we can calculate this naive term overlap to be $22.23 %$ . In addition, the new metrics allowed us to compute more precise overlap statistics that we show in Table 4.

The $L G$ module is the most similar to our previous naive term overlap method, as this module contains only mappings $\forall m \in L L$ (label–label mappings). However, there is a substantial increase in the level of the term overlap from $22.23 %$ to $25.37 %$ (non-hybrid term overlap).

Fig. 6.

30% term overlap among different BioPortal ontologies. For simplicity, only the OBO Foundry member and candidate ontologies (blue squares), UMLS terminologies (red circles), and a few popular ontologies in BioPortal (green octagons) are shown here.

Once we include also the other types of mappings using synonyms (rows 2–6 in Table 4), the term overlap gradually increases all the way up to $32.75 %$ , although the number of hybrid components also increases. It is noteworthy to see that the non-hybrid term overlap is almost similar to the term overlap of $L G$ module (≈25%).

Rows 6 and 7 in Table 4 show that after removing all the three reuse modules (cf. Section 3.3), the term overlap decreases—the range is ( $18.21 %$ , $21.62 %$ ). On evaluating the $L E G - {Reuse, Intent}$ , we find that the term overlap drops down to $(13.21 %, 16.57 %)$ . Obviously, this term overlap statistic captures only the intent for reuse rather than actual reuse.

4.2.2. Ontology overlap

As a next step, we investigate how the term overlap reflects on ontology overlap. Therefore, we mapped the nodes in the $L E G - {Reuse}$ module to their respective ontologies, and created an edge between all the pairs of ontologies, if there existed an edge between the nodes (i.e., $\forall e = (n_{1}, n_{2}), s . t . e \in L E G - {Reuse}, n_{1} \in {O_{1}, O_{2}}, n_{2} \in {O_{3}} \Rightarrow {e (O_{1}, O_{3}), e (O_{2}, O_{3})}$ ). After removing all the terms and aggregating all edges between two ontology nodes to a single edge with a weight $w = \sum e$ , we have an undirected ontological overlap graph with edges depicting the term overlap between two ontologies.

We generated a directed sub-graph (Fig. 6) between those ontologies that have more than $30 %$ term overlap with respect to any one of the connected ontologies. Note that, for simplicity, Fig. 6 only includes the OBO Foundry member and candidate ontologies (blue squares), UMLS terminologies (red circles), and a few popular ontologies in BioPortal (green octagons). If we were to include all the ontologies in this graph, it would have created an indecipherable visualization. The interactive visualization is available in the online materials (http://onto-apps.stanford.edu).

Figure 6 shows that there is substantial overlap among ontologies generated independently through the OBO Foundry and UMLS methodologies. The overlap between BFO and the OBO Foundry candidate ontologies is caused by the fact that the candidate ontologies import BFO as their upper-level ontology, but they use different (incorrect) IRI representations. It is also noteworthy to see that the UMLS terminologies for adverse events, namely World Health Organization Adverse Reaction Terminology (WHO-ART), Coding Symbols for a Thesaurus of Adverse Reaction Terms (COSTART), and the Medical Dictionary for Regulatory Activities (MEDDRA), have substantial term overlap. The lower region of the graph shows several anatomical ontologies (CARO, UBERON, XAO, TAO, FMA, MA, TGMA, etc.), in which term overlap is obvious (similar anatomical features), but is debatable – most terms represent anatomical parts that may not be necessarily equivalent, as they belong in different organisms. Finally, the top-right corner shows the overlap between the RxNorm Vocabulary and the Drug Ontology (DRON). These results and the intent for reuse are described in detail in Section 5.

4.3. Clustering

The first step of our two-phase clustering approach was to use a k-modes algorithm over simulations for $k = 2 \to 100$ . We computed cluster compactness and separation by computing the cosine distance between the set of ontologies in one cluster against another. The desired cluster compactness and separation value was found to be at $k = 6$ , after which we would have overlapping clusters, or clusters with single terms.

The primary ontological composition of the clusters was determined from the ontologies common among terms in a cluster, and is shown in Table 5. It should be noted that $IRI$ reuse was rarely found in UMLS terminologies with the exception of NCBITAXON, NCIT, and SNOMED CT. The primary ontological composition of the terms in the large clusters either consists of: i) ontologies that frequently reuse terms from one major source ontology (e.g. CHEBI, GO, NCIT, DOID) in that cluster, or ii) one main ontology that reuses terms from multiple other ontologies and exhibits >90% reuse, e.g. CCONT.

Table 5
Primary ontological composition of the clusters

Cluster Ontologies

Cluster 1 HINO, BIOMODELS, CHEBI, CCO, DRON, BDO

Cluster 2 GO, NIFSTD, GO-EXT, FYPO, CCO, NIGO, CL

Cluster 3 GWAS_EFO_SKOS, EFO, EFOGWAS, CCONT, CLO

Cluster 4 SYN, CSEO, SOPHARM, SNPO, IFAR, NCIT

Cluster 5 PHENOSCAPE-EXT, UBERON, NIFSTD, CL, CLO

Cluster 6 NIFSTD, ERO, DOID, CLO, NIFCELL, NIFDYS

Cluster	Ontologies
Cluster 1	HINO, BIOMODELS, CHEBI, CCO, DRON, BDO
Cluster 2	GO, NIFSTD, GO-EXT, FYPO, CCO, NIGO, CL
Cluster 3	GWAS_EFO_SKOS, EFO, EFOGWAS, CCONT, CLO
Cluster 4	SYN, CSEO, SOPHARM, SNPO, IFAR, NCIT
Cluster 5	PHENOSCAPE-EXT, UBERON, NIFSTD, CL, CLO
Cluster 6	NIFSTD, ERO, DOID, CLO, NIFCELL, NIFDYS

Fig. 7.

Proportion of term pairs with semantic similarity in a given range for each sub-cluster.

Fig. 8.

BioPortal Import Plugin Log Analysis: Few ontologies that are reused the most through the BioPortal Import Plugin are shown — FMA, ICD10PCS, NCIT and SNOMED CT. The lower plot indicates the total number of sessions observed, the total number of single terms imported, the total number of structures imported, and the total number of terms imported in log scale. The upper plot indicates the content imported from each ontology spanning across its depth. Each structure imported is represented as a translucent polygon, whereas the single terms are grouped as circular shapes for each level.

We computed an affinity matrix among all pairs of terms in a given cluster using weights $ω_{1} = 0.85, ω_{2} = 0.15$ . These values were again generated after a set of 100 simulations, so that most of these sub-clusters are generally composed of individual source ontologies.

After executing spectral clustering using the affinity matrix, we divided all the term pairs in each sub-cluster in 2 bins, based on their Jaccard semantic similarity measure (<0.9 in Bin 1, and >0.9 in Bin 2). We plotted the proportion of term pairs in each bin for each cluster. Cluster 4 is shown in Fig. 7. In Cluster 4, a larger proportion of term pairs in any given sub-cluster have a semantic similarity in the range of (0.9–1.0) (>70%), indicating that these are either sibling terms or one term is the direct superclass of another. Generally, we found this to be the case for all the large clusters of the first kind. This finding likely indicates that ontology developers reusing terms from one main source ontology tend to reuse hierarchical subtrees mainly composed of terms with parent–child or sibling relations. This was less evident in the second kind of the large clusters where the proportion ranged between 30–60% of term pairs.

We mapped these sub-clusters to their location in the source ontology. We found that most of these 2-level substructures are located in the higher or upper-middle levels of the ontology. Hence, developers reuse terms from the higher levels in the ontological hierarchy of a small set of popular ontologies, and seldom reuse leaf nodes.

4.4. BioPortal Import Plugin log analysis

We found a total of 3,538 distinct IP addresses originating from 90 different countries, from which ontology developers used the BioPortal Import Plugin to search and reuse terms from BioPortal ontologies. We were able to isolate 5,755 individual terms and 2,139 ontological subtrees imported from 40 different ontologies in 516 distinct sessions. For an IP address, a session indicates the time period that has no intermittent breaks of >1 hour between two REST API calls. We found a total of 195,894 terms that users imported using the plugin. Out of these, we were able to map 193,601 terms to terms in the current versions of the BioPortal ontologies. The remaining terms were either deprecated, or terms such as, owl:Thing and time#datetimedescription that do not have a designated source ontology.

The top 10 ontologies with the maximum number of sessions were SNOMEDCT, NCIT, BFO, ABA-AMB, FMA, GO, RCD, AMINO-ACID, HP and IAO, whereas with the maximum number of terms were in ICD10PCS, SNOMEDCT, NCIT, ICD9CM, LOINC, BIRNLEX, ABA-AMB, FMA, RCD and SHR.

The ontologies that were reused the most through the plugin, both by the maximum number of sessions or by the maximum number of terms, are shown in Fig. 8. The total number of sessions observed, total number of single term imports, total number of structures imported, and total number of terms imported are shown as a bar plot. The structure of the content imported from each source ontology is shown across the depth of an ontology – the imported structures are shown as translucent blue polygon and the terms imported (either single or as a group) are shown as circular constructs, grouped according to the level. The depth of the ontology was retrieved from BioPortal repository. The width of the structure on each level is indicative of the number of terms imported on that level in log scale. The radius of the circular construct represents the total number of terms on that level. For clarity purposes, we have only shown 4 ontologies – FMA, ICD10PCS, NCIT and SNOMEDCT. The website (http://onto-apps.stanford.edu) contains interactive versions of these plots with 16 different ontologies.

In general, we found that, on an average more people tend to reuse terms from OBO Foundry ontologies (higher number of sessions detected) than UMLS terminologies using the Bioportal Import Plugin, with the exception of NCIT and SNOMED CT. However, the users, who import UMLS terminologies, tend to reuse more number of terms, in the form of complete hierarchical structures, during a single import session.

In the cases of ICD10PCS and ICD9CM, we found that the users reuse the entire hierarchy of these ontologies starting from the root node, into their target ontology. We observed the same pattern also in the case of the BFO, but it is expected as it is an upper level ontology. In almost all the other cases, we found that the ontology developers simply reuse terms from the higher or upper-middle levels in an ontological hierarchy, and the lower leaf nodes and structures are seldom reused. This reuse pattern can be seen in the FMA ontology in Fig. 8. We found the same reuse pattern in GO, CHEBI, NCBITAXON and LOINC (http://onto-apps.stanford.edu). As is clearly evident from the SNOMED CT and NCIT, most ontology developers generally import 2-level sub-trees composed of parent–child and sibling terms. These structures are represented as triangular polygons of similar dimensions along the midline of the respective visualizations in Fig. 8 with a higher opacity than other structures.

4.5. Reuse and overlap visualization on the Web

One of the contributions of our work is a general-purpose visualization of reuse and overlap among biomedical ontologies that employs the reuse and overlap modules, which we generated as part of this work. The Web application also allows users to search for similar terms by providing any string or an IRI as an input. In case of a string, the application matches the name to the set of the most similar terms that have it as a label or a synonym. We believe such an application is of general interest, and we make it available to the community through our website (http://onto-apps.stanford.edu/).

The application does a depth-first search against the $X G$ module, and returns all composite mappings, in which each term is a node of. The results are displayed in a tabular, or a force-directed network layout. The interactive force-directed network visualization allows users to explore reuse dependencies and overlap among BioPortal ontologies. Our website also provides access to the module graphs, and the analysis results of the BioPortal Import Plugin logs.

5. Discussion

5.1. Term reuse

As seen in Fig. 4, we are seeing the full spectrum of reuse from 0–100%, but in general, reuse is fairly low. Not only do most ontologies in BioPortal never reuse terms, their terms are also never reused by other ontologies, which is contrary to the reference-application paradigm considered in the ontology engineering process. However, we did find some ontologies that are approaching complete reuse. For example, the Mental Functioning Ontology (MF) [17], reuses 91.33% of its terms from 6 different ontologies. Our clustering analysis shows that not only single terms are reused, but also entire hierarchical structures of the source ontologies are reused. Ontology engineers need semi-automated tools to support both cases.

Generally, well-established ontologies and controlled terminologies do not reuse terms from other ontologies. Usually, these ontology are built by large organizations (e.g., NCI, WHO, IHTSDO). Some of these organizations are making concerted efforts to take advantage of reuse. For example, ICD-11 and SNOMED CT are trying to define a common core ontology to be reused by both [31]. Such collaborations may generate a set of best practices for ontology reuse in the future.

Table 6
Different kinds of IRI representations observed in BioPortal ontologies and BioPortal Import Plugin logs

Type Source Representation Few Observed Examples

Versions BFO www.ifomis.org/bfo/1.1* (AERO) Adverse Event Reporting Ontology

www.ifomis.org/bfo/1.0 (SAO) Subcellular Anatomy Ontology

NCIT NCIT:C53037* (NCIT) National Cancer Institute Thesaurus

NCIT:Cerebral _ Vein (CSEO) Cigarette Smoke Exposure Ontology

Notations FMA OBO:FMA_31396* (VO) Vaccine Ontology

OBO:owlapi/fma#FMA _ 31396 (BIOMODELS) BioModels Ontology

OBO:owl/FMA#FMA _ 31396 (EP) Cardiac Electrophysiology Ontology

OBO:fma#Cartilage_of_inferior… BioPortal Import Plugin Logs

Namespaces BFO www.ifomis.org/bfo/ (ADO) Alzheimer’s Disease Ontology

purl.obolibrary.org/obo/BFO _ (IDO) Infectious Disease Ontology

SNOMED CT ihtsdo.org/snomedct (SNOMED CF) SNOMED Clinical Findings

purl.bioontology.org/ontology/SNOMEDCT (IFAR) Fanconi Anemia Ontology

FMA sig.uw.edu/fma# (BDO) Bone Dysplasia Ontology

purl.obolibrary.org/obo/FMA _ (SDO) Sleep Domain Ontology

Type	Source	Representation	Few Observed Examples
Versions	BFO	www.ifomis.org/bfo/1.1*	(AERO) Adverse Event Reporting Ontology
www.ifomis.org/bfo/1.0	(SAO) Subcellular Anatomy Ontology
NCIT	NCIT:C53037*	(NCIT) National Cancer Institute Thesaurus
NCIT:Cerebral _ Vein	(CSEO) Cigarette Smoke Exposure Ontology
Notations	FMA	OBO:FMA_31396*	(VO) Vaccine Ontology
OBO:owlapi/fma#FMA _ 31396	(BIOMODELS) BioModels Ontology
OBO:owl/FMA#FMA _ 31396	(EP) Cardiac Electrophysiology Ontology
OBO:fma#Cartilage_of_inferior…	BioPortal Import Plugin Logs
Namespaces	BFO	www.ifomis.org/bfo/	(ADO) Alzheimer’s Disease Ontology
purl.obolibrary.org/obo/BFO _	(IDO) Infectious Disease Ontology
SNOMED CT	ihtsdo.org/snomedct	(SNOMED CF) SNOMED Clinical Findings
purl.bioontology.org/ontology/SNOMEDCT	(IFAR) Fanconi Anemia Ontology
FMA	sig.uw.edu/fma#	(BDO) Bone Dysplasia Ontology
purl.obolibrary.org/obo/FMA _	(SDO) Sleep Domain Ontology

(*) marks the recommended representation(s)

Through the empirical analysis of the BioPortal Import Plugin logs, as well as, the generated clusters and overlap modules, we found some reuse patterns that show that ontology developers have the intention to reuse terms. Essentially, these are IRI patterns that generally have the same identifier and source ontology, but that are reused from different versions of the source ontology, or represented using different notations or namespaces. These patterns cannot be considered as term reuse, as the IRIs use different, and often incorrect, representations for the same terms, and no explicit CUI or xref mappings were found. Hence, the advantages of term reuse can not be experienced. By using the correct IRI representation, the term overlap could be reduced substantially. We summarize these IRI patterns in Table 6, and provide a few examples for each. We also indicate the recommended representation, where possible.

We found several cases, in which an ontology reuses the same terms from different ontologies, and these terms are not linked by a reuse construct. For example, the BioModels Ontology (BIOMODELS) reuses the same terms from two different ontologies: i) Hepatic Oval Stem Cell from Cell Ontology (CL) and Foundational Model of Anatomy (FMA), and ii) Xanthopore from CL and Gene Ontology (GO). Even if these terms are likely equivalent, there is no reuse construct that links them.

Based on the observations from this study that show only modest reuse among biomedical ontologies, we believe that ontology engineers would benefit from better guidelines, along with improved tools, to increase term reuse.

5.2. Term overlap

In 2010, a systematic analysis of all the OBO Foundry ontologies outlined consistent term overlap, yet minimum term reuse, and commented on the limitations and challenges to achieve orthogonality [14]. Five years later, we extended this analysis and estimated term reuse and overlap over the entire continuum of biomedical ontologies (including UMLS terminologies) in the BioPortal repository. We found that we are still very far from achieving desirable term reuse [20]. Most ontologies exhibit considerably less than 5% reuse or no reuse through any constructs, and generally reuse terms from only a small set of ontologies.

The OBO Foundry mandates reuse by candidate ontologies from the member ontologies under its orthogonality aim. However, there is still substantial term overlap present among biomedical ontologies, including OBO Foundry ontologies.

In our previous analysis, we used a conservative approach to determine term overlap. As a result, lexically-different terms that may be similar, and can be categorized under term overlap, were considered different. Using our approach of tokenization and removal stop words, we were able to map terms with labels such as “Muscle of Heart” and “Heart Muscle”, whereas, through different overlap modules of composite mappings from preferred labels and synonyms, we were able to link “Heart Muscle”, “Cardiac Muscle”, “Myocardium”, and also terms in other languages such as “Myocarde”@FR and “Herzmuskel”@DE. The estimated term overlap through these overlap modules ranges from 25%–31.5%.

Our approach for detecting overlap has certain limitations.

Terms with labels such as “Second phalange of the third finger” and “Third phalange of the second finger”, and also “WAS Gene” (Wiskott-Aldrich syndrome) and “Gene” will be grouped together – due to count vectors and the exclusion of the stop word “was” respectively.

Lexically-similar terms in different ontologies may represent different concepts (e.g., anatomical concepts like spleen between Zebrafish Anatomy (ZFA) and Xenopus Anatomy (XAO)).

Some biomedical ontologies use different classes for the same concept to show evolutionary or developmental stages (e.g. Myocardium in Human Development Anatomy, Timed (EHDA) and Abstract (EHDAA) ontologies). We group these classes under term overlap, but they may be different.

Some ontologies may instantiate a synonym relation between terms that can actually have an “is part of” relation. This choice can lead to false composite mappings (e.g. Cranium has the synonyms Skull in the Teleost Anatomy Ontology (TAO)).

Some ontologies use chemical formulas as synonyms. Terms with the same chemical formula may be stereoisomeric molecules or completely different compounds (e.g., (+)-Menthofuran and Safranal ( $C_{10} H_{14} O$ )). This challenge has also been seen during alignment of different biomedical vocabularies for federated search, where Aspirin and Acetylsalicylic acid are the same but L-Glucose and D-Glucose are not the same [16].

Hence, the term overlap estimates should be seen cautiously, and can serve as an upper bound to the actual term overlap. Overlapping nodes that are at a path distance of more than 2 edges are generally different, especially if the edges $e \notin {L L, L S_{E}}$ . To bring these estimates closer to actual overlap, we introduced the concept of bigram similarity for $e \in S_{O} S_{O}$ and hybrid components, and the resultant term overlap is closer to the one derived from the $L G$ module.

5.3. Clustering

One of the key challenges that we encountered while clustering was the fact that we were dealing with a large number of terms (compared to the features), resulting in a large $n \times m$ matrix where $n > > > m$ . Also, as the initial matrix consisted only of the IRI-reused term–ontology pairs that are reused on an average between 2–3 ontologies, we had a very sparse binary matrix. There are various methods to deal with this such large, multi-dimensional matrices, ranging from MapReduce [8] to simple candidate generation [21]. Our two-phase approach allowed us to divide the term–ontology pairs into large distinct clusters of terms shared between some common group of ontologies. We could then also include the semantic hierarchy of these terms in the different shared ontologies for a subsequent spectral clustering. We believe that our similarity equation can be extended to incorporate other features such as co-occurrence of these terms in PubMed annotations, and our generated term–term affinity matrix can be used in a item-based collaborative filtering method to generate recommendations for reuse.

From clustering, we claim the following hypotheses: i) ontology developers reuse hierarchical subtrees along with single terms, ii) the proportion of term pairs that have parent–child or sibling relations can be very high, especially if the reuse occurs from one main source ontology and iii) these terms are located on higher levels or upper-middle levels of ontological depth.

5.4. BioPortal Import Plugin log analysis

As was observed from our term reuse analysis across BioPortal ontologies, ontology developers only import terms from a small set of popular ontologies in BioPortal using the BioPortal Import Plugin. From our analysis of the logs, it is apparent that: i) ontology engineers have imported hierarchical subtrees of varying depths along with single terms, ii) the most common reuse structures are 2-level structures – parent–child structures (triangles with a higher opacity in Fig. 8), and iii) these structures and terms are located in the higher and upper-middle levels of the ontological hierarchy.

Hence, we can say that the claims made from our clustering analysis (Section 5.3) are validated through our BioPortal Import Plugin log analysis. As future work, we plan to do a more formal validation of this finding. Moreover, for some ontologies that were common between both our analysis (e.g. NCIT, GO and FMA), we found a substantial similarity between some sub-clusters and the reuse structures extracted from the logs (results online). The similarity ranged between 70–100% for NCIT structures. This similarity can suggest either the ontologies developed using the BioPortal Import Plugin were saved back to BioPortal repository, or there are recurrence patterns in some ontologies that are reused frequently in different ontologies.

From this validation, we can postulate that our approach used for the two-phase clustering process, using the similarity equation and the term–term affinity matrices, accurately captures the thought process of the ontology engineer, when she reuses terms, and it can be coupled with the BioPortal Import Plugin to provide reuse recommendations in the future. The clustering only used the terms in the IRI reuse module, and might be biased towards OBO Foundry ontologies, and not generate enough UMLS recommendations (as they are seldom reused using the same IRI). Hence, our initial term–ontology matrix and the similarity equation will need to be extended to deal with this bias.

5.5. Future work

All ontology development methodologies encourage reuse with several advantages, such as cost reduction, quality control, semantic interoperability, EHR mining and query federation, cited in favor of reuse [6,19,31,41]. However, our extensive analysis suggests that ontology developers do intend to reuse terms, but often, they are not able to do so correctly. Converting the intent for reuse into actual reuse can help increase term reuse, and reduce term overlap (Section 4.2).

We plan to provide personalized reuse recommendations for ontology developers through a WebProtégé plugin (http://webprotege.stanford.edu) [40]. The plugin will use our term–term affinity matrix (Section 3.4) and an item-based collaborative filtering method [33] to generate personalized recommendations for ontology developers, based on their target ontology and the engineering task at hand. These recommendations will be provided through a visual recommendation plugin built inside WebProtégé, where ontology developers can drag and select their terms of interest for reuse. This plugin may also keep developers informed, when the representation of the term in the source ontology changes.

We believe that our Web application will allow ontology developers to search for similar terms in other ontologies, while our visualization of overlap and reuse dependencies may guide developers to reuse terms in their own ontology based on the structure of ontologies in related domains. Our composite mappings approach may serve as a complement to the existing BioPortal mappings, which are currently generated through naive string matching algorithms [13]. We also plan to develop a term–centric visualization that summarizes everything known about a particular term in BioPortal, and presents it to developers and domain experts through an interactive interface. Our hope is that this visualization will enable ontology developers to serendipitously discover and reuse existing knowledge.

6. Conclusion

We estimated the level of reuse and overlap in a corpus of 337 ontologies from the BioPortal repository. We developed novel methods for detecting reuse and overlap in biomedical ontologies. Our findings show a term overlap of approximately 25.31–30.18%, and term reuse of less than 9%. Most ontologies reuse less than 5% of their terms from a small set of popular ontologies, with terms from several ontologies never being reused. We found strong indications that users actually intended to reuse terms, but in many cases they used incorrect representations. We also identified common error patterns in term reuse. Our hope is that the results of this work may be used to develop better guidelines and tool support with the aim to enhance reuse, and minimize overlap among biomedical ontologies.

Acknowledgments

The authors acknowledge Manuel Salvadores for providing a triplestore dump of BioPortal ontologies, and other members of the Protégé Group and the National Center for Biomedical Ontology for their input. This work is supported in part by grants GM086587 and GM103316 from the US National Institutes of Health.

References

Aleksovski

et al., Matching unstructured vocabularies using a background ontology, in: Managing Knowledge in a World of Networks, Springer, 2006, pp. 182–197. doi:10.1007/11891451_18.

C.Y.

Alexander, Methods in biomedical ontology, Journal of Biomedical Informatics 39(3) (2006), 252–266. doi:10.1016/j.jbi.2005.11.006.

Ashburner,

C.A.

Ball,

J.A.

Blake,

Botstein,

Butler,

J.M.

Cherry,

A.P.

Davis,

Dolinski,

S.S.

Dwight,

J.T.

Eppig,

M.A.

Harris,

D.P.

Hill,

Issel-Tarver,

Kasarskis,

Lewis,

J.C.

Matese,

J.E.

Richardson,

Ringwald,

G.M.

Rubin and

Sherlock, Gene Ontology: Tool for the unification of biology, Nature Genetics 25(1) (2000), 25–29. doi:10.1038/75556.

Bodenreider, The Unified Medical Language System (UMLS): Integrating biomedical terminology, Nucleic Acids Research 32(suppl 1) (2004), D267–D270. doi:10.1093/nar/gkh061.

Bodenreider, Biomedical ontologies in action: Role in knowledge management, data integration and decision support, in: Yearbook of Medical Informatics (2008), p. 67. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2592252/.

E.P.

Bontas,

Mochol and

Tolksdorf, Case studies on ontology reuse, in: Proc. of the 5th International Conference on Knowledge Management (I-KNOW 05), Graz, Austria, June 29–July 1, 2005,

Tochtermann and

Maurer, eds, 2005. http://www.inf.fu-berlin.de/users/mochol/papers/i-KNOW05.pdf.

Ó.

Corcho,

Fernández-López and

Gómez-Pérez, Methodologies, tools and languages for building ontologies: Where is their meeting point?, Data & Knowledge Engineering 46(1) (2003), 41–64. doi:10.1016/S0169-023X(02)00195-7.

R.L.F.

Cordeiro,

TrainaJr.,

A.J.M.

Traina,

López,

Kang and

Faloutsos, Clustering very large multi-dimensional datasets with mapreduce, in: Proc. of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, CA, USA, August 21–24, 2011,

Apté,

Ghosh and

Smyth, eds, ACM, 2011, pp. 690–698. doi:10.1145/2020408.2020516.

Courtot,

Gibson,

A.L.

Lister,

Malone,

Schober,

R.R.

Brinkman and

Ruttenberg, MIREOT: The minimum information to reference an external ontology term, Applied Ontology 6(1) (2011), 23–33. doi:10.3233/AO-2011-0087.

10.

Cristani and

Cuel, A survey on ontology creation methodologies, International Journal on Semantic Web and Information Systems 1(2) (2005), 49–69. doi:10.4018/jswis.2005040103.

11.

d’Aquin,

Schlicht,

Stuckenschmidt and

Sabou, Criteria and evaluation for ontology modularization techniques, in: Modular Ontologies: Concepts, Theories and Techniques for Knowledge Modularization,

Stuckenschmidt,

Parent and

Spaccapietra, eds, Lecture Notes in Computer Science, Vol. 5445, Springer, 2009, pp. 67–89. doi:10.1007/978-3-642-01907-4_4.

12.

Garcia-Santa,

Atemezing and

Villazon-Terrazas, Protege LOV Plugin, 2015, http://boris.villazon.terrazas.name/projects/prolov/index.html. (Accessed March 05, 2015.)

13.

Ghazvinian,

N.F.

Noy and

M.A.

Musen, Creating mappings for ontologies in biomedicine: Simple methods work, in: AMIA 2009, American Medical Informatics Association Annual Symposium, San Francisco, CA, USA, November 14–18, 2009, 2009, pp. 198. http://knowledge.amia.org/amia-55142-a2009a-1.626575/t-002-1.627282/f-001-1.627283/a-039-1.627287/a-040-1.627284.

14.

Ghazvinian,

N.F.

Noy and

M.A.

Musen, How orthogonal are the OBO foundry ontologies?, Journal of Biomedical Semantics 2(Suppl 2:S2) (2011). doi:10.1186/2041-1480-2-S2-S2.

15.

Hanna,

Cheng,

Crow,

R.A.

Hall,

Liu,

Pendurthi,

Schmidt,

S.F.

Jennings,

Brochhausen and

W.R.

Hogan, Simplifying MIREOT: A MIREOT Protégé plugin, in: Proc. of the ISWC 2012 Posters & Demonstrations Track, Boston, USA, November 11–15, 2012,

Glimm and

Huynh, eds, CEUR Workshop Proceedings, Vol. 914, CEUR-WS.org, 2012. http://ceur-ws.org/Vol-914/paper_48.pdf.

16.

Hasnain,

M.R.

Kamdar,

Hasapis,

Zeginis,

C.N.

WarrenJr.,

H.F.

Deus,

Ntalaperas,

K.A.

Tarabanis,

Mehdi and

Decker, Linked biomedical dataspace: Lessons learned integrating data for drug discovery, in: The Semantic Web – ISWC 2014 – Proc. of the 13th International Semantic Web Conference, Part I, Riva del Garda, Italy, October 19–23, 2014,

Mika,

Tudorache,

Bernstein,

Welty,

C.A.

Knoblock,

Vrandecic,

P.T.

Groth,

N.F.

Noy,

Janowicz and

C.A.

Goble, eds, Lecture Notes in Computer Science, Vol. 8796, Springer, 2014, pp. 114–130. doi:10.1007/978-3-319-11964-9_8.

17.

Hastings,

Ceusters,

Jensen,

Mulligan and

Smith, Representing mental functioning: Ontologies for mental health and disease, in: Proc. Towards an Ontology of Mental Functioning Workshop, ICBO 2012, 3rd International Conference on Biomedical Ontology, Medical University of Graz, 2012. http://kr-med.org/icbofois2012/proceedings/ICBOFOIS2012Workshops/ICBO2012MFO/SingleFiles/icbo-2012_MFO_Hastings_1.pdf.

18.

Huang, Clustering large data sets with mixed numeric and categorical values, in: Proc. of the 1st Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), Singapore, 1997, pp. 21–34. doi:10.1.1.94.9984.

19.

M.R.

Kamdar,

Zeginis,

Hasnain,

Decker and

H.F.

Deus, ReVeaLD: A user-driven domain-specific interactive search platform for biomedical research, Journal of Biomedical Informatics 47 (2014), 112–130. doi:10.1016/j.jbi.2013.10.001.

20.

M.R.

Kamdar,

Tudorache and

M.A.

Musen, Investigating term reuse and overlap in biomedical ontologies, in: Proc. of the International Conference on Biomedical Ontology, ICBO 2015, Lisbon, Portugal, July 27–30, 2015,

F.M.

Couto and

Hastings, eds, CEUR Workshop Proceedings, Vol. 1515, CEUR-WS.org, 2015. http://ceur-ws.org/Vol-1515/regular9.pdf .

21.

Kuramochi and

Karypis, Frequent subgraph discovery, in: Proc. of the 2001 IEEE International Conference on Data Mining, 29 November–2 December 2001, San Jose, California, USA,

Cercone,

T.Y.

Lin and

Wu, eds, IEEE Computer Society, 2001, pp. 313–320. doi:10.1109/ICDM.2001.989534.

22.

Matentzoglu,

Bail and

Parsia, A snapshot of the OWL web, in: The Semantic Web – ISWC 2013 – Proc. of the 12th International Semantic Web Conference, Part I, Sydney, NSW, Australia, October 21–25, 2013,

Alani,

Kagal,

Fokoue,

P.T.

Groth,

Biemann,

J.X.

Parreira,

Aroyo,

N.F.

Noy,

Welty and

Janowicz, eds, Lecture Notes in Computer Science, Vol. 8218, Springer, 2013, pp. 331–346. doi:10.1007/978-3-642-41335-3_21.

23.

Nair

et al., The BioPortal Import Plugin for Protégé, in: Proc. of the 2nd International Conference on Biomedical Ontology, Vol. 833, CEUR-WS, 2011.

24.

Nair and

Tudorache, BioPortal Import Plugin, 2011. http://protegewiki.stanford.edu/wiki/BioPortal_Import_Plugin. (Accessed March 01, 2015.)

25.

A.Y.

Ng,

M.I.

Jordan and

Weiss, On spectral clustering: Analysis and an algorithm, in: Advances in Neural Information Processing Systems 14, Neural Information Processing Systems: Natural and Synthetic, NIPS 2001, December 3–8, 2001, Vancouver, British Columbia, Canada,

T.G.

Dietterich,

Becker and

Ghahramani, eds, MIT Press, 2001, pp. 849–856.

26.

N.F.

Noy and

D.L.

McGuinness, Ontology development 101: A guide to creating your first ontology, Stanford KSL technical report KSL-01-05 and Stanford Medical Informatics technical report SMI-2001-0880, 2001.

27.

N.F.

Noy,

Sintek,

Decker,

Crubézy,

R.W.

Fergerson and

M.A.

Musen, Creating semantic web contents with protégé-2000, IEEE Intelligent Systems 16(2) (2001), 60–71. doi:10.1109/5254.920601.

28.

OBOFoundry, Inter-ontology Links, 2011. http://wiki.obofoundry.org/wiki/index.php/Mappings, http://goo.gl/OSrSjP. (Accessed March 01, 2015.)

29.

Pathak,

T.M.

Johnson and

C.G.

Chute, Survey of modular ontology techniques and their applications in the biomedical domain, Integrated Computer-Aided Engineering 16(3) (2009), 225–242. doi:10.3233/ICA-2009-0315.

30.

Poveda Villalón,

M.C.

Suárez-Figueroa and

Gómez-Pérez, The landscape of ontology reuse in linked data, in: 1st International Workshop on Ontology Engineering in a Data-driven World (OEDW 2012) at the 18th International Conference on Knowledge Engineering and Knowledge Management, Galway, Ireland, 9th October 2012, 2012. http://granvia.dia.fi.upm.es/oedw2012/tl_files/oedw2012/material/povedaEtAl-OEDW2012CRvFinal.pdf.

31.

J.M.

Rodrigues,

Schulz,

A.L.

Rector,

K.A.

Spackman,

Üstün,

C.G.

Chute,

V.D.

Mea,

Millar and

K.B.

Persson, Sharing ontology between ICD 11 and SNOMED CT will enable seamless re-use and semantic interoperability, in: MEDINFO 2013 – Proc. of the 14th World Congress on Medical and Health Informatics, 20–13 August 2013, Copenhagen, Denmark,

C.U.

Lehmann,

Ammenwerth and

Nøhr, eds, Studies in Health Technology and Informatics, Vol. 192, IOS Press, 2013, pp. 343–346. doi:10.3233/978-1-61499-289-9-343.

32.

D.L.

Rubin,

Shah and

N.F.

Noy, Biomedical ontologies: A functional perspective, Briefings in Bioinformatics 9(1) (2008), 75–90. doi:10.1093/bib/bbm059.

33.

B.M.

Sarwar,

Karypis,

J.A.

Konstan and

Riedl, Item-based collaborative filtering recommendation algorithms, in: Proc. of the Tenth International World Wide Web Conference, WWW 10, Hong Kong, China, May 1–5, 2001,

V.Y.

Shen,

Saito,

M.R.

Lyu and

M.E.

Zurko, eds, ACM, 2001, pp. 285–295. doi:10.1145/371920.372071.

34.

E.P.B.

Simperl, Reusing ontologies on the Semantic Web: A feasibility study, Data & Knowledge Engineering 68(10) (2009), 905–925. doi:10.1016/j.datak.2009.02.002.

35.

Sioutos,

de Coronado,

M.W.

Haber,

F.W.

Hartel,

Shaiu and

L.W.

Wright, NCI thesaurus: A semantic model integrating cancer-related clinical and molecular information, Journal of Biomedical Informatics 40(1) (2007), 30–43. doi:10.1016/j.jbi.2006.02.013.

36.

Smith,

Ashburner,

Rosse,

Bard,

Bug,

Ceusters,

L.J.

Goldberg,

Eilbeck,

Ireland,

C.J.

Mungall,

Leontis,

Rocca-Serra,

Ruttenberg,

S.-A.

Sansone,

R.H.

Scheuermann,

Shah,

P.L.

Whetzel and

Lewis, The OBO Foundry: Coordinated evolution of ontologies to support biomedical data integration, Nature Biotechnology 25(11) (2007), 1251–1255. doi:10.1038/nbt1346.

37.

M.Q.

Stearns,

Price,

K.A.

Spackman and

A.Y.

Wang, SNOMED clinical terms: Overview of the development process and project status, in: AMIA 2001, American Medical Informatics Association Annual Symposium, Washington, DC, USA, November 3–7, 2001, AMIA, 2001, p. 662. http://knowledge.amia.org/amia-55142-a2001a-1.597057/t-001-1.599654/f-001-1.599655/a-133-1.599740/a-134-1.599737 .

38.

M.C.

Suárez-Figueroa, NeOn Methodology for building ontology networks: Specification, scheduling and reuse, PhD thesis, Informatica, 2010.

39.

Tordai,

Ghazvinian,

van Ossenbruggen,

M.A.

Musen and

N.F.

Noy, Lost in translation? Empirical analysis of mapping compositions for large ontologies, in: Proc. of the 5th International Workshop on Ontology Matching (OM-2010), Shanghai, China, November 7, 2010,

Shvaiko,

Euzenat,

Giunchiglia,

Stuckenschmidt,

Mao and

I.F.

Cruz, eds, CEUR Workshop Proceedings, Vol. 7, CEUR-WS.org, 2010, p. 689. http://ceur-ws.org/Vol-689/om2010_Tpaper2.pdf .

40.

Tudorache,

Vendetti and

N.F.

Noy, Web-Protege: A lightweight OWL ontology editor for the web, in: Proc. of the Fifth OWLED Workshop on OWL: Experiences and Directions, Collocated with the 7th International Semantic Web Conference (ISWC-2008), Karlsruhe, Germany, October 26–27, 2008,

Dolbear,

Ruttenberg and

Sattler, eds, CEUR Workshop Proceedings, Vol. 432, CEUR-WS.org, 2008. http://ceur-ws.org/Vol-432/owled2008eu_submission_40.pdf .

41.

Tudorache,

S.M.

Falconer,

N.F.

Noy,

Nyulas,

T.B.

Üstün,

M.D.

Storey and

M.A.

Musen, Ontology development for the masses: Creating ICD-11 in WebProtégé, in: Knowledge Engineering and Management by the Masses – Proc. of the 17th International Conference, EKAW 2010, Lisbon, Portugal, October 11–15, 2010,

Cimiano and

H.S.

Pinto, eds, Lecture Notes in Computer Science, Vol. 6317, Springer, 2010, pp. 74–89. doi:10.1007/978-3-642-16438-5_6.

42.

P.-Y.

Vandenbussche and

Vatant, Linked Open Vocabularies (LOV), 2012. http://lov.okfn.org/dataset/lov. (Accessed October 09, 2015.)

43.

W3C OWL Working Group, OWL 2 Web Ontology Language Document Overview, 2nd edn, W3C Recommendation, 11, December 2012. Available at http://www.w3.org/TR/owl2-overview/.

44.

Wächter,

Fabian and

Schroeder, DOG4DAG: Semi-automated ontology generation in obo-edit and protégé, in: Proc. of the 4th International Workshop on Semantic Web Applications and Tools for the Life Sciences, SWAT4LS 2011, London, United Kingdom, December 07–09, 2011,

Paschke,

Burger,

Romano,

M.S.

Marshall and

Splendiani, eds, ACM, 2011, pp. 119–120. doi:10.1145/2166896.2166926.

45.

P.L.

Whetzel,

N.F.

Noy,

N.H.

Shah,

P.R.

Alexander,

Nyulas,

Tudorache and

M.A.

Musen, BioPortal: Enhanced functionality via new Web services from the National center for Biomedical ontology to access and use ontologies in software applications, Nucleic Acids Research 39(suppl 2) (2011), W541–W545. doi:10.1093/nar/gkr469.

46.

Xiang,

Courtot,

R.R.

Brinkman,

Ruttenberg and

He, OntoFox: Web-based support for ontology reuse, BMC Research Notes 3(1) (2010), 175. doi:10.1186/1756-0500-3-175.

Component ( $T_{1 a}$ )		Component ( $T_{1 b}$ )

$t_{1}$	Myocardium	$t_{4}$	Intercalated Disk
$t_{2}$	Cardiac Muscle	$t_{5}$	Intercalated-Disc
$t_{3}$	Heart Muscle	$t_{6}$	Discus Intercalatus
$t_{3}$	(Intercalated disc) →	$t_{7}$	Intercalated Disc