Abstract
Irrespective of data size and complexity, query and exploration tools for accessing data resources remain a central linkage for human–data interaction. A fundamental barrier in making query interfaces easier to use, ultimately as easy as online shopping, is the lack of faceted, interactive capabilities. We propose to repurpose existing ontologies by transforming them into nested facet systems (NFS) to support human–data interaction. Two basic issues need to be addressed for this to happen: one is that the structure and quality of ontologies need to be examined and elevated for the purpose of NFS; the second is that mappings from data-source specific metadata to a corresponding NFS need to be developed to support this new generation of NFS-enabled web-interfaces. The purpose of this paper is to introduce the concept of NFS and outline opportunities involved in using ontologies as NFS for querying and exploring data, especially in the biomedical domain.
Introduction
When it comes to exploring and accessing biomedical data, often is the question asked: “Why can’t it be as easy as shopping on Amazon?”
To answer this question, we need to identify the core technologies that made online-shopping experience “pleasant,” and then hope to be able to apply a similar strategy for exploring and accessing biomedical data, big or small. Among many drivers of online-shopping [24], faceted search [23,37] capability is perhaps one of the most ubiquitously applied information-retrieval techniques. Indeed, studies show that faceted search can help enhance user experience in a variety of settings [10,18,21,27,47].
Semantic labeling is the missing link between an entity (such as consumer goods for online shopping or study subjects in a clinical data warehouse) and ways to identify and accessing it through means such as a web-based user interface. This is well-articulated in a recent article by Balog [5] and in information organization as tags for folder and menu hierarchies [6,46,48].
Semantic labeling enables facets, such as size, color, make, price to be annotated for entities such as shoes in an online store. Faceted organization and presentation of metadata on products is the key mechanism that allowed consumers of web-sites to quickly narrow down from millions of products to items of interest using such simple facets. The entities for biomedical data, however, are highly complex and there does not exist a corresponding small set of semantic labels to support faceted search. For example, clinical data, captured as a part of patient care, are highly complex, and includes demographics, medical history, lab reports, diagnosis, medication, and discharge summaries.
Biomedical ontologies are suitable as semantic labels for biomedical entities. However, these ontologies, intended to model and capture concepts and their relations in the biomedical domain, are broad and complex. For example, SNOMED CT [17], the largest clinical terminology used worldwide, contains over 300,000 concepts and over 1.5 million relations. The National Cancer Institute thesaurus (NCIt) [15,32], on the other hand, is a biomedical terminology managed by NCI Enterprise Vocabulary Services, containing more than 140,000 concepts related to cancer. Such size and complexity raise basic questions related to their potential role as facets for web-based user interfaces: What, if any, structural transformations are needed for ontologies to play the role of facets for information retrieval? Is it feasible to have ontologies to play the role of facets? What kind of desirable properties are required for ontologies to support facet-oriented user interaction? How to measure and evaluate the performance of this approach?
In this paper we propose the concept of nested facet system (NFS), outline a strategy to transform existing ontologies into NFS to support human–data interaction, and identify exemplar research questions related to the use of NFS to enhance user experience in human–data interaction. Unlike traditional faceted search, the intended users of interfaces supported by NFS are those equipped with some levels of knowledge in specific domains. Our motivation for NFS is to facilitate information retrieval in such specific domains, but NFS can also be readily implemented as a navigation interface for the corresponding underlying ontologies [45].
Nested facet system
A facet is a semantic label of an entity along multiple possible axes or dimensions. Facets correspond to properties of the entity of interests. For example, online vendors use facets to label their product using readily available information about their type, brand, price, and support consumer shopping experience through faceted search [37].
A nested facet, or higher-order facet, is a facet that includes a (finite) collection of other facets as its components. In this context, traditional facets are primitive facets, those that are not made of other facets. A nested facet system is a set of nested facets (we call them facets from now on) with a taxonomy relation (i.e., subclass, subsumption, or hierarchical relation) among them.
A nested facet system Each element The head of any refinement is not a part of the body of the same refinement.
With respect to each refinement
The intuition for a refinement
Each NFS
To endow NFS’ with their intended meaning, we treat facets as generalized semantic labels as follows. Given a set of entities E, a facet p with value space
When
If
We call
If
Note that the notation ⊢ is deliberatively suggestive of a potential connection with “Information Systems” [31,40], part of domain theory [2] as a mathematical foundation for programming languages [38]. There appears to be potential formal connection to the notion of disjunctive information systems [39,40].
Biomedical ontologies serve as the semantic scaffolding for us to fully capitalize on the transformative opportunities of the increasingly large amounts of digital data produced by the biomedical research enterprise. For example, BioPortal [28], the world’s most comprehensive repository, contains over 600 ontologies and over 7 billion concepts that have been used to support a wide spectrum of scientific projects. Biomedical ontologies provide the basis for scientific rigor during the process of data collection, annotation, management, analysis, and sharing in biomedicine. They not only serve as metadata standards, but also play a vital role in down-stream systems as a declarative knowledge source [7]. For example, SNOMED CT [17], the most comprehensive and precise clinical health terminology product in the world, facilitates the clear exchange of health information in Electronic Health Records (EHRs), leading to higher quality, consistency and safety in healthcare delivery [14,33].
Ontological systems are not designed a priori as nested facet systems. But what if we attempt to reuse them as facets to support user interfaces? An intuitive idea is to leverage the hierarchical or is-a relation, the structural backbone of most ontologies and simply treat Ontological Concepts as Facets.
For a given ontology such as SNOMED CT, we can treat each concept c as a facet p, and build a nested facet system by letting
For this (very reasonable) intuition to work, the following questions must be answered:
Does this construction obey the soundness property mentioned at the end of the previous section?
Does this construction obey the completeness property, mentioned at the end of the previous section?
Intuitively, soundness means that all items below each facet are relevant to the facet. Completeness means that any items or facets relevant to a specific facet are already contained in and accessible through the facet. The soundness and completeness properties of NFS directly affect query performance in terms of precision and recall. Incomplete facets will reduce recall, while unsound facets will reduce precision. Top of Fig. 1 contains an incomplete facet, in that concept node 5 as a facet missed the sub-facet represented by concept node 6.

Two example NCI Thesaurus fragments. Above: a fragment containing a bug. Below: fragment with the bug fixed by redirecting node 5 as a direct parent of node 6.
Interestingly, similar properties of soundness and completeness have been studied in the area called Ontology Quality Research (OQR [11]) encompassing ontology quality auditing, assurance, and evaluation [3,50]. For example, OQR method can identify a missing is-a relation (incompleteness) in the top fragment of Fig. 1 and automatically suggest the addition of is-a in the lower part of Fig. 1. The addition of this is-a edge makes the facet represented by node 5 “more complete” because it now includes node 6 as a sub-facet (as it should be). The goal of OQR is to develop methods and tools to detect [41,42], identify [13], and address [1,9,49] quality issues in ontologies. This is a particularly important area in the biomedical domain, because of the significance, scope, complexity, manual involvement and evolving nature of biomedical ontologies that are intended to serve as terminology standards, as well as to codify knowledge at the same time.
Identifying quality issues in ontologies such as unsoundness or incompleteness is a task similar to finding bugs in software. Just as there is no single “recipe” to catch and fix all software bugs, no single method is expected to exist that addresses all ontology quality issues all at once. Similarly, for NFS, a single method to ensure and allow us to formally prove its soundness and completeness is unlikely. Instead, we see the development of methods to “improve” soundness and completeness of NFS’ derived from ontological systems, leading to meaningful enhancement of the performance of NFS for information retrieval tasks.
In the following sections we discuss such questions in more depth using biomedical ontologies and clinical data resources as examples, and provide use cases to demonstrate the feasibility and work involved to implement this approach.
An array of biomedical datasets in the context of human health exists but there is a general lack of faceted interfaces to facilitate data exploration and information retrieval. In most of the cases, ontological systems have already been used for annotating or labeling the backend data but their interface roles have not been fully exploited. This state of affairs represents a ripe and rich setting for developing and implementing NFS to facilitate cohort discovery and sub-group analysis. This section provides a brief synopsis of these data resources and the associated ontological systems as an illustration of a targeted application area for NFS.
Clinical data warehouse
The entity E for clinical data consists of patients. Clinical data from EHRs are critical for analyses to improve health care delivery. Clinical data warehouses are EHR data made available for research. Examples include i2b2 data warehouses [20,25], PCORnet – the National Patient-Centered Clinical Research Network [19], and Observational Health Data Sciences and Informatics (OHDSI) research network [22] with an open, community data standard called the Observational Medical Outcomes Partnership (OMOP) Common Data Model. SNOMED CT is a common ontological component of all these data sources.
Health claims data
Health claims data (also called administrative data) such as Cerner Health Facts, IBM Market Analytics, and Optum Health Data and Analytics, are those collected for the purpose of health insurance claims. They include information at the patient encounter level regarding diagnoses, treatments and billed and paid amounts. This is a valuable data source for research aimed at driving improvements in population health to address issues related to cost, quality and outcomes. The use of administrative data can complement EHR data by providing a regional or national scale view. Because of the health claims context, main vocabularies for health claims data involve diagnosis (ICD 9 and ICD 10), procedure code (CPT), and medication (RxNorm).
Clinical data and health claims data are domain-agnostic: they cover the entire spectrum of disorders and disease domains. Domain-specific data resources, however, are those cover a signal medical specialty, but with greater depth. We highlight several such resources next.
The National Sleep Research Resource – NSRR
The gold standard for sleep diagnosis is polysomnography (PSG), which monitors physiological processes including electroencephalogram (EEG – brain waves), electromyogram (EMG – muscle tone), and electro-occulogram (EOG – eye movements). The recorded polysomnograms provide comprehensive data about biophysical changes that occur during sleep and characterize the association between sleep and other public health related problems. The NSRR [16,44] is a retrospectively annotated repository of 30,000 overnight sleep recordings. The NSRR offers free and open web access to large collections of de-identified, well-annotated national repository of sleep data, including PSGs which are linked to risk factor and outcome data for participants in major NIH studies. Since its launching in 2014, 282TB of data have been shared by over 3,000 users around the world through the NSRR portal sleepdata.org.
NSRR uses the Sleep Domain Ontology [4] as the canonical vocabulary for across-study data mapping.
The Center for SUDEP Research – CSR
The Center for Sudden Unexpected Death in Epilepsy (SUDEP) Research [8] manages another domain-specific clinical research data resource. The CSR has prospectively collected high grade multimodal data including high-resolution electroencephalographic signal, research-grade brain MRI, biochemical and DNA samples together with detailed phenotypic data for more than 3,000 epilepsy patients. Similar to NSRR, a disease-specific ontology called Epilepsy and Seizure Ontology [30] has been created as a part of the CSR informatics infrastructure process.
Cancer registries
For cancer research, the US National Cancer Institute’s Surveillance Epidemiology and End Results (SEER) program [34] coordinates a collection of state-based SEER registries. These state-centered cancer registry receiving data about new cancer cases from healthcare facilities and physicians within the state. Typically, five aspects of data are captured: patient data, case data, follow-up, therapy data and pathology reports. Patient data consists of variables including various patient-related information such as demographics, race, ethnicity, smoking, and clinical trial participation information. Case data captures variables for diagnosis, morphology, staging, biomarkers, and other categories. Follow up information contains variables including follow-up physician, date of last contact, survival status, and cancer status. Therapy data records variables with information on surgery, chemotherapy, radiation, and other treatment modalities.
In general, SEER data are considered to be among the most accurate and complete population-based cancer registries in the world that includes stage of cancer at the time of diagnosis and patient survival data. Cancer registries uses NAACCR data dictionary [26] for variable definition, and is only partially mapped to NCIt. This is where work on primitive facets is needed in order to use NCIt as NFS.
Implementation strategy
The following steps are typically involved in developing an NFS-based query engine for a data source (see Fig. 2 for a functional architecture).
Identify or develop a domain ontology covering the conceptual scope of the data source. If multiple ontologies are used, ontology merging would be a necessary step involved in developing such a domain ontology.
Construct a mapping from the data dictionary for the data source to concept of the domain ontology.
Convert the domain ontology to NFS and implement NFS-based query interface by systematically extracting the “refinement” structure of nested facets from the hierarchical relationships of the ontology following the method given in Section 3.
Implement an appropriate query optimization strategy dedicated to the data source as a database. Transformation to a NoSQL database such as MongoDB may be desirable depending on the data source.

High level functional architecture of an NFS-based system.
Model-View-Controller [29], a well-established and popular web-based application development paradigm, is a suitable approach for developing an NFS-based system, particularly for the clinical informatics domain [36].
For disease-specific domains such as sleep and epilepsy, we have developed NFS query interfaces such as x-search [12] and Multi-Modality Epilepsy Data Capture and Integration System (MEDCIS [43]). x-search is a cross-cohort query and exploration system to enable researchers to query patient cohort counts across a growing number of completed, NIH funded studies in the NSRR. x-search is public available at https://x-search.net covering over 26,000 unique subjects. The canonical data dictionary, Sleep Domain Ontology [4], covers over 900 common data elements across a dozen cohort studies in NSRR. x-search has received over 2,300 queries by users from 16 countries since its initial launch [12]. For epilepsy, the MEDCIS interface uses a dedicated Epilepsy and Seizure Ontology [30] to drive an NFS-based query interface. MEDCIS is the main query interface for CSR data, integrating curated multi-modality clinical data of 2,000 epilepsy patients from 8 medical centers.
Based on our experience, benefits of an NFS-based query interface include:
It provides an intuitive interface for users to navigate to a specific concept of interest and specify the corresponding query criterion in a menu-driven, templated style.
The same boolean query can be constructed in a more efficient manner, usually involving only half of the time than that is needed for alternative interfaces without involving NFS.
A query optimization strategy can be readily implemented by precomputing queries corresponding to primitive facets and ordering the query execution sequence based on the result sizes for primitive facets.
Such benefits have been studied in the clinical data warehouse setting [35] but we also encountered challenges that seem to be typical in developing an NFS-based query interface:
There is no clear and efficient way to guarantee the soundness and completeness properties of NFS in general. For example, even though SNOMED CT and NCIt satisfy the soundness and completeness properties “for the most part” using the NFS refinements specified in Section 3, enough facet instances exist where such properties are violated [41]. Such violations affect the soundness and completeness properties of facets, leading to reduced precision and recall for query interfaces using NFS. Interestingly, non-lattice auditing methods can precisely identify and potentially fix such issues [1,9,13,42,49]. Primitive facets are not always specified and ready for use. For example, for Cancer Registries, the common data dictionary exists (i.e. NAACCR), but not all of its variables have been structurally mapped to appropriate NCIt terms both in value type and value range. Effort is needed to construct such a mapping (once only, though) before data dictionary variables can be used as primitive facets. When a domain ontology is large and deep (e.g. SNOMED CT), interface response can be sluggish if the hierarchical (sub-facet) interface widget rendering algorithm is not optimized.
Conclusion
We outlined a general approach for constructing nested facet systems from ontologies. We highlighted use cases for clinical data, and discussed progress and remaining challenges. Given the importance of faceted search, our proposed approach deserves further study. Efforts in developing experimental interfaces supporting NFS will be highly desirable and impactful for accessing biomedical data for research.
Footnotes
Acknowledgements
This work was supported in part by US National Cancer Institute under award R21CA231904 and US National Science Foundation under awards IIS1931134 and ACI1626364.
