Abstract
Abstract
This article, written by researchers studying metadata and standards, represents a fresh perspective on the challenges of electronic health records (EHRs) and serves as a primer for big data researchers new to health-related issues. Primarily, we argue for the importance of the systematic adoption of standards in EHR data and metadata as a way of promoting big data research and benefiting patients. EHRs have the potential to include a vast amount of longitudinal health data, and metadata provides the formal structures to govern that data. In the United States, electronic medical records (EMRs) are part of the larger EHR. EHR data is submitted by a variety of clinical data providers and potentially by the patients themselves. Because data input practices are not necessarily standardized, and because of the multiplicity of current standards, basic interoperability in EHRs is hindered. Some of the issues with EHR interoperability stem from the complexities of the data they include, which can be both structured and unstructured. A number of controlled vocabularies are available to data providers. The continuity of care document standard will provide interoperability in the United States between the EMR and the larger EHR, potentially making data input by providers directly available to other providers. The data involved is nonetheless messy. In particular, the use of competing vocabularies such as the Systematized Nomenclature of Medicine–Clinical Terms, MEDCIN, and locally created vocabularies inhibits large-scale interoperability for structured portions of the records, and unstructured portions, although potentially not machine readable, remain essential. Once EMRs for patients are brought together as EHRs, the EHRs must be managed and stored. Adequate documentation should be created and maintained to assure the secure and accurate use of EHR data. There are currently a few notable international standards initiatives for EHRs. Organizations such as Health Level Seven International and Clinical Data Interchange Standards Consortium are developing and overseeing implementation of interoperability standards. Denmark and Singapore are two countries that have successfully implemented national EHR systems. Future work in electronic health information initiatives should underscore the importance of standards and reinforce interoperability of EHRs for big data research and for the sake of patients.
Introduction
H
EHRs contain a diversity of information that has the potential to be shared among healthcare providers, medical researchers, and public health academics. However, the fields available in these records are not consistent, and those fields' values are not consistently supplied. Furthermore, the data that is provided may not make use of controlled vocabularies or other coding mechanisms that would make it machine readable and potentially interoperable. In the context of EHRs, we define interoperability as the ability to share EHR data and metadata between a variety of medical systems and EHR storage solutions. We consider the descriptive information in EHRs to be data. Metadata can be defined simply as data about data or, in this case, the structure and rules governing the data in the EHRs. Issues of completeness, correctness, and consistency/coherency in metadata are measurement indicators tightly linked to metadata quality 5 and data interoperability. If the records, fields, and data of EHRs could be standardized and the data consistently supplied, 6 EHRs could be of greater usefulness in large-scale clinical decision making, potentially leading to increased patient safety and the detection and intervention of dangerous patterns.
“THE HEALTHCARE COMMUNITY MUST EXAMINE CURRENT EHR DATA AND METADATA STANDARDS, HOWEVER, AND DETERMINE HOW TO CONSISTENTLY IMPLEMENT THEM IN ORDER TO IMPROVE PATIENT CARE WHILE MAKING RESEARCH POSSIBLE.”
In this article, we present a brief assessment of the current state of EHR standards for data and metadata, content which, in the aggregate, can be considered big data. The healthcare community must examine current EHR data and metadata standards, however, and determine how to consistently implement them in order to improve patient care while making research possible.
The Anatomy of an EHR
EHRs represent one way to capture personal health data. Terminology used to describe EHRs has varied, and functionality of systems supporting EHRs is not uniform. 7 Generally though, EHRs include electronic medical record (EMR) content from individual providers, such as quantitative data (e.g., laboratory test results, vital signs), qualitative data (e.g., clinical notes), and transactional data (e.g., prescription information). 8 Patients' medical histories, encompassing diagnoses and past treatments, can also be included in EMRs. Although the terms EHR and EMR have been used interchangeably, the U.S. government defines them separately. 9 Because information from all clinicians involved in a patient's care is recorded, EMRs that form the EHR provide a rich source for analysis. 10
EMRs are essentially used only within an individual healthcare provider's office and record the patient's history only in the context of that provider's practice. For example, patients' EMRs at a cardiologist's office will only have information about their visits with that particular cardiologist and would not include any additional information from, for example, their dietician. The omittance of the context of a patient's care in effect makes EMRs not much better than the paper charts traditionally used in most Western healthcare practices. EHRs, which contain the EMRs, can provide a broader view of the patient's overall health. EHRs are intended to be shared between healthcare providers and even be accessible to patients. For examples of the kinds of data included in the EHR and the data provider, see Table 1. Although the ultimate goal of EHRs is to provide shareable information, several impediments to making this a simple reality currently exist.
The volume, velocity, and variety of EHR data pose a management problem for researchers. In order to make effective use of EHRs by the research community, data in them must be interoperable. Furthermore, standards to facilitate preservation and storage within the United States alone will need to be created, adjusted, and/or determined.
Complexities of Data in EHRs
EHRs contain both structured and unstructured data, and much of the structured data can be entered using standardized or controlled vocabularies that promote interoperability. 11 Controlled vocabularies, often called front-end vocabularies, can provide consistent terms to use to represent straightforward ideas. For example, prescription information can be entered using the controlled vocabulary RxNorm, “a normalized naming system for generic and branded drugs” produced by the National Library of Medicine. 12 By providing a single, consistent way to name pharmaceuticals, confusion surrounding trade and generic names and changes in packaging conventions from country to country can be mitigated, and potential drug interactions can be avoided. Radiology information can be captured using devices in which Digital Imaging and Communication in Medicine (DICOM) (http://medical.nema.org/) has been implemented. DICOM “defines the formats for medical images that can be exchanged with the data and quality necessary for clinical use,” yielding a set of codes that can be applied; it is the International Organization for Standardization (ISO) standard 12052. 13 See Table 2 for examples of controlled vocabularies used in EHRs.
“CONSISTENCY IN THE METADATA FIELDS AVAILABLE IS A FIRST BIG STEP TOWARD THE CREATION OF QUALITY, INTEROPERABLE RECORDS.”
Within the United States, the Continuity of Care Document (CCD) standard is being examined as one way to facilitate digital interoperability. 14 The CCD is an XML standard approved by the Department of Health and Human Services to meet the meaningful use requirement outlined in the HITECH act. The intention is for the CCD to structure the EHR as a whole, with the EHR's individual elements input and coded in their own controlled vocabularies. Consistency in the metadata fields available is a first big step toward the creation of quality, interoperable records. Although the promise of CCD is great, D'Amore et al. note the persistent challenges that prevent full implementation. 14 For example, data is often entered inconsistently, and even the structure of CCDs is inconsistent from one vendor-supplied EHR to the next. 15 Clinical narratives typically make up the largest portion of EHRs but are the most difficult to structure. As a result, they can also be the most difficult to analyze and make interoperable. To facilitate interoperability and provide some structure to this information, medical terms used in clinical narratives can be mapped to controlled vocabularies, in some cases programmatically using natural language processing (NLP) tools. 11 Vreeman and Richoz suggest that a structured template for documentation is one place to begin for physiotherapists, who use unstructured clinical narratives extensively. 16 Text mining tools, such as MedLee and MetaMap, can recognize clinical terms within EHRs written in English and map them to controlled vocabularies. 17 Further advances in NLP tools are surely inevitable and will continue to affect such mapping.
Until this technology improves, terms from controlled vocabularies will likely need to be input into EMRs manually by the caregiver. One example of a commonly known and multilingual front-end vocabulary (also called health terminology or nomenclature) for clinical narratives is the Systematized Nomenclature of Medicine–Clinical Terms (SNOMED CT), a comprehensive terminology owned and maintained by the International Health Terminology Standards Development Organisation and available through the United States' National Library of Medicine Unified Medical Language System® (UMLS®). 18 Vreeman and Richoz note that SNOMED CT provides a “solid core terminology” for EMRs in the United States because of its broad coverage. 16 The vocabulary MEDCIN is an additional option, as is the creation of a new, system-specific vocabulary. 3
On the other hand, too much structure to the data within EHRs can be counterproductive. Clinical narratives are a rich source of information, and not all of the data found in them can (or should) be put into standardized language. Gardner discusses some of the types of data that should remain unstructured and why. For example, a change in medication can be documented in a structured field, but the reasons for doing so should remain in natural language. 19 Attempting to structure this information by creating a checklist or drop-down menu to document the myriad possible reasons a patient requires a change in medication would be unwieldy and unhelpful. The unstructured information in clinical narratives provides context to and complements the structured, quantitative data that exists elsewhere in the record. 20
Furthermore, simply because data is unstructured does not mean that it cannot be analyzed programmatically. Roque et al. discuss the viability of “mining” the unstructured areas of electronic patient records to identify disease patterns and comorbidities. Their study occurred in a Danish psychiatric hospital, and terms were mapped to the International Classification of Disease version 10. 17 LePendu et al. note the potential of mining clinical narratives for the purpose of “pharmacovigilance” (i.e., identifying potential adverse reactions or drug interactions). 21 These examples illustrate how unstructured data can be mined for rich analysis by healthcare providers and public health researchers. 20 Although structured data and metadata undeniably have their place, the need for certain types of data to remain uncontrolled is essential. Therefore, metadata schemas and data standards for EHRs must retain an element for free-form text (or “dirty data,” as Gardner calls it). 19
Ensuring Interoperability of EHRs
Healthcare providers should easily be able to view EHRs so that they can access the most current information about their patients. Individual vendors often advocate for adoption of proprietary controlled vocabularies for interoperability within one system; however, this strategy is not desirable if data will be shared across different healthcare systems and for a variety of purposes. We believe in the long-term that EHRs are usable as big data for medical and public health researchers; however, several kinds of data need to be in place for this to be feasible. Patient metadata, or descriptive information about the patient (including demographics, pharmaceuticals, diagnosis, etc.), needs to be recorded in a way that is interoperable. Additionally, administrative metadata needs to be included for the records to be understandable outside of the context in which they were created. Administrative metadata is data describing the EMR; it can include information about the controlled vocabularies and standards used, necessary information to ensure patient privacy, and specifics of the EHR's authenticity and creation.
The ability to generate administrative metadata automatically for EHRs is desirable and necessary to provide full context for researchers. Greenberg et al. note that scientific data are being created at a faster rate than metadata can organize them. 22 When certain kinds of metadata do not already exist, some can be extracted if the document is predictably structured. 5 If EMRs were structured identically, were completed consistently, and used the same vocabularies, manipulating them and automatically generating metadata could be a possibility. As of right now, however, these records are not uniform. If automatic metadata creation from EHRs could be accomplished, it would be an important development for creating shareable, usable data.
Having reliable, consistent access to data across systems is another issue. The systems in which EHRs are housed are inconsistent, creating another potential barrier to interoperability. Carter notes that there are no widely accepted schema standards for EHR databases, which makes moving information between various EMR products difficult. 7 The complexity of some EMR systems means that data about an individual is housed in the database for the provider. 7 As an example, this means that all physician-created data will remain in the physician's database, and while his/her clinical notes will be visible to the pharmacist, they will not necessarily be housed in the same database as the pharmacist's data about medication prescribed even in a single, unified (hospital) system.
“INSTITUTIONS MUST DECIDE ON AND IMPLEMENT AN ENCODING SCHEME INTERNALLY BEFORE IT IS EVEN POSSIBLE TO THINK ABOUT SHARING DATA BETWEEN DIFFERENT INSTITUTIONS FOR RESEARCH PURPOSES.”
Creating and implementing standards for EHRs and their systems can be difficult even at an institutional level. Carter discusses the necessity of data-level integration for true EHR functionality. 7 All systems within an institution must share a data-encoding scheme, and there must be a way to move data from the various systems in an institution to a central system (often called a clinical data repository). Institutions must decide on and implement an encoding scheme internally before it is even possible to think about sharing data between different institutions for research purposes. Carter cites the lack of “widely accepted data standards” as one major impediment to system interoperability and is therefore a barrier to more widespread adoption of EHRs, particularly for hospitals. Once patient data can be extracted in a meaningful and systematic way from well-defined, standards-based records in consistently created databases, research becomes increasingly possible.
Storage Considerations for EHRs
Interoperability concerns do not begin and end with data structure and input. Where to store these large data sets is another question remaining to be answered. The scientific community has been addressing small data needs and metadata issues for as long as it has been generating data. Marcial and Hemminger examined the metadata in scientific data repositories available online and found that policies and documentation were quite varied. 23 Many repositories did not employ a metadata standard at all, and most of the metadata was generally descriptive (rather than administrative or for preservation). When developing their own repositories for big data sets, healthcare community members must take note of this and provide adequate documentation for metadata standards and allow for a variety of metadata types. 23
Standards Initiatives for EHRs
There are currently a few notable international standards initiatives for EHRs. The European Medical Information Framework (EMIF) aims to create a standard medical information framework to share and link medical data across diverse information resources, including databases and EHRs. EMIF will initially focus on research questions surrounding obesity and dementia. 24 It will be interesting to see how the creation of a common information framework might be implemented and transferred to the United States. International organizations such as Health Level Seven International (HL7), Clinical Data Interchange Standards Consortium (CDISC), and the proposed Global Alliance to Enable Responsible Sharing of Genomic and Clinical Data 25 are developing and overseeing implementation of interoperability standards in certain related sectors, namely, clinical sectors. 11 HL7 CDA, a clinical encoding standard that specifies the structure of medical documents to facilitate interoperability for medical information exchange, can be used as a template to generate clinical documents. 26 Vreeman and Richoz note that HL7 Application Protocol for Electronic Data Exchange in Healthcare Environments is one of the most widely implemented healthcare standards worldwide. 16
International standards initiatives
“JAPAN HAS BEEN ACTIVELY PROMOTING IT IN HEALTHCARE SINCE 2001, WITH FUJISTU HAVING BEEN RECRUITED TO PROVIDE AN EMR STRUCTURE IN SUPPORT OF THE I-JAPAN STRATEGY 2015.” 31
Several countries in Europe and Asia have been successful in adopting EHRs to support healthcare and could serve as examples to the United States. In 2011, Hodge identified 23 countries with a national EHR initiative, citing Denmark as a global leader. 27 In general, smaller countries with more agile infrastructures have been more easily able to adopt EHRs. 11 For example, in Denmark, MedCom (http://medcom.dk/wm109991) is a Danish initiative to “contribute to the development, testing, dissemination and quality assurance of electronic communication and information in the healthcare sector.” 28 MedCom is a member of Healthcare Interoperability Testing and Conformance Harmonization (www.hitch-project.eu/), a European initiative. In Asia, by 2012 Singapore's public health sector was using an EHR system. 29 “MOH has created a national electronic health record (NEHR) system for Singapore's public health sector. The NEHR system is available to all public healthcare institutions (comprising 8 restructured hospitals, 8 specialist centers, and 18 polyclinics), five community hospitals, two nursing homes, a hospice, selected general practitioners (GPs), and users from the Agency for Integrated Care (AIC).” 30 Japan has been actively promoting IT in healthcare since 2001, with Fujistu having been recruited to provide an EMR structure in support of the i-Japan Strategy 2015. 31 In 2012, Sinha et al. examined the systems in use in a dozen countries in their case studies section, indicating that there is much international interest in the mechanics of EHRs, especially in Asia but also in Europe and North America. 32 The appeal of EHRs is clear, both in the United States and internationally, and it is possible and necessary for U.S. researchers to examine the systems their international counterparts have already implemented and to do their best to ensure international interoperability of data between systems.
Future Study
Making data and its metadata interoperable among different repositories is a compelling area for study in light of the data being housed and that which has the potential to be housed using EHRs. Expansion of existing standards initiatives, such as HL7 CDA and CDISC, will further the interoperability discussion as countries work to finalize the standards that undergird their national electronic healthcare initiatives. Developing easily understood standards and adhering to them will be essential to sharing data with researchers in the research community. In their discussion of the scientific data repository Dryad, Greenberg et al. note an emphasis on focusing on long-term goals for metadata architecture. 22 Such a focus will be essential in creating metadata schemas and repositories for health data in the future, particularly because of the sheer amount of possibilities for which such data could be used. We suggest that an international study group be formed to investigate the worldwide standardization of EHRs that include mechanisms to protect privacy, to allow patients to opt in and opt out, to permit patients to view and contribute to their own EHRs, and to allow researchers and doctors to collect the kinds of data they need to analyze medical trends.
Conclusions
EHRs are an area that has shown itself to be worthy of study, and healthcare practitioners and institutions could open up a range of possibilities for scientific research by giving interoperability considerations for EHRs the attention they are due. Big data uses in healthcare research are wide ranging. Future analysis of EHRs will continue to allow researchers and clinicians to identify patterns of disease, infestations, and interactions, and begin to move toward cures. Determining what information within EHRs should be structured (or not) with controlled vocabularies will be an important step. Metadata creation is essential to ensuring data is shareable. The scientific community has embraced digital repositories to store much of their work, and while this may be an attractive option for healthcare big data, several logistical questions remain unanswered. Another major concern surrounding metadata creation for EHRs, although outside the scope of this article, is privacy. Although there is not currently enough information available about how metadata can play a role in this future (or what that role might look like), it must be an integral part of the big data in healthcare discussion.
Footnotes
Author Disclosure Statement
No competing financial interests exist.
