Abstract
Veracity, one of the five V’s used to describe big data, has received attention when it comes to using electronic medical record data for research purposes. In this perspective article, we discuss the idea of data veracity and associated concepts as it relates to the use of electronic medical record data and administrative data in research. We discuss the idea that electronic medical record data are “good enough” for clinical practice and, as such, are “good enough” for certain applications. We then propose three primary issues to attend to when establishing data veracity: data provenance, cross validation, and context.
Keywords
Introduction
Data science, in the use of big data, is purported to hold great opportunities for healthcare in patient care as well as research. 1 At the same time, data science approaches represent a “sea change” for researchers who have been trained to maintain high levels of rigor in data collection. By definition, big data can be “messy” as it is collected for reasons other than research. For example, administrative claims data are collected for payment purposes but have been widely used in health services research for many years. Similarly, electronic medical record (EMR) data are designed for clinical care but also has been widely used in research. Multiple concerns have been raised regarding the veracity of EMR data and its implications for use in research. Following is a discussion of these concerns. First, we review the concept of data veracity. Then, we propose that administrative and EMR data are “good enough” and discuss ways of improving data veracity.
On veracity
Data scientists have identified a series of characteristics that represent big data, commonly known as the V words: volume, velocity, and variety, 2 that has recently been expanded to also include value and veracity. 3 Of particular interest is veracity, which is defined as “uncertainty due to data inconsistency & incompleteness, ambiguities, latency, deception, model approximations.” 4 While not identifying veracity per se, Kitchin and McArdle 2 identify “error” for big data that include sampling and non-sampling errors as well as new types of errors, which are undefined. Other groups refer to issues of the quality of the data and include “accuracy, believability, reputation, objectivity, factuality, consistency, freedom from bias, correctness, and unambiguousness.” 5 Bellazzi 6 refers to veracity as the uncertainty of the data and notes that varying data sources may provide high noise to signal versus the preferred high signal to noise. Lukoianova and Rubin 7 identify “objectivity, truthfulness and credibility” as the three main theoretical dimensions of veracity. They identify the conceptual and operational definitions of each and then develop an index to rate veracity on textual data. However, in healthcare, the data may not be textual but also numeric, images, and pre-selected pull down choices (e.g. from an EMR). While a standard definition for data veracity has yet to be proposed, we can generally consider it as relating to the accuracy and fidelity of a particular datum or data.
Administrative and EMR data are “good enough”
Revisiting some common forms of big data in healthcare, administrative records, and EMR data, we propose that the data available are “good enough” for billing and administrative purposes and, more importantly, for clinical decisions. One need not consider much more when establishing the veracity of EMR data for use in research activities than the idea the data in the EMR is the same data that was collected, recorded (although potentially errantly), and used to make clinical decisions. It is a quite interesting juxtaposition that EMR data are okay for use in clinical decision-making, but not for use in research, research that will guide clinical decision-making, or other secondary purposes such as driving real-time clinical decision support systems. Why so? Are there significant differences in the ramifications of the resulting outcomes of using EMR data for different purposes, or are we diving into the waters of research exceptionalism?8–13
Consider the lack of applicability of a large swath of research findings in clinical practice. Why is this the case? Moving beyond the standard discussions of strict inclusion and exclusion criteria, might it be that in the attempt to maintain fidelity to study procedures in reducing measurement bias, we are simultaneously measuring ourselves right out of real-world clinical applicability? Stated differently—the strict measurement protocols, take, for instance, blood pressure measurement, are usually nowhere near how blood pressure measurements are made in the clinical setting. Blood pressure measurement in a tightly controlled trial is administered nearly the same way, by the same people, with the same equipment, and with calibration of the equipment. Alternatively, in clinical practice like a hospital stay, repeated measurements like blood pressure are conducted differently, by different people, at different times using different equipments which may not be calibrated. It is no wonder then that when we try to implement findings in clinical settings, the effect found in the trial is reduced or all but disappears. So why is it then that we hold such high regard for study findings that do not translate, yet we question the veracity of the very data that is recorded and actually used in clinical care?
The lack of clinical trial translation into practice was recognized decades ago by Cochrane 14 and has become more problematic as we make the push toward developing a learning health system. Veracity in using EMR data for research is analogous to the discussions about the differences between randomized clinical trials (where the investigator controls the delivery and timing of the intervention(s)) and pragmatic clinical trials (where the investigator observes and measures but does not control to the same extent the delivery and timing of the intervention). The primary criticism of pragmatic trials is that the interventions and subsequent administration of care are not controlled so that it is not clear what part of a multi-factorial intervention is really effective. Yet, the findings of a pragmatic trial are immediately applicable to the setting they were generated in and have a greater likelihood of applying to other similar settings as well.
Similarly, in the use of administrative data for research, the quality and financial challenges of the US health system are well described, driving the Triple Aim. 15 For most quality measures, whether used for payment or for performance improvement including public reporting of outcomes, the data underlying the quality measures are derived from an EMR or billing data or both. With the current value-based purchasing initiatives on the part of the Medicare program, there is a blend in that both EMR and billing data may be used. Finally, the administrative data used for billing and payment have been used for many years in health services research, with the appropriate caveats 16 for what it can and cannot be used for.
Assessing if data are “good enough”
If we accept the premise that EMR data that are “good enough” to make clinical decisions and, in turn, are “good enough” to drive research used to inform clinical practice, then we must attend to achieving the highest levels of data veracity possible. Similarly, if the data used to make payment determinations and publicly report outcomes are “good enough” for these purposes, data veracity is equally important. In the following, we discuss three primary issues related to understanding if data from EMR and administrative sources are “good enough,” those issues being: data provenance, validation, and context.
Data provenance
First, the data scientist needs to either know or have someone on the investigative team who has a deep understanding of the data elements and also knows which data elements are credible (data provenance). Data provenance, or knowing the meaning and context under which data were collected, 17 is critical as the first step in developing an understanding of the data one is working with. For example, data credibility in terms of deep knowledge may be particularly important for administrative data used for payment in that there are often multiple fields but only certain ones are used or some are more credible than others. For example, diagnoses for hospitalized patients upon hospital admission are often tentative until the actual diagnosis has been confirmed through further evaluation and testing. Thus, most health services researchers use the hospital discharge administrative form for the confirmation of the reasons for the hospitalization and not the admission form. Similarly, laboratory values recorded in most helicopter and ground transport EMRs reflect the results of the most recent laboratory results from the sending hospital, or place the patient was picked up from and not lab results that were acquired and resulted during transport. Finally, administrative data used for billing, for example, may be sufficient only for administrative purposes as the processes for completing the bill may meet only billing criteria and not be useful for research.
We must acknowledge that final diagnosis decisions can be influenced by potential impacts on reimbursement or other local contextual considerations that influence billing and coding practices and the impact that those practices may have on data veracity. The impact that biased data entry can have on data veracity is most prominent in diagnosis and procedures codes. Alternatively, the strict regulation guiding laboratory testing and resulting, and the dispensing of medications, are more rigorous in practice and the recording of the final results, contributing to a higher degree of certainty of the recorded data. The aforementioned considerations provide critical support for the need of establishing rigorous data provenance.
Cross validation
Second, the credibility should be analyzed, when possible, with cross validation. Cross validation, when possible, increases the confidence in the credibility of the data. For example, the incidence of errant vital sign entry in EMRs has been documented. 18 It is rather easy to transpose numbers when entering heart rate or blood pressure, and if left unchecked, can result in permanent storage of an errant entry (e.g. recording a heart rate of 49 instead of 94). In practice, the errant entry would likely trigger a sequence of events to either confirm or disapprove that entry, and if confirmed, acted upon. The same should be done when using data secondarily in research. Cross validating values outside of the range of expected values, like say, for instance, the sudden appearance of medication related to hepatitis when no history prior to or after the medication entry confirms that diagnosis, should be conducted. Similarly, another example of the need to perform cross validation in the Outcome and Assessment Information Set (OASIS), the standardized data collection instrument for home healthcare in the United States, is when one needs to assess functional status. There are functional status items related to ambulation/locomotion and transferring. 19 A logical cross check would be to compare item M1850 for transferring with item M1860 for ambulation/locomotion for consistency such that a patient identified as bedbound in transferring should not be rated as independent in ambulation/locomotion.
Context
As with anything else, context matters, and in the case of secondary use of data, is a significant consideration. Of course, the obvious challenge in the secondary use of data is that the available data may not meet your needs, and in the case of reusing EMR data, a lot of time is dedicated to pre-processing the data that includes identifying the raw data elements 20 that provide the needed data to answer the research questions or provide the informational foundation to drive a clinical decision support system. There are multiple considerations that should be addressed contextually to improve the veracity of the data used.
The first consideration is identifying the appropriate time frame for data collection. For example, developing a decision support system that uses EMR data in real-time would require the use of data from inpatient admissions from that particular hospital, a rather narrow inclusion of data and technically easier to accomplish. Alternatively, if you were to develop a home health decision support system, you would require data from the hospital inpatient encounter leading to home health admission as well as data from other home health encounters, creating a potentially massive technical challenge in being able to abstract data from multiple hospitals with potentially different EMR systems and capabilities. An additional consideration also emerges if the context of interest spans episodes of care and care settings that include periods of well-being and illness, called stationarity, that encompasses the need to address the influence of time and acute illness on measurement biases, and how those biases influence the ability to include and analyze data. 21
The second consideration is the intended use or application. For example, using EMR data for determining the feasibility and accuracy of natural language processing (NLP) techniques from a purely scientific interest is very different than using EMR data to make clinical decisions using a decision support system. For the former, the investigators would expect that there would be terms in the EMR that are not included in the NLP lexicon. This would not reflect how “good” the EMR data were as much as the limitations of the NLP system or the complexities of clinician charting.
Conversely, making clinical decisions based on EMR data would demand a higher degree of accuracy. For instance, developing care paths requires the use of current evidence that has been validated, due to the fact that the primary aim of a care path is to reduce practice variation and direct the required testing and procedures for a particular problem. Due to the prescriptive nature of care paths, EMR data may not be the best primary source of data to support initial development. Alternatively, developing a real-time clinical decision support system aimed at providing patient centered care recommendations could operate primarily off of EMR data and ideally would use EMR data. A decision support system that incorporates an individual patient’s EMR data in real-time, while comparing that patient’s current state with other similar patients, can provide robust prognostication of clinical trajectory and care recommendations based on previous patients’ progression through that very system. Of particular importance is the fact that these decision support systems provide recommendations, not prescriptions for care. This is very different than incorporating findings from studies produced external to that particular setting and is the essence of what a learning health system would look like.
Current efforts that encompass the previous principles in concert is the work being done on developing electronic phenotypes. An electronic phenotype, or computable phenotype, “is a clinical condition, characteristic, or set of clinical features that can be determined solely from the data in EHRs and ancillary data sources and does not require chart review or interpretation by a clinician.” 22 Developing an electronic phenotype consists of identifying the appropriate data elements and value sets that produce reliable and valid identification of patients that exhibit the true phenotype and which ones do not, in a similar manner described above. While a lot of effort has been focused on developing valid electronic phenotypes, phenotypes are primarily concerned only with demographic, diagnostic (ICD-9/10), and procedural codes. However, the basic principles of data provenance, cross validation, and context extend well beyond accurately identifying patients to include all data types necessary for the intended application.
When the data are not “good enough”
Despite the many uses that big data, whether clinical or administrative, offers for research, there are also circumstances where big data may fail to meet research requirements. Rare diseases and conditions, for example, may not have sufficient numbers of patients, even in very large datasets, to allow for answering the research question. Depending on the structure of the data and the sources, it may not be possible to determine if there are unduplicated individuals in the same data source. For example, persons using two different healthcare systems may show up in a repository of pooled data as two people. Unit of analysis also matters: using hospital unit–level data will “miss” complexities of some patients by rolling up patient-level data to the hospital unit level. For example, a study using big data that examines pressure injury development may only have hospital unit–level data and thus would be inappropriate to use to determine individual patient-level predictors for the development of pressure injuries.
Furthermore, one cannot apply the results of a decision support system driven by a single hospital or health system’s EMR to other settings. As previously described, context matters, and the unique nature and richness of the EMR data from one hospital or health system will not be replicated in a hospital or different health system even within the same city. 23
For reasons of concern about deductive disclosure, even “anonymized” datasets may have sufficient information associated with them to allow for identification of individuals, which would ethically preclude use of big data for research without at least review by an Institutional Review Board. For example, researchers using publicly available genomic data were able to link to individuals via their surnames via publicly available information from the Internet 24 for 12 percent of American men in one genomic database. Similar identifications are possible with big data that contains sufficient information to allow for determination of health and illness conditions that can then be paired with publicly available data from social media sites and property information sites.
Current approaches to data quality assessment
Conceptual
The level of detail and repeated measures at the individual level that EMR data provides can be leveraged in conjunction with publicly available administrative datasets that are nationally representative. Assessing external generalizability of local findings or assessing if certain clinical phenotypes or clinical trajectories hold beyond the local context can be accomplished. One can start with the population database(s) which are usually administrative in nature and lack granular patient-level data. Starting with this top–down approach is useful in identifying subgroups of patients that experience similar responses to care or cluster into clinically meaningful groups of patients that warrant further investigation. Then, using the characteristics of the subgroups identified in the national cohort, one can identify patients in the local EMR data that enables the ability to conduct in-depth inter- and intra-individual analyses. Alternatively, the process can be completed via a bottom–up approach that aims to answer clinical questions using EMR data and then transitioning to population-based datasets to see if the findings identified locally are also replicated nationally, and if not, what are the similarities and differences. Leveraging both local and state/national databases provides the ability to identify patterns of characteristics or care that hold across settings and contribute to truly generalizable knowledge that is often lacking.
Additionally, there are multiple approaches to assessing data quality in EMRs. The most recent and comprehensive approach is presented by Kahn et al. 25 This comprehensive framework unifies the concepts and associated terminologies to guide data quality assessment of EMR for secondary purposes, establishing the foundation to continue to develop common assessment approaches and reporting requirements. Other more narrowly focused frameworks include Reimer et al.’s 26 framework for assessing data quality in data repositories that contain longitudinal data from multiple data domains and sources, and Holve et al.’s 27 framework to assess data quality when conducting comparative effectiveness research.
Technological
There are existing technological tools that can help to automate the veracity of data. Individual approaches applied within healthcare to EMR data include: process mining—a method that assesses data quality by mapping the chronological time/date stamping of longitudinal data within the EMR; 28 developing ontologies to characterize data quality across different organizations to enable sharing quality metrics; 29 text mining tools based on NLP to identify specific words or combinations of words to aid in verification of data quality and completeness, 30 or when paired with standardized terminologies can also provide data correction for misspelled words in unstructured text fields; or applying probabilistic data quality control methods that leverage information and geometric theory to conduct simultaneous assessment of multisource and temporal variability present in multisite data collection repositories. 31
Other tools from outside of healthcare also merit consideration. For example, the field of auditing describes computer-assisted audit tools and techniques (CAATTs) for a variety of tests including use of test data where the auditor uses a known artificial dataset to test the system as well as CAATTs that use actual data from a company to undergo more frequent and automated analyses of the audit process. 32 In some cases, the audit process generates results that the auditor can use to support their findings. Kiesow et al. indicate that the test of veracity can be done by examination of the completeness and accuracy of the CAATTs.
Automatic error correction has been used with optical character reading (OCR) for many years with seminal work in the field by Damerau. 33 Economics work provides other examples of automatic error correction where there were three kinds of errors identified: data format errors, incorrect entries, and missing values. Using a set of queries, the researchers could automate the error finding and correct the errors in a large and growing dataset on public spending in Greece. 34
More recently, another approach developed outside of healthcare that may be adapted and applied to assessing EMR data quality is the application of reputation tools. Basically, an automated, 35 or semi-automated, 36 tool tracks online activity such as blog postings or Twitter feeds to identify and assess the relevance of newly posted material related to a given company or individual. These approaches may be applied within the healthcare domain to assess data quality.
Conclusion
Data science and big data have much to offer healthcare, and the projections and expectations of the impact of data science approaches on patient care are real. However, the concerns about veracity of the data need to be based on the particularities of the study. “Good enough” data for clinical care and billing for services should be “good enough” for applications that intend to apply the results in similar fashions in which the source data were originally collected. Careful attention on the part of the researchers to how well the data meet the objectives of the research, the selection criteria for the patients and unit of analysis issues and research questions or hypotheses will strengthen the veracity of the data use. Additional efforts are necessary to extend the process of electronic phenotyping to other data types necessary to support the development of decision support and a learning healthcare system. For studies that have an impact, whether real or potential, on patient care and outcomes, a more stringent approach is necessary and requires team members with deep insight and appropriate data that can be cross validated and applied in the proper context.
Footnotes
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
