Sage Journals: Discover world-class research

Abstract

The target trial framework has emerged as a powerful tool for addressing causal questions in clinical practice and in public health. In the healthcare sector, where decision-making is increasingly data-driven, transactional databases, such as electronic health records (EHR) and insurance claims, present an untapped potential for answering complex causal questions. This narrative review explores the potential of the integration of the target trial framework with real-world data to enhance healthcare decision-making processes. We outline essential elements of the target trial framework, and identify pertinent challenges in data quality, privacy concerns, and methodological limitations, proposing solutions to overcome these obstacles and optimize the framework’s application.

Keywords

Causal inference Target trial Electronic Health Records EHR Claims data Narrative review

Background

The pursuit of effective healthcare interventions and informed decision-making are enduring challenges in the field of medicine. Robust, evidence-based answers to causal questions are paramount in guiding clinical practice and public health policy.¹ The target trial framework has emerged as a useful tool to address causal questions in a way that avoids common pitfalls and self-inflicted biases, particularly with respect to observational data.^2–5 Within the expansive landscape of healthcare, the burgeoning adoption of transactional data such as electronic health records (EHR) and insurance claims has bestowed a trove of data amenable to analysis of causal inference. The application of the target trial framework to these data presents an opportunity to augment our understanding of healthcare interventions and outcomes. The purpose of this narrative review was to explore the potential of the integration of the target trial framework with real-world data to enhance healthcare decision-making processes.

Search strategy

For this narrative review, a comprehensive literature search of PubMed/Medline and Google Scholar was conducted to identify relevant studies. Search terms included: ‘causal inference’; ‘target trial’; ‘electronic health records’; ‘electronic medical records’; ‘insurance claims data’. Boolean operators (AND, OR) were used to combine these terms and variations of them, ensuring comprehensive retrieval of articles pertaining to the application of the target trial framework with EHR and/or insurance claims data.

Causal analyses of observational data

Causal inference

Research and causal inquiries are crucial in public health as they aim to identify the underlying causes of health outcomes and provide information on the most effective interventions.¹ Understanding the comparative efficacy or safety of different health interventions helps in developing evidence-based policies, programs, and interventions with the ultimate goal of preventing and controlling diseases, reducing disparities, and promoting health equity.^1,6 This is particularly important for guiding policy definitions and strategies in any organization directly or indirectly responsible for a population, such as government ministries, funding entities, or service-providing institutions. The decisions are often informed by randomized controlled trials (RCT), which are considered the gold standard for evaluating causal questions.² In addition, other types of analyses, such as health technology assessments, cost-effectiveness studies, or budget impact analyses, can also provide useful data and supplement results from RCT. However, conducting a RCT may not always be feasible due to cost constraints, ethical considerations, or need for timely data availability.¹ In these circumstances, analysis of observational data from transactional sources, such as EHR, may provide an alternative to help determine causal relationships.^2,5,7,8 Nevertheless, the use of observational data (specifically EHR data) comes with its own set of challenges that have to be addressed before arriving at valid conclusions.

The process of causal question analysis

The process of formulating and addressing causal questions involves formulating an estimand, selecting an estimator, and applying them to data to obtain an estimate that, under certain assumptions may be interpreted as causal.⁹ Under the potential outcomes framework, discerning a causal relationship involves comparing outcomes for an individual across different levels of exposure.¹⁰ In other words, the analysis entails contrasting expected results for a patient exposed to treatment A, with those for the same patient, had she been exposed to treatment B. This is inherently challenging because, by definition, one of the potential outcomes remains unobserved in the real world and so is counterfactual. Consequently, causal inference is typically framed as a missing data problem, as we commonly lack data on one of the potential outcomes.¹¹

Identifiability assumptions serve as a means to bridge the gap between potential outcomes and actual outcomes. To assist in making this leap, epidemiologists use three assumptions (i.e., exchangeability, positivity, and consistency).¹¹ The exchangeability assumption necessitates that individuals with exposure and those without exposure, exhibit on average equivalent risk for the outcome.¹² This equivalence enables the unexposed group to function as a substitute for the counterfactual outcomes in the exposed group, and vice versa. This assumption can be satisfied either marginally or conditionally. Marginal exchangeability implies that, within the entire population, the potential outcomes are independent of the treatment assignment (e.g., different exposure groups have equivalent risk for the outcome); most RCT satisfy this assumption because random assignment assures that treatment is independent of other variables. By contrast, conditional exchangeability is a more relaxed assumption: within groups of individuals who share the same values of certain covariates, the potential outcomes are independent of the treatment assignment. Conditional exchangeability is the assumption that most observational studies aspire to. The positivity assumption specifies that a non-zero probability exists for every individual to receive each possible level of the treatment or intervention.¹³ While this condition is inherently satisfied in RCT, it must also hold true for every combination of covariates when conditional exchangeability is pursued. Finally, the consistency assumption mandates a clear and well-defined definition of exposure and ensures that each individual has a singular potential outcome corresponding to each level of exposure.¹⁴

Estimand, estimator and estimate

The estimand represents the causal effect of interest and can be defined in different ways depending on the research question and available data. It encompasses specifying the population, the intervention, the variable of interest or outcome, the summary measure to be used, and the handling of intercurrent events.^9,15

The estimator is a mathematical function that takes the observed data and produces an estimate of the target estimand.¹⁵ Common estimators include regression models, propensity score matching, and inverse probability weighting. The choice of estimator depends on the research question, the data, and the assumptions made about the causal relationships between variables. These assumptions are typically defined through Directed Acyclic Graphs (DAG), a tool that allows users to represent the set of assumptions about the relationship between variables.¹⁶ Particularly in observational studies, DAG help define how variables will be treated within the statistical model to suppress or minimize potential biases in the data. This process relies on expert a priori knowledge.^16,17

Lastly, the estimate is the numerical value obtained when applying the estimator to the data, which, under certain identifiability assumptions (i.e., exchangeability, positivity, and consistency), allows the researcher to interpret it as a causal effect.¹¹

The target trial framework

The target trial

Causal observational analyses are prone to biases due to both, the non-controlled nature of the data and also common design flaws. These can be overcome by designing a hypothetical randomized trial (the target trial). Target trial emulation is a two-step process. In the first step, the hypothetical RCT must articulate the causal question in the form of the protocol of the hypothetical study. In the second step, the aim is to emulate the components of the target trial using observational data.^18,19 This two-step process is iterative, particularly in the case of already collected observational data, as it is possible that the necessary data to emulate the target trial may not be present in the data source. In such cases, it may be necessary to go back to step 1 and reformulate the target trial. Investigators must then question whether the trial that can be emulated given the available data is still of interest. The target trial framework can be used to design studies that address a wide range of research questions, including ones related to drug safety, treatment effectiveness, or disease prognosis.^20–40 However, we can only emulate pragmatic target trials because real-world data cannot be used to emulate a placebo-controlled trial or, one that employs a blinded design.²

Components of the target trial protocol and their emulation using transactional data sources

The target trial protocol must include several key components essential for designing and conducting a clinical study. These are: eligibility criteria; treatment strategies; assignment procedures; follow-up; outcome(s); casual contrast; data analysis plan.^2–5,7 Herein, we describe each component of the target trial and provide some considerations regarding their emulation using observational data.

The eligibility criteria should specify the prerequisites for participant inclusion. Emulating this component of the target trial with observational data, involves finding eligible individuals in the healthcare databases. Adequate methods of electronic phenotyping, structuring clinical notes through natural language processing (NLP) techniques, and inclusion of complementary data sources, can help mitigate the risk of misclassification bias at this stage.⁴¹ Another challenge when working with healthcare databases, is identifying regular users of a particular healthcare system. For this reason, an eligibility criterion requiring a minimum time or interactions with the system is often included.⁷

The treatment strategies component describes the interventions to be compared in the target trial. These can be either point (i.e., one-time) or sustained (i.e., long-term) interventions.⁴² Sustained interventions can be further classified into static or dynamic. Static strategies are those that are fully defined at baseline (e.g., take statins during the entire follow-up period). Conversely, in dynamic strategies the treatment depends on patient’s evolving characteristics, such as response or side-effects (e.g., take statins during the entire follow-up unless a contraindication develops, in which case, interrupt treatment). Emulating a target trial that involves a comparison of dynamic treatment strategies requires adequate data, not only on treatment received but also on the circumstances under which patients are ‘excused’ from following the assigned treatment (e.g., in the case of statin treatment, myopathy or hepatic impairment).⁴³

The assignment component details the method of random patient allocation to therapeutic strategies, emphasizing participant awareness due to the absence of blinding in pragmatic trials. The emulation of this component using observational data requires adjustment for baseline covariates to control for confounding.¹¹ The minimum set of covariates required to adjust for confounders should be chosen using a causal DAG. For emulating the random assignment, high quality data must be available for all the covariates that are included in the DAG.^11,44 The absence of information on a relevant variable should prompt the research team to consider the usefulness of the study or the need to reformulate the target trial.⁴⁵

The follow-up component in the target trial is defined by: (i) its start (time zero) and (ii) its end (which depends on the occurrence, or not, of the primary outcome, lost-to-follow-up, or administrative end of follow-up). A careful consideration of how intercurrent events will be handled may be relevant at this stage (e.g., whether to consider death as a censoring event or not).^8,46 Emulating time zero using EHR data must use the same criteria as the target trial. Time zero can be easily defined as the first moment when an individual meets all eligibility criteria.⁷ When considering the end of follow-up, defining criteria for loss to follow-up can be challenging when employing EHR data. A frequent choice is to use the lack of appointments or utilization of the healthcare system for a significant amount of time.⁴⁸

The outcomes section should focus on defining the clinical outcomes of interest. Emulating this component using EHR data presents two fundamental challenges. Firstly, an appropriate electronic phenotyping strategy is needed, and precise information about the timing of the event is necessary. It is common for information about the event and its timing to be recorded in clinical notes, necessitating the use of NLP techniques to extract relationships between entities and temporal references from the text (especially for diseases with insidious onset, such as dementia).^41,49 Secondly, recording events independent of exposure is crucial to avoid information bias.⁵⁰ This is because certain exposures may only be investigated and recorded at the time of an event’s occurrence.⁵¹ Therefore, it is necessary to assess the timing of the recording exposures and events, not just the timing of their occurrence.⁵²

The causal contrast of interest section should describe intention-to-treat and/or per-protocol effects. The per-protocol effect is the effect had everybody adhered to their assigned treatment strategy. The intention-to-treat effect is the effect of being randomized to a particular treatment at baseline, regardless of whether the treatment is actually used or not during follow-up. Whenever possible, when emulating the target trial using EHR data, we would consider both estimands.

Finally, the data analysis component should outline analytical strategies to estimate the above-mentioned causal contrasts. The intention-to-treat analysis in randomized trials typically involves a Kaplan-Meier estimator to estimate non-parametrically the risk for the outcome at each time point. To adjust for potential selection bias due to loss of follow-up, individuals can be assigned time-varying inverse probability weights with the weights depending on all baseline and post-baseline variables. The per-protocol analysis is the same as intention-to-treat analysis except that participants are censored (i.e., their data stream is interrupted) if, and when, they deviate from their assigned treatment strategy. To adjust for the potential selection bias induced by this censoring, each individual receives a time-varying inverse probability (IP) weight. When emulating the target trial using EHR data, the statistical analysis for intention-to-treat or per-protocol populations is identical with the only exception that to estimate the intention-to-treat analogue, we need to adjust for baseline confounding.

Using EHR data for causal inference

Opportunities

Transactional data, particularly EHR, are comprehensive and digitized repositories of patient health information and offer a wealth of clinical, demographic, and administrative data, making them an invaluable resource for investigating causal relationships.⁵³ The level of detail provided by EHR, particularly in the clinical domain, is a clear advantage over insurance claims data.⁵³

One of the primary domains where the fusion of transactional data sources and causal inference has found extensive application is in comparative effectiveness research. Examples of this include direct comparisons between different vaccine products and regimens,^20–24 as well as estimation of the effects of glucose-lowering drugs on diverse health outcomes.^25–27 In addition, these data have been used in the assessment of the impact of various monoclonal antibodies on asthma,²⁸ evaluation of the influence of multiple antiviral agents on hospital admissions in SARS-CoV-2 patients,^29–31 and in the comparison of benzodiazepine regimens in individuals with post-traumatic stress disorder (PTSD).³² Furthermore, investigations within this domain have also encompassed analysis of dosing and treatment timing.³³ Additionally, non-pharmacological interventions and cancer prevention strategies have been investigated in several studies,^35–37 and this approach has been used in numerous studies analyzing drug safety.^{22,25,38–40} Using data from EHR and insurance claims can facilitate examination of health trajectories and treatment outcomes for stigmatized conditions such as mental health issues,^54–56 or sub-represented groups such as the elderly.^57–59 These conditions often present challenges in recruitment for primary studies. The extensive sample sizes provided by EHR data facilitate such investigations, offering valuable insights into these under-represented groups.

The ability to examine uncommon exposures, population subgroups, outcomes, and their interactions transcends the constraints of small sample sizes that often typify conventional clinical trials.⁶⁰ Importantly, the use of transactional data for causal inference has facilitated the discovery of disease subgroups and analysis of treatment effect heterogeneity among these groups, a feature which is invaluable to the concept of precision medicine.^61–65 Furthermore, EHR provide a multifaceted trove of longitudinal and temporally rich data.⁶⁰ This temporal granularity is of paramount importance for causal inference, as it enables the tracking of individual patients over time, capturing dynamic changes in exposures, interventions, and outcomes. The longitudinal nature of EHR data enables researchers to discern temporal sequences and better approximate causal ordering. For example, longitudinal data represented in EHR has facilitated the study of sustained statin therapy in the prevention of primary cardiovascular events.⁶⁶ The research focused on the impact of statins on mortality among patients diagnosed with cardiovascular disease.⁶⁷

Challenges

Although transactional data yield many advantages, challenges can occur because the data are collected from secondary sources.⁶⁸ In contrast to traditional ad-hoc data collection, researchers have no control over what has been recorded.⁶⁰ This presents challenges when attempting to re-use the data, primarily in terms of population selection and generalizability,⁶⁹ also for defining patient characteristics and exposures, outcomes, and follow-up,⁶⁰ which are fundamental aspects of any study of causal inference. Furthermore, data recording is often passive, and depends on patients attending a healthcare center for their information to be recorded.⁷⁰ Therefore, there may be data incompleteness between patient groups,⁴⁸ manifested in an overrepresentation of those with more access in terms of number of consultations, studies, and differential follow-up.⁷¹

Another challenge of transactional data is the subjectivity of the data recording process. Often, only data that are considered relevant by the consulting healthcare professional are recorded.⁷⁰ Information may be recorded inconsistently and remain in free text within medical notes.^72,73 This is particularly a problem for variables related to education, symptom severity, functional status and social determinants of health.⁷⁴ Furthermore, using EHR data for epidemiological studies may raise privacy and ethical issues.⁷⁵ EHR contain sensitive personal health information, and so data security is paramount. Coding errors or subjectivity in the EHR data can also lead to misclassification bias. Performing manual chart review, establishing clear data collection protocols, and using standardized data extraction tools may enhance data objectivity and reliability.⁷³

While EHR data typically offer a robust representation of individuals who have access to healthcare services, they are considerably less representative of those lacking such access. This discrepancy arises from the non-random nature of healthcare accessibility. For instance, in health systems where access is contingent upon employment status, EHR data inherently exclude the unemployed, thereby failing to encapsulate their health experiences. Consequently, this limitation impedes efforts to mitigate healthcare disparities between those with, and without, access to the system. One study describes such differential access where researchers validated EHR coverage for several health care organizations in Colorado, USA.⁷⁶ Moreover, it is important to recognize that disparities also exist within groups that do have access to healthcare services.

Diverse strategies have been implemented to lessen the shortcomings of transactional data. For example, generalizability concerns have been addressed by comparing data from EHR with data from population statistics such as Census estimates.⁷⁷ This approach, although useful, is not without limitations.⁷⁸ However, leveraging large, diverse EHR networks, if available, can also help alleviate this issue.⁷⁹ In addition, sensitivity analyses focused on missing data, and integration of external data sources, may serve as effective strategies to mitigate potential biases.^48,78 These analyses should be guided by DAG that can assist in identifying which variables should be collected and included in those analysis. However, it is important to acknowledge that these methods necessitate specific assumptions regarding the ‘missingness’ of data and the difference among, ‘missing completely at random’, ‘missing at random’ and ‘missing not at random’.⁸⁰

Coding errors or subjectivity in data recorded within EHRs can also be a form of information bias. Establishing clear data collection protocols and employing standardized and data extraction tools can enhance data objectivity and reliability. The current standard for building a clinical data warehouse uses subject matter, expert validated, electronic phenotyping algorithms within extraction-transform-load (ETL) processes, that include NLP methods, in addition to a common data model such as the Observational Medical Outcomes Partnership (OMOP) model managed by the Observational Health Data Sciences and Informatics (OHDSI) group.^{41,49,73,81,82}

Interpreting results: navigating through potential biases

In observational studies, especially those leveraging transactional data sources, the quest to draw causal inferences is fraught with challenges.¹¹ Identifiability assumptions required for causal inference are seldom fully met. Therefore, this fundamental limitation necessitates a careful interpretation of results, with awareness of potential biases. Importantly, even with rigorous methodology, residual confounding bias can never be entirely ruled out in observational studies. Researchers should discuss specific sources of potential bias in their studies and provide a general assessment of the possible magnitude and direction of these biases. Furthermore, the use of formal causal sensitivity analyses can help clarify the sensitivity of the results to potential biases. These analyses involve systematically varying key assumptions and observing the impact on the study’s conclusions.⁸³

Another potential source of bias is misclassification. Eligibility criteria, exposures, outcomes, and/or confounders may be measured with error in EHR data. A frequent problem occurs when outcomes are ascertained differently in those with and without the exposure (e.g., exposed individuals interact more frequently with the healthcare system and are thus more likely to receive a diagnosis). Performing manual chart review, refining phenotypes used in EHR, and using standardized data extraction tools may enhance data objectivity and reliability.⁷³

Despite these challenges, observational studies do provide valuable insights into therapies, particularly in situations where RCT are not feasible. However, the influence of potential biases should not be underestimated. Accordingly, several researchers have proposed a framework for interpreting observational studies in the context of decision-making and policy development.⁸⁴

Future directions and implications

The learning health system

The concept of a learning health system, where real-world data continually informs and improves clinical practice, is a promising approach to evidence-informed decision making.⁸⁵ Transactional data sources serve as the lifeblood for these learning health systems, facilitating the seamless integration of evidence generation and clinical care. Combined with recent advancements in observational causal inference, the gap between research and practice can be reduced.⁸⁶ The learning health system framework can help healthcare providers adapt and refine strategies based on the most current and reliable causal evidence, ultimately leading to better patient outcomes.⁸⁵

Collaborative efforts for data sharing

Collaborative initiatives for data sharing are paramount in advancing causal inference using transactional data in healthcare. These endeavors foster collective access to diverse datasets from multiple institutions and healthcare systems. This not only enhances dataset representativeness, but also allows for findings to be validated across varied populations and settings.⁸⁷ By pooling resources, researchers can conduct large-scale, multi-site studies, leading to more robust and generalizable causal conclusions. Importantly, ethical considerations and stringent data privacy safeguards remain crucial, ensuring that patient confidentiality is upheld throughout this collaborative pursuit of knowledge.

Value-based healthcare

The shift towards value-based healthcare models will further accentuate the importance of transactional data sources for causal inference. With a focus on delivering high-quality care while optimizing costs, value-based healthcare demands a meticulous understanding of which interventions yield the most favorable outcomes.⁸⁸ Transactional data, augmented by causal inference methodologies, will enable healthcare systems to identify and implement interventions that offer the best value. This approach will not only benefit patients by ensuring they receive the most effective treatments, but also contribute to the sustainability of healthcare systems.

Conclusions

The integration of the target trial framework with transactional data sources, particularly EHR, represents a significant advancement in the field of causal inference in healthcare. This review has provided a comprehensive examination of the application of this framework, emphasizing its potential to inform crucial decision-making processes. As the adoption of transactional systems continues to grow, so too does the potential for leveraging these data sources for causal inference. By embracing opportunities presented by the target trial framework in conjunction with transactional data, we are poised to make significant strides in advancing our understanding of causal relationships in healthcare, ultimately leading to more effective and personalized patient care.

Footnotes

Declaration of conflicting interests

The authors declare that there are no conflicts of interest.

Funding

This research received no specific grant from funding agency in the public, commercial, or not-for-profit sectors.

ORCID iD

Santiago Esteban

References

Glass

Goodman

Hernán

, et al. Causal inference in public health. Annu Rev Public Health 2013; 34: 61–75.

Matthews

Danaei

Islam

, et al. Target trial emulation: applying principles of randomised trials to observational studies. BMJ 2022; 378: e071108.

Zuo

Campbell

, et al. The implementation of target trial emulation for causal inference: a scoping review. J Clin Epidemiol 2023; 162: 29–37.

Scola

Chis Ster

Bean

, et al. Implementation of the trial emulation approach in medical research: a scoping review. BMC Med Res Methodol 2023; 23: 186.

Hernán

Wang

Leaf

DE.

Target Trial Emulation: A Framework for Causal Inference From Observational Data. JAMA 2022; 328: 2446–2447.

Victora

Habicht

Bryce

Evidence-based public health: moving beyond randomized trials. Am J Public Health 2004; 94: 400–405.

Hernán

Robins

JM.

Using Big Data to Emulate a Target Trial When a Randomized Trial Is Not Available. Am J Epidemiol 2016; 183: 758–764.

Hernán

Sauer

Hernández-Díaz

, et al. Specifying a target trial prevents immortal time bias and other self-inflicted injuries in observational analyses. J Clin Epidemiol 2016; 79: 70–75.

Kang

Kendall

Ribaudo

, et al. Incorporating estimands into clinical trial statistical analysis plans. Clin Trials 2022; 19: 285–291.

10.

Rubin

DB.

Causal Inference Using Potential Outcomes. J Am Stat Assoc 2005; 100: 469; 322–331. Available from: https://www.tandfonline.com/doi/abs/10.1198/016214504000001880.

11.

Hernán

Robins

JM.

Instruments for causal inference: an epidemiologist’s dream?

Epidemiology 2006 17: 360–372.

12.

Hernán

Robins

JM.

Estimating causal effects from epidemiological data. J Epidemiol Community Health 2006; 60: 578–586.

13.

Petersen

Porter

Gruber

, et al. Diagnosing and responding to violations in the positivity assumption. Stat Methods Med Res 2012; 21: 31–54.

14.

VanderWeele

TJ.

Concerning the consistency assumption in causal inference. Epidemiology 2009; 20: 880–883.

15.

Little

Lewis

RJ.

Estimands, Estimators, and Estimates. JAMA 2021; 326: 967–968.

16.

Greenland

Pearl

Robins

JM.

Causal diagrams for epidemiologic research. Epidemiology 1999; 10: 37–48.

17.

Sauer

VanderWeele

TJ.

Use of Directed Acyclic Graphs. In: Velentgas

Dreyer

Nourjah

, et al., editors. Developing a Protocol for Observational Comparative Effectiveness Research: A User’s Guide. Rockville (MD): Agency for Healthcare Research and Quality (US); 2013 Jan. Supplement 2. Available from: https://www.ncbi.nlm.nih.gov/books/NBK126189.

18.

Hernán

Alonso

Logan

, et al. Observational studies analyzed like randomized experiments: an application to postmenopausal hormone therapy and coronary heart disease. Epidemiology 2008; 19: 766–779.

19.

Hernán

Robins

García Rodríguez

LA.

Discussion on “statistical issues arising in the women’s health initiative”. Biometrics 2005; 61: 922–930.

20.

Dickerman

Gerlovin

Madenci

, et al. Comparative Effectiveness of BNT162b2 and mRNA-1273 Vaccines in U.S. Veterans. N Engl J Med 2022; 386: 105–115.

21.

Ioannou

Locke

Green

, et al. Comparison of Moderna versus Pfizer-BioNTech COVID-19 vaccine outcomes: A target trial emulation study in the U.S. Veterans Affairs healthcare system. EClinicalMedicine 2022; 45: 101326.

22.

Dickerman

Madenci

Gerlovin

, et al. Comparative Safety of BNT162b2 and mRNA-1273 Vaccines in a Nationwide Cohort of US Veterans. JAMA Intern Med 2022; 182: 739–746.

23.

Dickerman

Gerlovin

Madenci

, et al. Comparative effectiveness of third doses of mRNA-based COVID-19 vaccines in US veterans. Nat Microbiol 2023; 8: 55–63.

24.

Monteiro

Lima Neto

Kahn

, et al. Impact of CoronaVac on Covid-19 outcomes of elderly adults in a large and socially unequal Brazilian city: A target trial emulation study. Vaccine 2023; 41: 5742–5751.

25.

Xie

Bowe

Gibson

, et al. Comparative Effectiveness of SGLT2 Inhibitors, GLP-1 Receptor Agonists, DPP-4 Inhibitors, and Sulfonylureas on Risk of Kidney Outcomes: Emulation of a Target Trial Using Health Care Databases. Diabetes Care 2020; 43: 2859–2869.

26.

Xie

Bowe

Xian

, et al. Comparative effectiveness of SGLT2 inhibitors, GLP-1 receptor agonists, DPP-4 inhibitors, and sulfonylureas on risk of major adverse cardiovascular events: emulation of a randomised target trial using electronic health records. Lancet Diabetes Endocrinol 2023; 11: 644–656.

27.

Kalia

Saarela

O’Neill

, et al. Emulating a Target Trial Using Primary-Care Electronic Health Records: Sodium-Glucose Cotransporter 2 Inhibitor Medications and Hemoglobin A1c. Am J Epidemiol 2023; 192: 782–789.

28.

Akenroye

Segal

Zhou

, et al. Comparative effectiveness of omalizumab, mepolizumab, and dupilumab in asthma: A target trial emulation. J Allergy Clin Immunol 2023; 151: 1269–1276.

29.

Bhatia

Preiss

Xiao

, et al. Effect of Nirmatrelvir/Ritonavir (Paxlovid) on Hospitalization among Adults with COVID-19: an EHR-based Target Trial Emulation from N3C. medRxiv [Preprint] 2023; 2023: 05.03.23289084. Epub ahead of print May 4

30.

Xie

Bowe

Al-Aly

Nirmatrelvir and risk of hospital admission or death in adults with covid-19: emulation of a randomized target trial using electronic health records. BMJ 2023; 381: e073312.

31.

Xie

Bowe

Al-Aly

Molnupiravir and risk of hospital admission or death in adults with covid-19: emulation of a randomized target trial using electronic health records. BMJ 2023; 380: e072705.

32.

Gilbert

Dinh La

Romulo Delapaz

, et al. An Emulation of Randomized Trials of Administrating Benzodiazepines in PTSD Patients for Outcomes of Suicide-Related Events. J Clin Med Res 2020; 9: 3492.

33.

Lim

Teparrukkul

, et al. Effect of Delays in Concordant Antibiotic Treatment on Mortality in Patients With Hospital-Acquired Acinetobacter Species Bacteremia: Emulating a Target Randomized Trial With a 13-Year Retrospective Cohort. Am J Epidemiol 2021; 190: 2395–2404.

34.

Althunian

De Boer

Groenwold

RHH

, et al. Rivaroxaban was found to be noninferior to warfarin in routine clinical care: A retrospective noninferiority cohort replication study. Pharmacoepidemiol Drug Saf 2020; 29: 1263–1272.

35.

Dickerman

García-Albéniz

Logan

, et al. Emulating a target trial in case-control designs: an application to statins and colorectal cancer. Int J Epidemiol 2020; 49: 1637–1646.

36.

García-Albéniz

Hsu

Hernán

MA.

The value of explicitly emulating a target trial when using real world evidence: an application to colorectal cancer screening. Eur J Epidemiol 2017; 32: 495–500.

37.

Keyhani

Cheng

Hoggatt

, et al. Comparative Effectiveness of Carotid Endarterectomy vs Initial Medical Therapy in Patients With Asymptomatic Carotid Stenosis. JAMA Neurol 2020; 77: 1110–1121.

38.

Yoshida

Liu

Desai

, et al. Comparative Safety of Gout Treatment Strategies on Cardiovascular Outcomes Using Observational Data: Clone-censor-weight Target Trial Emulation Approach. Epidemiology 2023; 34: 544–553.

39.

Rekkas

Van Klaveren

Ryan

, et al. A standardized framework for risk-based assessment of treatment effect heterogeneity in observational healthcare databases. NPJ Digit Med 2023; 6: 58.

40.

Czaja

Ross

Liu

, et al. Electronic health record (EHR) based postmarketing surveillance of adverse events associated with pediatric off-label medication use: A case study of short-acting beta-2 agonists and arrhythmias. Pharmacoepidemiol Drug Saf 2018; 27: 815–822.

41.

Khurshid

Reeder

Harrington

, et al. Cohort design and natural language processing to reduce bias in electronic health records research. NPJ Digit Med 2022; 5: 47.

42.

Young

Toh

Estimating Effects of Dynamic Treatment Strategies in Pharmacoepidemiologic Studies with Time-varying Confounding: A Primer. Curr Epidemiol Rep 2017; 4: 288–297.

43.

Hernán

MA.

How to estimate the effect of treatment duration on survival outcomes using observational data. BMJ 2018; 360: k182.

44.

Naimi

Cole

Kennedy

EH.

An introduction to g methods. Int J Epidemiol 2017; 46: 756–762.

45.

García-Albéniz

Hsu

Bretthauer

, et al. Estimating the Effect of Preventive Services With Databases of Administrative Claims: Reasons to Be Concerned. Am J Epidemiol 2019; 188: 1764–1767.

46.

Rojas-Saunero

Young

Didelez

, et al. Considering Questions Before Methods in Dementia Research With Competing Events and Causal Goals. Am J Epidemiol 2023; 192: 1415–1423.

47.

Danaei

García Rodríguez

Cantero

, et al. Electronic medical records can be used to emulate target trials of sustained treatment strategies. J Clin Epidemiol 2018; 96: 12–22.

48.

Tsiampalis

Panagiotakos

Methodological issues of the electronic health records’ use in the context of epidemiological investigations, in light of missing data: a review of the recent literature. BMC Med Res Methodol 2023; 23: 180.

49.

Alfattni

Peek

Nenadic

Extraction of temporal relations from clinical free text: A systematic review of current approaches. J Biomed Inform 2020; 108: 103488.

50.

Yland

Wesselink

Lash

, et al. Misconceptions About the Direction of Bias From Nondifferential Misclassification. Am J Epidemiol 2022; 191: 1485–1495.

51.

Hubbard

Lett

GYF

, et al. Characterizing Bias Due to Differential Exposure Ascertainment in Electronic Health Record Data. Health Serv Outcomes Res Methodol 2021; 21: 309–323.

52.

Agniel

Kohane

Weber

GM.

Biases in electronic health record data due to processes within the healthcare system: retrospective observational study. BMJ 2018; 361: k1479.

53.

Lin

Rosenthal

Murphy

, et al. External Validation of an Algorithm to Identify Patients with High Data-Completeness in Electronic Health Records for Comparative Effectiveness Research. Clin Epidemiol 2020; 12: 133–141.

54.

McCoy

Castro

Rosenfield

, et al. A clinical perspective on the relevance of research domain criteria in electronic health records. Am J Psychiatry 2015; 172: 316–320.

55.

Chen

Boreta

Braunstein

, et al. Association of mental health diagnosis with race and all-cause mortality after a cancer diagnosis: Large-scale analysis of electronic health record data. Cancer 2022; 128: 344–352.

56.

Gersing

Swartz

, et al. Using electronic health records data to assess comorbidities of substance use and psychiatric diagnoses and treatment settings among adults. J Psychiatr Res 2013; 47: 555–563.

57.

Han

Lee

, et al. Effectiveness and safety of sodium-glucose co-transporter-2 inhibitors compared with dipeptidyl peptidase-4 inhibitors in older adults with type 2 diabetes: A nationwide population-based study. Diabetes Obes Metab 2021; 23: 682–691.

58.

Park

Jeong

Bea

, et al. Safety of sodium-glucose co-transporter-2 inhibitors on amputation across categories of baseline cardiovascular disease and diuretics use in patients with type 2 diabetes. Diabetes Obes Metab 2023; 25: 3248–3258.

59.

Güdemann

Young

Thomas

NJM

, et al. Safety and effectiveness of SGLT2-inhibitors in people with type 2 diabetes over 70: UK population-based study using an Instrumental Variable approach. medRxiv 2024; 2024: 24300832.

60.

Casey

Schwartz

Stewart

, et al. Using Electronic Health Records for Population Health Research: A Review of Methods and Applications. Annu Rev Public Health 2016; 37: 61–81.

61.

Yang

Mwangi

Kantor

, et al. Tree-based subgroup discovery using electronic health record data: heterogeneity of treatment effects for DTG-containing therapies. Biostatistics 2023: kxad014. Epub ahead of print.

62.

Maurits

Korsunsky

Raychaudhuri

, et al. A framework for employing longitudinally collected multicenter electronic health records to stratify heterogeneous patient populations on disease history. J Am Med Inform Assoc 2022; 29: 761–769.

63.

Ling

Upadhyaya

Chen

, et al. Emulate randomized clinical trials using heterogeneous treatment effect estimation for personalized treatments: Methodology review and benchmark. J Biomed Inform 2023; 137: 104256.

64.

Kent

Nelson

Pittas

, et al. An electronic health record-compatible model to predict personalized treatment effects from the Diabetes Prevention Program: A cross-evidence synthesis approach using clinical trial and real-world data. Mayo Clin Proc 2022; 97: 703–715.

65.

Rapsomaniki

Timmis

George

, et al. Blood pressure and incidence of twelve cardiovascular diseases: lifetime risks, healthy life-years lost, and age-specific associations in 1·25 million people. Lancet 2014; 383: 1899–1911.

66.

Danaei

Rodríguez

LAG

Cantero

, et al. Observational data for comparative effectiveness research: an emulation of randomised trials of statins and primary prevention of coronary heart disease. Stat Methods Med Res 2013; 22: 70–96.

67.

Danaei

Tavakkoli

Hernán

MA.

Bias in observational studies of prevalent users: lessons for comparative effectiveness research from a meta-analysis of statins. Am J Epidemiol 2012; 175: 250–262.

68.

Weiskopf

Weng

Methods and dimensions of electronic health record data quality assessment: enabling reuse for clinical research. J Am Med Inform Assoc 2013; 20: 144–151.

69.

Boyd

Gonzalez-Guarda

Lawrence

, et al. Potential bias and lack of generalizability in electronic health record data: reflections on health equity from the National Institutes of Health Pragmatic Trials Collaboratory. J Am Med Inform Assoc 2023; 30: 1561–1566.

70.

Cham

CH.

Secondary Analysis of Electronic Health Records (Internet). Springer International Publishing. 2016. Available from: https://www.ncbi.nlm.nih.gov/books/NBK543630/.

71.

Sauer

Chen

Hyland

, et al. Leveraging electronic health records for data science: common pitfalls and how to avoid them. Lancet Digit Health 2022; 4: e893–e898.

72.

Wright

McCoy

Hickman

, et al. Problem list completeness in electronic health records: A multi-site study and assessment of success factors. Int J Med Inform 2015; 84: 784–790.

73.

Liao

Cai

Savova

, et al. Development of phenotype algorithms using electronic medical records and incorporating natural language processing. BMJ 2015; 350: h1885.

74.

Lyles

Wachter

Sarkar

Focusing on Digital Health Equity. JAMA 2021; 326: 1795–1796.

75.

Gostin

Halabi

Wilson

Health Data and Privacy in the Digital Era. JAMA 2018; 320: 233–234.

76.

Scott

Bacon

Kraus

, et al. Evaluating Population Coverage in a Regional Distributed Data Network: Implications for Electronic Health Record-Based Public Health Surveillance. Public Health Rep 2020; 135: 621–630.

77.

Thompson

Jin

Luft

, et al. Population-Based Registry Linkages to Improve Validity of Electronic Health Record-Based Cancer Research. Cancer Epidemiol Biomarkers Prev 2020; 29: 796–806.

78.

Gianfrancesco

Goldstein

ND.

A narrative review on the validity of electronic health record-based research in epidemiology. BMC Med Res Methodol 2021; 21: 234.

79.

Denny

Ritchie

Basford

, et al. PheWAS: demonstrating the feasibility of a phenome-wide scan to discover gene-disease associations. Bioinformatics 2010; 26: 1205–1210.

80.

Bhaskaran

Smeeth

What is the difference between missing completely at random and missing at random?

Int J Epidemiol 2014; 43: 1336–1339.

81.

Banda

Seneviratne

Hernandez-Boussard

, et al. Advances in Electronic Phenotyping: From Rule-Based Definitions to Machine Learning Models. Annu Rev Biomed Data Sci 2018; 1: 53–68.

82.

Keloth

Banda

Gurley

, et al. Representing and utilizing clinical textual data for real world studies: An OHDSI approach. J Biomed Inform 2023; 142: 104343.

83.

VanderWeele

Ding

Sensitivity Analysis in Observational Research: Introducing the E-Value. Ann Intern Med 2017; 167: 268–274.

84.

Berger

Sox

Willke

, et al. Good practices for real-world data studies of treatment and/or comparative effectiveness: Recommendations from the joint ISPOR-ISPE special Task Force on real-world evidence in health care decision making. Value Health 2017; 20: 1003–1008.

85.

Enticott

Johnson

Teede

Learning health systems using data to drive healthcare improvement and impact: a systematic review. BMC Health Serv Res 2021; 21: 200.

86.

Shi

Norgeot

Learning Causal Effects From Observational Data in Healthcare: A Review and Summary. Front Med (Lausanne) 2022; 9: 864882.

87.

Coorevits

Sundgren

Klein

, et al. Electronic health records: new opportunities for clinical research. J Intern Med 2013; 274: 547–560.

88.

Anon. Healthcare Big Data and the Promise of Value-Based Care. Catalyst Carryover. 2018. Available from: https://api.semanticscholar.org/CorpusID:216980860

Making causal inferences from transactional data: A narrative review of opportunities and challenges when implementing the target trial framework

Abstract

Keywords

Background

Search strategy

Causal analyses of observational data

Causal inference

The process of causal question analysis

Estimand, estimator and estimate

The target trial framework

The target trial

Components of the target trial protocol and their emulation using transactional data sources

Using EHR data for causal inference

Opportunities

Challenges

Interpreting results: navigating through potential biases

Future directions and implications

The learning health system

Collaborative efforts for data sharing

Value-based healthcare

Conclusions

Footnotes

Declaration of conflicting interests

Funding

ORCID iD

References