Abstract
The target trial framework has emerged as a powerful tool for addressing causal questions in clinical practice and in public health. In the healthcare sector, where decision-making is increasingly data-driven, transactional databases, such as electronic health records (EHR) and insurance claims, present an untapped potential for answering complex causal questions. This narrative review explores the potential of the integration of the target trial framework with real-world data to enhance healthcare decision-making processes. We outline essential elements of the target trial framework, and identify pertinent challenges in data quality, privacy concerns, and methodological limitations, proposing solutions to overcome these obstacles and optimize the framework’s application.
Background
The pursuit of effective healthcare interventions and informed decision-making are enduring challenges in the field of medicine. Robust, evidence-based answers to causal questions are paramount in guiding clinical practice and public health policy. 1 The target trial framework has emerged as a useful tool to address causal questions in a way that avoids common pitfalls and self-inflicted biases, particularly with respect to observational data.2–5 Within the expansive landscape of healthcare, the burgeoning adoption of transactional data such as electronic health records (EHR) and insurance claims has bestowed a trove of data amenable to analysis of causal inference. The application of the target trial framework to these data presents an opportunity to augment our understanding of healthcare interventions and outcomes. The purpose of this narrative review was to explore the potential of the integration of the target trial framework with real-world data to enhance healthcare decision-making processes.
Search strategy
For this narrative review, a comprehensive literature search of PubMed/Medline and Google Scholar was conducted to identify relevant studies. Search terms included: ‘causal inference’; ‘target trial’; ‘electronic health records’; ‘electronic medical records’; ‘insurance claims data’. Boolean operators (AND, OR) were used to combine these terms and variations of them, ensuring comprehensive retrieval of articles pertaining to the application of the target trial framework with EHR and/or insurance claims data.
Causal analyses of observational data
Causal inference
Research and causal inquiries are crucial in public health as they aim to identify the underlying causes of health outcomes and provide information on the most effective interventions. 1 Understanding the comparative efficacy or safety of different health interventions helps in developing evidence-based policies, programs, and interventions with the ultimate goal of preventing and controlling diseases, reducing disparities, and promoting health equity.1,6 This is particularly important for guiding policy definitions and strategies in any organization directly or indirectly responsible for a population, such as government ministries, funding entities, or service-providing institutions. The decisions are often informed by randomized controlled trials (RCT), which are considered the gold standard for evaluating causal questions. 2 In addition, other types of analyses, such as health technology assessments, cost-effectiveness studies, or budget impact analyses, can also provide useful data and supplement results from RCT. However, conducting a RCT may not always be feasible due to cost constraints, ethical considerations, or need for timely data availability. 1 In these circumstances, analysis of observational data from transactional sources, such as EHR, may provide an alternative to help determine causal relationships.2,5,7,8 Nevertheless, the use of observational data (specifically EHR data) comes with its own set of challenges that have to be addressed before arriving at valid conclusions.
The process of causal question analysis
The process of formulating and addressing causal questions involves formulating an estimand, selecting an estimator, and applying them to data to obtain an estimate that, under certain assumptions may be interpreted as causal. 9 Under the potential outcomes framework, discerning a causal relationship involves comparing outcomes for an individual across different levels of exposure. 10 In other words, the analysis entails contrasting expected results for a patient exposed to treatment A, with those for the same patient, had she been exposed to treatment B. This is inherently challenging because, by definition, one of the potential outcomes remains unobserved in the real world and so is counterfactual. Consequently, causal inference is typically framed as a missing data problem, as we commonly lack data on one of the potential outcomes. 11
Identifiability assumptions serve as a means to bridge the gap between potential outcomes and actual outcomes. To assist in making this leap, epidemiologists use three assumptions (i.e., exchangeability, positivity, and consistency).
11
The exchangeability assumption necessitates that individuals with exposure and those without exposure, exhibit on average equivalent risk for the outcome.
12
This equivalence enables the unexposed group to function as a substitute for the counterfactual outcomes in the exposed group, and
Estimand, estimator and estimate
The estimand represents the causal effect of interest and can be defined in different ways depending on the research question and available data. It encompasses specifying the population, the intervention, the variable of interest or outcome, the summary measure to be used, and the handling of intercurrent events.9,15
The estimator is a mathematical function that takes the observed data and produces an estimate of the target estimand.
15
Common estimators include regression models, propensity score matching, and inverse probability weighting. The choice of estimator depends on the research question, the data, and the assumptions made about the causal relationships between variables. These assumptions are typically defined through Directed Acyclic Graphs (DAG), a tool that allows users to represent the set of assumptions about the relationship between variables.
16
Particularly in observational studies, DAG help define how variables will be treated within the statistical model to suppress or minimize potential biases in the data. This process relies on expert
Lastly, the estimate is the numerical value obtained when applying the estimator to the data, which, under certain identifiability assumptions (i.e., exchangeability, positivity, and consistency), allows the researcher to interpret it as a causal effect. 11
The target trial framework
The target trial
Causal observational analyses are prone to biases due to both, the non-controlled nature of the data and also common design flaws. These can be overcome by designing a hypothetical randomized trial (the target trial). Target trial emulation is a two-step process. In the first step, the hypothetical RCT must articulate the causal question in the form of the protocol of the hypothetical study. In the second step, the aim is to emulate the components of the target trial using observational data.18,19 This two-step process is iterative, particularly in the case of already collected observational data, as it is possible that the necessary data to emulate the target trial may not be present in the data source. In such cases, it may be necessary to go back to step 1 and reformulate the target trial. Investigators must then question whether the trial that can be emulated given the available data is still of interest. The target trial framework can be used to design studies that address a wide range of research questions, including ones related to drug safety, treatment effectiveness, or disease prognosis.20–40 However, we can only emulate pragmatic target trials because real-world data cannot be used to emulate a placebo-controlled trial or, one that employs a blinded design. 2
Components of the target trial protocol and their emulation using transactional data sources
The target trial protocol must include several key components essential for designing and conducting a clinical study. These are: eligibility criteria; treatment strategies; assignment procedures; follow-up; outcome(s); casual contrast; data analysis plan.2–5,7 Herein, we describe each component of the target trial and provide some considerations regarding their emulation using observational data.
The eligibility criteria should specify the prerequisites for participant inclusion. Emulating this component of the target trial with observational data, involves finding eligible individuals in the healthcare databases. Adequate methods of electronic phenotyping, structuring clinical notes through natural language processing (NLP) techniques, and inclusion of complementary data sources, can help mitigate the risk of misclassification bias at this stage. 41 Another challenge when working with healthcare databases, is identifying regular users of a particular healthcare system. For this reason, an eligibility criterion requiring a minimum time or interactions with the system is often included. 7
The treatment strategies component describes the interventions to be compared in the target trial. These can be either point (i.e., one-time) or sustained (i.e., long-term) interventions. 42 Sustained interventions can be further classified into static or dynamic. Static strategies are those that are fully defined at baseline (e.g., take statins during the entire follow-up period). Conversely, in dynamic strategies the treatment depends on patient’s evolving characteristics, such as response or side-effects (e.g., take statins during the entire follow-up unless a contraindication develops, in which case, interrupt treatment). Emulating a target trial that involves a comparison of dynamic treatment strategies requires adequate data, not only on treatment received but also on the circumstances under which patients are ‘excused’ from following the assigned treatment (e.g., in the case of statin treatment, myopathy or hepatic impairment). 43
The assignment component details the method of random patient allocation to therapeutic strategies, emphasizing participant awareness due to the absence of blinding in pragmatic trials. The emulation of this component using observational data requires adjustment for baseline covariates to control for confounding. 11 The minimum set of covariates required to adjust for confounders should be chosen using a causal DAG. For emulating the random assignment, high quality data must be available for all the covariates that are included in the DAG.11,44 The absence of information on a relevant variable should prompt the research team to consider the usefulness of the study or the need to reformulate the target trial. 45
The follow-up component in the target trial is defined by: (i) its start (time zero) and (ii) its end (which depends on the occurrence, or not, of the primary outcome, lost-to-follow-up, or administrative end of follow-up). A careful consideration of how intercurrent events will be handled may be relevant at this stage (e.g., whether to consider death as a censoring event or not).8,46 Emulating time zero using EHR data must use the same criteria as the target trial. Time zero can be easily defined as the first moment when an individual meets all eligibility criteria. 7 When considering the end of follow-up, defining criteria for loss to follow-up can be challenging when employing EHR data. A frequent choice is to use the lack of appointments or utilization of the healthcare system for a significant amount of time. 48
The outcomes section should focus on defining the clinical outcomes of interest. Emulating this component using EHR data presents two fundamental challenges. Firstly, an appropriate electronic phenotyping strategy is needed, and precise information about the timing of the event is necessary. It is common for information about the event and its timing to be recorded in clinical notes, necessitating the use of NLP techniques to extract relationships between entities and temporal references from the text (especially for diseases with insidious onset, such as dementia).41,49 Secondly, recording events independent of exposure is crucial to avoid information bias. 50 This is because certain exposures may only be investigated and recorded at the time of an event’s occurrence. 51 Therefore, it is necessary to assess the timing of the recording exposures and events, not just the timing of their occurrence. 52
The causal contrast of interest section should describe intention-to-treat and/or per-protocol effects. The per-protocol effect is the effect had everybody adhered to their assigned treatment strategy. The intention-to-treat effect is the effect of being randomized to a particular treatment at baseline, regardless of whether the treatment is actually used or not during follow-up. Whenever possible, when emulating the target trial using EHR data, we would consider both estimands.
Finally, the data analysis component should outline analytical strategies to estimate the above-mentioned causal contrasts. The intention-to-treat analysis in randomized trials typically involves a Kaplan-Meier estimator to estimate non-parametrically the risk for the outcome at each time point. To adjust for potential selection bias due to loss of follow-up, individuals can be assigned time-varying inverse probability weights with the weights depending on all baseline and post-baseline variables. The per-protocol analysis is the same as intention-to-treat analysis except that participants are censored (i.e., their data stream is interrupted) if, and when, they deviate from their assigned treatment strategy. To adjust for the potential selection bias induced by this censoring, each individual receives a time-varying inverse probability (IP) weight. When emulating the target trial using EHR data, the statistical analysis for intention-to-treat or per-protocol populations is identical with the only exception that to estimate the intention-to-treat analogue, we need to adjust for baseline confounding.
Using EHR data for causal inference
Opportunities
Transactional data, particularly EHR, are comprehensive and digitized repositories of patient health information and offer a wealth of clinical, demographic, and administrative data, making them an invaluable resource for investigating causal relationships. 53 The level of detail provided by EHR, particularly in the clinical domain, is a clear advantage over insurance claims data. 53
One of the primary domains where the fusion of transactional data sources and causal inference has found extensive application is in comparative effectiveness research. Examples of this include direct comparisons between different vaccine products and regimens,20–24 as well as estimation of the effects of glucose-lowering drugs on diverse health outcomes.25–27 In addition, these data have been used in the assessment of the impact of various monoclonal antibodies on asthma, 28 evaluation of the influence of multiple antiviral agents on hospital admissions in SARS-CoV-2 patients,29–31 and in the comparison of benzodiazepine regimens in individuals with post-traumatic stress disorder (PTSD). 32 Furthermore, investigations within this domain have also encompassed analysis of dosing and treatment timing. 33 Additionally, non-pharmacological interventions and cancer prevention strategies have been investigated in several studies,35–37 and this approach has been used in numerous studies analyzing drug safety.22,25,38–40 Using data from EHR and insurance claims can facilitate examination of health trajectories and treatment outcomes for stigmatized conditions such as mental health issues,54–56 or sub-represented groups such as the elderly.57–59 These conditions often present challenges in recruitment for primary studies. The extensive sample sizes provided by EHR data facilitate such investigations, offering valuable insights into these under-represented groups.
The ability to examine uncommon exposures, population subgroups, outcomes, and their interactions transcends the constraints of small sample sizes that often typify conventional clinical trials. 60 Importantly, the use of transactional data for causal inference has facilitated the discovery of disease subgroups and analysis of treatment effect heterogeneity among these groups, a feature which is invaluable to the concept of precision medicine.61–65 Furthermore, EHR provide a multifaceted trove of longitudinal and temporally rich data. 60 This temporal granularity is of paramount importance for causal inference, as it enables the tracking of individual patients over time, capturing dynamic changes in exposures, interventions, and outcomes. The longitudinal nature of EHR data enables researchers to discern temporal sequences and better approximate causal ordering. For example, longitudinal data represented in EHR has facilitated the study of sustained statin therapy in the prevention of primary cardiovascular events. 66 The research focused on the impact of statins on mortality among patients diagnosed with cardiovascular disease. 67
Challenges
Although transactional data yield many advantages, challenges can occur because the data are collected from secondary sources.
68
In contrast to traditional
Another challenge of transactional data is the subjectivity of the data recording process. Often, only data that are considered relevant by the consulting healthcare professional are recorded. 70 Information may be recorded inconsistently and remain in free text within medical notes.72,73 This is particularly a problem for variables related to education, symptom severity, functional status and social determinants of health. 74 Furthermore, using EHR data for epidemiological studies may raise privacy and ethical issues. 75 EHR contain sensitive personal health information, and so data security is paramount. Coding errors or subjectivity in the EHR data can also lead to misclassification bias. Performing manual chart review, establishing clear data collection protocols, and using standardized data extraction tools may enhance data objectivity and reliability. 73
While EHR data typically offer a robust representation of individuals who have access to healthcare services, they are considerably less representative of those lacking such access. This discrepancy arises from the non-random nature of healthcare accessibility. For instance, in health systems where access is contingent upon employment status, EHR data inherently exclude the unemployed, thereby failing to encapsulate their health experiences. Consequently, this limitation impedes efforts to mitigate healthcare disparities between those with, and without, access to the system. One study describes such differential access where researchers validated EHR coverage for several health care organizations in Colorado, USA. 76 Moreover, it is important to recognize that disparities also exist within groups that do have access to healthcare services.
Diverse strategies have been implemented to lessen the shortcomings of transactional data. For example, generalizability concerns have been addressed by comparing data from EHR with data from population statistics such as Census estimates. 77 This approach, although useful, is not without limitations. 78 However, leveraging large, diverse EHR networks, if available, can also help alleviate this issue. 79 In addition, sensitivity analyses focused on missing data, and integration of external data sources, may serve as effective strategies to mitigate potential biases.48,78 These analyses should be guided by DAG that can assist in identifying which variables should be collected and included in those analysis. However, it is important to acknowledge that these methods necessitate specific assumptions regarding the ‘missingness’ of data and the difference among, ‘missing completely at random’, ‘missing at random’ and ‘missing not at random’. 80
Coding errors or subjectivity in data recorded within EHRs can also be a form of information bias. Establishing clear data collection protocols and employing standardized and data extraction tools can enhance data objectivity and reliability. The current standard for building a clinical data warehouse uses subject matter, expert validated, electronic phenotyping algorithms within extraction-transform-load (ETL) processes, that include NLP methods, in addition to a common data model such as the Observational Medical Outcomes Partnership (OMOP) model managed by the Observational Health Data Sciences and Informatics (OHDSI) group.41,49,73,81,82
Interpreting results: navigating through potential biases
In observational studies, especially those leveraging transactional data sources, the quest to draw causal inferences is fraught with challenges. 11 Identifiability assumptions required for causal inference are seldom fully met. Therefore, this fundamental limitation necessitates a careful interpretation of results, with awareness of potential biases. Importantly, even with rigorous methodology, residual confounding bias can never be entirely ruled out in observational studies. Researchers should discuss specific sources of potential bias in their studies and provide a general assessment of the possible magnitude and direction of these biases. Furthermore, the use of formal causal sensitivity analyses can help clarify the sensitivity of the results to potential biases. These analyses involve systematically varying key assumptions and observing the impact on the study’s conclusions. 83
Another potential source of bias is misclassification. Eligibility criteria, exposures, outcomes, and/or confounders may be measured with error in EHR data. A frequent problem occurs when outcomes are ascertained differently in those with and without the exposure (e.g., exposed individuals interact more frequently with the healthcare system and are thus more likely to receive a diagnosis). Performing manual chart review, refining phenotypes used in EHR, and using standardized data extraction tools may enhance data objectivity and reliability. 73
Despite these challenges, observational studies do provide valuable insights into therapies, particularly in situations where RCT are not feasible. However, the influence of potential biases should not be underestimated. Accordingly, several researchers have proposed a framework for interpreting observational studies in the context of decision-making and policy development. 84
Future directions and implications
The learning health system
The concept of a learning health system, where real-world data continually informs and improves clinical practice, is a promising approach to evidence-informed decision making. 85 Transactional data sources serve as the lifeblood for these learning health systems, facilitating the seamless integration of evidence generation and clinical care. Combined with recent advancements in observational causal inference, the gap between research and practice can be reduced. 86 The learning health system framework can help healthcare providers adapt and refine strategies based on the most current and reliable causal evidence, ultimately leading to better patient outcomes. 85
Collaborative efforts for data sharing
Collaborative initiatives for data sharing are paramount in advancing causal inference using transactional data in healthcare. These endeavors foster collective access to diverse datasets from multiple institutions and healthcare systems. This not only enhances dataset representativeness, but also allows for findings to be validated across varied populations and settings. 87 By pooling resources, researchers can conduct large-scale, multi-site studies, leading to more robust and generalizable causal conclusions. Importantly, ethical considerations and stringent data privacy safeguards remain crucial, ensuring that patient confidentiality is upheld throughout this collaborative pursuit of knowledge.
Value-based healthcare
The shift towards value-based healthcare models will further accentuate the importance of transactional data sources for causal inference. With a focus on delivering high-quality care while optimizing costs, value-based healthcare demands a meticulous understanding of which interventions yield the most favorable outcomes. 88 Transactional data, augmented by causal inference methodologies, will enable healthcare systems to identify and implement interventions that offer the best value. This approach will not only benefit patients by ensuring they receive the most effective treatments, but also contribute to the sustainability of healthcare systems.
Conclusions
The integration of the target trial framework with transactional data sources, particularly EHR, represents a significant advancement in the field of causal inference in healthcare. This review has provided a comprehensive examination of the application of this framework, emphasizing its potential to inform crucial decision-making processes. As the adoption of transactional systems continues to grow, so too does the potential for leveraging these data sources for causal inference. By embracing opportunities presented by the target trial framework in conjunction with transactional data, we are poised to make significant strides in advancing our understanding of causal relationships in healthcare, ultimately leading to more effective and personalized patient care.
