Sage Journals: Discover world-class research

Abstract

Sequential trial emulation (STE) is an approach to estimating causal treatment effects by emulating a sequence of target trials from observational data. In STE, inverse probability weighting is commonly utilised to address time-varying confounding and/or dependent censoring. Then structural models for potential outcomes are applied to the weighted data to estimate treatment effects. For inference, the simple sandwich variance estimator is popular but conservative, while nonparametric bootstrap is computationally expensive, and a more efficient alternative, linearised estimating function (LEF) bootstrap, has not been adapted to STE. We evaluated the performance of various methods for constructing confidence intervals (CIs) of marginal risk differences in STE with survival outcomes by comparing the coverage of CIs based on nonparametric/LEF bootstrap, jackknife, and the sandwich variance estimator through simulations. LEF bootstrap CIs demonstrated better coverage than nonparametric bootstrap CIs and sandwich-variance-estimator-based CIs with small/moderate sample sizes, low event rates and low treatment prevalence, which were the motivating scenarios for STE. They were less affected by treatment group imbalance and faster to compute than nonparametric bootstrap CIs. With large sample sizes and medium/high event rates, the sandwich-variance-estimator-based CIs had the best coverage and were the fastest to compute. These findings offer guidance in constructing CIs in causal survival analysis using STE.

Keywords

Causal inference confidence intervals inverse probability weighting marginal structural models target trial emulation survival analysis

1. Introduction

1.1. Target trial emulation with survival outcomes

Target trial emulation (TTE) has become a popular approach for causal inference using observational longitudinal data.^1,2 The goal of TTE is to estimate and make inferences about causal treatment effects that are comparable to those that would be obtained from a target randomised controlled trial (RCT).^1,3 TTE can be helpful when it is not possible to conduct this RCT because of time, budget and ethical constraints. Hernán and Robins (2016) have proposed a formal framework for TTE,¹ which highlights the need to specify the target trial’s protocol, that is, the protocol of the RCT that would have ideally been conducted, in order to guide the design and analysis of the emulated trial using data extracted from observational databases such as disease registries or electronic health records. The protocol should include the eligibility criteria, treatment strategies, assignment procedures, outcome(s) of interest, follow-up periods, causal contrast of interest and an analysis plan; see some step-by-step guides to TTE in Hernán et al.,⁴ Matthews et al.,⁵ and Maringe et al.⁶

In TTE there are various sources of bias that must be addressed. Firstly, unlike in an RCT, non-random assignment of treatment at baseline must be accounted for when estimating the causal effect (intention-to-treat or per-protocol) of a treatment in an emulated trial. Secondly, similar to an RCT, it is necessary to account for censoring caused by loss to follow-up in an emulated trial. Thirdly, when the per-protocol effect is the causal effect of interest, it is also necessary to handle non-adherence to assigned treatments. To address these issues in TTE with survival outcomes, a useful approach is to fit a marginal structural Cox model (MSCM) using inverse probability weighting (IPW),^7–13 after first artificially censoring the patients’ follow-up at the time of treatment non-adherence.^14–16 Baseline confounders are included in this MSCM as covariates to adjust for the non-random treatment assignment at baseline using regression. The inverse probability weights are the product of two sets of time-varying weights: One to address selection bias from censoring due to loss to follow-up, and one to address selection bias from the artificial censoring due to treatment non-adherence. Counterfactual hazard ratios can be estimated from the fitted MSCM using weighted data. A modification of this approach is to discretise the survival time and replace the MSCM with a pooled logistic regression model.^9,10 Provided that the probability of failure between the discrete times is small, this pooled logistic model well approximates the MSCM.¹⁷

The counterfactual hazard ratio has been criticised as lacking a causal interpretation, and it has been proposed that other estimands be used instead, for example, the marginal risk difference (MRD).^18–20 The MRD over time can be estimated by first using the counterfactual hazard ratio estimates from a marginal structural model (MSM) together with an estimate of the baseline hazard to predict the survival probabilities of the patients in the emulated trial under two scenarios: When all are treated and when none are treated. For each scenario, the predicted survival probabilities are averaged over all enrolled patients. The estimate of the MRD is then calculated as the difference between these two averages.^16,20 This estimator is consistent, provided that the MSM for the survival outcome is correctly specified.

1.2. Constructing confidence intervals in sequential trial emulation

A potential problem when emulating a trial is that the number of treated and/or untreated patients eligible for inclusion in a trial that begins at any given time may be small. This can be addressed by sequential trial emulation (STE),^14,21 which takes advantage of the fact that patients may meet the eligibility criteria for the target trial multiple times during their follow-up in an observational database. In STE, a sequence of target trials is emulated, each starting at a different time. The data from these sequential trials are pooled and analysed to produce an overall estimate of the treatment effect.²⁰ This approach was first proposed by Hernán et al.¹⁴ and Gran et al.²¹ as a simple way to improve the efficiency of treatment effect estimation relative to emulating a single trial. There have been several applications of STE; see Keogh et al.²⁰ for a list.

Despite the increasing popularity of TTE, there is a lack of research on different methods for constructing confidence intervals (CIs) of treatment effects in STE. The sandwich variance estimator, bootstrap or jackknife can be used to obtain a variance estimate of the parameter estimates in an MSM by accounting for correlations induced by the same patient being eligible for multiple trials. In causal survival analysis with IPW to adjust for baseline confounding of point treatments,^22,23 the sandwich variance estimator of Lin and Wei²⁴ is frequently used. However, this estimator does not account for the uncertainty due to weight estimation, and can consequently overestimate the true variance.^22,23 More complex sandwich variance estimators that account for this uncertainty are available. Shu et al.²³ proposed a variance estimator for the hazard ratio of a point treatment in an MSCM with IPW used for baseline confounding only. Enders et al.²⁵ developed a sandwich variance estimator for the hazard ratios in an MSCM when IPW is used to deal with treatment switching and censoring due to loss to follow-up. They found no substantial differences in the performance of the simple sandwich variance estimator and their sandwich variance estimator, and the latter performed comparatively poorly in scenarios with small sample sizes and many confounders. No off-the-shelf software has implemented the variance estimator by Enders et al.²⁵

Bootstrap has been recommended as an alternative to the sandwich variance estimator because it accounts for uncertainty in weight estimation. In the simple setting where IPW is used to estimate the effect of a point treatment on a survival outcome, Austin (2016) found that bootstrap CIs performed better than the sandwich-variance-estimator-based CIs when the sample size was moderate (1000).²² In the setting with continuous and binary outcomes, Austin (2022) found that when sample sizes were small (250 or 500) to moderate (1000), bootstrap resulted in more accurate estimates of standard errors than sandwich variance estimators. However, bootstrap CIs did not achieve nominal coverage when the sample sizes were small to moderate and the treatment prevalence was either very low or very high.²⁶ Mao et al. (2018) observed similar results when constructing CIs of hazard ratios for a binary point treatment using IPW: With small (500) and moderate (1000) sample sizes and strong associations between confounders and treatment assignment, both the sandwich variance estimator and bootstrap resulted in under-coverage of CIs.²⁷ For the longitudinal setting with binary time-varying treatments, Seaman and Keogh (2023) found that with moderate (1000) sample sizes, bootstrap and the sandwich variance estimator both led to slightly under-coverage of CIs for hazard ratios in an MSCM but with no notable difference between the two methods.²⁸ With small (250, 500) sample sizes, the coverage of the CIs deteriorated, but bootstrap CIs had coverage closer to the nominal level than the sandwich-estimator-based CIs.²⁸

Jackknife resampling has been used in TTE to construct CIs of hazard ratio and risk difference (see Serdarevic et al.²⁹ and Virtanen et al.³⁰ for recent examples). Gran et al. (2010) also used jackknife to construct the CI of the hazard ratio of a binary treatment in the STE setting because in their analysis bootstrap led to non-convergence problems due to the large number of covariates used.²¹ Jackknife could be advantageous when the sample size is small because it is computationally faster than bootstrap and it is less likely to lead to non-convergence problems since only one patient’s data are left out in each jackknife sample.

The works mentioned above all focussed on variance estimation and CIs for counterfactual hazard ratios, which are often chosen as the estimand in the literature of TTE with survival outcomes.^14,31 While these works were not researched specifically for TTE with survival outcomes, they could be easily applied to such a setting. However, in the more complex setting of STE, less attention appears to be paid to the development and evaluation of CI construction methods. There is a lack of research on comparing the nonparametric bootstrap and the sandwich variance estimator for constructing CIs of the MRD in various settings of STE. It is also desirable to develop computationally more efficient CI methods than nonparametric bootstrap.

1.3. The contribution of this article

To fill this gap, we carry out an extensive simulation study to compare different methods for constructing a CI of the MRD in STE with a survival outcome. The first method uses the sandwich variance estimator that ignores the uncertainty caused by weight estimation. The second method is nonparametric bootstrap. This has the drawback of being computationally expensive. For this reason, the third method that we investigate is the computationally less intensive linearised estimating function (LEF) bootstrap. Hu and Kalbfleisch³² first developed the estimating function bootstrap approach. In settings with cross-sectional/longitudinal survey data with design weights, Rao and Tausi³³ and Binder et al.³⁴ proposed the LEF bootstrap to improve computational efficiency and to avoid ill-conditioned matrices when fitting logistic models to bootstrap samples. We develop two forms of the LEF bootstrap for the STE setting. The fourth method is jackknife resampling. In our simulation study, we consider scenarios with varying sample sizes, treatment prevalence, outcome event rates, and strength of time-varying confounding. Our results provide some guidance to practitioners on which methods could perform better in different settings.

The article is organised as follows. In Section 2 we introduce the HIV Epidemiology Research Study (HERS) data as a motivating example and describe a protocol of STE based on the HERS data. Section 3 describes the notation, causal estimand, causal assumptions and MRD estimation procedure in STE. In Section 4, we describe the CI construction methods that we compare in this article, including our proposed LEF bootstrap CIs. Section 5 presents our simulation study. In Section 6, we apply STE to the HERS data. We conclude in Section 7 with a discussion.

2. HERS: A motivating example

The HERS included 1310 women with, or at high risk of, HIV infection at four sites (Baltimore, Detroit, New York, Providence) enrolled between 1993 and 1995 and followed up to 2000.³⁵ The HERS had 12 approximately six-monthly scheduled visits, where clinical, behavioural, and sociological outcomes and (self-reported) treatment were recorded.

Following Ko et al.³⁵ and Yiu and Su,³⁶ we aim to estimate the causal effect of (self-reported) Highly Active AntiRetroviral Treatment (HAART) on all-cause mortality among HIV-infected patients in the HERS cohort. Clinical and demographic variables related to treatment assignment and disease progression were available, including CD4 cell count, HIV viral load, self-reported HIV symptoms, race, and the site in which a patient was enrolled. Following Yiu and Su,³⁶ we treat visit 8 in 1996 as the baseline of the observational cohort, as HAART was more widely used and recorded in the HERS by then. There were 584 women assessed at visit 8. Time of death during follow-up was recorded exactly and there were 24 deaths in total. Some patients were also lost to follow-up, with 179 patients assessed at the last follow-up visit 12. Yiu and Su (2022) conducted their analysis with standard MSMs with IPW by defining the time-varying treatment as ordinal with 3 levels: ‘no treatment’, ‘antiretroviral therapy other than HAART’, and ‘HAART’.³⁶ In our analysis we consider a binary treatment: ‘HAART’ versus ‘no HAART’. A hypothetical RCT (the target trial) to estimate the per-protocol effect of HAART (vs. other or no treatment) on all-cause mortality could be emulated using the HERS data. The target trial protocol can be found in Table 14 of the Supplemental Materials.

As mentioned in Section 1, a practical problem for TTE is that if we only emulate a single trial, the number of patients who initiate (i.e. start to receive) the treatment at baseline and the number of outcome events among them could be small. In the HERS, only 76 patients initiated the HAART when baseline is defined as visit 8. By emulating a sequence of target trials and combining their analyses, more efficient estimates of treatment effects can be obtained. For example, an additional 62 women initiated HAART at visit 9 of the HERS, and so would be in the treatment arm of an emulated trial with baseline defined as visit 9.

In Section 6 we emulate 5 sequential trials from the HERS data labelled from 0 to 4 with sequential enrolment periods so that the trials start at visits 8, 9, 10, 11 and 12, respectively. The trial protocol, and more specifically the eligibility criteria, remain the same across all 5 trials, which in our example means that patients must have no prior use of HAART before the baseline of the trial. The study horizons differ: trial 0 has 4 follow-up assessments at visits 9–12; trial 1 has 3 follow-up assessments at visits 10–12; and so on. Trial 4 only has a baseline assessment at visit 12 and no further follow-up. This approach means that we can use data from patients who started receiving the HAART later in the HERS cohort. Table 1 presents tabulation of the HERS data prepared for STE, where we note that the total number of patients in the treatment arm is increased from 76 (in a single trial with baseline at visit 8) to 234 by using the STE approach. A patient can be eligible for multiple trials. For example, a patient who had not been receiving HAART at visits 8 and 9 but started to receive HAART from visit 10 will be eligible as a member of the control arm in trials 0 and 1, and as a member of the treatment arm in trial 2. Moreover, this patient’s follow-up in trials 0 and 1 will be artificially censored at visit 10. Figure 1 of the Supplemental Materials provides a schematic illustration of the STE approach.

Figure 1.

Empirical coverage of the CIs in the scenarios with low event rates. Bootstrap: CIs constructed by nonparametric bootstrap. LEF both: CIs constructed by applying Approach 2 of LEF bootstrap. LEF outcome: CIs constructed by applying Approach 1 of LEF bootstrap. Jackknife Wald: CIs constructed by applying Approach 1 of jackknife resampling; Jackknife MVN: CIs constructed by applying Approach 2 of jackknife resampling; Sandwich: CIs based on the sandwich variance estimator. Note that the results for applying Approaches 1 and 2 of LEF bootstrap were very similar so that the purple and green lines overlapped. CI: confidence interval; LEF: linearised estimating function.

Table 1.

Data tabulation of the HERS data prepared for a per-protocol analysis in STE.

			Trial
Treatment	Outcome	Censoring	0	1	2	3	4
0	0	0	390	314	249	175	155
		1	14	16	15	48	0
	1	0	11	8	5	5	4
1	0	0	73	52	49	25	19
		1	1	5	0	1	0
	1	0	2	3	1	0	1
		Total eligible in trial	491	398	319	254	179

Note: The numbers in a column represent the number of patients enrolled in an emulated sequential trial by their assigned treatment strategies, outcome and censoring status by the end of the emulated trial. Treatment: assigned treatment strategies; 0, never treated with HAART; 1, always treated with HAART. Outcome: indicator of all-cause mortality. Censoring: indicator of censoring due to loss to follow-up. HERS: HIV Epidemiology Research Study; HAART: Highly Active AntiRetroviral Treatment.

3. Estimation of the per-protocol effect in sequentially emulated trials

3.1. Setting and notation

Consider an observational study in which $n$ patients are followed up from time $t_{0}$ until the earliest of the event of interest, loss to follow-up, and the end of the study. For each patient, time-independent variables are measured at time $t_{0}$ , and time-varying variables are measured at regular times $t_{0} < t_{1} < \dots < t_{n_{v} - 1}$ during follow-up, where $t_{n_{v}}$ denotes the time of the end of the study ( $t_{n_{v} - 1} < t_{n_{v}}$ ) and $n_{v}$ is therefore the maximum number of study visits before $t_{n_{v}}$ . Data on each patient are assumed to be independent and identically distributed. Data from this study will be used to create data for a sequence of $n_{v}$ trials. The $m$ th sequential trial (i.e. trial $m$ , $m = 0, \dots, n_{v} - 1$ ) begins at time $t_{m}$ , includes patients who are eligible for enrolment at this time, and ends at time $t_{n_{v}}$ . Hence, within this trial, the time-varying variables are measured at times $t_{m}, \dots, t_{n_{v} - 1}$ . We shall refer to these $n_{v} - m$ measurement times as the trial visits for trial $m$ . For example, trial visit 0 in trial $m$ takes place at time $t_{m}$ and trial visit $n_{v} - m - 1$ takes place at time $t_{n_{v} - 1}$ . For each of the $n$ patients in the observational study, we define the following variables.

$E_{m}$ is an indicator of whether the patient is eligible ( $E_{m} = 1$ ) or not ( $E_{m} = 0$ ) for trial $m$ .

$Y_{m, k} = 1$ ( $k = 0, 1, \dots$ ) if the patient experiences the event of interest in time interval $[t_{m}, t_{m + k + 1})$ , and $Y_{m, k} = 0$ otherwise.

$V$ denotes the patient’s vector of time-independent covariates measured at time $t_{0}$ .

$A_{m, k}$ and $L_{m, k}$ denote, respectively, the patient’s treatment and time-varying covariates measured at time $t_{m + k}$ .

$C_{m, k}$ is an indicator that the patient is censored due to loss to follow-up in the interval $[t_{m + k}, t_{m + k + 1})$ . So, if $C_{m, k} = 1$ then $Y_{m, k + 1}, Y_{m, k + 2}, \dots, Y_{m + 1, k}, Y_{m + 1, k + 1}, \dots$ are not observed.

$Y_{m, k}$ , $A_{m, k}$ , $L_{m, k}$ and $C_{m, k}$ will serve as, respectively, the outcome indicator, the binary treatment indicator, the time-varying covariates and the censoring indicator at trial visit $k$ ( $k = 0, 1, \dots$ ) in trial $m$ ( $m = 0, \dots, n_{v} - 1$ ). Also, we shall use the overbar to denote variable history, for example, ${\bar{A}}_{m, k} = (A_{m, 0}, \dots, A_{m, k})$ denotes the patient’s history of treatment up to trial visit $k$ in trial $m$ , and define $\bar{0} = (0, \dots, 0)$ and $\bar{1} = (1, \dots, 1)$ . We assume the temporal ordering $(L_{m, k}, A_{m, k}, Y_{m, k}, C_{m, k})$ within $[t_{m + k}, t_{m + k + 1})$ , $\forall m, k$ .

For a patient enrolled in trial $m$ (i.e. with $E_{m} = 1$ ) and for ${\bar{a}}_{k}$ equal to either $\bar{0}$ or $\bar{1}$ , we define the potential variable $Y_{m, k}^{{\bar{a}}_{k}}$ to be a binary indicator that the patient would have experienced the event of interest during the time interval $[t_{m}, t_{m + k + 1})$ if he/she/they had, possibly contrary to fact, received treatment $a \in {0, 1}$ since the baseline of trial $m$ , that is, from time $t_{m}$ up to time $t_{m + k}$ . Analogously, $L_{m, k}^{{\bar{a}}_{k}}$ denotes the potential time-varying covariates the patient would have if he/she/they had received this treatment since $t_{m}$ up to time $t_{m + k}$ . Note that $Y_{m, k}^{{\bar{a}}_{k}}$ and $L_{m, k}^{{\bar{a}}_{k}}$ are not defined for patients ineligible for trial $m$ , that is patients with $E_{m} = 0$ . We shall omit the explicit conditioning on $E_{m} = 1$ when describing the causal estimand in the next section.

3.2. Causal estimand and assumptions

We define the per-protocol effect in trial $m$ in terms of the MRD. The MRD at trial visit $k$ in trial $m$ is the difference between the marginal cumulative incidence at time $t_{m + k}$ if, possibly contrary to fact, all patients in the population eligible for trial $m$ were always treated from time $t_{m}$ up to time $t_{m + k}$ and the marginal cumulative incidence if, possibly contrary to fact, they were not treated at all from time $t_{m}$ up to time $t_{m + k}$ .²⁰ That is,

{MRD}_{m} (k) = Pr (Y_{m, k}^{{\bar{a}}_{k} = \bar{1}} = 1) - Pr (Y_{m, k}^{{\bar{a}}_{k} = \bar{0}} = 1)

(1)

Identification of (1) requires the following assumptions.

No Interference.

For $i \neq j$ , patient $i$ ’s received treatment has no effect on patient $j$ ’s potential outcomes,¹⁵ that is, a patient’s event time is not affected by other patient’s treatments.

Consistency.

$\forall m, k, {\bar{Y}}_{m, k} = {\bar{Y}}_{m, k}^{{\bar{A}}_{m, k}}$ and ${\bar{L}}_{m, k} = {\bar{L}}_{m, k}^{{\bar{A}}_{m, k}}$ , meaning that the observed outcomes ${\bar{Y}}_{m, k}$ and covariates ${\bar{L}}_{m, k}$ are equal to their potential outcomes and covariates, under the treatment assignment which they actually received, for every trial visit $k$ in trial $m$ .^15,37

Positivity of treatment assignment, treatment adherence and censoring.

\begin{aligned} \forall m, & Pr {A_{m, 0} = a | V, L_{m, 0}} > 0; \\ \forall m, k (k > 0), & Pr {A_{m, k} = a ∣ {\bar{A}}_{m, k - 1} = {\bar{a}}_{k - 1}, V, {\bar{L}}_{m, k}, Y_{m, k - 1} = 0, C_{m, k - 1} = 0} > 0; \\ \forall m, k, & Pr {C_{m, k} = 0 ∣ C_{m, k - 1} = 0, {\bar{A}}_{m, k} = {\bar{a}}_{k}, V, {\bar{L}}_{m, k}, Y_{m, k - 1} = 0} > 0 \end{aligned}

for

a = 0, 1

, that is a patient has non-zero probability of being assigned to either treatment at trial baseline, adhering to the treatment assigned and remaining in the study at all times conditional on the observed histories of treatment and covariates.³⁷

Sequentially ignorable treatment assignment.

$\forall m, k, Y_{m, k}^{{\bar{a}}_{k}} ⊥ ⊥ A_{m, k} ∣ V, {\bar{L}}_{m, k}, {\bar{A}}_{m, k - 1} = {\bar{a}}_{k - 1}, Y_{m, k - 1} = 0, C_{m, k - 1} = 0$ for $a = 0, 1$ , that is, at a given time, conditional on past treatment assignment and covariate history, there is no unmeasured confounding between the potential outcome and the current treatment received.¹⁵

Sequentially ignorable loss to follow up.

$\forall m, k, {\bar{C}}_{m, k} ⊥ ⊥ Y_{m, k + 1}^{{\bar{a}}_{k}}, Y_{m, k + 2}^{{\bar{a}}_{k}}, \dots ∣ {\bar{C}}_{m, k - 1} = 0$ , $Y_{m, k} = 0$ , ${\bar{A}}_{m, k} = {\bar{a}}_{k}$ , $V$ , ${\bar{L}}_{m, k}$ , for $a = 0, 1$ .¹⁵ In other words, at a given time, a patient’s probability of being under follow-up does not depend on their future risk of event, conditional on the treatment and covariate history up to that time.

Equation (1) can be written equivalently as

\begin{aligned} {MRD}_{m} (k) & = Pr (Y_{m, k}^{{\bar{a}}_{k} = \bar{0}} = 0) - Pr (Y_{m, k}^{{\bar{a}}_{k} = \bar{1}} = 0) \\ = E_{V, L_{m, 0}} {Pr (Y_{m, k}^{{\bar{a}}_{k} = \bar{0}} = 0 ∣ V, L_{m, 0}) - Pr (Y_{m, k}^{{\bar{a}}_{k} = \bar{1}} = 0 ∣ V, L_{m, 0})} \end{aligned}

(2)

The counterfactual survival probabilities in (2) can be written in terms of counterfactual discrete-time hazards as follows:

\begin{aligned} Pr (Y_{m, k}^{{\bar{a}}_{k} = \bar{a}} = 0) = E_{V, L_{m, 0}} [{1 - Pr (Y_{m, 0}^{a_{0} = a} = 1 ∣ V, L_{m, 0})} \prod_{j = 1}^{k} {1 - Pr (Y_{m, j}^{{\bar{a}}_{j} = \bar{a}} = 1 ∣ Y_{m, j - 1}^{{\bar{a}}_{j - 1} = \bar{a}} = 0, V, L_{m, 0})}], \\ a \in {0, 1} \end{aligned}

3.3. MSM with IPW

We assume the following MSM in the form of a pooled logistic model with regression parameters $β$ :

\begin{aligned} logit {Pr (Y_{m, k}^{{\bar{a}}_{k} = \bar{a}} = 1 ∣ Y_{m, k - 1}^{{\bar{a}}_{k - 1} = \bar{a}} = 0, V, L_{m, 0})} & = β_{0} (m) + β_{1} (k) + β_{2} \cdot a + β_{3}^{T} V + β_{4}^{T} L_{m, 0}, \end{aligned}

(3)

\begin{aligned} a \in {0, 1}, \end{aligned}

where

β_{0} (m)

is a trial-specific intercept and

β_{1} (k)

is the baseline hazard at trial visit

k

. This MSM could be fitted separately in each emulated trial. However, a combined analysis can be more efficient and may be necessary when the number of treated patients for some trials are small. This involves making modelling assumptions about how the MSM parameters vary across trials. For example, Danaei et al. (2013) allowed a trial-specific intercept term

β_{0} (m)

but assumed that the coefficient for treatment

β_{2}

was the same in all trials,³¹ thus allowing for borrowing strength across trials for treatment effect estimation. Parametric forms or splines can be used for

β_{0} (m)

and

β_{1} (k)

. In (3) it is also assumed that the baseline hazard and the coefficients of

V

and

L_{m, 0}

do not vary across trials. Interactions between trials, trial follow-up visits, treatment and baseline covariates can be included as well. Keogh et al. (2023) discussed the MSM specification in STE and recommended that formal tests be performed for the inclusion of any covariate interaction.²⁰

Given the baseline covariates $V$ and $L_{m, 0}$ , we assume no unmeasured confounding of baseline treatment assignment. Thus if all patients adhere to the treatment assigned and no censoring occurs, the MSM parameters can be estimated by fitting the pooled logistic model in (3) to the observed data in the emulated trials. In practice, not all eligible patients for trial $m$ would adhere to their assigned treatments. We artificially censor patients’ follow-up in trial $m$ at the time at which they cease to adhere to the treatment $A_{m, 0}$ received at the baseline of trial $m$ .^14,21 We use IPW as mentioned in Section 1 to account for the artificial censoring and censoring due to loss-to-follow-up.

To address artificial censoring due to treatment switching and censoring due to loss to follow-up, we calculate each patient’s stabilised inverse probability of treatment weight $s w_{m, k}^{A}$ and stabilised inverse probability of censoring weight (IPCW) $s w_{m, k}^{C}$ at trial visit $k$ in trial $m$ . The formulae of these weights are provided Section 2 of the Supplemental Materials. Each patient’s stabilised inverse probability of treatment and censoring weight (SIPTCW)^7,15,16,31 at trial visit $k$ in trial $m$ is therefore $s w_{m, k}^{A C} = s w_{m, k}^{A} \times s w_{m, k}^{C}$ .

We follow the method of Danaei et al.,³¹ who fitted logistic models to the treatment and censoring data from the original observational study to estimate the conditional probabilities used for calculating SIPTCWs. They used observed treatment and censoring data of each patient from the visits that correspond to the baselines of the eligible trials until the trial visits where the patient stopped adhering to the assigned treatments or the last trial visits. If patients were eligible for multiple trials, duplicates of the treatment and censoring data within patients were discarded.

We pool the observed data from the $n_{v}$ trials and fit the MSM in (3) to the pooled, artificially censored and weighted data to obtain a point estimate $\hat{β}$ of the MSM parameters $β$ in (3).

3.4. Estimating the causal estimand

The MRD at trial visit $k$ in trial $m$ can be estimated using the parameter estimates $\hat{β}$ from the MSM by the empirical standardisation formula:

\begin{aligned} {\hat{MRD}}_{m} (k) & = \frac{1}{n_{m}} \sum_{i = 1}^{n} E_{m, i} \prod_{j = 0}^{k} {1 - {logit}^{- 1} {μ (j, m, a = 0, V_{i}, L_{m, 0, i}; \hat{β})}} \\ - \frac{1}{n_{m}} \sum_{i = 1}^{n} E_{m, i} \prod_{j = 0}^{k} {1 - {logit}^{- 1} {μ (j, m, a = 1, V_{i}, L_{m, 0, i}; \hat{β})}}, \end{aligned}

(4)

where

i

indexes the patient in the original observational study,

n_{m} = \sum_{i = 1}^{n} E_{m, i}

is the total number of patients enrolled in trial

m

{logit}^{- 1} (\cdot) = \exp (\cdot) / {1 + \exp (\cdot)}

and

μ (j, m, a, L_{m, 0}, V;

\hat{β})

= {\hat{β}}_{0} (m) + {\hat{β}}_{1} (j) + {\hat{β}}_{2} \cdot a + {\hat{β}}_{3}^{T} V + {\hat{β}}_{4}^{T} L_{m, 0}

for

j = 0, \dots, k

{\hat{MRD}}_{m} (k)

is an estimate of the MRD at trial visit

k

in the population of patients eligible for trial

m

. Alternatively, we could standardise to a population characterised by a different distribution of

(V, L_{m, 0})

To summarise, the MRD in (2) is estimated by the following steps:

Estimate the SIPTCWs using data from the observational study.

Expand the observational data by assigning patients to eligible sequential trials and artificially censor patients’ trial follow-up when they were no longer adhering to the treatment assigned at the baseline of each sequential trial.

Estimate the MSM for the sequential trials using the expanded, artificially censored data with the estimated weights in Step 1.

Create two datasets containing patients from a target population with their baseline covariates data, setting their treatment assignment to either treatment arm, and calculate the estimated counterfactual survival probabilities of each patient in each of these two datasets.

Estimate the MRD by averaging the survival probabilities in each of the two datasets and taking the difference between these two averages, as is done in equation (4).

In this article, we use the R package TrialEmulation³⁸ (see Section 3 of Supplemental Materials) to implement these steps.

4. Constructing CIs of the MRD

In this section, we describe the methods for constructing CIs of the MRD based on the simple sandwich variance estimator, nonparametric bootstrap, LEF bootstrap and jackknife resampling.

4.1. Sandwich variance estimator

The simple sandwich variance estimator accounts for the inverse probability weights and the correlation induced by patients being eligible to multiple trials.^9,31 Specifically, for the pooled logistic regression model in (3), the simple sandwich variance estimator of $\hat{β}$ is

\hat{Σ} = {\sum_{i = 1}^{n} \frac{\partial U_{i} (β)}{\partial β^{T}}}_{β = \hat{β}}^{- 1} {\sum_{i = 1}^{n} U_{i} (\hat{β}) U_{i} (\hat{β})^{T}} {\frac{\partial U_{i} (β)}{\partial β^{T}}}_{β = \hat{β}}^{- 1}

where

U_{i} (\hat{β})

is the weighted score function of the pooled logistic model evaluated at

\hat{β}

for patient

i

We follow the parametric bootstrap algorithm of Mandel (2013) to construct simulation-based CIs of the MRD as follows:³⁹

1)
Obtain the parameter estimate $\hat{β}$ and the sandwich variance estimate $\hat{Σ}$ of the MSM in (3).
2)
Draw an i.i.d. sample $β^{(1)}, \dots, β^{(S)}$ of size $S$ (say $S = 500$ ) from the multivariate normal (MVN) distribution with mean $\hat{β}$ and variance $\hat{Σ}$ .
3)
For each vector $β^{(s)}$ ( $s = 1, \dots, S$ ), estimate the MRD at each trial visit by setting $β^{(s)}$ as the MSM parameters.
4)
Use the $2.5$ th and $97.5$ th percentiles of these $S$ MRD estimates at each trial visit as the lower and upper bounds of the 95% CI.

This is currently the only CI method implemented in the TrialEmulation package.
4.2. Nonparametric bootstrap with the pivot method

We use the non-Studentized pivot method⁴⁰ to construct CIs based on nonparametric bootstrap. Specifically, we follow the steps described below:

1)
Draw $B$ bootstrap samples from the observational data, treating the $n$ patients as the resampling units.
2)
For each bootstrap sample $b$ ( $b = 1, \dots, B$ ), obtain the bootstrap parameter estimate ${\hat{β}}^{(b)}$ and estimate the MRD at trial visit $k$ in trial $m$ , ${MRD}_{m} (k)$ , using the method in Section 3.
3)
Define the lower and upper bounds of the 95% CI for the MRD at each trial visit $k$ in trial $m$ as, respectively, $2 {\hat{MRD}}_{m} (k) - {\hat{MRD}}_{m} (k)_{(0.975)}^{}$ and $2 {\hat{MRD}}_{m} (k) - {\hat{MRD}}_{m} (k)_{(0.025)}^{},$ where ${\hat{MRD}}_{m} (k)$ is the point estimate of the MRD at trial visit $k$ in trial $m$ estimated from the original dataset, and ${\hat{MRD}}_{m} (k)_{(0.025)}^{}$ , ${\hat{MRD}}_{m} (k)_{(0.975)}^{}$ are the $2.5$ th and $97.5$ th percentiles of the $B$ bootstrap MRD estimates at trial visit $k$ in trial $m$ , respectively.

4.3. LEF bootstrap

The main advantages of LEF bootstrap over nonparametric bootstrap are reduced computational time and non-convergence issues.^32,34 In terms of computational time, unlike the nonparametric bootstrap, which involves fitting a regression model to each bootstrap sample, LEF bootstrap requires fitting this model only once to the original dataset. This can reduce computational time considerably, especially when iterative procedures are used to fit the model. In terms of non-convergence issues, Binder et al. (2004) found that when using nonparametric bootstrap for logistic regression it was possible to have several bootstrap samples for which the parameter estimation algorithm would not converge, due to ill-conditioned matrices that were not invertible.³⁴ For both reasons, LEF bootstrap may have advantages in our STE setting, where logistic regression models are used both for estimating the SIPTCWs and for fitting the MSM. We now explain how LEF bootstrap works in the general situation where the goal is to construct a CI for some function of a parameter vector $θ$ . In Section 4.3.1, we shall describe how to apply this general method to the specific setting of STE.

Let $U (θ)$ denote the estimating function for $θ$ (note that $U$ here is different from $U_{i}$ in Section 4.1). Let $U^{org} (θ)$ denote the sum of $U (θ)$ over the $n$ patients in the original dataset. Then $U^{org} (θ) = 0$ is the estimating equation for $θ$ based on the original dataset. Let $\hat{θ}$ denote the estimate of $θ$ obtained by solving this equation. Now suppose that $B$ bootstrap samples have been generated by resampling with replacement from this original dataset. Let $U^{(b)} (θ) = 0$ denote the estimating equation for $θ$ based on the $b$ th bootstrap sample, and let ${\hat{β}}^{(b)}$ denote the corresponding estimate of $θ$ ( $b = 1, \dots, B$ ). If we apply Taylor linearisation to the function $U^{(b)} (θ)$ around $\hat{θ}$ , we obtain

U^{(b)} (θ) \approx U^{(b)} (\hat{θ}) + {\frac{\partial U^{(b)} (θ)}{\partial θ}}_{θ = \hat{θ}} (θ - \hat{θ}) .

From this and the fact that

U^{(b)} ({\hat{θ}}^{(b)}) = 0

by definition, we obtain

0 \approx U^{(b)} (\hat{θ}) + {\frac{\partial U^{(b)} (θ)}{\partial θ}}_{θ = \hat{θ}} ({\hat{θ}}^{(b)} - \hat{θ})

Rearranging the terms, we get

{\hat{θ}}^{(b)} \approx \hat{θ} - {\frac{\partial U^{(b)} (θ)}{\partial θ}}_{θ = \hat{θ}}^{- 1} U^{(b)} (\hat{θ}) .

Replacing the matrix

{\partial U^{(b)} (θ) / \partial θ}_{θ = \hat{θ}}^{- 1}

{\partial U^{org} (θ) / \partial θ}_{θ = \hat{θ}}^{- 1}

, we obtain the following approximation of

{\hat{θ}}^{(b)}

{\hat{θ}}_{LEF}^{(b)} \approx \hat{θ} - {\frac{\partial U^{org} (θ)}{\partial θ}}_{θ = \hat{θ}}^{- 1} U^{(b)} (\hat{θ}) = \hat{θ} + vcov (\hat{θ}) U^{(b)} (\hat{θ}),

(5)

where

vcov (\hat{θ})

is the model-based variance matrix based on the original dataset.

We propose the following two ways of using LEF bootstrap to construct CIs for the MRD.

4.3.1. Approach 1: LEF bootstrap for the MSM parameters

This approach applies Taylor linearisation only to the estimating function of the pooled logistic regression for the MSM in (3), with the SIPTCWs first being estimated from each bootstrap sample by fitting the corresponding logistic models to that bootstrap sample, as is done in nonparametric bootstrap. We shall use $w_{m, k, i}$ to denote the estimated SIPTCW for patient $i$ ( $i = 1, \dots, n$ ) in the original dataset at trial visit $k$ in trial $m$ (if patient $i$ is not enrolled in trial $m$ , that is $E_{m, i} = 0$ , then $w_{m, k, i} = 0$ ). Analogously, $w_{m, k, i}^{(b)}$ will denote the estimated SIPTCW for patient $i$ at trial visit $k$ in trial $m$ in the $b$ th bootstrap sample. The procedure is as follows.

1)
Estimate $w_{m, k, i}$ using the original dataset. Using $w_{m, k, i}$ , fit the weighted pooled logistic model to the original dataset and obtain the point estimate $\hat{β}$ of the MSM parameters.
2)
Create $B$ bootstrap samples using the $n$ patients as resampling units. Using the $b$ th bootstrap sample, estimate $w_{m, k, i}^{(b)}$ for the patients in the $b$ th bootstrap sample ( $b = 1, \dots, B$ ).
3)
Calculate approximate bootstrap parameter estimate ${\hat{β}}_{LEF}^{(b)}$ from the $b$ th bootstrap sample using the weights $w_{m, k, i}^{(b)}$ according to the formula in equation (5).

For our case with a pooled logistic regression for MSM in (3), this formula can be written as
$\begin{aligned} {\hat{β}}_{LEF}^{(b)} = \hat{β} - & {(\sum_{m = 0}^{n_{v} - 1} \sum_{i = 1}^{n} E_{m, i} \sum_{k = 0}^{q_{m, i} - 1} w_{m, k, i} {logit}^{- 1} {μ_{m, k, i} (\hat{β})} [1 - {logit}^{- 1} {μ_{m, k, i} (\hat{β})}] X_{m, k, i} X_{m, k, i}^{T})}^{- 1} \\ \times \sum_{m = 0}^{n_{v} - 1} \sum_{i = 1}^{n} E_{m, i}^{(b)} \sum_{k = 0}^{q_{m, i} - 1} w_{m, k, i}^{(b)} X_{m, k, i}^{(b)} [Y_{m, k, i}^{(b)} - {logit}^{- 1} {μ_{m, k, i}^{(b)} (\hat{β})}], \end{aligned}$
(6)
where $E_{m, i}$ and $E_{m, i}^{(b)}$ are the eligibility indicators of patient $i$ for trial $m$ in the original dataset and in the $b$ th bootstrap sample, respectively; $q_{m, i}$ is the total number of trial visits made by patient $i$ eligible in trial $m$ before being artificially censored, loss to follow-up, the occurrence of outcome event or reaching the end of trial $m$ ; ${logit}^{- 1} {μ_{m, k, i} (\hat{β})}$ and ${logit}^{- 1} {μ_{m, k, i}^{(b)} (\hat{β})}$ are the estimated discrete-time hazards at trial visit $k$ for patient $i$ in trial $m$ evaluated at $\hat{β}$ using the original dataset and the $b$ th bootstrap sample, respectively; $X_{m, k, i}$ and $X_{m, k, i}^{(b)}$ are the design vector for the discrete-time hazard MSM at trial visit $k$ for patient $i$ in trial $m$ in the original dataset and in the $b$ th bootstrap sample, respectively; $Y_{m, k, i}^{(b)}$ is the observed outcome indicator at trial visit $k$ for patient $i$ in trial $m$ in the $b$ th bootstrap sample.
4)
For the $b$ th bootstrap sample, estimate the MRD at each trial visit by setting ${\hat{β}}_{LEF}^{(b)}$ as the MSM parameters.
5)
Construct the pivot CI using the $2.5$ th and $97.5$ th percentiles of the $B$ LEF bootstrap estimates of the MRD at each trial visit, as in Section 4.2.

4.3.2. Approach 2: LEF bootstrap for the model parameters for SIPTCWs and the MSM parameters

In the second approach, the Taylor linearisation is applied to the estimating functions of both the models for estimating SIPTCWs and the pooled logistic regression for the MSM (3). Approach 2 should be even more computationally efficient than Approach 1, because it avoids fitting the models for the SIPTCWs to each bootstrap sample. Moreover, Approach 2 could be useful when there are many covariates (relative to the sample size) in the models for the SIPTCWs, in which case ill-conditioned matrices could arise when fitting these models to some of the bootstrap samples.

Because multiple models have to be fitted to obtain the SIPTCWs, we explain the steps for implementing the LEF bootstrap using the model for the denominator term of the stabilised inverse probability of treatment weights (IPTWs, see equation (1) of the Supplemental Materials) given ${\bar{A}}_{m, k - 1} = \bar{1}$ as an illustration. Implementing the LEF bootstrap is analogous for all other models involved in equations (1) and (2) of the Supplemental Materials.

Let $p_{m, k} = Pr (A_{m, k} = 1 ∣ {\bar{A}}_{m, k - 1} = \bar{1}, V, L_{m, k}, E_{m} = 1, Y_{m, k - 1} = 0, C_{m, k - 1} = 0)$ , the conditional probability that the patient remains treated at trial visit $k$ in trial $m$ given they received treatment up to trial visit $k - 1$ , conditional on their observed variables up to trial visit $k$ . Suppose that a logistic regression model is assumed for $p_{m, k}$ ,

logit (p_{m, k}) = Z_{m, k}^{T} γ,

(7)

where

Z_{m, k}

is the design vector and

γ

is the regression parameter vector. Note that

Z_{m, k}

only contains rows for patients who were always treated in trial

m

up until the previous trial visit

k - 1

,that is

{\bar{A}}_{m, k - 1} = \bar{1}

. The procedure is as follows.

Fit the weighted pooled logistic model to the original dataset and obtain the point estimate $\hat{β}$ of the MSM parameters, using the estimated weights $w_{m, k, i}$ based on the original dataset.

Create $B$ bootstrap samples with patients as resampling units, and let $s_{i}^{(b)}$ denote the number of times that patient $i$ is sampled in the $b$ th bootstrap sample ( $b = 1, \dots, B$ ). Note that $s_{i}^{(b)} = 0$ if patient $i$ is not sampled in the $b$ th bootstrap sample.

Calculate the LEF bootstrap parameter estimates of the models for estimating SIPTCWs for each bootstrap sample according to (5). For example, the LEF bootstrap estimates ${\hat{γ}}^{(b)}$ of $γ$ in (7) can be obtained by

\begin{aligned} {\hat{γ}}^{(b)} = \hat{γ} - & {[\sum_{m = 0}^{n_{v} - 1} \sum_{i = 1}^{n} E_{m, i} \sum_{k = 0}^{q_{m, i} - 1} 1_{{{\bar{A}}_{m, k - 1, i} = \bar{1}}} p_{m, k, i} (\hat{γ}) {1 - p_{m, k, i} (\hat{γ})} Z_{m, k, i} Z_{m, k, i}^{T}]}^{- 1} \\ \times \sum_{m = 0}^{n_{v} - 1} \sum_{i = 1}^{n} s_{i}^{(b)} E_{m, i} \sum_{k = 0}^{q_{m, i} - 1} 1_{{{\bar{A}}_{m, k - 1, i} = \bar{1}}} Z_{m, k, i} {A_{m, k, i} - p_{m, k, i} (\hat{γ})}, \end{aligned}

(8)

where

1_{{{\bar{A}}_{m, k - 1, i} = \bar{1}}}

is the indicator for whether patient

i

has always been treated up to trial visit

k - 1

in trial

m

p_{m, k, i} (\hat{γ})

is the estimated probability of treatment adherence given previous treatment at trial visit

k

for patient

i

in trial

m

evaluated at

\hat{γ}

Z_{m, k, i}

is the corresponding design vector for the logistic regression model (7), and

A_{m, k, i}

is the observed treatment indicator at trial visit

k

for patient

i

in trial

m

Based on LEF bootstrap parameter estimates for the models for estimating SIPTCWs, calculate a new set of weights ${\tilde{w}}_{m, k, i}^{(b)}$ for the $b$ th bootstrap sample.

Construct the pivot CI using Steps (3)–(5) in Approach 1 in Section 4.3.1, replacing $w_{m, k, i}^{(b)}$ with ${\tilde{w}}_{m, k, i}^{(b)}$ in (6).

4.4. Jackknife

We use jackknife resampling to construct two types of jackknife-based CIs: (1) using a jackknife estimate of the MRD standard error,⁴¹ and (2) using the jackknife variance estimator^42,43 of $\hat{β}$ . The resampling units are the patients.

4.4.1. Approach 1: Wald-type CI using the jackknife estimate of the MRD standard error

We use the jackknife estimator of the standard error of the MRD estimate and a Normal approximation to obtain CIs as follows:

1)
Obtain the parameter estimate $\hat{β}$ of the MSM in (3) and the MRD estimate ${\hat{MRD}}_{m} (k)$ at each trial visit $k$ in trial $m$ using the method in Section 3.
2)
For $i = 1, \dots, n$ , obtain parameter estimates ${\hat{β}}^{(- i)}$ and ${\hat{MRD}}_{m}^{(- i)} (k)$ using the method in Section 3 after leaving out data from patient $i$ .
3)
Obtain the jackknife standard error estimate ${\hat{SE}}_{m}^{J} (k)$ of the MRD at trial visit $k$ in trial $m$ :
${\hat{SE}}_{m}^{J} (k) = {[\frac{n - 1}{n} \sum_{i = 1}^{n} {({\hat{MRD}}_{m}^{(- i)} (k) - {\hat{MRD}}_{m} (k))}^{2}]}^{1 / 2}$
(9)
4)
Define the lower and upper bounds of the 95% CI for the MRD at trial visit $k$ in trial $m$ , respectively, ${\hat{MRD}}_{m} (k) - 1.96 \cdot {\hat{SE}}_{m}^{J} (k)$ and ${\hat{MRD}}_{m} (k) + 1.96 \cdot {\hat{SE}}_{m}^{J} (k) .$

4.4.2. Approach 2: MVN sampling using the jackknife variance estimator of the MSM parameters

We use the jackknife variance estimator^42,43 of $\hat{β}$ to construct simulation-based CIs of the MRD in a manner similar to the sandwich-variance-estimator-based CIs as follows:

1)
Obtain the parameter estimate $\hat{β}$ of the MSM in (3).
2)
For $i = 1, \dots, n$ , obtain parameter estimates ${\hat{β}}^{(- i)}$ using the method in Section 3 after leaving out data from patient $i$ .
3)
Obtain the jackknife variance estimate $\hat{V_{J}}$ of the MSM parameters:
$\hat{V_{J}} = \frac{1}{n (n - 1)} \sum_{i = 1}^{n} ({\tilde{β}}_{i} - \bar{β}) ({\tilde{β}}_{i} - \bar{β})^{T}$
(10)
where ${\tilde{β}}_{i} = n \hat{β} - (n - 1) {\hat{β}}^{(- i)}$ for $i = 1, \dots, n$ and $\bar{β} = n^{- 1} \sum_{i = 1}^{n} {\tilde{β}}_{i}$ .
4)
Construct a 95% CI by repeating Steps (2)–(4) of Section 4.1 and replacing the sandwich variance matrix estimate $\hat{Σ}$ with $\hat{V_{J}}$ .

5. Simulation study

We conducted an extensive simulation study to compare the performance of CIs obtained using the sandwich variance estimator, nonparametric bootstrap, the proposed LEF bootstrap and jackknife methods. For simplicity, we assumed there was no loss to follow-up.

5.1. Study setup

5.1.1. Data generating mechanism

We used the algorithm described in Young and Tchetgen Tchetgen³⁷ to simulate data. This algorithm ensure that previous treatments affect time-varying variables (confounders) that are associated with both current treatment and hazard of the outcome event. The data generating mechanism is described in Table 2.

Table 2.
Summary of data generating mechanism of the simulation study.

Data simulation setting specifications $n$ : number of patients

$n_{v} = 5$ : number of visits

$t_{j} = 0, \dots, n_{v} - 1$ : visit time for visit $j$

$α_{a}$ : intercept in the treatment model, representing the baseline rate of treatment initiation

$α_{c}$ : coefficient that describes the strength of confounding due to time-varying variable $X_{1, t_{j}}$

$α_{y}$ : intercept term in the discrete-time hazard model, representing the baseline hazard

Time-varying confounder $X_{1, t_{j}} \sim N (Z_{t_{j}} - 0.3 A_{t_{j - 1}}, 1)$ , where $A_{t_{- 1}} \equiv 0$ and $Z_{t_{j}} \sim N (0, 1)$ .

Time-invariant confounder $X_{2} \sim N (0, 1)$ .

Treatment $logit {Pr (A_{t_{j}} = 1 ∣ A_{t_{j - 1}}, X_{1, t_{j}}, X_{2}, Y_{t_{j - 1}} = 0)}$

$= α_{a} + 0.05 A_{t_{j - 1}} + α_{c} X_{1, t_{j}} + 0.2 X_{2}$ , where $Y_{t_{- 1}} \equiv 0$ .

Discrete-time hazard of the outcome event $logit {Pr (Y_{t_{j}} = 1 ∣ A_{t_{j}}, X_{1, t_{j}}, X_{2}, Y_{t_{j - 1}} = 0)}$

$= α_{y} - 0.5 A_{t_{j}} + α_{c} X_{1, t_{j}} + X_{2}$ .

Trial eligibility $E_{t_{j}} = 1$ if patient has not received treatment before $t_{j}$

and has not experienced the outcome event before $t_{j}$ ;

$E_{t_{j}} = 0$ otherwise

5.1.2. Monte Carlo simulation settings

We considered three settings with different outcome event rates, by setting the baseline hazards $α_{y}$ as $- 4.7$ , $- 3.8$ and $- 3$ to allow low ( $5 - 6.5 %$ ), medium ( $10 - 14 %$ ) and high ( $20 - 25 %$ ) percentage of patients experiencing the event during follow-up in the simulated data, respectively. In total, we investigated 81 scenarios for generating the simulated data using the mechanism in Table 2, by considering the combinations of the specifications presented in Table 3. We varied the number of patients, confounding strength of a time-varying confounder and treatment prevalence (i.e. percentage of patients who ever received treatments during follow-up). By varying the intercept term of the treatment model $α_{a}$ , we could generate a low (25–30%), medium (50–60%) and high (75–80%) treatment prevalence in the simulated data. For each scenario, we generated 1000 simulated datasets.

Table 3.
Summary of the specifications for the 81 scenarios considered in the simulations.

Outcome event rate Sample size Confounding strength Treatment prevalence

Low: $α_{y} = - 4.7$ Small: $n = 200$ $α_{c} = 0.1$ Low: $α_{a} = - 1$

Medium: $α_{y} = - 3.8$ Medium: $n = 1000$ $α_{c} = 0.5$ Medium: $α_{a} = 0$

High: $α_{y} = - 3$ Large: $n = 5000$ $α_{c} = 0.9$ High: $α_{a} = 1$

Outcome event rate	Sample size	Confounding strength	Treatment prevalence
Low: $α_{y} = - 4.7$	Small: $n = 200$	$α_{c} = 0.1$	Low: $α_{a} = - 1$
Medium: $α_{y} = - 3.8$	Medium: $n = 1000$	$α_{c} = 0.5$	Medium: $α_{a} = 0$
High: $α_{y} = - 3$	Large: $n = 5000$	$α_{c} = 0.9$	High: $α_{a} = 1$

5.1.3. Estimation and inference

For each simulated dataset, we emulated 5 trials ( $m = 0, \dots, 4$ ), with trial 0 including trial visits $k = 0, \dots, 4$ , trial 1 including trial visits $k = 0, \dots, 3$ , and so on. Our estimand of interest is the MRDs at trial visits $k = 0, \dots, 4$ for patients eligible in trial $0$ . We chose trial 0 patients as the target population because they had the longest follow-up and so we can assess the methods for estimation and inference of the MRDs at later visits.

Correctly specified logistic regression models were fitted to the simulated data to estimate the denominator terms of the stabilised IPTWs in equation (1) of the Supplemental Materials, while the numerator terms were estimated by fitting an intercept-only logistic model stratified by treatments received at the immediately previous visit $A_{t_{j - 1}}$ .

Then the following weighted pooled logistic model was fitted to the artificially censored data in the five sequential trials to estimate the counterfactual discrete-time hazard:

logit {Pr (Y_{m, k}^{{\bar{a}}_{k} = \bar{a}} = 1 ∣ Y_{m, k - 1}^{{\bar{a}}_{k - 1} = \bar{a}} = 0, X_{1 m, 0}, X_{2})} = β_{0, k} + β_{1, k} \cdot a + β_{2, k} X_{1, m, 0} + β_{3, k} X_{2}

(11)

where

X_{1, m, 0}

is the value of the time-varying confounder at the baseline of trial

m

(

m = 0, \dots, 4

) and

β_{0, k}

β_{1, k}

β_{2, k}

β_{3, k}

are regression coefficients that vary by trial visit (

k = 0, \dots, 4

) but are assumed to be the same across trials. Note that, while the MSM correctly specifies the counterfactual discrete-time hazard for the baseline visits (

k = 0

), we were not able to correctly specify the counterfactual discrete-time hazards at later visits (

k \geq 1

) via pooled logistic regression models. This was due to the non-collapsibility of the logistic regression model used in the data-generating mechanism.²⁰ It was to minimise this misspecification (and potential bias caused by it) that we chose the model in (11), which is a rich model that allows the coefficients of treatment and confounders to vary by trial visit.

We applied the methods for constructing 95% CIs of the MRD in Section 4 to each simulated dataset, where we set $S = 500$ for the CIs based on the sandwich variance estimator and $B = 500$ for the CIs based on the bootstrap methods. We applied the two jackknife methods only when $n = 200$ , because it would be computationally inefficient relative to bootstrap to use the jackknife CI methods for larger sample sizes. The jackknife method is a linear approximation of the bootstrap,⁴⁴ and so we would not expect jackknife CIs to perform any better than bootstrap with larger sample sizes. A pseudo-code algorithm for the simulation study is presented in Section 4 of the Supplemental Materials.

5.1.4. True values

True values of the MRDs for each simulation scenario were obtained by generating data for a very large randomized controlled trial, as proposed by Keogh et al.²⁰ The true marginal risks in trial 0 when all patients were always treated or all patients were never treated were approximated by Kaplan–Meier estimates from two extremely large datasets ( $n = 1, 000, 000$ ). For the first dataset, the treatment strategy was set to ‘always treated’ (by setting $A_{0, k} = 1$ for $k = 0, \dots, 4$ ). For the second, it was set to ‘never treated’ (by setting $A_{0, k} = 0$ for $k = 0, \dots, 4$ ).

Additionally, for each simulation scenario, we fitted the same weight estimation models and MSM in Section 5.1.3 to multiple simulated datasets, each with 200,000 patients. We then averaged the resulting MRD estimates to compute ‘pseudo-true’ MRDs induced by the misspecified MSM. The absolute difference between the ‘pseudo-true’ MRDs and the corresponding Kaplan–Meier estimates of the true MRDs were all less than 0.001. More details about these ‘pseudo-true’ values are provided in Section 5.4 of the Supplemental Materials.

5.1.5. Performance measures

Empirical coverage of the CIs was calculated for each trial visit in each simulation scenario. This was done by dividing the number of times that a CI of the MRD contained the true MRD by 1000 minus the number of times there was an error in the CI construction. Such errors include issues with the sandwich variance estimation or jackknife variance estimation that led to non-positive definite variance matrices. We also considered bias-eliminated CI coverage to adjust for the impact of bias in the coverage results.⁴⁵ This involved calculating the proportion of the 1000 CIs that contained the sample mean of the 1000 MRD estimates for each scenario.

Morris et al.⁴⁵ state four potential reasons for CI under- or over-coverage when examining simulation results: (i) MRD estimation bias, (ii) the standard error estimates from a CI method underestimate the empirical standard deviation (SD) of the MRD estimator, (iii) the distribution of the MRD estimates is not normal and CIs have been constructed assuming normality, and (iv) the standard error estimates from a CI method are too variable. Therefore, for each simulation scenario, we calculated the empirical bias of the MRD estimates at each trial visit to check reason (i) for CI under-coverage. For each CI method, we calculated the SD of the MRD estimates (from resampling), denoted as $\hat{SE}$ . This provided us with 1000 standard error estimates for each CI method. We then computed $\frac{\hat{SE}}{{SD}_{\hat{MRD}}}$ , the ratio of standard error estimate for each CI method to the SD of the MRD empirical distribution ${SD}_{\hat{MRD}}$ , where the ${SD}_{\hat{MRD}}$ was obtained by taking the SD of the 1000 MRD point estimates across simulations. We referred to this ratio as the ‘Standard Error (SE) ratio’. The summary statistics of the SE ratio across 1000 simulations would allow us to assess reasons (ii) and (iv) for potential CI under-coverage or over-coverage.

We recorded the number of times that each CI method encountered an error (a ‘construction failure’). We also calculated the Monte Carlo standard error of the empirical coverage (due to using a finite number of simulated datasets):⁴⁵

{SE}_{MC} = \sqrt{\frac{\hat{Coverage} (1 - \hat{Coverage})}{n_{sim}}},

(12)

where

\hat{Coverage}

is the empirical coverage and

n_{sim}

is the number of simulations performed (

n_{sim} = 1000

in our case).

Finally, we reported the relative computation time that it took to construct CIs based on nonparametric bootstrap, LEF bootstrap and jackknife compared to CIs based on the sandwich variance estimator.

5.1.6. Computational resources

The simulations were conducted using R (version 4.1.3).⁴⁶ Point estimation and CI construction based on the sandwich variance estimator were performed using the R package TrialEmulation³⁸ (version 0.0.3.9), and some of the key functions in the R package are summarised in Section 3 of the Supplemental Materials. We used packages doParallel and doRNG to parallelise the simulation, which took 12 hours to finish by using 67 cores of the research institution’s high-performance computing cluster.

5.2. Results

In this section, we focus on the discussion of the results in low event rate scenarios; results for medium and high event rate scenarios can be found in Section 5 of the Supplemental Materials. The reason for this focus is two-fold: (1) The HERS data example had low event rates; (2) the CI methods performed the worst in low event rate scenarios and yet we still observed interesting patterns for their relative performance, which were not drastically different from those in medium and high event rate scenarios.

Figure 1 shows the empirical coverage rates of the CIs when the event rate was low, using the true values by Kaplan–Meier estimates; see Figures 2 and 3 of the Supplemental Materials for results with medium and high event rates. Differences between these coverage rates and the corresponding bias-eliminated coverage rates were negligible (see Figures 4 to 6 of the Supplemental Materials). We also calculated the coverage rates using the ‘pseudo-true’ MRDs induced by the misspecified MSM. The differences between the coverage rates based on the ‘pseudo-true’ MRDs (Figures 7 to 9 of the Supplemental Materials) and those based on the true MRDs calculated using Kaplan–Meier estimates (Figure 1 of the main text and Figures 2 and 3 in the Supplemental Materials) were negligible. Thus we focus on the simulation results based on the true MRDs calculated using Kaplan–Meier estimates in the following discussions.

We will initially examine the bias of the MRD estimates, followed by the SE ratio results for the CI methods, to explain the under-coverage or over-coverage of the CIs and the relative performance of these CI methods in various simulation scenarios. Afterwards, we will discuss computation time for the CI methods. Results for Monte Carlo standard errors of the CI coverage estimates and CI construction failure rates can be found in Sections 5.7 and 5.8 of the Supplemental Materials. Since we observe very little difference in the performance of the two LEF bootstrap CI methods, we will refer to them as one when discussing the results. We refer to the CIs constructed using the sandwich variance estimator as ‘Sandwich CIs’, those using Approach 1 from jackknife resampling as ‘Jackknife Wald CIs’ and those using Approach 2 as ‘Jackknife Multivariate Normal (MVN) CIs’.

5.2.1. Empirical bias and its impact on CI coverage

Figure 2 presents the empirical biases of the MRD estimates.

Figure 2.

Empirical biases of the marginal risk difference (MRD) estimates in various simulation scenarios.

Minimal biases were observed at earlier visits ( $k < 2$ ), which approached zero as sample sizes increased. The increasing absolute bias at later visits ( $k > 2$ ) could be explained by the increasing data sparsity and treatment arm imbalances at later visits. Very few events occurred after trial visit $2$ in most scenarios, even in those with large sample sizes ( $n = 5000$ ) (see Table 3 of Supplemental Materials for an example).

The absolute biases in low or high treatment prevalence scenarios were larger than those in medium treatment prevalence scenarios. This likely stemmed from data sparsity issues at later trial visits that were aggravated by treatment arm imbalance caused by low or high treatment prevalence. Table 1 of the Supplemental Materials shows an example of this imbalance.

No clear patterns were observed for the impact of confounding strength and outcome event rate on the bias. The direction and magnitude of the bias appear to differ by the combinations of these scenarios.

These bias results explain the general trends of CI coverage in Figure 1. As the sample size increased, the CI coverage at earlier visits approached the nominal 95% level, which was not surprising given that all the methods rely on asymptotic approximations. The increasing absolute bias at later visits ( $k > 2$ ) coincided with the deterioration in CI coverage that we observed across all simulation scenarios at later visits (reason (i) of under-coverage according to Morris et al.⁴⁵)

In scenarios with low treatment prevalence ( $α_{a} = - 1$ ), CI coverage did not achieve the nominal 95% level at later visits ( $k > 2$ ) even with larger sample sizes. Similarly, in scenarios with high treatment prevalence ( $α_{a} = 1$ ), nominal coverage was rarely achieved, except at the baseline visit ( $k = 0$ ), and coverage decayed considerably at later visits for Sandwich CIs, nonparametric and LEF bootstrap CIs.

We note that the poor coverage in low or high treatment prevalence scenarios does not appear to originate from practical violations of the positivity assumption. Table 12 in Section 5.5 of the Supplemental Materials presents summary statistics of the estimated IPTWs stratified by assigned treatments from a large data set with $n = 50, 000$ for each simulation scenario. We chose $n = 50, 000$ to obtain an empirical distribution of the IPTWs, allowing for possible extreme values sampled from the covariate distributions. The estimated weights were not very large ( $< 50$ ) and their summary statistics were similar between treatment arms for all scenarios, probably because the models for weight estimation were correctly specified and the true values of the parameters were not too extreme. This suggests that there was no practical violation of the positivity assumption manifested by the instability of the weight estimation.

We observed minimal impact of confounding strength on CI coverage. Increasing the event rate improved the CI coverage for all methods most prominently in scenarios with larger sample size ( $n = 1000, 5000$ ) and medium treatment prevalence, as shown in Figures 2 and 3 of the Supplemental Materials.

The empirical SD and the root-mean squared error (MSE) of the MRD estimates largely followed the patterns of the bias (see Figures 10 and 11 of the Supplemental Materials): scenarios with larger biases also exhibited larger SDs and MSEs.

5.2.2. SE ratio and its impact on CI coverage

Figures 3 and 4 present the summary statistics of the SE ratio, $\frac{\hat{SE}}{{SD}_{\hat{MRD}}}$ , for each CI method in scenarios with low event rates and small sample size ( $n = 200$ ); see Figures 12 to 18 of the Supplemental Materials for scenarios with low event rates and medium/large sample sizes and for scenarios with medium and high event rates.

Figure 3.

Ratio of the estimated standard error to empirical standard deviation of the MRD estimator (SE ratio) in low event rate and small sample size scenarios. The dots represent the averages of the ratio, with the bottom and top of the bar being the first and third quartile of this ratio, respectively. Bootstrap: CIs constructed by nonparametric bootstrap. LEF both: CIs constructed by applying Approach 2 of LEF bootstrap. LEF outcome: CIs constructed by applying Approach 1 of LEF bootstrap. Jackknife Wald: CIs constructed by applying Approach 1 of jackknife resampling; Sandwich: CIs based on the sandwich variance estimator. MRD: marginal risk difference; SE: standard error; CI: confidence interval; LEF: linearised estimating function.

Figure 4.

We found that the SE ratio results largely explained the differences in CI coverage of the various CI methods. The comparative performance of the CI methods in terms of coverage generally followed the patterns of the SE ratios: when the average SE ratios were above 1, we observed over-coverage; when the average SE ratios were about 1, we observed close to nominal coverage; when the average SE ratios were below 1, we observed under-coverage. Higher variability of the SE ratios also coincided with lower CI coverage. Notably, in scenarios with high event rate, large sample size, medium treatment prevalence and weak confounding, all CI methods achieved close-to-nominal coverage, which corresponded to all CI methods having SE ratios close to 1 on average and with low variability.

Low event rate and small sample size scenarios From Figure 3, we can see that the SE ratios for nonparametric bootstrap CIs were on average lower than 1 at later visits, and they were also more variable, especially with high treatment prevalence. This is consistent with the coverage results in Figure 1, where we expect nonparametric bootstrap CIs to have low coverage if the variability of the MRD estimator was underestimated and the variability estimates were highly variable across simulations.⁴⁵ The nonparametric bootstrap CIs had lower coverage compared to LEF bootstrap CIs in scenarios with low/medium treatment prevalence, and lower coverage than Sandwich CIs when treatment prevalence was high. This performance is mirrored in the SE ratio results for nonparametric bootstrap CIs in such scenarios. It was likely that the data sparsity issues at later visits increased the instability of parameter estimation in bootstrap samples and thus resulted in the lower SE ratios and large variability of SE ratios for nonparametric bootstrap CIs. In contrast, LEF bootstrap reduced instability of parameter estimation in bootstrap samples by design, and thus led to smaller variability of the SE ratios at later visits. Similar findings for nonparametric bootstrap CIs were observed by Austin.²⁶

In Figure 3, the SE ratios for Jackknife Wald CIs were also lower than 1 at later visits, but in low and high treatment prevalence scenarios, they were less variable compared to those for nonparametric bootstrap CIs, which might partly explain the better coverage of Jackknife Wald CIs in these scenarios. However, in Figure 4, the SE ratios for Jackknife MVN CIs were much larger than 1 in all scenarios with small sample sizes, which may explain the over-coverage or closer-to-nominal coverage for Jackknife MVN CIs in some scenarios.

In Figure 3, the SE ratios for LEF bootstrap CIs and Sandwich CIs were larger than 1 at early visits ( $k < 2$ ), but they gradually dropped below 1 at later visits. When the treatment prevalence was high ( $α_{a} = 1$ ), their SE ratios were much lower but less variable than those of nonparametric bootstrap CIs at the later visits. LEF bootstrap CIs also provided better coverage than Sandwich CIs at later visits in small sample size scenarios, possibly because the SE ratios for LEF bootstrap CIs were on average closer to 1 or less variable than the Sandwich CIs’ SE ratios. In such scenarios, Sandwich CIs might have suffered from finite-sample bias of variance matrix estimation that affected the MVN sampling for the MRD estimation, as well as the construction failures due to non-positive definite variance matrices (see Table 13 of the Supplemental Materials).

Low event rate and medium/large sample size scenarios In Figure 12 of the Supplemental Materials, with larger sample sizes ( $n = 1000, 5000$ ), the SE ratio results for all CI methods were improved in low/medium treatment prevalence scenarios, which was reflected correspondingly by the better coverage. The SE ratios of the LEF bootstrap CIs were above 1 at most visits in low/medium treatment prevalence scenarios, which could explain some of the CI over-coverage we observed and the better coverage compared with Sandwich CIs. With high treatment prevalence, the SE ratios of the LEF bootstrap sample at later visits approached to 1 on average as sample sizes increased, but they became more variable as well. This might explain the corresponding decreased coverage of LEF bootstrap CIs, which was now similar to the coverage of nonparametric bootstrap CIs. Similar patterns in high treatment prevalence scenarios were also observed for the SE ratios of Sandwich CIs. However, they were slightly less variable than those of LEF bootstrap CIs and closer to 1 than those of nonparametric bootstrap CIs, which might be due to the much reduced number of construction failures for Sandwich CIs in large sample size and high treatment prevalence scenarios. Overall, this resulted in better coverage of Sandwich CIs in these scenarios.

Medium/high event rate scenarios From Figures 13 to 18 in the Supplemental Materials, increasing the outcome event rate also improved the SE ratio results for all CI methods, with the SE ratios approaching 1 and becoming less variable. This is consistent with the coverage results in Figures 2 and 3 in the Supplemental Materials, where an increase of outcome event rate resulted in the various CI methods converging to similar closer-to-nominal coverage. However, in scenarios with large sample sizes and low or high treatment prevalence, we still observed SE ratios to be lower than 1 and highly variable at later visits. Together with the larger empirical bias, this could explain the low coverage at later visits for all CI methods in these scenarios.

With increased event rates, we observed that differences in CI coverage among the methods were primarily due to SE ratio variability. In most scenarios with medium/high event rates and medium/large sample sizes, the average SE ratios of Sandwich CI and LEF bootstrap CIs were very similar but the SE ratios of Sandwich CIs tended to be less variable, which might lead to their better coverage. With medium/large sample sizes, nonparametric bootstrap CIs and LEF bootstrap CIs had similar SE ratios, resulting in similar coverage, whereas with small sample sizes, the SE ratios of nonparametric bootstrap CIs tended to be much more variable than those for LEF bootstrap CIs particularly at later visits, leading to worse coverage.

We also observed close-to-nominal coverage for Jackknife Wald CIs in scenarios with small sample size, low treatment prevalence and medium/high outcome event rate, which could be explained by smaller variability in their SE ratios with increased event rates. In such scenarios, SE ratios of Jackknife Wald CIs were also closer to 1 for all visits, compared to the low event rate scenario. The SE ratios of Jackknife MVN CIs were still larger than 1 but became less variable. This is consistent with their slight over-coverage or close-to-nominal coverage in scenarios with small sample size, low/medium treatment prevalence and medium/high outcome event rates.

Confounding strength scenarios Similarly to the empirical bias results, the confounding strength appeared to have minimal impact on the SE ratio results. This also translated to minimal impact of confounding strength on CI coverage.

5.2.3. Computation time

Table 4 summarises the average computation time (across 1000 simulated data sets and the scenarios of treatment prevalence and confounding strength) for constructing a CI of the MRD using each method, relative to the time for constructing Sandwich CI. On average, nonparametric bootstrap CIs took about 2.4–4.5 times longer to compute than Sandwich CIs. The computation time increased exponentially as sample sizes increased. LEF bootstrap CIs, using both Approaches 1 and 2, took about 1.8–3 times longer to compute compared to Sandwich CIs.

Table 4.
Summary of average computation time of the CI construction methods in simulations with the sandwich-variance-estimator-based CIs as the reference.

LEF LEF Jackknife Jackknife

Outcome event rate Sample size Bootstrap outcome both Wald MVN Sandwich

Low 200 2.44 2.05 1.84 1.56 2.78 1.00

1000 2.80 2.10 1.93 1.00

5000 4.52 3.08 2.86 1.00

Medium 200 2.39 2.03 1.83 1.54 2.76 1.00

1000 2.75 2.12 1.95 1.00

5000 4.35 3.11 2.85 1.00

High 200 2.40 2.05 1.85 1.52 2.73 1.00

1000 2.72 2.14 1.97 1.00

5000 4.11 3.07 2.83 1.00

			LEF	LEF	Jackknife	Jackknife
Low	200	2.44	2.05	1.84	1.56	2.78	1.00
	1000	2.80	2.10	1.93			1.00
	5000	4.52	3.08	2.86			1.00
Medium	200	2.39	2.03	1.83	1.54	2.76	1.00
	1000	2.75	2.12	1.95			1.00
	5000	4.35	3.11	2.85			1.00
High	200	2.40	2.05	1.85	1.52	2.73	1.00
	1000	2.72	2.14	1.97			1.00
	5000	4.11	3.07	2.83			1.00

Bootstrap: CIs constructed by nonparametric bootstrap. LEF both: CIs constructed by applying approach 2 of LEF bootstrap. LEF outcome: CIs constructed by applyingapproach 1 of LEF bootstrap. Jackknife Wald: CIs constructed by applying approach 1 of jackknife resampling; jackknife MVN: CIs constructed by applying approach 2 of jackknife resampling; sandwich: CIs based on the sandwich variance estimator. CI: confidence interval; LEF: linearised estimating function; MVN: multivariate normal.

For small sample sizes ( $n = 200$ ), Jackknife Wald CIs had slightly shorter computation times to LEF bootstrap CIs. Jackknife MVN CIs took almost 3 times longer to construct compared to Sandwich CIs, most likely because the former method required two sampling steps: the jackknife resampling step and the MVN sampling step, while Sandwich CIs only involved the MVN sampling step.

The large gain in computational efficiency through LEF bootstrap is very much of interest; with large sample sizes ( $n = 5000$ ), LEF bootstrap was on average 1.6 times faster than nonparametric bootstrap. However, the sandwich variance estimator-based method remains the fastest for CI construction, especially if it is not possible to parallelise the computing of the bootstrap-based CIs.

5.2.4. Summary

Our simulation results suggest that LEF bootstrap CIs provided better coverage compared to Sandwich CIs in scenarios with small/medium sample sizes, low/medium outcome event rates and low/medium treatment prevalence. This might be attributed to the finite sample bias of the sandwich variance estimator and high frequency of the construction failure of Sandwich CIs in such scenarios. The performance of nonparametric bootstrap CIs was considerably affected by treatment arm imbalance and data sparsity, particularly in scenarios with low event rates and small/medium sample sizes. In these scenarios, the LEF bootstrap method not only performed better but was also more computationally efficient than nonparametric bootstrap. While Jackknife Wald CIs achieved nominal coverage with small sample sizes and low treatment prevalence, their performance plummeted with medium/high treatment prevalence. With medium/high outcome event rates and low/medium treatment prevalence, Jackknife MVN CIs also achieved close-to-nominal coverage, possibly due to their overall conservativeness. Due to data sparsity and finite-sample bias, all methods performed poorly when treatment prevalence was high, and Jackknife MVN CIs faced numerous construction failures, making them impractical for use.

Since STE is particularly useful for data scenarios with small numbers of patients initiating treatments at any given time (low treatment prevalence) and with low event rates, LEF bootstrap offers a useful alternative to the sandwich variance estimator and the nonparametric bootstrap for CI construction, especially in small/medium sample sizes. Although Jackknife Wald CIs and Jackknife MVN CIs also provided good coverage in some scenarios, they become computationally inefficient as the number of patients exceeds the number of bootstrap samples. For large sample sizes and medium/high event rates, Sandwich CIs are computationally more efficient than the bootstrap-based CI methods. A summary of recommended CI methods is provided in Table 5. Overall, our investigation underscores the importance of considering sample size, outcome event rate, and treatment prevalence when selecting a CI construction method in STE.

Table 5.
Summary of recommended procedures for constructing CIs based on experimental criteria in the simulation study.

Experimental criteria Recommended procedure Rationale

Small/medium sample sizes and low/medium event rates and low/medium treatment prevalence LEF Bootstrap CIs Sandwich CIs suffer from finite sample bias of the sandwich variance estimator and construction failures; nonparametric bootstrap CIs suffer from treatment arm imbalances and data sparsity; LEF bootstrap CIs outperformed both and were computationally efficient.

Large sample size and medium/high event rates Sandwich CIs Performs better and computationally more efficient than resampling-based CI methods

Small sample size and low treatment prevalence Jackknife Wald CIs Achieved close-to-nominal coverage

Small sample size and medium/high event rate and Jackknife MVN CIs Achieved close-to-nominal coverage low/medium treatment prevalence

Experimental criteria	Recommended procedure	Rationale
Small/medium sample sizes and low/medium event rates and low/medium treatment prevalence	LEF Bootstrap CIs	Sandwich CIs suffer from finite sample bias of the sandwich variance estimator and construction failures; nonparametric bootstrap CIs suffer from treatment arm imbalances and data sparsity; LEF bootstrap CIs outperformed both and were computationally efficient.
Large sample size and medium/high event rates	Sandwich CIs	Performs better and computationally more efficient than resampling-based CI methods
Small sample size and low treatment prevalence	Jackknife Wald CIs	Achieved close-to-nominal coverage
Small sample size and medium/high event rate and	Jackknife MVN CIs	Achieved close-to-nominal coverage low/medium treatment prevalence

CI: confidence interval; LEF: linearised estimating function; MVN: multivariate normal.

6. Application to the HERS data

In this section, we applied the methods described in Sections 3 and 4 to the HERS data. The treatment process for the denominator term of the stabilised IPTWs was modelled by two logistic regressions stratified by treatment received at the immediately previous visit $A_{t_{j - 1}}$ ( $j = 1, \dots, 4$ ). We included the following covariates: CD4 count (after square root transformation and standardisation) and HIV viral load (after $\log_{10}$ transformation and standardisation) measured at the previous two visits (since HAART treatment was a self-reported status between the last visit and the current visit), HIV symptoms at the previous visit, ethnicity (Caucasian, Black, other) and study sites. The numerator term of the stabilised IPTWs was estimated as the marginal probability of receiving HAART in the stratum defined by previous treatment $A_{t_{j - 1}}$ .

Similarly, the censoring process for the denominator term of the stabilised IPCWs ( see equation (2) of the Supplemental Materials) was modelled by two logistic regressions stratified by the previous treatment $A_{t_{j - 1}}$ . The following covariates were included: CD4 count, HIV viral load and HIV symptoms measured at the current and previous visits, ethnicity and study sites. The numerator term of the stabilised IPCWs was estimated by the marginal probabilities of remaining in the HERS cohort, stratified by the previous treatment $A_{t_{j - 1}}$ . Table 15 in the Supplemental Materials presents the summary statistics of the estimated inverse probability of treatment and censoring weights by treatment arms in the HERS data analysis, which suggests no practical violation of the positivity assumption of treatment adherence and censoring in the HERS data.

The pooled logistic regression for the MSM included an intercept term, main effect of treatment, main effects of CD4 count and HIV viral load at previous two visits before the trial baseline, main effects of ethnicity and study sites, and the interaction between treatment and CD4 count at the visit before trial baseline. A summary of the fitted MSM is provided in Table 6.

Table 6.
Results for the fitted MSM for the sequentially emulated trials using the HERS data. Robust standard error: Standard errors based on the sandwich variance estimate.

Estimate Robust standard error

Intercept −5.1510 1.1250

Assigned treatment: HAART vs. non-HAART −0.0002 0.5528

CD4 count at 1 visit before trial baseline 0.1725 0.3719

CD4 count at 2 visits before trial baseline −0.2988 0.3705

Viral load at 1 visit before trial baseline 0.0213 0.3518

Viral load at 2 visits before trial baseline 0.6119 0.2714

Site 2 vs. Site 1 0.5002 1.1399

Site 3 vs. Site 1 0.0646 0.9345

Caucasian vs. Black −0.2009 0.7596

Other ethnicity vs. Black 0.9969 1.0755

Interaction between assigned treatment and CD4 count at 1 visit before trial baseline −0.0272 0.5507

	Estimate	Robust standard error
Intercept	−5.1510	1.1250
Assigned treatment: HAART vs. non-HAART	−0.0002	0.5528
CD4 count at 1 visit before trial baseline	0.1725	0.3719
CD4 count at 2 visits before trial baseline	−0.2988	0.3705
Viral load at 1 visit before trial baseline	0.0213	0.3518
Viral load at 2 visits before trial baseline	0.6119	0.2714
Site 2 vs. Site 1	0.5002	1.1399
Site 3 vs. Site 1	0.0646	0.9345
Caucasian vs. Black	−0.2009	0.7596
Other ethnicity vs. Black	0.9969	1.0755
Interaction between assigned treatment and CD4 count at 1 visit before trial baseline	−0.0272	0.5507

MSM: marginal structural model; HERS: HIV Epidemiology Research Study; HAART: Highly Active AntiRetroviral Treatment.

We used 500 bootstrap samples for Nonparametric and LEF bootstrap CIs, and drew a sample of size $S = 500$ of MSM parameters for constructing Sandwich CIs.

Figure 5 presents the estimated MRD, which is the difference in risk of all-cause mortality when all patients eligible to trial 0 adhere to the treatment strategies assigned (HAART vs. no HAART). Figure 5 also includes pointwise 95% CIs of the MRD at each trial visit obtained by the four methods described in Section 4. We note that the results are not statistically significant given that all four CIs include zero.

Figure 5.

Estimates and pointwise 95% CIs of the MRD of all-cause mortality under HAART treatment strategy and the non-HAART strategy for patients enrolled in the first emulated trial (trial 0) of the HERS data. Bootstrap: CIs constructed by nonparametric bootstrap. LEF both: CIs constructed by applying Approach 2 of LEF bootstrap. LEF outcome: CIs constructed by applying Approach 1 of LEF bootstrap. Sandwich: CIs based on the sandwich variance estimator. CI: confidence interval; MRD: marginal risk difference; HAART: Highly Active AntiRetroviral Treatment; LEF: linearised estimating function; HERS: HIV Epidemiology Research Study.

7. Discussion

In this article, we focussed on the application of STE to estimate and make inferences about the per-protocol effect of treatments on a survival outcome in terms of counterfactual MRDs over time. We conducted a simulation study to compare the relative performance of four CI construction methods for the MRD, based on the sandwich variance estimator, nonparametric bootstrap, LEF bootstrap (which previously had not been extended to STE) and jackknife resampling. In scenarios with small/medium sample sizes, low/medium event rates and low/medium treatment prevalence, we observed relatively better coverage for LEF bootstrap CIs than for nonparametric bootstrap and Sandwich CIs. These results align with previous findings on the limitations of the sandwich variance estimator when the sample size is small.^47,48 Since STE is particularly useful in scenarios with small sample sizes, low event rates and low treatment prevalence, the LEF bootstrap offers a valuable alternative to the sandwich variance estimator and the nonparametric bootstrap for constructing CIs of the MRD. With large sample sizes and medium/high event rates, Sandwich CIs exhibited relatively better performance in our simulations and were also computationally more efficient, therefore one could choose to use them in such scenarios. Our simulation results also highlighted how data sparsity issues which are inherent when implementing STE can greatly affect CI performance, meaning that one should carefully consider features of their data when choosing a CI method.

Although dependent censoring was not included in the data generating mechanism of our simulation study, we expect that the relative performance of CIs constructed using LEF bootstrap and the sandwich variance estimator observed in our simulation study would also hold if there were dependent censoring and IPCWs were used to handle it. Inference would still face the same issues, but with the additional uncertainty caused by the estimation of IPCWs and possible exacerbation of the finite-sample bias as a result of loss of information.

We primarily focussed on pooled logistic regressions for fitting MSMs and did not explore alternative survival time models, such as additive hazards models.²⁰ Additive hazard models can be used in moderate-to-high event rates settings, while in low event rate settings they might result in negative hazard estimates. Inference procedures for STE based on additive hazard models warrant further research.

We have used parametric models for estimation of the SIPTCWs. Recent developments in nonparametric data-adaptive methods have led to their widespread use in biomedical research. This is partly because these models have become more accessible due to increased automation in their implementation across programming languages. Because the consistency of the weighted estimators of the MSM parameters relies on correct specification of treatment and censoring models, data-adaptive methods are attractive. However, the subsequent inference can be challenging.^49–54

Throughout this article, it was assumed that there were no unmeasured confounders. However, in practice, this assumption is unlikely to hold, as it is challenging to identify and measure all potential confounders in observational databases. Instrumental variable approaches and sensitivity analysis have been proposed to deal with such unmeasured confounding for point treatments. Recently, Tan (2023) proposed a sensitivity analysis approach in general longitudinal settings.⁵⁵ In the specific setting of STE, sensitivity analysis methods for unmeasured confounding warrant further research.

Supplemental Material

sj-pdf-1-smm-10.1177_09622802251356594 - Supplemental material for Inference procedures in sequential trial emulation with survival outcomes: Comparing confidence intervals based on the sandwich variance estimator, bootstrap and jackknife

Supplemental material, sj-pdf-1-smm-10.1177_09622802251356594 for Inference procedures in sequential trial emulation with survival outcomes: Comparing confidence intervals based on the sandwich variance estimator, bootstrap and jackknife by Juliette M Limozin, Shaun R Seaman and Li Su in Medical Research

Footnotes

Acknowledgements

The authors would like to thank Dr Brian Tom and Dr Pantelis Samartsidis for helpful comments and suggestions.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: Data from the HERS were collected under grant U64-CCU10675 from the U.S. Centers for Disease Control and Prevention. This work is supported by the U.K. Medical Research Council [grants MC_UU_00002/15, MC_UU_00040/03 and MC_UU_00040/05] and the MRC Biostatistics Unit Core Studentship.

Declaration of conflicting interest

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data availability

Permission to access the HERS data can be obtained through the U.S. Centers for Disease Control and Prevention. Process to access the database and contact person can be found at .

Other statements

All R scripts are available at .

ORCID iDs

Juliette M Limozin

Shaun R Seaman

Li Su

Supplemental material

Supplemental material for this article is available online.

References

Hernán

Robins

. Using big data to emulate a target trial when a randomized trial is not available. Am J Epidemiol 2016; 183: 758–764.

Hansford

Cashin

Jones

, et al. Reporting of observational studies explicitly aiming to emulate randomized trials: A systematic review. JAMA network open 2023; 6: e2336023.

Matthews

Szummer

Dahabreh

, et al. Comparing effect estimates in randomized trials and observational studies from the same population: An application to percutaneous coronary intervention. J Am Heart Assoc 2021; 10: e020357.

Hernán

Wang

Leaf

. Target trial emulation. JAMA 2022; 328: 2446.

Matthews

Danaei

Islam

, et al. Target trial emulation: applying principles of randomised trials to observational studies. BMJ 2022; 378: e071108. DOI: https://doi.org/10.1136/bmj-2022-071108

Maringe

Benitez Majano

Exarchakou

, et al. Reflection on modern methods: Trial emulation in the presence of immortal-time bias. Assessing the benefit of major surgery for elderly lung cancer patients using observational data. Int J Epidemiol 2020; 49: 1719–1729. https://doi.org/10.1093/ije/dyaa057

Robins

Hernán

Brumback

. Marginal structural models and causal inference in epidemiology. Epidemiology 2000; 11: 550–560.

Robins

Finkelstein

. Correcting for noncompliance and dependent censoring in an AIDS clinical trial with inverse probability of censoring weighted (IPCW) log-rank tests. Biometrics 2000; 56: 779–788.

Hernán

Brumback

Robins

. Marginal structural models to estimate the causal effect of zidovudine on the survival of HIV-positive men. Epidemiol (Cambridge, Mass) 2000; 11: 561–570.

10.

Hernán

Brumback

Robins

. Marginal structural models to estimate the joint causal effect of nonrandomized treatments. J Am Stat Assoc 2001; 96: 440–448.

11.

Cain

Cole

. Inverse probability-of-censoring weights for the correction of time-varying noncompliance in the effect of randomized highly active antiretroviral therapy on incident AIDS or death. Stat Med 2009; 28: 1725–1738.

12.

Seaman

White

. Review of inverse probability weighting for dealing with missing data. Stat Methods Med Res 2013; 22: 278–295.

13.

Clare

Dobbins

Mattick

. Causal models adjusting for time-varying confounding—a systematic review of the literature. Int J Epidemiol 2019; 48: 254–265.

14.

Hernán

Alonso

Logan

, et al. Observational studies analyzed like randomized experiments. Epidemiology 2008; 19: 766–779.

15.

Daniel

Cousens

De Stavola

, et al. Methods for dealing with time-dependent confounding. Stat Med 2013; 32: 1584–1618.

16.

Murray

Caniglia

Petito

. Causal survival analysis: A guide to estimating intention-to-treat and per-protocol effects from randomized clinical trials with non-adherence. Res Methods Med Health Sci 2021; 2: 39–49.

17.

D’Agostino

Lee

Belanger

, et al. Relation of pooled logistic regression to time dependent cox regression analysis: the framingham heart study. Stat Med 1990; 9: 1501–1515.

18.

Hernán

. The hazards of hazard ratios. Epidemiology 2010; 21: 13–15.

19.

Buchanan

Hudgens

Cole

, et al. Worth the weight: Using inverse probability weighted cox models in AIDS research. AIDS Res Hum Retroviruses 2014; 30: 1170–1177.

20.

Keogh

Gran

Seaman

, et al. Causal inference in survival analysis using longitudinal observational data: Sequential trials and marginal structural models. Stat Med 2023; 42: 2191–2225.

21.

Gran

Røysland

Wolbers

, et al. A sequential cox approach for estimating the causal effect of treatment in the presence of time-dependent confounding applied to data from the Swiss HIV cohort study. Stat Med 2010; 29: 2757–2768.

22.

Austin

. Variance estimation when using inverse probability of treatment weighting (IPTW) with survival analysis. Stat Med 2016; 35: 5642–5655.

23.

Shu

Young

Toh

, et al. Variance estimation in inverse probability weighted cox models. Biometrics 2021; 77: 1101–1117.

24.

Lin

Wei

. The robust inference for the cox proportional hazards model. J Am Stat Assoc 1989; 84: 1074–1078.

25.

Enders

Engel

Linder

, et al. Robust versus consistent variance estimators in marginal structural cox models. Stat Med 2018; 37: 3455–3470.

26.

Austin

. Bootstrap vs asymptotic variance estimation when using propensity score weighting with continuous and binary outcomes. Stat Med 2022; 41: 4426–4443.

27.

Mao

Yang

, et al. On the propensity score weighting analysis with survival outcome: Estimands, estimation, and inference. Stat Med 2018; 37: 3745–3763.

28.

Seaman

Keogh

. Simulating data from marginal structural models for a survival time outcome, 2023. DOI: 10.48550/arXiv.2309.05025.

29.

Serdarevic

Cvitanovich

MacDonald

, et al. Emergency department bridge model and health services use among patients with opioid use disorder. Ann Emerg Med 2023; 82: 694–704.

30.

Virtanen

Lagerberg

Takami Lageborn

, et al. Antidepressant use and risk of manic episodes in children and adolescents with unipolar depression. JAMA Psychiatry 2024; 81: 25–33.

31.

Danaei

Rodríguez

LAG

Cantero

, et al. Observational data for comparative effectiveness research: An emulation of randomised trials of statins and primary prevention of coronary heart disease. Stat Methods Med Res 2013; 22: 70–96.

32.

Kalbfleisch

. The estimating function bootstrap. Cana J Stat 2000; 28: 449–481.

33.

Rao

JNK

Tausi

. Estimating function jackknife variance estimators under stratified multistage sampling. Commun Stat - Theory Methods 2004; 33: 2087–2095.

34.

Binder

Kovacevic

Roberts

. Design-based methods for survey data: alternative uses of estimating functions. In: Proceedings of the survey research methods section, 2004, pp.3301–3312. American Research Association.

35.

Hogan

Mayer

. Estimating causal treatment effects from longitudinal HIV natural history studies using marginal structural models. Biometrics 2003; 59: 152–162.

36.

Yiu

. Joint calibrated estimation of inverse probability of treatment and censoring weights for marginal structural models. Biometrics 2022; 78: 115–127.

37.

Young

Tchetgen Tchetgen

. Simulation from a known cox MSM using standard parametric models for the g-formula. Stat Med 2014; 33: 1001–1014.

38.

Rezvani

Gravestock

. TrialEmulation: Causal analysis of observational time-to-event data, 2023.

39.

Mandel

. Simulation-based confidence intervals for functions with complicated derivatives. Am Stat 2013; 67: 76–81.

40.

Carpenter

Bithell

. Bootstrap confidence intervals: when, which, what? A practical guide for medical statisticians. Stat Med 2000; 19: 1141–1164.

41.

Tibshirani

BERJ

. An Introduction to the Bootstrap. New York: Chapman and Hall/CRC, 1994. ISBN 9780429246593. DOI: https://doi.org/10.1201/9780429246593.

42.

Lipsitz

Laird

Harrington

. Using the jackknife to estimate the variance of regression estimators from repeated measures studies. Commun Stat - Theory Methods 1990; 19: 821–845.

43.

Friedl

Stampfer

. Jackknife Resampling. In: Encyclopedia of Environmetrics, 2006, John Wiley & Sons, Ltd. ISBN 978-0-470-05733-9, DOI: https://doi.org/10.1002/9780470057339.vaj001.

44.

Efron

. Bootstrap methods: Another look at the jackknife. Ann Stat 1979; 7: 1–26.

45.

Morris

White

Crowther

. Using simulation studies to evaluate statistical methods. Stat Med 2019; 38: 2074–2102.

46.

R Core Team. R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing, 2023.

47.

Pan

Wall

. Small-sample adjustments in using the sandwich variance estimator in generalized estimating equations. Stat Med 2002; 21: 1429–1441.

48.

Rogers

Stoner

. Assessment of a modified sandwich estimator for generalized estimating equations with application to opioid poisoning in MIMIC-IV ICU patients. Stats 2021; 4: 650–664.

49.

Schuler

Rose

. Targeted maximum likelihood estimation for causal inference in observational studies. Am J Epidemiol 2017; 185: 65–73.

50.

Chernozhukov

Chetverikov

Demirer

, et al. Double/Debiased machine learning for treatment and causal parameters, 2017. DOI: https://doi.org/10.48550/arXiv.1608.00060.

51.

Cai

van der Laan

. Nonparametric bootstrap inference for the targeted highly adaptive least absolute shrinkage and selection operator (LASSO) estimator. Int J Biostat 2020; 16: 20170070. doi: https://doi.org/10.1515/ijb-2017-0070

52.

Petersen

Schwab

Gruber

, et al. Targeted maximum likelihood estimation for dynamic and static longitudinal marginal structural working models. J Causal Inference 2014; 2: 147–185.

53.

Zheng

Petersen

Laan

MJvd

. Doubly robust and efficient estimation of marginal structural models for the hazard function. Int J Biostat 2016; 12: 233–252.

54.

Ertefaie

Hejazi

van der Laan

. Nonparametric inverse probability weighted estimators based on the highly adaptive lasso. Biometrics 2023; 79: 1029–1041.

55.

Tan

. Sensitivity models and bounds under sequential unmeasured confounding in longitudinal studies, 2023. DOI: https://doi.org/10.48550/arXiv.2308.15725.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

8.57 MB

Data simulation setting specifications	$n$ : number of patients
	$n_{v} = 5$ : number of visits
	$t_{j} = 0, \dots, n_{v} - 1$ : visit time for visit $j$
	$α_{a}$ : intercept in the treatment model, representing the baseline rate of treatment initiation
	$α_{c}$ : coefficient that describes the strength of confounding due to time-varying variable $X_{1, t_{j}}$
	$α_{y}$ : intercept term in the discrete-time hazard model, representing the baseline hazard

Time-varying confounder	$X_{1, t_{j}} \sim N (Z_{t_{j}} - 0.3 A_{t_{j - 1}}, 1)$ , where $A_{t_{- 1}} \equiv 0$ and $Z_{t_{j}} \sim N (0, 1)$ .
Time-invariant confounder	$X_{2} \sim N (0, 1)$ .

Treatment	$logit {Pr (A_{t_{j}} = 1 ∣ A_{t_{j - 1}}, X_{1, t_{j}}, X_{2}, Y_{t_{j - 1}} = 0)}$
	$= α_{a} + 0.05 A_{t_{j - 1}} + α_{c} X_{1, t_{j}} + 0.2 X_{2}$ , where $Y_{t_{- 1}} \equiv 0$ .
Discrete-time hazard of the outcome event	$logit {Pr (Y_{t_{j}} = 1 ∣ A_{t_{j}}, X_{1, t_{j}}, X_{2}, Y_{t_{j - 1}} = 0)}$
	$= α_{y} - 0.5 A_{t_{j}} + α_{c} X_{1, t_{j}} + X_{2}$ .

Trial eligibility	$E_{t_{j}} = 1$ if patient has not received treatment before $t_{j}$
	and has not experienced the outcome event before $t_{j}$ ;
	$E_{t_{j}} = 0$ otherwise

			LEF	LEF	Jackknife	Jackknife
Outcome event rate	Sample size	Bootstrap	outcome	both	Wald	MVN	Sandwich
Low	200	2.44	2.05	1.84	1.56	2.78	1.00
	1000	2.80	2.10	1.93			1.00
	5000	4.52	3.08	2.86			1.00
Medium	200	2.39	2.03	1.83	1.54	2.76	1.00
	1000	2.75	2.12	1.95			1.00
	5000	4.35	3.11	2.85			1.00
High	200	2.40	2.05	1.85	1.52	2.73	1.00
	1000	2.72	2.14	1.97			1.00
	5000	4.11	3.07	2.83			1.00

Inference procedures in sequential trial emulation with survival outcomes: Comparing confidence intervals based on the sandwich variance estimator,bootstrap and jackknife

Abstract

Keywords

1. Introduction

1.1. Target trial emulation with survival outcomes

1.2. Constructing confidence intervals in sequential trial emulation

1.3. The contribution of this article

2. HERS: A motivating example

3.1. Setting and notation

3.2. Causal estimand and assumptions

4.1. Sandwich variance estimator

4.4.1. Approach 1: Wald-type CI using the jackknife estimate of the MRD standard error

5.1. Study setup

5.1.1. Data generating mechanism

5.1.5. Performance measures

5.2. Results

5.2.1. Empirical bias and its impact on CI coverage

Supplemental Material

sj-pdf-1-smm-10.1177_09622802251356594 - Supplemental material for Inference procedures in sequential trial emulation with survival outcomes: Comparing confidence intervals based on the sandwich variance estimator, bootstrap and jackknife

Footnotes

Acknowledgements

Funding

Declaration of conflicting interest

Data availability

Other statements

ORCID iDs

Supplemental material

References

Supplementary Material