Sage Journals: Discover world-class research

Abstract

Background:

Stepped wedge designs are a specific type of cluster randomised trial design that requires all clusters to start and end the study at the same time. Batched stepped wedge designs are a variant that allows clusters to commence the study in batches instead of all at once. Since different batches of clusters may have different characteristics, it is possible that the treatment effect differs across batches. Here, we explore different approaches for accommodating this treatment effect heterogeneity when analysing data from a batched stepped wedge design where each participant provides a single outcome measurement: linear mixed models with a random coefficient for the intervention for each batch of clusters with estimation via restricted maximum likelihood; and meta-analysis approaches that pool the batch-specific treatment effect estimates using fixed- or random-effects meta-analysis approaches. In many batched stepped wedge designs, the number of batches will be low (e.g. 2); it is unclear what impact the small number of batches may have on bias and confidence interval coverage of these approaches.

Methods:

We simulated continuous outcome data from a range of batched stepped wedge designs with two and five batches, different within-cluster correlation structures (exchangeable, block-exchangeable), with and without period effects shared across batches, and with and without treatment effect heterogeneity across batches. Using the ‘lme4’ and ‘meta’ packages in R 4.4.1, we analysed each simulated data set using both linear mixed effects models (with and without treatment effect heterogeneity) and a range of meta-analysis approaches. We assessed the relative performance of all methods to estimate the true treatment effect using rates of convergence, bias, confidence interval coverage, and empirical standard error.

Results:

Treatment effect estimates from all meta-analysis approaches were unbiased. When period effects were incorrectly specified, treatment effects from linear mixed effects models were biased; when correctly specified, treatment effect estimates from these models were unbiased. Confidence interval coverage was similar for all random-effects meta-analysis approaches and close to the nominal level. For fixed-effect meta-analysis and linear mixed models, coverage depended on whether treatment effect heterogeneity was present.

Conclusion:

When batched stepped wedge trials include a small number of batches (two and five batches), linear mixed models provide unbiased estimates of treatment effects when correctly specified, but exhibit under-coverage of confidence intervals when treatment effect heterogeneity is present. Random-effects meta-analysis approaches provide unbiased estimates and nominal confidence interval coverage whether or not period or treatment effects vary across batches, but can yield excessively wide confidence intervals. When batched stepped wedge trials include small numbers of batches, we recommend the use of random-effects meta-analysis approaches for the analysis of continuous outcomes.

Keywords

Cluster randomised trial linear mixed effect model meta-analysis simulation study stepped wedge trial treatment effect heterogeneity

Background

Cluster randomised trials randomise groups (clusters) of individuals (e.g. hospitals, schools, and villages) rather than individual participants.¹ Stepped wedge trials are particular variants of cluster randomised trials, where all clusters start implementing the control condition, and at pre-specified times clusters switch to the intervention condition, in a randomised order, until all clusters have crossed over to the intervention (see Figure 1).² Stepped wedge designs can reduce the number of clusters required to assess interventions compared with the standard cluster randomised design: the treatment effect estimator combines between-cluster and within-cluster comparisons.³ However, stepped wedge designs require that all clusters start and end the study at the same time. Consequently, all clusters must be recruited with ethics approvals and data collection procedures in place before the trial can start.^2,4,5

Figure 1.

Top row: an example of a stepped wedge trial design with four periods and three sequences of treatment allocation (0 indicates control period; 1 indicates intervention period); multiple clusters can be randomised to each sequence. Middle and bottom rows: examples of batched stepped wedge trial designs with three batches, four periods, three sequences, and differing degrees of period overlap. Design 1 has no overlap between batches; Design 2 has one period of overlap between batches.

The batched stepped wedge design allows clusters to commence the study in batches, allowing for staggered cluster recruitment (examples in Figure 1).⁵ If one batch starts before the previous batch has finished, it is said that there is ‘overlap between batches’ (e.g. there is one period of overlap between successive batches in the bottom row of Figure 1). Kasza et al⁵ showed that when a linear mixed model (LMM) is assumed for the outcome, batched stepped wedge design power remains unaffected by delays in batch commencement, provided separate period effects are included for each batch in the outcome regression model. However, this assumes a constant treatment effect across all batches – which may not hold depending on the context. For example, if different batches correspond to different geographical locations, or if there are characteristics that may influence the ability of a cluster to fully implement the intervention, it may be expected that the treatment effect differs across batches. Treatment effect heterogeneity could be incorporated by including a random effect of treatment for each batch (other models are possible).⁶ However, when the number of batches is small, it is unclear whether the restricted maximum likelihood estimator for the treatment effect will yield nominal levels of coverage, even if the number of clusters is large.

An alternative approach to analyse the results from a batched stepped wedge trial (targeting the same parameters) is to use meta-analysis, which combines the results of multiple studies into summary measures.⁷ A two-stage meta-analytic approach for batched stepped wedge trial analysis would first analyse the data from each batch separately and then combine these results using standard meta-analytic techniques. An example of a batched stepped wedge study adopting this approach is the EAGLE trial, which assessed if a standardised quality improvement intervention could reduce anastomotic leak in patients undergoing right colectomy.⁸ The EAGLE trial randomised 355 hospitals from 64 countries to the three sequences of an incomplete variant of a stepped wedge design in 16 batches.⁸ To estimate the treatment effect, the EAGLE trial applied a meta-analytic approach: fitting models to the data from each batch, and then combining these estimates using random-effects meta-analysis.^8,9

In situations where the assumption of a constant treatment effect across all batches may be questionable, meta-analysis may yield less biased estimates of the treatment effect (defined below) than standard LMMs; however, it is unclear in which circumstances meta-analysis may be beneficial. In this article, we conduct a simulation study to investigate when a meta-analysis approach may yield less biased and more efficient estimation of the treatment effect than the LMM approach when considering a continuous outcome from a batched stepped wedge design. We generated continuous outcomes from a range of batched stepped wedge designs assuming a range of within-cluster correlation structures and scenarios (when period effects do and do not differ across batches, and when the treatment effect does and does not randomly vary across batches). Data sets were analysed using LMMs and meta-analysis procedures. We also apply the various methods to analyse a specific trial.

Methods

Models

We assume that each participant provides a single continuous outcome measurement, represented by $Y_{bkti}$ , for participant $i = 1, \dots, m$ in period $t = 1, \dots, T$ from cluster $k = 1, \dots, K$ in batch $b = 1, \dots, B$ . Here, $T$ is the number of periods in a single batch of the design, constant across batches. $X_{bkt}$ is a binary indicator for the treatment group (1 indicates intervention; 0 indicates control) for cluster $k$ in period $t$ in batch $b$ ; $θ$ is the average of the treatment effects from a hypothetical population of batches of clusters.

We consider two within-cluster correlation structures: exchangeable and block-exchangeable. These two structures encode the within-period intracluster correlations (ICCs; the correlation between two observations in the same cluster in the same period) and cluster autocorrelations (CACs; the ratio of the correlation between two observations in the same cluster in adjacent periods to the ICC; these are equal to 1 for the exchangeable model) assumed in the model. Table 1 provides the ICCs and CACs for the models we consider.

Model 1: Exchangeable within-cluster correlation structure

Model 1 is given by¹⁰

\begin{matrix} Y_{bkti} = β_{bt} + (θ + η_{b}) X_{bkt} + α_{bk} \\ + ε_{bkti}, ε_{bkti} ~ N (0, σ_{ε}^{2}), α_{bk} ~ N (0, σ_{α}^{2}), \\ \begin{matrix} η_{b} ~ N (0, σ_{η}^{2}), \end{matrix} \end{matrix}

(1)

where $β_{bt}$ is the fixed period effect corresponding to batch $b$ and period $t$ ; $α_{bk}$ is the random effect corresponding to cluster $k$ in batch $b$ ; $η_{b}$ is the random coefficient of the treatment effect for batch $b$ ; $ε_{bkti}$ are the subject-specific random errors.

Table 1.

Intracluster correlations (ICC) and cluster autocorrelations (CAC) for Models 1 and 2. In the presence of treatment effect heterogeneity ( $σ_{η}^{2} \neq 0$ ), these depend on the cluster-period treatment allocations. Dependence of observations within the same batch but in different clusters is also induced (not shown). The intracluster correlation (ICC) measures the correlation between two participants within the same cluster (and batch) and time period and may differ depending on whether the cluster is under the control or intervention condition. The cluster autocorrelation (CAC) is defined as the ratio of the correlation between two observations from the same cluster in adjacent periods to the ICC. CAC = 1 implies the same within-cluster correlation between any participants in an adjacent period and those in the same period. Under an exchangeable correlation structure, this condition holds for all adjacent pairs of periods. The value of the CAC, depending on the numerator ( $Corr (Y_{bkti}, Y_{bk (t + 1) j})$ ), may vary across three scenarios: first, control in both periods ( $X_{bkt} = X_{bk (t + 1)} = 0$ ); second, intervention in both periods ( $X_{bkt} = X_{bk (t + 1)} = 1$ ); third, control in the first period and intervention in the adjacent period ( $X_{bkt} = 0, X_{bk (t + 1)} = 1$ ).

	ICC $Corr (Y_{bkti}, Y_{bktj})$ $X_{bkt} = 0$	CAC $\frac{Corr (Y_{bkti}, Y_{bk (t + 1) j})}{Corr (Y_{bkti}, Y_{bktj})}$ $X_{bkt} = X_{bk (t + 1)} = 0$	ICC $Corr (Y_{bkti}, Y_{bktj})$ $X_{bkt} = 1$	CAC $\frac{Corr (Y_{bkti}, Y_{bk (t + 1) j})}{Corr (Y_{bkti}, Y_{bktj})}$ $X_{bkt} = X_{bk (t + 1)} = 1$	CAC $\frac{Corr (Y_{bkti}, Y_{bk (t + 1) j})}{Corr (Y_{bkti}, Y_{bktj})}$ $X_{bkt} = 0, X_{bk (t + 1)} = 1$
1. Exchangeable model	$\frac{σ_{α}^{2}}{σ_{α}^{2} + σ_{ε}^{2}}$	1	$\frac{σ_{η}^{2} + σ_{α}^{2}}{σ_{η}^{2} + σ_{α}^{2} + σ_{ε}^{2}}$	1	$\frac{\sqrt{σ_{α}^{2} + σ_{ε}^{2}}}{\sqrt{σ_{η}^{2} + σ_{α}^{2} + σ_{ε}^{2}}}$
2. Block-exchangeable model	$\frac{σ_{α}^{2} + σ_{γ}^{2}}{σ_{α}^{2} + σ_{γ}^{2} + σ_{ε}^{2}}$	$\frac{σ_{α}^{2}}{σ_{α}^{2} + σ_{γ}^{2}}$	$\frac{σ_{η}^{2} + σ_{α}^{2} + σ_{γ}^{2}}{σ_{η}^{2} + σ_{α}^{2} + σ_{γ}^{2} + σ_{ε}^{2}}$	$\frac{σ_{η}^{2} + σ_{α}^{2}}{σ_{η}^{2} + σ_{α}^{2} + σ_{γ}^{2}}$	$\frac{σ_{α}^{2} \sqrt{σ_{α}^{2} + σ_{γ}^{2} + σ_{ε}^{2}}}{(σ_{α}^{2} + σ_{γ}^{2}) \sqrt{σ_{η}^{2} + σ_{α}^{2} + σ_{γ}^{2} + σ_{ε}^{2}}}$

The cluster random effect, $α_{bk}$ , accounts for clustering. The inclusion of $η_{b}$ allows the treatment effect to vary by batch. If $σ_{η}^{2} = 0$ , there is no treatment effect heterogeneity across batches.

We assume separate period effects for each batch. However, when generating and analysing data sets, if there is overlap between batches, we will also consider variants where period effects are shared across batches.

Model 2: Block-exchangeable within-cluster correlation structure

Model 2 is given by

\begin{matrix} Y_{bkti} = β_{bt} + (θ + η_{b}) X_{bkt} + α_{bk} + γ_{bkt} \\ + ε_{bkti}, ε_{bkti} ~ N (0, σ_{ε}^{2}), α_{bk} ~ N (0, σ_{α}^{2}), \\ \begin{matrix} γ_{bkt} ~ N (0, σ_{γ}^{2}), η_{b} ~ N (0, σ_{η}^{2}) . \end{matrix} \end{matrix}

(2)

This extends Model 1, by including the cluster-period random effect, $γ_{bkt}$ ,^11,12 allowing the correlation between observations from the same cluster but different periods to be less than those within the same period. When $σ_{γ}^{2} = 0$ , this model is equivalent to Model 1.

Later, we also consider LMMs with a discrete-time decay within-cluster correlation structure (in the Disinvestment example section). Details about the discrete-time decay within-cluster correlation structure are provided in Section S4 of the Supplemental Material.

Meta-analysis approaches

Meta-analysis combines the results of the analysis of multiple studies into summary measures. There are two common meta-analysis models: fixed-effect and random-effects.⁷ These should not be confused with fixed and random effects (i.e. mixed effects) models (e.g. Models 1 and 2) commonly used to analyse cluster randomised trials. In the context of batched stepped wedge designs, each batch constitutes a separate study, with treatment effects estimated separately for each batch and then combined. The target parameters in these approaches parallel those from LMMs in the previous section.

The fixed-effect meta-analysis model assumes there is one true treatment effect shared across batches, $θ; if {\hat{θ}}_{b}$ is the estimated treatment effect for batch $b$ , the difference in the estimates between studies is assumed to be only due to estimation error. That is, ${\hat{θ}}_{b} = θ + ϵ_{b}$ ; $ϵ_{b}$ is the error associated with the estimation of $θ$ in batch $b = 1, \dots, B$ .

The random-effects meta-analysis model instead assumes there is a distribution of true treatment effects across batches with mean $θ$ : ${\hat{θ}}_{b} = (θ + τ_{b}) + ϵ_{b}$ , where the difference in the treatment effect between batches is represented through the random batch effect, $τ_{b}$ , assumed to follow a normal distribution with mean zero and some between-batch variance $σ_{τ}^{2}$ ; $ϵ_{b}$ is once again the random error.

To analyse batched stepped wedge data using meta-analysis, the LMMs applied to each batch would be as in Models 1 and 2, with the exclusion of $η_{b}$ and the inclusion of separate period effects for each batch. When random-effect meta-analysis is applied, the between-study variance ( $σ_{τ}^{2}$ ) must be estimated; numerous methods are available to estimate this and construct the 95% confidence interval for $θ$ .^13,14 The approaches we applied for between-study variance estimation and 95% confidence interval construction are described below.

Simulation study design

We aimed to compare the bias and statistical efficiency of LMM and meta-analysis approaches in estimating the treatment effect $θ$ when the treatment effect does and does not vary across batches in a batched stepped wedge design.

Data generation

We set the total variance equal to 1. Setting $σ_{η}^{2} = 0.025$ implies that 95% of the true intervention effects for batches would be expected to lie between $\pm 1.96 \sqrt{0.025} = \pm 0.31$ of $θ$ .

We conducted a factorial simulation study, and for every combination of parameters in Table 2 (2,304 combinations), we generated 500 data sets. With 500 data sets for each combination, the simulation study could be completed in a timely fashion, and the Monte Carlo standard error on an estimated coverage of 95% is <1%.

Table 2.

The parameters used when generating data for the simulation study.

Parameter	Meaning	Values
$T$	Number of periods in each stepped wedge design	4, 6
$K$	Number of clusters randomised to each sequence	2, 5
$m$	Number of participants per cluster-period	20, 100
$B$	Number of batches	2, 5
$N_{olap}$	Number of periods of overlap between successive batches	1, $T$ − 2
$θ$	Effect size	0, 0.2
$σ_{η}^{2}$	Variance of effect across batches	0, 0.025
$β_{bt}$	Period effects	1. Shared across batches ( $β_{t}$ ): given periods $t = 1, 2, \dots, T * B$ , $β_{t} = 0.1 * t$ for all $b$ ; 2. Separate period effects for each batch ( $β_{bt}$ ): ${β_{bt}}^{~} N (0, 1)$ that would have random differences between periods within batches and across batches.
$corrstruct$	Within-cluster correlation structure	Exchangeable or block-exchangeable. When CAC = 1, this is equivalent to an exchangeable within-cluster correlation structure
$ICC$	Intracluster correlation	0.01, 0.05, 0.2
$CAC$	Cluster autocorrelation	1, 0.9, 0.5

Estimand and analysis methods

The estimand was the average treatment effect across the population of batches, represented by the parameter $θ$ in both the LMM and meta-analysis models.

Each simulated data set was analysed by fitting LMMs via restricted maximum likelihood estimation (REML) and using meta-analysis approaches. For each of these approaches, only models with the within-cluster correlation structure matching the data-generating model were fit.

For each simulated data set, four LMMs were fit:

1. Model A: Shared period effects; no treatment effect heterogeneity across batches ( $σ_{η}^{2} = 0$ ).

2. Model B: Separate period effects; no treatment effect heterogeneity across batches ( $σ_{η}^{2} = 0$ ).

3. Model C: Shared period effects; treatment effect heterogeneity across batches ( $σ_{η}^{2} \neq 0$ ).

4. Model D: Separate period effects; treatment effect heterogeneity across batches ( $σ_{η}^{2} \neq 0$ ).

When applying the meta-analysis methods, LMMs were fitted to data from each batch separately, and the results pooled using the fixed-effect or random-effects meta-analysis. To observe the behaviour of multiple specifications for random-effects meta-analysis, we applied every possible combination of methods available in the R ‘meta’ package:^15–19

• Between study variance estimation: REML estimator; Sidik–Jonkman (SJ) estimator.

• 95% confidence interval (CI) construction: Wald-type CI; Hartung and Knapp and Sidik–Jonkman (HK) method; Kenward–Roger (KR) method.

• For the HK 95% CI method, we also considered variations of ad hoc adjustments available in the meta package, referred to in the package as: no ad hoc adjustment; HK ad hoc adjustment; IQWiG ad hoc adjustment. The details of the ad hoc approaches are provided in Section 1 of the Supplemental Material.

Analyses were conducted in R 4.4.1,²⁰ with LMMs fit using the ‘lme4’ package²¹ and meta-analysis via the ‘meta’ package.¹⁹ Results are displayed using nested loop plots constructed with the ‘looplot’ package.²² For the Disinvestment example, the LMMs with discrete-time decay within-cluster correlation structure were fit using the ‘glmmTMB’ package.²³ Code to replicate our simulation study is available at https://github.com/DixonOrlando/BatchSW-MA.git.

Performance measures

To compare method performance, we calculated: bias of the estimate of the treatment effect, 95% CI coverage, empirical standard error, and rates of convergence.²⁴ Monte Carlo standard errors were calculated for each performance measure.²⁴ Non-convergence was identified by convergence warnings or errors printed by the lme4 command, with non-convergence indicated by warnings or errors indicating non-convergence. For the meta-analysis approaches, we defined non-convergence as occurring if none of the LMMs fitted for each batch converged. Some random-effects meta-analysis CI approaches have been found to yield very wide CIs.^25–28 To give some indication of when CIs obtained for meta-analysis approaches were very wide, we also calculated the percentage of simulations for which the selected random-effects meta-analysis (described below) produced a CI wider than the extremes of the confidence intervals for the estimates obtained from each individual batch. Schulz et al.²⁸ stated such estimates should be considered ‘noninformative’.

Results

None of the considered performance measures depended on the effect size $θ$ , so we only present results for $θ = 0$ . We present results when data were generated with separate period effects for each batch in the main article, and provide results when data were generated with shared period effects across batches in the Supplemental Material.

Meta-analysis results

To maintain simplicity when reporting results, we present results for the fixed-effect meta-analysis and one random-effects meta-analysis approach in the main article, with results for all other random-effects meta-analysis approaches in Section S1 of the Supplemental Material. Across all random-effects meta-analysis approaches, bias and empirical standard error results were indistinguishable (bias: Supplemental Figures S1–S2 and S7–S8; empirical standard error: Supplemental Figures S5–S6 and S11–S12). Coverage depended on CI construction method, but for each CI construction method, both between-study variance estimation methods (REML and SJ) produced similar results (Supplemental Figures S3–S4 and S9–S10). Among all CI construction methods, HK with no ad hoc adjustment produced the best results across all performance metrics. Hence, this specification was chosen to represent the random-effects meta-analysis, with REML between-study variance estimation.

Non-convergence

Supplemental Figures S13–S14 display the rates of non-convergence for LMMs, indicating relatively high rates of non-convergence of the LMM approaches (up to 40%). Non-convergence rates were generally highest when more complex models were fit. While for some meta-analysis approaches one or more individual linear mixed models failed to converge, we only consider instances where none of the models fit to each batch converged as instances of non-convergence. This means that in some rare situations, where only an estimate from a single batch was available, no meta-analysis was undertaken. Using this definition, the meta-analysis approaches had no instances of non-convergence. Additional discussion of the rates of convergence is in Section S2 of the Supplemental material.

All results below are based on replicates with model convergence.

Bias

Figures 2 and Supplemental Figures S15 and S16 show bias for all models when the data-generating models assumed separate and shared period effects, respectively. Meta-analysis approaches were always unbiased, and the LMM approach produced unbiased estimates when period effects were correctly parameterised: in Figure 2, where data-generating models included separate period effects per batch, LMMs A and C incorrectly parameterised the period effects, and only these models showed biased results across most scenarios. Omitting true treatment effect heterogeneity did not impact bias.

Figure 2.

Nested loop plot displaying bias for Models A–D, the fixed-effect and random-effects meta-analysis approaches, when data were generated with separate period effects and null effect size. The simulation results are arranged based on the order of the loops. Loops in a nested loop plot represent the simulation parameters. The loops are ordered from the bottom to the top of the plot: the order of the loops starts from observations per cluster period, then treatment effect heterogeneity, CAC, ICC, periods (T), and finally clusters (K). Models A and B assume no treatment effect heterogeneity across batches; Models C and D allow for treatment effect heterogeneity across batches. Models A and C assume shared period effects across batches; Models B and D allow for separate period effects across batches. The overall pattern repeats across all possible combinations of the number of batches (horizontal grid) and number of overlaps (vertical grid); to simplify the results displayed in the figure, the scenarios are restricted to only one combination of number of batches and overlap. For more details about other scenarios, refer to Supplemental Figure S16.

Coverage

Figure 3 and Supplemental Figures S18 and S19 show coverage when data-generating models assumed separate and shared period effects, respectively. Figure 3 does not include coverage results for Models A and C (models with shared period effects), since these were shown to be biased for these scenarios. Overall, only the selected random-effects meta-analysis method achieved nominal coverage rates in almost all scenarios. When there was no treatment effect heterogeneity, all models had close to nominal coverage. When treatment effect heterogeneity was present, only Model D (and Model C when data were generated with shared period effects across batches) achieved coverage closest to 95%, although neither reached nominal levels. For Models D and C, coverage improved when the number of batches was higher (5 vs 2 batches), but nominal levels were still not attained. Not using small-sample corrections to account for a small number of clusters may have contributed to the under-coverage, but the number of batches had a much greater influence: increasing the number of clusters did not improve coverage, whereas increasing the number of batches did. This suggests that, for LMMs, correct specification of treatment effect heterogeneity is not enough to achieve nominal coverage; an adequate number of batches are also required, suggesting potential small-sample issues. Overparameterisation of period effects did not appear to impact the coverage of the LMMs.

Figure 3.

Nested loop plot displaying coverage of 95% confidence intervals for models with unbiased treatment effect estimates, when data were generated with separate period effects, null effect size, and number of overlapping periods equal to one. The simulation results are arranged based on the order of the loops. Loops in a nested loop plot represent the simulation parameters. The loops are ordered from the bottom to the top of the plot: the order of the loops starts from observations per cluster period, then treatment effect heterogeneity, CAC, ICC, periods (T), and finally clusters (K). Model B assumes no treatment effect heterogeneity across batches; Model D allows for treatment effect heterogeneity across batches. Models B and D allow for separate period effects across batches. The restriction for number of overlaps was performed because the results for coverage were similar across different numbers of overlaps. For complete results for all the scenarios, refer to Supplemental Figure S18.

The selected random-effects meta-analysis method tended to produce very wide confidence intervals when the number of batches was small (two batches), and this issue was exacerbated by the presence of treatment effect heterogeneity (Supplemental Figures S22–S25).

Empirical standard error

Empirical standard error results are in Supplemental Figures S20 and S21. Empirical standard errors reduce as the sample size increases and when treatment effect heterogeneity was not present. All methods were similar when data were generated with separate period effects across batches. When data were generated with shared period effects across batches, Models A and C (with shared period effects) had the lowest empirical standard error. The ICC affected empirical standard errors, but the effect depended on CAC: when CAC was low, higher ICCs increased empirical standard errors; as CAC rose, the ICC had less impact.

Disinvestment trial example

We consider data from a stepped wedge trial conducted in Dandenong and Footscray Hospitals in Australia.²⁹ This trial assessed whether ceasing routinely provided health services on weekends (i.e. ‘disinvestment’ from services) led to inferior outcomes for patients. Although not labelled a batched stepped wedge trial, each hospital can be considered a separate batch: randomisation of wards to sequences was stratified by hospital, with Footscray starting after Dandenong. Figure S26 displays the trial schematic with the number of participants in each ward in each period of the study (14,834 observations in total: Dandenong, 8,120 and Footscray, 6,714).

Here, we analyse patient length of stay using the LMMs and meta-analysis procedures with three within-cluster correlation structures. Supplemental Tables S2 and S3 provide the estimates of the ICCs and CACs obtained when applying Models B and D. Figure S27 displays the results of the application of each method, indicating the similarity of the estimated treatment effect across approaches, with much wider confidence intervals associated with random-effects meta-analysis. Table 3 presents Model D (separate period effects; treatment effect heterogeneity) results: estimates for each batch separately and their combination using random-effects meta-analysis with REML between-study variance estimation and confidence intervals constructed using HK with no ad hoc adjustment.

Table 3.

Individual hospital results before combining using the meta-analysis approach, results from Model D, and random-effects meta-analysis with ‘REML’ between-study variance estimation method and HK with no ad hoc adjustment CI construction.

Within-cluster correlation structure	Dandenong hospital: estimate (standard error); (95% CI)	Footscray hospital: estimate (standard error); (95% CI)	Model D estimates (95% CI)	Random-effects meta-analysis estimates (95% CI)
Exchangeable	0.906 (0.047); (0.814, 0.998)	0.290 (0.037); (0.218, 0.362)	0.60 (−0.001, 1.20)	0.59 (−3.32, 4.50)
Block-exchangeable	0.935 (0.067); (0.804, 1.066)	0.345 (0.102); (0.144, 0.545)	0.64 (0.03, 1.26)	0.67 (−3.06, 4.40)
Discrete-time decay	0.964 (0.073); (0.821, 1.107)	0.635 (0.107); (0.425, 0.846)	0.76 (0.32, 1.20)	0.83 (−1.23, 2.89)

Model D results indicate that disinvestment from weekend services led to an increase in mean length of stay of around 0.7 days, and the data support a mean difference from around −0.001 to 1.26 days; however, only one correlation structure has a confidence interval including 0. The meta-analysis results have wider confidence intervals for the difference in length of stay than those for Model D. This exemplifies the simulation study results where the width of confidence intervals associated with fitting a standard LMM will likely be narrower in settings such as this, while the random-effects meta-analysis approach is likely to yield wider confidence intervals. Hence, the possibility of an important increase, decrease, or no important changes in mean length of stay after disinvestment from weekend services cannot be excluded.

Conclusion

The batched stepped wedge design was only recently described, and the most appropriate statistical analysis methods for this design are not yet well-understood. Previous recommendations for analysis of batched stepped wedge trials did not consider the possibility of differing treatment effects across batches.

We have provided an initial investigation into the application of meta-analysis techniques when treatment effects vary by batch. Through a simulation study, we compared the bias and confidence interval coverage of various meta-analysis approaches to those of the LMMs. We found that the random-effects meta-analysis with either ‘REML’ or ‘SJ’ between-study variance estimation methods and confidence intervals constructed using the Hartung–Knapp–SJ approach with no ad hoc adjustment performed best among all considered approaches. For all considered scenarios, this method produced unbiased estimates and nominal coverage rates. Other meta-analysis approaches did not perform as well. For the LMMs considered, bias and empirical standard error were similar to the meta-analysis approach when models were correctly specified, but coverage depended on correctly specifying treatment effect heterogeneity and the number of batches. ‘Correctly specified’ here includes allowing for differing period effects across batches: we, therefore, recommend always allowing period effects to vary across batches when analysing batched stepped wedge trials with LMMs. Furthermore, LMM approaches had high non-convergence rate, especially as the fitted model became more complex (more period effects and/or more complex within-cluster correlation structure).

Random-effects meta-analysis approaches are designed to combine results from (relatively) small numbers of studies with varied treatment effects.⁷ Hence, we may expect random-effects meta-analysis approaches to perform well when there are ≥5 batches; however, as shown, CIs can be excessively wide when there are two batches (as has been pointed out).²⁷ Poor performance of LMMs is likely due to the small number of batches in our simulations (2 or 5). A small number of batches are not unusual: of the six trials that have cited the batched stepped wedge paper to date, all but EAGLE included≤4 batches.^30–35 It is well-known that LMMs exhibit poor performance in the analysis of data from stepped wedge designs when the number of clusters is small, and small-sample corrections are recommended.^36–40 Here, we have shown that issues associated with the use of LMMs with treatment effect heterogeneity across batches may also be subject to small-sample issues, where small sample here means a small number of batches. Additional simulations with 30 batches confirmed this finding, with coverage rates much closer to 95%. Given the higher rates of non-convergence for Models C and D (including treatment effect heterogeneity), and when a higher number of period effects were fitted, we recommend some caution in interpreting results associated with those models.

Our study has limitations. First, we did not consider data generated from the model with a discrete-time decay within-cluster correlation structure: primarily due to computational limitations. Given that our focus was on correct specification of the within-cluster correlation structure, we expect that the discrete-time decay model would yield similar results to the two correlation structure models examined in this study; we may observe higher rates of non-convergence. The trial designs we considered included scenarios with total clusters <50, and we did not apply any small-sample corrections when fitting LMMs – doing so may have improved coverage for scenarios with smaller number of clusters.³⁸ Nor did we consider the small number of batches issue: further research is required to understand adjustment of standard errors and CI widths. Our measure for identifying ‘excessively wide’ confidence intervals provided an indicator of this, but this measure is somewhat crude: wide confidence intervals may be necessary to capture variability across batches. Alternative models that include random intercepts for batches, in addition to random intervention coefficients, or models with separate fixed treatment effects for each batch, could be considered. Another limitation of this work is that we have not considered the implications of these approaches in designing batched stepped wedge trials: what are the implications for sample size and statistical power? Finally, we only considered continuous outcomes: binary or count outcomes are common in stepped wedge designs.⁴¹ These are possible directions for future research.

Here, we have focused on treatment effects varying across the batches of batched stepped wedge designs, with an average treatment effect that does not vary with time or exposure duration. However, the important issue of time-varying treatment effects in stepped wedge designs has been the subject of much recent work. For example, Kenny et al.⁴² highlighted the vulnerabilities of stepped wedge designs to the assumption of a constant treatment effect, with alternative models and treatment effects that accommodate time-varying effects presented by Maleyeff et al.⁴³ and Lee et al.⁴⁴ The impact of time-varying treatment effects on the analysis of batched stepped wedge designs requires further investigation.

We found that random-effects meta-analysis with either the ‘REML’ or ‘SJ’ estimation method for between-study variability, and HK confidence interval construction with no ad hoc adjustment performed the best among all approaches when analysing continuous outcomes from batched stepped wedge designs. Hence, we recommend using random-effects meta-analysis with these specifications; particularly when batch characteristics may be related to the outcome and treatment effects. Even in the absence of treatment effect heterogeneity, this approach performs well. However, caution is required when combining results of only two batches, as confidence intervals may be very wide. We have not considered correlation structure misspecification, but we would not expect meta-analysis to exacerbate the known issues of such misspecification,⁴⁵ and recommend pre-specification of sensitivity analyses under different assumptions. While LMMs with separate period effects for each batch and treatment effect heterogeneity across batches can perform similarly to random-effects meta-analysis, given the greater risk of non-convergence, we recommend pre-specifying back-up analysis methods when this model is to be fit.

Supplemental Material

sj-docx-1-ctj-10.1177_17407745261429957 – Supplemental material for Comparing meta-analysis and linear mixed model-based approaches for the analysis of continuous outcomes from batched stepped wedge trials

Supplemental material, sj-docx-1-ctj-10.1177_17407745261429957 for Comparing meta-analysis and linear mixed model-based approaches for the analysis of continuous outcomes from batched stepped wedge trials by Dixon Orlando, Rhys Bowden, Kelsey L Grantham, Joanne E McKenzie, Andrew B Forbes and Jessica Kasza in Clinical Trials

Footnotes

Declaration of conflicting interests

The authors declared the following potential conflicts of interest with respect to the research, authorship, and/or publication of this article: J.K. is an associate editor of Clinical Trials.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was funded by a National Health and Medical Research Council (NHMRC) Investigator Grant (GNT2033380) awarded to J.K. J.E.M. is supported by a National Health and Medical Research Council (NHMRC) Investigator Grant (GNT2009612).

ORCID iDs

Dixon Orlando

Rhys Bowden

Kelsey L Grantham

Joanne E McKenzie

Andrew B Forbes

Jessica Kasza

Supplemental material

Supplemental material for this article is available online.

References

Giraudeau

Weijer

Eldridge

, et al. Why and when should we cluster randomize. J Epidemiol Popul Health 2024; 72(1): 202197.

Hemming

Haines

Chilton

, et al. The stepped wedge cluster randomised trial: rationale, design, analysis, and reporting. BMJ 2015; 350: h391.

Matthews

JNS

Forbes

AB.

Stepped wedge designs: insights from a design of experiments perspective. Stat Med 2017; 36(24): 3772–3790.

Hemming

Copas

Forbes

, et al. What type of cluster randomized trial for which setting? J Epidemiol Popul Health 2024; 72(1): 202195.

Kasza

Bowden

Hooper

, et al. The batched stepped wedge design: a design robust to delays in cluster recruitment. Stat Med 2022; 41(18): 3627–3641.

Hemming

Taljaard

Forbes

Modeling clustering and treatment effect heterogeneity in parallel and stepped-wedge cluster randomized trials. Stat Med 2018; 37(6): 883–898.

McKenzie

Beller

Forbes

AB.

Introduction to systematic reviews and meta-analysis. Respirology 2016; 21(4): 626–637.

ESCP EAGLE Safe Anastomosis Collaborative NIHR Global Health Unit in Surgery. Evaluation of a quality improvement intervention to reduce anastomotic leak following right colectomy (EAGLE): pragmatic, batched stepped-wedge, cluster-randomized trial in 64 countries. Br J Surg 2024; 111(1): znad370.

Riley

Ensor

Hattle

, et al. Two-stage or not two-stage? That is the question for IPD meta-analysis projects. Res Synth Methods 2023; 14(6): 903–910.

10.

Hussey

Hughes

JP.

Design and analysis of stepped wedge cluster randomized trials. Contemp Clin Trials 2007; 28(2): 182–191.

11.

Girling

Hemming

Statistical efficiency and optimal design for stepped cluster studies under linear mixed effects models. Stat Med 2016; 35(13): 2149–2166.

12.

Hooper

Teerenstra

de Hoop

, et al. Sample size calculation for stepped wedge and other longitudinal cluster randomised trials. Stat Med 2016; 35(26): 4718–4728.

13.

Veroniki

Jackson

Bender

, et al. Methods to calculate uncertainty in the estimated overall effect size from a random-effects meta-analysis. Res Synth Methods 2019; 10(1): 23–43.

14.

Veroniki

Jackson

Viechtbauer

, et al. Methods to estimate the between-study variance and its uncertainty in meta-analysis. Res Synth Methods 2016; 7(1): 55–79.

15.

Langan

Higgins

JPT

Jackson

, et al. A comparison of heterogeneity variance estimators in simulated random-effects meta-analyses. Res Synth Methods 2019; 10(1): 83–98.

16.

Partlett

Riley

RD.

Random effects meta-analysis: coverage performance of 95% confidence and prediction intervals following REML estimation. Stat Med 2017; 36(2): 301–317.

17.

Knapp

Hartung

Improved tests for a random effects meta-regression with a single covariate. Stat Med 2003; 22(17): 2693–2710.

18.

Institute for Quality and Efficiency in Health Care. General methods: Version 6.1. Institute for Quality and Efficiency in Health Care (IQWiG). Cologne: Institute for Quality and Efficiency in Health Care, 2022.

19.

Balduzzi

Rücker

Schwarzer

How to perform a meta-analysis with R: a practical tutorial. Evid Based Ment Health 2019; 22(4): 153–160.

20.

R Core Team: a language and environment for statistical computing. 4.4.1 2024, https://www.gbif.org/tool/81287/r-a-language-and-environment-for-statistical-computing

21.

Bates

Mächler

Bolker

, et al. Fitting linear mixed-effects models using lme4. J Stat Softw 2015; 67(1): 1–48.

22.

Kammer

looplot: an R package for create nested loop plots, 2025, https://github.com/matherealize/looplot

23.

Brooks

Kristensen

van Benthem

, et al. glmmTMB balances speed and flexibility among packages for zero-inflated generalized linear mixed modeling. R J 2017; 9: 378–400.

24.

Morris

White

Crowther

MJ.

Using simulation studies to evaluate statistical methods. Stat Med 2019; 38(11): 2074–2102.

25.

Bender

Friede

Koch

, et al. Methods for evidence synthesis in the case of very few studies. Res Synth Methods 2018; 9(3): 382–392.

26.

Röver

Knapp

Friede

Hartung-Knapp-Sidik-Jonkman approach and its modification for random-effects meta-analysis with few studies. BMC Med Res Methodol 2015; 15: 99.

27.

Gonnermann

Framke

Großhennig

, et al. No solution yet for combining two independent studies in the presence of heterogeneity. Stat Med 2015; 34(16): 2476–2480.

28.

Schulz

Schürmann

Skipka

, et al. Performing meta-analyses with very few studies. Methods Mol Biol 2022; 2345: 91–102.

29.

Haines

Bowles

Mitchell

, et al. Impact of disinvestment from weekend allied health services across acute medical and surgical wards: 2 stepped-wedge cluster randomised controlled trials. PLoS Med 2017; 14(10): e1002412.

30.

Sepucha

Elmariah

Valentine

, et al. The IMproving treatment decisions for Patients with AortiC stenosis Through Shared Decision Making (IMPACT SDM) study: study protocol for a cluster randomized stepped wedge trial. Trials 2024; 25(1): 820.

31.

Boyle

Sanders

, et al. The thriving kids and parents schools project: protocol of an incomplete stepped wedged cluster randomised trial evaluating the effectiveness of a Triple P seminar series. BMC Public Health 2023; 23(1): 2021.

32.

Song

, et al. The third intensive care bundle with blood pressure reduction in acute cerebral haemorrhage trial (INTERACT3): an international, stepped wedge cluster randomised controlled trial. Lancet 2023; 402(10395): 27–40.

33.

Pereira

Noritomi

Dos Santos

, et al. Effect of Tele-ICU on clinical outcomes of critically Ill patients: the TELESCOPE randomized clinical trial. JAMA 2024; 332(21): 1798–1807.

34.

Kasza

Recasens

, et al. SCANPatient: study protocol for a multi-centre, batched, stepped wedge, comparative effectiveness, randomised clinical trial of synoptic reporting of computerised tomography (CT) scans assessing cancers of the pancreas. Trials 2024; 25(1): 388.

35.

Van Houtven

Coffman

Decosimo

, et al. A stepped wedge cluster randomized trial to evaluate the effectiveness of a multisite family caregiver skills training program. Health Serv Res 2024; 59(6): e14326.

36.

Kahan

Forbes

Ali

, et al. Increased risk of type I errors in cluster randomised trials with small or medium numbers of clusters: a review, reanalysis, and simulation study. Trials 2016; 17(1): 438.

37.

Leyrat

Morgan

Leurent

, et al. Cluster randomized trials with a small number of clusters: which analyses should be used? Int J Epidemiol 2018; 47(1): 321–331.

38.

Grantham

Kasza

Heritier

, et al. Evaluating the performance of Bayesian and restricted maximum likelihood estimation for stepped wedge cluster randomized trials with a small number of clusters. BMC Med Res Methodol 2022; 22(1): 112.

39.

Barker

D’Este

Campbell

, et al. Minimum number of clusters and comparison of analysis methods for cross sectional stepped wedge cluster randomised trials with binary outcomes: a simulation study. Trials 2017; 18(1): 119.

40.

Lee

Cheung

YB.

The fixed-effects model for robust analysis of stepped-wedge cluster trials with a small number of clusters and continuous outcomes: a simulation study. Trials 2024; 25(1): 718.

41.

Nevins

Ryan

Davis- Plourde

, et al. Adherence to key recommendations for design and analysis of stepped-wedge cluster randomized trials: a review of trials published 2016–2022. Clin Trials 2023; 21(2): 199–210.

42.

Kenny

Voldal

Xia

, et al. Analysis of stepped wedge cluster randomized trials in the presence of a time-varying treatment effect. Stat Med 2022; 41(22): 4311–4339.

43.

Maleyeff

Haneuse

, et al. Assessing exposure-time treatment effect heterogeneity in stepped-wedge cluster randomized trials. Biometrics 2023; 79(3): 2551–2564.

44.

Lee

Turner

Kenny

. Analysis of stepped-wedge cluster randomized trials when treatment effects vary by exposure time or calendar time. Stat Med 2025; 44(20-22): e70256.

45.

Kasza

Forbes

AB.

Inference for the treatment effect in multiple-period cluster randomised trials when random effect correlation structure is mispecified. Stat Methods Med Res 2019; 28(10-11): 3112–3122.