Sage Journals: Discover world-class research

Abstract

Linear mixed models are commonly used in analyzing stepped-wedge cluster randomized trials. A key consideration for analyzing a stepped-wedge cluster randomized trial is accounting for the potentially complex correlation structure, which can be achieved by specifying random-effects. The simplest random effects structure is random intercept but more complex structures such as random cluster-by-period, discrete-time decay, and more recently, the random intervention structure, have been proposed. Specifying appropriate random effects in practice can be challenging: assuming more complex correlation structures may be reasonable but they are vulnerable to computational challenges. To circumvent these challenges, robust variance estimators may be applied to linear mixed models to provide consistent estimators of standard errors of fixed effect parameters in the presence of random-effects misspecification. However, there has been no empirical investigation of robust variance estimators for stepped-wedge cluster randomized trials. In this article, we review six robust variance estimators (both standard and small-sample bias-corrected robust variance estimators) that are available for linear mixed models in R, and then describe a comprehensive simulation study to examine the performance of these robust variance estimators for stepped-wedge cluster randomized trials with a continuous outcome under different data generators. For each data generator, we investigate whether the use of a robust variance estimator with either the random intercept model or the random cluster-by-period model is sufficient to provide valid statistical inference for fixed effect parameters, when these working models are subject to random-effect misspecification. Our results indicate that the random intercept and random cluster-by-period models with robust variance estimators performed adequately. The CR3 robust variance estimator (approximate jackknife) estimator, coupled with the number of clusters minus two degrees of freedom correction, consistently gave the best coverage results, but could be slightly conservative when the number of clusters was below 16. We summarize the implications of our results for the linear mixed model analysis of stepped-wedge cluster randomized trials and offer some practical recommendations on the choice of the analytic model.

Keywords

Correlation structure mixed-effects model model misspecification sandwich variance estimator small-sample correction degrees of freedom

1 Introduction

The cluster randomized trial (CRT) is a commonly used design for evaluating the effectiveness of cluster-level interventions or when there are substantial risks of contamination of the intervention within clusters.¹ In CRTs, individuals who are members of the same group (such as a neighborhood, hospital, or school) are randomized as an entire unit to either the control or the intervention arms. A major implication of cluster randomization is that individuals from the same cluster tend to be more similar than individuals from different clusters. This similarity is measured by the intracluster correlation coefficient (ICC).² In recent years, a novel type of longitudinal CRT design, referred to as the stepped-wedge cluster randomized trial (SW-CRT) has become increasingly popular.^3,4 In the basic version of this design (Figure 1), all clusters commence the trial in the control condition and conclude in the intervention condition, with clusters randomized to one of several treatment sequences defined by the time of transition from the control condition to the intervention condition. Depending on the sampling structure, a SW-CRT can be either cross-sectional (a different set of individuals is included in each observation period), closed cohort (each individual contributes to all observation periods), or open cohort.³ Each observation period (e.g. each cell in Figure 1) is also referred to as a cluster-period.

Figure 1.

Illustration of a stepped wedge design, with five sequences, six periods.

A commonly used analytical model for stepped-wedge cluster randomized trials (SW-CRTs) is the linear mixed model,⁵ which accounts for the ICC through the specification of random effects. The specification of random effects induces particular within-cluster correlation structures for the marginal distribution of the outcome vector for each cluster. Three popular forms of correlation structures induced by the specification of random effects are the exchangeable (EXCH),⁴ nested exchangeable (NE),^6,7 and discrete-time decay (DTD) structures.⁸ Sample size and power calculation software are available for SW-CRTs.^9–12 The choice of different random-effects structures and hence the correlation structures could impact both sample size calculations and statistical inference.^5,13,14 The exchangeable model arises from a single random intercept for clusters and implies a common ICC both within and across periods.² The nested exchangeable model arises from a model with a random intercept and an additional random cluster-by-period interaction effect. This correlation structure allows researchers to distinguish a within-period ICC (correlation between measurements from two individuals in the same cluster-period) and a constant between-period ICC (correlation between measurements from two individuals in the same cluster but different periods).⁶ The ratio of the within- and between-period ICCs is defined as the cluster autocorrelation coefficient (CAC) and is generally smaller than one (i.e. when the variance component for the random cluster-by-period interaction is greater than zero).⁶ Finally, the DTD model assumes that the between-period ICC decays exponentially by period and arises from a first-order autoregressive (AR(1)) random-effects specification.⁸ In this model, the CAC measures the decay rate per period.⁸ In addition to these three types of correlation structures, it has also been recommended to allow for cluster-treatment effect heterogeneity, that is, to allow the treatment effect to vary across clusters by including a random cluster-by-intervention interaction effect.^15,16

The existing literature has demonstrated the potential impact of random-effects misspecification in linear mixed models and generalized linear mixed models. Frequently, the random effects are assumed to follow normal (Gaussian) distributions, which can be incorrectly specified when their true distributional forms deviate from normality.¹⁷ Litière et al.¹⁸ showed that under an individually randomized design, error distribution misspecifications can either over- or under-estimate power and could substantially inflate the type I error rate. In SW-CRTs, rather than error distribution misspecification, random-effects misspecification can also occur when certain random effects are omitted (e.g. assuming an exchangeable correlation structure when a decay parameter is needed).¹⁹ In this article, we focus on this latter type of misspecification and empirically explore methods that can maintain the validity of inference under such misspecification. In practice, investigators are often faced with an array of modeling choices without clear guidance as to which random effects or correlation structures they should choose. When the true random effects component is unknown, the final choice may depend heavily on subject-matter expertise and/or subjective judgment. Hui et al.²⁰ empirically demonstrated the sensitivity of linear mixed model inference to the random effects choices with clustered data. Voldal et al.²¹ examined the implications of misspecified correlation structures in SW-CRTs (with linear mixed models) and showed that misspecification affects both type I error rates and the efficiency of the treatment effect estimator. Although some guidance has recently been provided on the selection of a correlation structure using information criteria for continuous outcomes, adequate selection performance requires a large number of clusters, and the validity of post-selection inference in SW-CRTs has not been sufficiently investigated.²² In fact, since the true correlation structure for any application is always unknown, it is also impossible to assess the impact of correlation structure misspecification for any given application.^23,24 In addition, even if researchers believe a more complex model is appropriate, it may not be possible to fit a more complex model, as SW-CRTs often have a small number of clusters^25,26 and computational challenges may arise.

A potential solution to misspecification has been provided under the generalized estimating equations (GEEs) framework,²⁷ where researchers may utilize the robust variance estimator (RVE), sometimes referred to as the sandwich variance estimator, to consistently estimate the sampling variance of the treatment effect estimator in an assumed marginal model for the mean of the outcome.^28–30 Although RVEs have been implemented in mixed-effects models and previous studies have hinted that RVEs might be used to maintain the validity of inferences under misspecified random-effects structures,^21,31 the performance of RVEs in mixed-effects models has not been empirically examined in the context of SW-CRTs, and there has been no recommended practice for their use to date. Given the popularity of the mixed-effects model in SW-CRTs,^5,32 we investigated the performance of RVEs under different types of misspecified random-effects structures. Furthermore, as SW-CRTs often have a limited number of clusters, and several modified versions of RVEs with small-sample corrections have been proposed, we also explored the performance of small-sample corrections that are currently available in standard software for mixed-effects models—in particular, the R software package. Therefore, our objectives were to conduct an extensive simulation study to address the following emerging questions:

First, does the use of RVEs in linear mixed models lead to valid inferences for the estimation of the intervention effect when the random-effects structure is misspecified, even in the most severe form of misspecification, that is, with a simple exchangeable correlation?

Second, if small-sample corrections are required to maintain statistical properties with a small number of clusters, which available small-sample corrections in linear mixed model perform the best when using RVEs under random-effects misspecification?

The remainder of the article is organized as follows. In Section 2, we review the RVEs that are currently available in the R package “clubSandwich” for mixed-effects models. In Section 3, we present the results of a simulation study to investigate the performance of these RVEs for SW-CRT across a range of scenarios. In Section 4, we demonstrate the use of RVEs in a real data example. Section 5 concludes with a discussion.

2 Robust variance estimators and small-sample corrections in linear mixed models

2.1 Model and notations

We review some general notation and linear mixed model formulations in the context of the multilevel data structure. Commonly used linear mixed models in SW-CRTs are special cases of this general model formulation.⁵ Let $Y_{i}$ be a vector of outcomes for individuals from the $i$ -th cluster $(i = 1, \dots, I)$ and let $X_{i}$ be a design matrix of corresponding covariates. The vector $Z_{i}$ is the design matrix of random effects for each cluster. A generic linear mixed model can be written as

Y_{i} = X_{i} β + Z_{i} u_{i} + ϵ_{i}

(1)

where

β

is the fixed-effects regression parameters,

u_{i}

follows the multivariate normal distribution

N_{q} (0, R)

, where q is the dimension of the random-effects parameters, and

R

is a

q \times q

random-effects variance matrix. In a SW-CRT,

β

typically includes the time effects (secular trend) and intervention effect parameters. Specific examples of the random-effects structure parameterized by

R

will be provided in Section 3. In addition,

ϵ_{i} = (ϵ_{i 1}, \dots, ϵ_{i n_{i}})^{'}

is a vector of random errors for all individuals in the

i

-th cluster, and is assumed to follow a normal distribution

N_{n_{i}} (0, σ_{ϵ}^{2} I_{n_{i}})

, where

n_{i}

is the cluster size (number of observations made in the

i

-th cluster across all periods),

σ_{ϵ}^{2} = Var (ϵ_{i j})

and

I_{n_{i}}

is the

n_{i} \times n_{i}

identity matrix. The total variance for each cluster then can be denoted as

V_{i} = Var (Y_{i}) = Z_{i} {R Z}_{i}^{'} + σ_{ϵ}^{2} I_{n_{i}}

(2)

If the variance components are known, the estimated fixed effects can be obtained from the generalized least squares formula:

\hat{β} = (X^{'} V^{- 1} X)^{- 1} X^{'} V^{- 1} Y

(3)

where

V

is a block-diagonal variance matrix with

V_{i}

on the diagonal and zero on the off-diagonal, and

X

is the joint design matrix across all clusters (by stacking

X_{i}

over all clusters), and

Y = (Y_{1}^{'}, \dots, Y_{I}^{'})^{'}

is the vector of all outcomes observed in the trial. Based on the work by Liang and Zeger,²⁸ the cluster RVE for a linear mixed model can be written as follows:^33–35

Var (\hat{β}) = (X^{'} V^{- 1} X)^{- 1} (\sum_{i = 1}^{I} X_{i}^{'} V_{i}^{- 1} Σ_{i} V_{i}^{- 1} X_{i}) (X^{'} V^{- 1} X)^{- 1}

(4)

where the true covariance matrix

Σ_{i} = Var (Y_{i} - X_{i} β) = Var (r_{i})

is unknown and must be estimated from the data. In particular, with a small number of clusters, there is a tendency for

X_{i} \hat{β}

to be too close to the observation

Y_{i}

, making the estimated residuals

{\hat{r}}_{i}

too close to zero, hence biasing variance estimates downward.³⁶ To address this small sample bias,

Σ_{i}

in (4) needs to be inflated to reflect the correct uncertainty of the regression parameter estimates. There are several variations of (4) that have been developed to address small sample bias. In our study, we particularly focus on corrections that are currently available in the “clubSandwich” package in R, which is targeting the cluster RVE under mixed-effects models.³⁷ The available RVEs in this package can be written in the form:

\hat{V a r} (\hat{β}) = (X^{'} V^{- 1} X)^{- 1} \sum_{i = 1}^{I} X_{i}^{'} V^{- 1} A_{i} {\hat{r}}_{i} {\hat{r}}_{i}^{'} A_{i}^{'} V^{- 1} X_{i} (X^{'} V^{- 1} X)^{- 1}

(5)

where

r_{i}

is the residual vector in cluster i, and

A_{i}

is an adjustment matrix. A standard RVE (denoted CR0) is to let

A_{i} = I

. For multilevel regression models in general, the standard RVE may lead to inflated type I error if the number of independent units is fewer than 50.³⁸ Similar findings were discussed in the literature for GEE analysis of SW-CRTs,^27,39–41 but little evidence has been provided for mixed-model analysis of SW-CRTs. A recent review of 160 published SW-CRTs found that 70% of studies used linear or generalized linear mixed models as the primary analytical approach even though the median number of clusters was only 11.²⁶ Therefore, small sample corrections to the standard RVEs are often required when implementing linear mixed models in practice and we will review available small sample corrections in the next section.

2.2 Simple small sample corrections

A simple small sample correction is to adjust the CR0 variance estimator by different degrees of freedom (DoF) corrections. In “clubSandwich,” there are three types of DoF corrections: (1) correct the standard RVE by $I / (I - 1)$ (denoted as CR1); (2) correct the standard RVE by $I / (I - P)$ , where P is the number of fixed-effects covariate parameters estimated in the model (denoted as CR1P); or (3) correct the standard RVE by $(I (N - 1)) / [(I - 1) (N - P)]$ , where N is the total number of observations (denoted as CR1S).

2.3 Bias-reduced linearization and approximation of jackknife estimator

In addition to the DoF correction, a common correction is to use resampling methods, such as jackknife resampling estimators.^42–44 Mancl and DeRouen³⁶ proposed a bias-reduced correction by deriving an approximation for the bias of ${\hat{r}}_{i} {\hat{r}}_{i}^{'}$ with a first-order Taylor series expansion of the residuals. Bell and McCaffrey⁴⁵ showed that this correction closely approximated the jackknife estimator under an unweighted linear model. This correction, denoted as CR3, replaces $A_{i} r_{i} r_{i}^{'} A_{i}^{'}$ by

A_{i} r_{i} r_{i}^{'} A_{i}^{'} = [I_{i} - H_{i}]^{- 1} r_{i} r_{i}^{'} [I_{i} - H_{i}]^{- 1}

(6)

where

H_{i} = X_{i} (X^{'} V^{- 1} X)^{- 1} X_{i}^{'} V_{i}^{- 1}

is the cluster-leverage matrix. This approach is also known as the Mancl and DeRouen³⁶ variance correction in the GEE literature and has been found to be an effective small-sample method (but sometimes conservative due to over-correction) for GEE estimators in SW-CRTs. Finally, Bell and McCaffrey found that CR3 tends to over-correct the bias of CR0, and thus a bias-reduced linearization (BRL) method was proposed. This correction (denoted as CR2) is close to CR3:

A_{i} r_{i} {r_{i}}^{'} A_{i}^{'} = [I_{i} - H_{i}]^{- 1 / 2} r_{i} r_{i}^{'} [I_{i} - H_{i}]^{- 1 / 2}

(7)

Except that the terms parameterizing (6) and (7) were specifically for linear mixed models, this approach is also known as the Kauermann and Carroll variance correction in the GEE literature,⁴⁶ and it has been found to maintain the nominal type I error rate for GEE analyses of small SW-CRTs under different multilevel working correlation structures (but mostly under the correct specification of the working correlation structure).^39–41,47 To investigate the potential for these variance estimators to yield valid inferences under misspecified linear mixed models, we conducted a simulation study to compare their finite-sample operating characteristics. For ease of reference, a summary of the RVEs and their features is provided in Table 1.

Table 1.

Summary of robust variance estimators used in our simulations.

Robust variance estimator	Description
CR0	Standard robust variance estimator
CR1	Simple degrees of freedom (DoF) correction of CR0 in the form $I / (I - 1)$
CR1P	Simple DoF correction of CR0 in the form $I / (I - P)$ , where P is the number of covariate parameters estimated in the model
CR1S	Simple DoF correction of CR0 in the form $(I (N - 1)) / [(I - 1) (N - P)]$ , where N is the number of observations
CR2	Correction based on “bias-reduced linearization”
CR3	Closely approximates the leave-one-cluster-out jackknife resampling estimator

3 Simulation study

In this section, we report the results of a simulation study to determine whether utilizing RVEs can help to maintain valid inference when the correlation structure has been incorrectly specified in SW-CRTs and to identify which specific RVE performs best under different scenarios, including with a limited number of clusters.

3.1 Trial design

We considered the standard SW-CRT design with cross-sectional measurement, assuming I clusters, S sequences, and $J = S + 1$ periods of equal length (Figure 1). We assumed balanced designs such that $I / S$ clusters are included per sequence, and equal cluster-period sizes K.

3.2 Data-generating process

We assumed a continuous outcome and generated data under two different models: (1) the discrete-time decay with random intervention (DTD-RI) model and (2) the nested exchangeable with random intervention (NE-RI) model. Both data generators share the same mean model (i.e. fixed effects) and only differ in their random effect structures. While we explicitly write out each data generator below, we refer to Li et al.⁵ for a unified model representation that includes these two data generators as special cases with different specifications of the random-effects structure.

(1)
DTD-RI model: The DTD-RI model can be written as
$Y_{i j k} = u + β_{j} + (θ + v_{i}) X_{i j} + γ_{i j} + ε_{i j k}$

where $Y_{i j k}$ denotes the outcome of the $k$ -th individual from the $i$ -th cluster and $j$ -th period; $β_{j}$ denotes the fixed time effect ( $β_{j} = j$ , where $j = 1, 2, \dots, J$ ); u is the overall mean under the control condition; $θ$ denotes the time-invariant treatment effect, and $X_{i j}$ a binary treatment indicator for the $i$ -th cluster in the $j$ -th period (taking values 1 or 0 depending on the treatment condition). The individual error term $ε_{i j k}$ is assumed to follow $N (0, τ_{ε}^{2})$ . The random treatment effect $v_{i}$ is assumed to follow $N (0, τ_{v}^{2})$ and measures the departure of the treatment effect in each cluster relative to its overall mean $θ$ . The vector of random cluster-by-period effects is assumed to follow $γ_{i} = (γ_{i 1}, \dots, γ_{i J})^{'} \sim N (0, τ_{γ}^{2} \tilde{Z})$ . We particularly focus on $\tilde{Z}$ as an autoregressive (AR (1)) structure,⁸ that is
$\tilde{Z} = [\begin{matrix} 1 & r_{12} & r_{13} & \dots & r_{1 J} \\ r_{12} & 1 & r_{23} & \dots & r_{2 J} \\ r_{13} & r_{23} & 1 & \dots & r_{3 J} \\ ⋮ & ⋮ & ⋮ & ⋱ & ⋮ \\ r_{J 1} & r_{J 2} & r_{J 3} & \dots & 1 \end{matrix}] = [\begin{matrix} 1 & r & r^{2} & \dots & r^{J - 1} \\ r & 1 & r & \dots & r^{J - 2} \\ r^{2} & r & 1 & \dots & r^{J - 3} \\ ⋮ & ⋮ & ⋮ & ⋱ & ⋮ \\ r^{J - 1} & r^{J - 2} & r^{J - 3} & \dots & 1 \end{matrix}]$
where in the absence of the random intervention effect, r in the second equation represents the common amount of between-period correlation decay per period. (2)
NE-RI model: The NE-RI model can be written as
$Y_{i j k} = u + α_{i} + β_{j} + (θ + v_{i}) X_{i j} + γ_{i j} + ε_{i j k}$

This model has an additional random intercept for the cluster and is denoted as $α_{i}$ , the following $N (0, τ_{α}^{2})$ . The term $γ_{i j}$ still denotes the random cluster-period effect but now is assumed to follow $N (0, τ_{γ}^{2})$ . The two random effects in this model are assumed to be independent, including a nested exchangeable correlation structure for the outcome observations within the same cluster.

The definitions for the ICCs under both models are summarized in Table 2; also see Ouyang et al.¹⁴ for a more detailed exposition of the definition of ICCs in simpler models without the random intervention effect. Notice that, due to the inclusion of random intervention effects, the definitions of the within- and between-period ICCs are different under control and intervention conditions. Furthermore, the interpretation of the CAC under the control condition is different for models (1) and (2). In the nested exchangeable model, the CAC is fixed regardless of the distance between periods, whereas, in the exponential decay model, the CAC is the fixed decay rate per period. The original definition of the CAC, namely the ratio of the between- to within-period ICCs, is still applicable under both intervention and control conditions. We emphasize that an extra term for the random intervention effect must be added to the ICC definition under the intervention condition, and the inclusion of a random intervention effect induces between-arm heterogeneity.

Table 2.
Summary of within-period and between-period intracluster correlation coefficients under data generating models.

Data generating model Parameters Under control condition Under intervention condition

Random intervention with nested exchangeable Within-period ICC $\frac{τ_{α}^{2} + τ_{γ}^{2}}{τ_{α}^{2} + τ_{γ}^{2} + τ_{ε}^{2}}$ $\frac{τ_{α}^{2} + τ_{γ}^{2} + τ_{v}^{2}}{τ_{α}^{2} + τ_{γ}^{2} + τ_{v}^{2} + τ_{ε}^{2}}$

Between-period ICC $\frac{τ_{α}^{2}}{τ_{α}^{2} + τ_{γ}^{2} + τ_{ε}^{2}}$ $\frac{τ_{α}^{2} + τ_{v}^{2}}{τ_{α}^{2} + τ_{γ}^{2} + τ_{v}^{2} + τ_{ε}^{2}}$

CAC $\frac{τ_{α}^{2}}{τ_{α}^{2} + τ_{γ}^{2}}$ $\frac{τ_{α}^{2} + τ_{v}^{2}}{τ_{α}^{2} + τ_{γ}^{2} + τ_{v}^{2}}$

Random intervention with discrete time decay (assuming an AR(1) $\tilde{Z}$ matrix) Within-period ICC $\frac{τ_{γ}^{2}}{τ_{γ}^{2} + τ_{ε}^{2}}$ $\frac{τ_{γ}^{2} + τ_{v}^{2}}{τ_{γ}^{2} + τ_{v}^{2} + τ_{ε}^{2}}$

CAC (between period j and l, where $j \neq l$ ) $r^{| j - l |}$ $\frac{τ_{γ}^{2} r^{| j - l |} + τ_{v}^{2}}{τ_{γ}^{2} + τ_{v}^{2}}$

Between-period ICC (between period j and l, where $k \neq l$ ) $\frac{τ_{γ}^{2} r^{| j - l |}}{τ_{γ}^{2} + τ_{ε}^{2}}$ $\frac{τ_{γ}^{2} r^{| j - l |} + τ_{v}^{2}}{τ_{γ}^{2} + τ_{v}^{2} + τ_{ε}^{2}}$

ICC: intracluster correlation coefficient; CAC: cluster autocorrelation; $r_{j l} :$ the decay parameter (CAC) between period i and period j assuming a general $\tilde{Z}$ matrix in the DTD-RI model; $ρ$ : the decay parameter (CAC) in adjacent periods assuming an AR(1) $\tilde{Z}$ matrix in the DTD-RI model; DTD-Rl: discrete decay time + random intervention model.

3.3 Simulation parameters

Data generating model	Parameters	Under control condition	Under intervention condition
Random intervention with nested exchangeable	Within-period ICC	$\frac{τ_{α}^{2} + τ_{γ}^{2}}{τ_{α}^{2} + τ_{γ}^{2} + τ_{ε}^{2}}$	$\frac{τ_{α}^{2} + τ_{γ}^{2} + τ_{v}^{2}}{τ_{α}^{2} + τ_{γ}^{2} + τ_{v}^{2} + τ_{ε}^{2}}$
Between-period ICC	$\frac{τ_{α}^{2}}{τ_{α}^{2} + τ_{γ}^{2} + τ_{ε}^{2}}$	$\frac{τ_{α}^{2} + τ_{v}^{2}}{τ_{α}^{2} + τ_{γ}^{2} + τ_{v}^{2} + τ_{ε}^{2}}$
CAC	$\frac{τ_{α}^{2}}{τ_{α}^{2} + τ_{γ}^{2}}$	$\frac{τ_{α}^{2} + τ_{v}^{2}}{τ_{α}^{2} + τ_{γ}^{2} + τ_{v}^{2}}$
Random intervention with discrete time decay (assuming an AR(1) $\tilde{Z}$ matrix)	Within-period ICC	$\frac{τ_{γ}^{2}}{τ_{γ}^{2} + τ_{ε}^{2}}$	$\frac{τ_{γ}^{2} + τ_{v}^{2}}{τ_{γ}^{2} + τ_{v}^{2} + τ_{ε}^{2}}$
CAC (between period j and l, where $j \neq l$ )	$r^{\| j - l \|}$	$\frac{τ_{γ}^{2} r^{\| j - l \|} + τ_{v}^{2}}{τ_{γ}^{2} + τ_{v}^{2}}$
Between-period ICC (between period j and l, where $k \neq l$ )	$\frac{τ_{γ}^{2} r^{\| j - l \|}}{τ_{γ}^{2} + τ_{ε}^{2}}$	$\frac{τ_{γ}^{2} r^{\| j - l \|} + τ_{v}^{2}}{τ_{γ}^{2} + τ_{v}^{2} + τ_{ε}^{2}}$

We chose simulation parameters (i.e. number of clusters, correlation parameters, and cluster sizes) based on published literature. The assumed values are summarized in Table 3. For example, Nevins et al.^26,48 conducted a review of 160 published SW-CRT and found that the median number of clusters was 11 with first and third quartiles of 8 and 18. Furthermore, Korevaar et al.⁴⁹ presented a database of estimated correlation parameters from reanalyzing multiple SW-CRTs and found that ICC (depending on the correlation structure) tended to range between 0.01 and 0.1 and a typical CAC value was 0.8. We therefore generated data with $I$ = 8, 16, and 32 clusters, S = 4 and 8 sequences, and cluster-period sizes $K$ = 10 or 100. We primarily focused on evaluating the validity of the Wald test (type I error rate), thus, we assumed the time-invariant treatment effect (scaled treatment effect relative to the error variance) is zero and fixed the standard deviation of the individual error term ( $τ_{ε}$ ) at 1. We considered two combinations of within-period ICCs under the control ( $ρ_{0}$ ) and intervention conditions ( $ρ_{1}$ ): (0.01, 0.05), (0.05, 0.10), and (0.05, 0.15). The corresponding standard deviations of the random intervention effect ( $τ_{v}$ ) are 0.21, 0.24, and 0.35. The CAC under the control condition was assumed to be 0.8, and the corresponding values of CAC under the intervention condition can then be calculated using Table 2.

Table 3.
Range of trial configuration and simulation parameter values under random intervention with nested exchangeable correlation structure model.

Parameter Values

Number of clusters (I) 8, 16, 32

Number of sequences (S) 4, 8

Cluster-period sizes (K) 10, 100

Effect size 0

Within-period ICC in control and intervention group (0.01, 0.05), (0.05, 0.10), (0.05, 0.15)

CAC 0.8

Standard deviation of individual error term 1

Parameter	Values
Number of clusters (I)	8, 16, 32
Number of sequences (S)	4, 8
Cluster-period sizes (K)	10, 100
Effect size	0
Within-period ICC in control and intervention group	(0.01, 0.05), (0.05, 0.10), (0.05, 0.15)
CAC	0.8
Standard deviation of individual error term	1

ICC: intracluster correlation coefficient; CAC: cluster autocorrelation.

3.4 Data analysis and performance measures

Each simulated dataset was analyzed with (1) the correctly specified model (either DTD-RI or NE-RI), and then with two misspecified models: (2) EXCH and (3) NE, each with and without RVE. We estimated six different RVEs (see Table 1) under each mixed-effects model (described in Section 2.4) using “clubSandwich” (version 0.5.10) in the R package. In total, for each simulated dataset, there were 15 fitted models. All models were fitted using “lmerTest” in R⁵⁰ with the exception of the DTD-RI model which was fitted using “glmmTMB” in R.⁵¹ All models were fitted with their default non-linear optimizers. As an ad hoc analysis, the NE-RI models were also fitted using bound optimization by quadratic approximation (“bobyqa”) in an attempt to improve the convergence rate.⁵²

The definitions of the performance measures of interest are shown in Table 4. We first calculated the bias of the estimated time-invariant treatment effects. Then, we recorded the empirical (Monte Carlo) standard error (SE) of the estimated treatment effect as well as the average SE (from models with and without RVEs) and the percentage error in the averaged model-based SE (SE estimate from a model) relative to the empirical SE. The coverage of 95% confidence intervals (CIs) for the treatment effect was assessed over repeated data generations.

Table 4.
Summary of performance measures.

Performance measures Definition Estimates

Bias $E [\hat{θ}] - θ$ $\frac{1}{n_{s i m}} \sum_{i = 1}^{n_{s i m}} \hat{θ_{i}} - θ$

Coverage $P (\hat{θ_{l o w}} < θ < \hat{θ_{u p p}})$ $\frac{1}{n_{s i m}} \sum_{i = 1}^{n_{s i m}} I (\hat{θ_{l o w}} < θ_{i} < \hat{θ_{u p p}})$

Average estimated SE $E [\sqrt{\hat{V a r} (\hat{θ})}]$ $\frac{1}{n_{s i m}} \sum_{i = 1}^{n_{s i m}} \sqrt{\hat{V a r} (\hat{θ_{i}})}$

Empirical SE $\sqrt{V a r (\hat{θ})}$ $\sqrt{\frac{1}{n_{s i m}} \sum_{i = 1}^{n_{s i m}} {(\hat{θ_{i}} - \bar{θ})}^{2}}$

Percentage error of average estimated SE to empirical SE $100 (\frac{E [\sqrt{\hat{V a r} (\hat{θ})}]}{\sqrt{V a r (\hat{θ})}} - 1)$ $100 (\frac{\frac{1}{n_{s i m}} \sum_{i = 1}^{n_{s i m}} \sqrt{\hat{V a r} (\hat{θ_{i}})}}{\sqrt{\frac{1}{n_{s i m}} \sum_{i = 1}^{n_{s i m}} {(\hat{θ_{i}} - \bar{θ})}^{2}}} - 1)$

Performance measures	Definition	Estimates
Bias	$E [\hat{θ}] - θ$	$\frac{1}{n_{s i m}} \sum_{i = 1}^{n_{s i m}} \hat{θ_{i}} - θ$
Coverage	$P (\hat{θ_{l o w}} < θ < \hat{θ_{u p p}})$	$\frac{1}{n_{s i m}} \sum_{i = 1}^{n_{s i m}} I (\hat{θ_{l o w}} < θ_{i} < \hat{θ_{u p p}})$
Average estimated SE	$E [\sqrt{\hat{V a r} (\hat{θ})}]$	$\frac{1}{n_{s i m}} \sum_{i = 1}^{n_{s i m}} \sqrt{\hat{V a r} (\hat{θ_{i}})}$
Empirical SE	$\sqrt{V a r (\hat{θ})}$	$\sqrt{\frac{1}{n_{s i m}} \sum_{i = 1}^{n_{s i m}} {(\hat{θ_{i}} - \bar{θ})}^{2}}$
Percentage error of average estimated SE to empirical SE	$100 (\frac{E [\sqrt{\hat{V a r} (\hat{θ})}]}{\sqrt{V a r (\hat{θ})}} - 1)$	$100 (\frac{\frac{1}{n_{s i m}} \sum_{i = 1}^{n_{s i m}} \sqrt{\hat{V a r} (\hat{θ_{i}})}}{\sqrt{\frac{1}{n_{s i m}} \sum_{i = 1}^{n_{s i m}} {(\hat{θ_{i}} - \bar{θ})}^{2}}} - 1)$

SE: standard error.

For each trial configuration, we simulated 2000 datasets. This allowed us to control the Monte Carlo standard error (MCSE) to within $\pm 0.5 %$ for a coverage probability of 95%.⁵³ Convergence is a major concern when fitting complex mixed-effects models even under correct model specifications. Previous simulation studies have shown a considerable proportion of non-convergence, especially when the number of clusters is small.²⁷ We therefore also recorded the prevalence of non-convergence for all models. The data generation and analyses were conducted on the Compute Canada High-Performance Computing Cluster with R version 4.0.2.

3.5 Degree-of-freedom corrections

With a small number of clusters, studies have shown the need to correct the asymptotic normal approximation for Wald tests. For all our assessments of inferential properties (e.g. 95% CI coverage), we used a $t$ -test with the DoF equal to the number of clusters ( $I$ ) minus two. This choice of DoF was originally proposed by Ford and Westgate for GEE estimators,⁵⁴ and has subsequently been shown empirically to work well in several other studies that focused on GEE estimators of SW-CRTs.^41,47,55 In addition, the DoF $I - 2$ is also easier to implement in sample size calculations compared to other DoF corrections such as Satterthwaite, which are complex and data-dependent.⁵⁶ However, for general interest, we have also provided the results for the Satterthwaite DoF correction in the Supplemental Material.⁵⁷ Please note that Satterthwaite DoF under the DTD-RI model is currently not available in R; therefore, the normal approximation was used under that specific model (this, however, was not an issue for the $I - 2$ DoF).

3.6 Simulation results

3.6.1 Bias of the estimated treatment effects

Not surprisingly, since we used linear mixed models with correctly specified models for the conditional mean outcome (fixed-effects component), the estimated treatment effects were unbiased across all trial configurations, even if the random effects structure was misspecified.^21,58 The details are presented in Supplemental Table S1.

3.6.2 Coverage probabilities of 95% confidence intervals

3.6.2.1 Data generated from the DTD-RI model

We first present simulation results with data generated from the DTD-RI model. The SEs of estimated treatment effects from the different fitted models varied substantially. As expected, misspecified random-effects structures (e.g. EXCH and NE) without RVEs produced smaller (but biased) average model-based SEs, which led to the coverage probabilities being smaller than the nominal 95% level. In Figure 2, we show the coverage probabilities of 95% CI obtained from the EXCH and NE models with and without RVE, and from the true models (based on model-based SE). Each quarter-block of Figure 2 presents the 95% CI coverage for various combinations of within-period ICCs, number of sequences, and cluster period sizes, together with Monte Carlo bounds around 95% coverage. For example, the first quarter of Figure 2, (the top-right four mini-panels) present comparisons between models with and without RVEs, first under exchangeable (EXCH) and then under nested exchangeable (NE) correlation structures. We only present results for ICC combinations of (0.01, 0.05) and (0.05, 0.15). The results for the ICC combination (0.05 and 0.10) are presented in Supplemental Table S2. The results show that the misspecified models with model-based SE estimators were generally not able to reach 95% coverage. Both the EXCH and NE models without RVEs had similar coverage when $K = 10$ , but the NE models had much better coverage, especially when $K = 100$ . With a model-based SE estimator, the performance of the misspecified models was worse when the within-period ICCs were larger (this is more obvious for EXCH) or when the number of sequences was larger (e.g. fewer clusters were allocated to each sequence). The coverage probabilities of the misspecified model with a model-based SE estimator were lower with larger cluster-period sizes thus, increasing the cluster-period sizes did not improve the validity of inferences.

Figure 2.

Coverage probability of 95% confidence intervals for the treatment effect under both the true model and misspecified models with or without RVE when data were generated under DTD-Rl. CR0 is the standard robust variance estimator, and CR3 closely approximates the leave-one-cluster-out jackknife resampling estimator. The three dashed lines represent the 95% $\pm$ 2*MCSE.

To investigate the performance of RVEs in finite samples, Figure 2 also presents the coverage probabilities under CR0 (standard RVE) and CR3 (the RVE of approximating the leave-one-cluster-out jackknife variance estimator), respectively, compared to coverage under the true model. We chose to focus on the presentation of results for these two RVEs because CR0 is the standard RVE, and CR3 had the best performance among all RVEs we investigated. In general, the coverage probabilities tended to be closer to 95% when using RVEs than without RVEs. CR0 did not always reach a nominal 95% coverage, even with 32 clusters. Therefore, small sample corrections are generally required to maintain the validity of misspecified linear mixed models in SW-CRTs. Corrections with a simple DoF (CR1 and CR1S) led to coverage probabilities much lower than the nominal 95%. The bias-reduced linearization (CR2) had better coverage and produced decent coverage probabilities in some cases. However, it generally had underestimated SEs, leading to likely under-coverage with 16 clusters. CR1P consistently led to coverage probabilities over 96% and overestimated SEs (Supplemental Tables S2 and S3). It is important to note that CR1P (with correction $I / (I - P)$ ) was not defined when the number of clusters and sequences were both 8 (i.e. I = 8, $P$ = 9). CR3 had reasonable performance, leading to coverage probabilities generally closer to the nominal level and similar to that of the true model. When the number of clusters was small (e.g. eight clusters), the coverage probabilities (based on the model-based SE) under the true model exceeded the nominal level. While the same observation applied to CR3, the coverage probabilities tended to be closer to 95% in general. Satterthwaite DoF usually led to under-coverage when the number of clusters was either 8 or 16 (Supplemental Table S4). We found the coverage probabilities (under Satterthwaite DoF) could be as low as 91% when the number of clusters was 8. Even with 16 clusters, we still observed scenarios with 93% coverage. In general, when we fixed other parameters, the within-period ICCs and the number of sequences had a limited impact on coverage for the RVE methods, although we saw slightly better coverage when the cluster-period sizes were larger.

3.6.2.2 Data generated from NE-RI model

The results when the data were generated from the NE-RI model are similar to those from DTD-RI. In Figure 3, we present the coverage probabilities of 95% confidence intervals obtained from the misspecified EXCH and NE with and without RVE, and the true NE-RI models. Since the NE-RI is basically the NE model with a random intervention effect, the performance of the NE model with the model-based SE estimator would be associated with the standard error of the random intervention across clusters (the NE model under the ICC combination of (0.05, 0.15) shows worse performance due to larger $τ_{v}$ value).

Figure 3.

Coverage probability of 95% confidence intervals for the treatment effect under both the true model and misspecified models with or without RVE when data were generated under NE-Rl. CR0 is the standard robust variance estimator, and CR3 closely approximates the leave-one-cluster-out jackknife resampling estimator. The three dashed lines represent the 95% $\pm$ 2*MCSE.

3.6.3 Percentage error of model-based SE to empirical SE

3.6.3.1 DTD-RI model

Supplemental Table S3 summarizes the percentage errors of model-based SE to empirical SE under all models. When the data were analyzed using misspecified models without RVE, the standard errors were often underestimated by over 60%. Relative errors decreased substantially when RVEs were used. In Supplemental Figure S1, we compare the percentage error of model-based SEs from CR0 and CR3 to the true models. Across simulation scenarios, the patterns were similar to those presented for coverage probabilities: the correction by CR0 was too small, and standard errors were underestimated by over 24%. CR3 had the smallest relative errors compared to other types of RVEs, although the error could still be as high as 8% when the number of clusters was only eight. The overestimate of SEs indicates a potential loss in efficiency after applying RVEs. When the number of clusters was larger (e.g. 16 or 32), the relative error was usually within 5% and could be as low as 1%. Comparing across all RVEs, CR3 had relative errors closest to the relative errors obtained by fitting the true models (based on the model-based SE).

3.6.3.2 NE-RI model

We observed similar results when the data were generated from the NE-RI model. In Supplemental Figure S2, we present the percentage error of model-based SE to empirical SE for both true model and misspecified models with CR0 and CR3.

3.6.4. Prevalence of non-convergence

In our simulations, we excluded results when models did not converge. Table 5 summarizes the percentage of non-convergence for each trial configuration by model. Generally, when all other parameters remain the same, larger cluster-period sizes K tend to have larger non-convergence rates whereas a larger number of clusters tend to have smaller non-convergence rates. The proportion of non-convergence was as high as 39% for DTD-RI when the number of clusters and cluster-period sizes were both small (in the most extreme scenario we investigated). Under the default optimizer (“nloptwarp”: nonlinear optimizer), non-convergence for the NE-RI models was as high as 15%. However, with bound optimization by quadratic approximation (“bobyqa” optimizer), the non-convergence rate was close to zero, suggesting that this is a promising optimization routine for fitting more complex linear mixed models in SW-CRTs. As the “nloptwarp” option is a nonlinear version of the “bobyqa” option, the better convergence rate may be explained by the simpler optimization tolerance parameters required by “bobyqa.”

Table 5.
Percentage of non-convergence for fitting the true model across 2000 simulations.

I S WPICC (control, intervention) K NE-RI^a NE-RI^b DTD-RI

8 4 (0.01, 0.05) 10 0.3% 0.0% 39.4%

8 4 (0.01, 0.05) 100 3.2% 0.2% 23.1%

8 4 (0.05, 0.10) 10 0.9% 0.2% 30.8%

8 4 (0.05, 0.10) 100 1.9% 0.1% 1.7%

8 4 (0.05, 0.15) 10 1.0% 0.0% 30.4%

8 4 (0.05, 0.15) 100 1.4% 0.2% 1.7%

8 8 (0.01, 0.05) 10 1.2% 0.2% 27.0%

8 8 (0.01, 0.05) 100 4.5% 0.1% 7.9%

8 8 (0.05, 0.10) 10 1.8% 0.2% 12.8%

8 8 (0.05, 0.10) 100 3.7% 0.0% 0.1%

8 8 (0.05, 0.15) 10 1.2% 0.1% 13.5%

8 8 (0.05, 0.15) 100 3.4% 0.0% 0.0%

16 4 (0.01, 0.05) 10 1.2% 0.0% 34.6%

16 4 (0.01, 0.05) 100 4.8% 0.1% 12.5%

16 4 (0.05, 0.10) 10 1.3% 0.2% 20.4%

16 4 (0.05, 0.10) 100 2.6% 0.0% 0.0%

16 4 (0.05, 0.15) 10 1.1% 0.0% 21.1%

16 4 (0.05, 0.15) 100 2.5% 0.0% 0.0%

16 8 (0.01, 0.05) 10 1.6% 0.1% 21.0%

16 8 (0.01, 0.05) 100 8.2% 0.1% 0.9%

16 8 (0.05, 0.10) 10 2.6% 0.1% 5.4%

16 8 (0.05, 0.10) 100 6.1% 0.0% 0.0%

16 8 (0.05, 0.15) 10 1.6% 0.1% 5.7%

16 8 (0.05, 0.15) 100 6.6% 0.0% 0.0%

32 4 (0.01, 0.05) 10 1.8% 0.1% 31.9%

32 4 (0.01, 0.05) 100 7.7% 0.1% 4.6%

32 4 (0.05, 0.10) 10 1.4% 0.2% 10.9%

32 4 (0.05, 0.10) 100 4.9% 0.0% 0.0%

32 4 (0.05, 0.15) 10 1.1% 0.2% 12.3%

32 4 (0.05, 0.15) 100 5.9% 0.0% 0.0%

32 8 (0.01, 0.05) 10 2.7% 0.1% 17.9%

32 8 (0.01, 0.05) 100 14.5% 0.0% 0.1%

32 8 (0.05, 0.10) 10 3.4% 0.0% 1.3%

32 8 (0.05, 0.10) 100 12.7% 0.0% 0.0%

32 8 (0.05, 0.15) 10 2.7% 0.1% 1.4%

32 8 (0.05, 0.15) 100 11.3% 0.0% 0.0%

I	S	WPICC (control, intervention)	K	NE-RI^a	NE-RI^b	DTD-RI
8	4	(0.01, 0.05)	10	0.3%	0.0%	39.4%
8	4	(0.01, 0.05)	100	3.2%	0.2%	23.1%
8	4	(0.05, 0.10)	10	0.9%	0.2%	30.8%
8	4	(0.05, 0.10)	100	1.9%	0.1%	1.7%
8	4	(0.05, 0.15)	10	1.0%	0.0%	30.4%
8	4	(0.05, 0.15)	100	1.4%	0.2%	1.7%
8	8	(0.01, 0.05)	10	1.2%	0.2%	27.0%
8	8	(0.01, 0.05)	100	4.5%	0.1%	7.9%
8	8	(0.05, 0.10)	10	1.8%	0.2%	12.8%
8	8	(0.05, 0.10)	100	3.7%	0.0%	0.1%
8	8	(0.05, 0.15)	10	1.2%	0.1%	13.5%
8	8	(0.05, 0.15)	100	3.4%	0.0%	0.0%
16	4	(0.01, 0.05)	10	1.2%	0.0%	34.6%
16	4	(0.01, 0.05)	100	4.8%	0.1%	12.5%
16	4	(0.05, 0.10)	10	1.3%	0.2%	20.4%
16	4	(0.05, 0.10)	100	2.6%	0.0%	0.0%
16	4	(0.05, 0.15)	10	1.1%	0.0%	21.1%
16	4	(0.05, 0.15)	100	2.5%	0.0%	0.0%
16	8	(0.01, 0.05)	10	1.6%	0.1%	21.0%
16	8	(0.01, 0.05)	100	8.2%	0.1%	0.9%
16	8	(0.05, 0.10)	10	2.6%	0.1%	5.4%
16	8	(0.05, 0.10)	100	6.1%	0.0%	0.0%
16	8	(0.05, 0.15)	10	1.6%	0.1%	5.7%
16	8	(0.05, 0.15)	100	6.6%	0.0%	0.0%
32	4	(0.01, 0.05)	10	1.8%	0.1%	31.9%
32	4	(0.01, 0.05)	100	7.7%	0.1%	4.6%
32	4	(0.05, 0.10)	10	1.4%	0.2%	10.9%
32	4	(0.05, 0.10)	100	4.9%	0.0%	0.0%
32	4	(0.05, 0.15)	10	1.1%	0.2%	12.3%
32	4	(0.05, 0.15)	100	5.9%	0.0%	0.0%
32	8	(0.01, 0.05)	10	2.7%	0.1%	17.9%
32	8	(0.01, 0.05)	100	14.5%	0.0%	0.1%
32	8	(0.05, 0.10)	10	3.4%	0.0%	1.3%
32	8	(0.05, 0.10)	100	12.7%	0.0%	0.0%
32	8	(0.05, 0.15)	10	2.7%	0.1%	1.4%
32	8	(0.05, 0.15)	100	11.3%	0.0%	0.0%

I: number of clusters; S: number of sequences, M: cluster-period size; DTD-Rl: discrete decay time + random intervention model; NE-RI: nested exchangeable + random intervention model.

Linear mixed effect model with optimizer “nloptwarp.”

Linear mixed effect model with optimizer “bobyqa.”

3.6.5 Additional evaluations

As suggested by reviewers, we conducted additional analysis with selected scenarios to evaluate the statistical power of linear mixed models with RVEs. In addition, we compared the performance of the linear mixed model with RVE with alternative robust analytical methods (GEE and permutation test) that researchers proposed in the literature.

3.6.5.1 Statistical power

As an additional set of evaluations, we compared power when using linear mixed models with CR3 and the true model under selected scenarios. Specifically, we considered 6 scenarios for each data generator (12 scenarios in total), including a combination of various factors such as the number of clusters (8 and 32), sequences (4), and ICC values ((0.01, 0.05), (0.05, 0.10), (0.05, 0.15)). The results are presented in Supplemental Table S5. We primarily focus on CR3 because it has comparable coverage probabilities to the true model and the size of the test is generally well maintained to the nominal level. We observed that the loss of statistical power in the RVE (CR3) model is minimal when the number of clusters is 32 and is less than 10% when the number of clusters is only eight.

3.6.5.2 Comparison with GEE and permutation tests

Both GEE and permutation tests have been used in stepped wedge trials for robust inference and to protect potential random-effects misspecification.^54,59 Under the 12 scenarios mentioned in the above section, we conducted additional simulations (Supplemental Table S6) to show that bias and coverage probabilities under linear mixed models (CR2 and CR3) were very similar to those under GEE (KC and MD).

Similarly, we compared the coverage probabilities under linear mixed models (CR2 and CR3) with the permutation test. We implemented the permutation test for SW-CRT suggested by Ren et al.⁶⁰ The test statistic was defined as the estimated treatment effect under the fitted linear mixed effect model. The permutation test was implemented as follows. We first estimated the test statistics from the observed allocation. Then, by randomly “shuffling” the randomized treatment sequences among all clusters, we generated 1000 permutations from all possible randomized allocations. The permutation distribution of the test statistic was obtained by estimating the treatment effects from the permuted datasets. We rejected the null hypothesis if the test statistic computed from the observed dataset is either smaller than the 2.5th percentile or larger than the 97.5th percentile of the permutation distribution. The whole process was repeated with 500 simulated datasets. The results, presented in Supplemental Table S7, show that the permutation test consistently gave inflated type I errors. These results are expected as we were investigating the performance of the permutation test when the data-generating process includes a random treatment effect; therefore, in each simulated data, only the weak null holds but not the sharp null. While the permutation test should still be valid as advertised with respect to a sharp null, it can carry an inflated type I error rate when only the weak null holds.

4 Illustrative example

This section presents a real example to demonstrate the application of the different models considered in this article. OXTEXT-7 was an open cohort SW-CRT involving 11 community mental health teams in the Oxford Health NHS Foundation Trust to assess the effectiveness of a program: “Feeling Well with TrueColours” (an intervention originally used for individuals with bipolar disorder) in producing better health outcomes for participants in their care.^61,62 Teams were assigned at random to receive the intervention throughout 16 months. The Health of the Nation Outcome Scales (HoNOS) total score (continuous) was the primary outcome and was measured monthly, primarily on different people. There were 4595 observations with an average cluster period size of 26. Since the expected churn rate is close to 1 (0.91), for simplicity, we treated this trial as a cross-sectional design in the analysis.⁶³

Table 6 shows the estimated time-invariant treatment effect, model-based SE, and 95% CI for the treatment effect for OXTEXT-7 under all models investigated in the simulations. For models without RVEs, the NE model produced higher SEs. Even though the point estimates slightly differed across the two models, all 95% CIs included zero. Clearly, SEs under EXCH and NE models without RVE were smaller than with any RVE. In this case, it might suggest some degree of random-effect structures misspecification, and hence overly optimistic model-based SE estimates. The RVE results were generally similar across working model specifications for this application. When adding the RVEs, the SE correction for small samples resembled what we saw in our simulations. Not surprisingly, the standard error produced by CR0 was the smallest, but our simulations show that this choice could lead to under-coverage with a small number of clusters, and CR3 is the recommended method. RVEs were generally larger in the NE model (which may be explained by the larger number of random cluster-by-time interactions required by the NE model); however, the CIs were generally similar. When we fitted the RI model, the model-based SE was still much smaller than that under the RVE. This may suggest that the true data generator deviates from the RI model, and based on our simulation results, the RVE, in theory, should be consistent regardless of the random-effects structure specification (assuming the correct fixed-effects specification such that the true treatment effect is time-invariant) and serves as an objective reflection of the uncertainty in the treatment effect estimates. In this example, the DTD-RI model failed to converge. Since the number of periods was larger than the number of clusters, some corrections (e.g. CR1P) would not be available if we treated the period as a categorical variable. One could treat the period as a continuous variable in the analytic model to obtain CR1P, and the impact of this change, particularly for the RVE, is worthy of future investigation.^40,64,65

Table 6.
Example OXTEXT-7 trial analyzed with each method investigated.

Model Estimated time-invariant treatment effects Estimated SE 95% confidence interval

Exchangeable^b 0.066 0.402 (−0.843, 0.975)

Exchangeable + CR0 0.651 (−1.407, 1.539)

Exchangeable + CR1 0.683 (−1.479, 1.611)

Exchangeable + CR1P^a – –

Exchangeable + CR1S 0.684 (−1.481, 1.613)

Exchangeable + CR2 0.706 (−1.531, 1.663)

Exchangeable + CR3 0.767 (−1.669, 1.801)

Nested exchangeable^b 0.238 0.487 (−0.864, 1.340)

Nested exchangeable + CR0 0.711 (−1.370, 1.846)

Nested exchangeable + CR1 0.746 (−1.450, 1.926)

Nested exchangeable + CR1P^a – –

Nested exchangeable + CR1S 0.747 (−1.452, 1.928)

Nested exchangeable + CR2 0.762 (−1.486, 1.962)

Nested exchangeable + CR3 0.820 (−1.617, 2.093)

NE-RI^b 0.322 0.504 (−0.818, 1.462)

DTD-RI Did not converge

Model	Estimated time-invariant treatment effects	Estimated SE	95% confidence interval
Exchangeable^b	0.066	0.402	(−0.843, 0.975)
Exchangeable + CR0	0.651	(−1.407, 1.539)
Exchangeable + CR1	0.683	(−1.479, 1.611)
Exchangeable + CR1P^a	–	–
Exchangeable + CR1S	0.684	(−1.481, 1.613)
Exchangeable + CR2	0.706	(−1.531, 1.663)
Exchangeable + CR3	0.767	(−1.669, 1.801)

Nested exchangeable^b	0.238	0.487	(−0.864, 1.340)
Nested exchangeable + CR0	0.711	(−1.370, 1.846)
Nested exchangeable + CR1	0.746	(−1.450, 1.926)
Nested exchangeable + CR1P^a	–	–
Nested exchangeable + CR1S	0.747	(−1.452, 1.928)
Nested exchangeable + CR2	0.762	(−1.486, 1.962)
Nested exchangeable + CR3	0.820	(−1.617, 2.093)

NE-RI^b	0.322	0.504	(−0.818, 1.462)
DTD-RI	Did not converge

NE-RI: nested exchangeable + random intervention model; DTD-RI: discrete decay time + random intervention model; RVE: robust variance estimator.

CR1P is not available as the number of parameters in the model is greater than the number of clusters.

For the model without RVE, the model-based SE was presented.

One caveat in our illustrative application is that we have assumed that the true treatment effect structure is constant and time-invariant. Although this is the primary focus of our simulation study, this simple treatment effect structure might not hold true for OXTEXT-7, as implied in the reanalysis of Nickless et al.⁶² We wish to reiterate that the use of RVEs in linear mixed models may not address the systematic bias that may arise from the incorrect specification of the treatment effect structure (such as specifying a time-invariant treatment effect in the presence of a true exposure-time-specific treatment effect).^66,67 We leave a comprehensive examination of small sample variance corrections under the general time-on-treatment linear mixed model for future research.

5 Discussion

5.1 Summary of main findings

We examined by simulation whether the use of RVEs in linear mixed models can help maintain valid statistical inference under random-effects structure misspecification when the true data are generated under either DTD-RI or NE-RI. Furthermore, we examined the performance of RVEs under five small-sample corrections (available in the R package “clubSandwich”) when the number of clusters is limited, which is common in SW-CRTs. We confirmed, as expected with linear mixed models, that misspecification of the correlation structure generally does not introduce bias to the estimated treatment effects. However, random effects misspecification substantially affected the standard errors of estimated treatment effects. This was reflected in large relative errors of model SEs and under-coverage. When the linear mixed model was misspecified, model-based SEs generally underestimated the (true) Monte Carlo SE by 60%. The coverage probability of 95% confidence intervals for the estimated treatment effect dropped to around 50% in misspecified models without RVEs. Similar results have been noticed in previous studies. For example, Kasza and Forbes²⁴ showed analytically that fitting a simpler correlation structure would lead to an underestimated model-based SE. Our simulation results also confirmed results by Bowden et al.²³ and Voldal et al.²¹ that failing to include random treatment effects would lead to an underestimated (model-based) variance of the treatment effect causing lower coverage (or equivalently, inflation of type I error rates). Beyond confirming these findings, our simulations provide compelling evidence that adding RVEs generally improved the validity of statistical inference for a misspecified linear mixed model. This has not been thoroughly discussed in previous literature in the context of SW-CRTs. It is important to notice that even a simple exchangeable model with RVEs (with small-sample correction) has acceptable performance with respect to bias and coverage in SW-CRTs. Adding extra random effects (e.g. using an NE model) did not appreciably improve the model performance in terms of bias and coverage. Regardless of the true data generation mechanism, the performance of different types of RVEs was very similar. Finally, we also wish to point out that although we have only considered DTD-RI and NE-RI as two data generators, our findings should be generalizable to other types of linear mixed data generators as the RVE, in theory, converges to the true variance of the treatment effect estimator under arbitrary random-effects misspecification (assuming the same fixed-effect model).

In the stepped wedge design literature, it is well-known under the GEE framework that the standard RVE could underestimate the variance when the number of clusters is small.⁶⁸ Unfortunately, this has not been as widely appreciated for linear mixed models and has led to reservations about adopting linear mixed models with a simple random-effects structure for data analysis. Our study has empirically demonstrated that the RVE can help maintain the validity of inference under a potentially misspecified linear mixed model, and to some extent, provides a path forward in practice when it might be challenging to correctly specify a complicated random-effects structure for SW-CRTs. In our simulations, the number of clusters was the most important factor that affected the performance of RVEs. The relative error of the model SE produced by using standard RVE (CR0) without small sample correction could still be as high as −22%, and the coverage probability could be as low as 83%. In fact, with fewer than 32 clusters, we found that small-sample corrections were essential. Among five types of small-sample correction methods (CR1, CR1P, CR1S, CR2, and CR3), when a trial has a relatively large number of clusters (e.g. 32), all types of small-sample corrections had similar performance. With a small number of clusters, CR2 performed reasonably well and was clearly better than the others but exhibited some mild under-coverage. Using CR3 generally led to similar performance to the true model, small relative errors of model-based SEs, and coverage probabilities close to 95%. In our simulations, we found that models with CR3 (DoF = $I - 2$ ) performed adequately but tended to have conservative coverage probabilities when the number of clusters was only 8. This observation is consistent with prior findings in the GEE literature.⁵⁴ The use of Satterthwaite DoF seems to be more prevalent in practice for linear mixed models, but we found that it generally led to coverage probabilities lower than the nominal level. Although the DoF correction $I - 2$ has previously been implemented in a GEE framework with RVE, we have shown that it is equally applicable to linear mixed models with RVEs, and indeed we have empirically demonstrated its appropriate performance in our simulations with a small number of clusters.

5.2 Recommendations and limitations

To the best of our knowledge, our study is the first to comprehensively examine the performance of misspecified random-effects structures in linear mixed models with RVEs in the context of SW-CRTs. When the true correlation structure is unknown, we found that a random intercept model with standard RVE (CR0) is usually sufficient to achieve valid inference with at least 32 clusters. A small-sample correction (e.g. CR3) is highly recommended for cases with fewer than 32 clusters. With the number of clusters minus two DoF correction, CR3 achieved appropriate 95% CI coverage with as few as eight clusters, at the expense of a slight overestimation of SEs. Putting our results in the context of the SW-CRT literature, we notice that our findings are slightly different from those obtained from previous simulations of SW-CRTs with GEE estimators. To give a few examples, Li et al.⁴¹ and Li⁴⁷ have found that the Kauermann and Carroll variance estimator (the equivalence of CR2 under GEE inference), coupled with the DoF $I - 2$ , can maintain nominal type I error rates for GEE analysis of small SW-CRTs; Ford and Westgate⁵⁴ found that an average of the Kauermann and Carroll SE and Mancl and DeRouen SE (the equivalence of CR3 under GEE inference) estimators, coupled with the DoF $I - 2$ , can maintain nominal type I error rates for GEE analysis of small SW-CRTs. However, apart from the fact that these prior studies considered a different estimator than ours, they have mostly investigated the performance of RVE under correct specification or approximately correct specification of the within-cluster correlation structure (as opposed to misspecified random-effects structures). In our simulations, CR3 coupled with a $t (I - 2)$ reference distribution is recommended as it maintained the best small-sample inference for misspecified linear mixed models in SW-CRTs.

Historically, GEE is known to be more robust than linear mixed models but usually requires larger numbers of clusters. Our findings empirically demonstrate that linear mixed models may be equally—if not more—competitive than GEE in providing robust analysis of small SW-CRTs. In fact, for continuous outcomes, the performance of GEE and linear mixed models is very similar when the marginal model (fixed effects) is correctly specified: their differences lie in the estimation of nuisance parameters such as the variance components and the correlation parameters. A potential advantage of linear mixed models is that it is generally easier to specify and estimate more complicated random-effects structures, whereas substantial conceptual difficulty and computational burden may be introduced to the formulation and estimation of an equally complex working correlation structure with multiple parameters in GEE.⁴¹ In the analysis of SW-CRTs, there is a potential efficiency advantage in specifying a more complicated correlation structure, especially when the fitted model with RVE is closer to the true model (as also shown in our simulation Supplemental Table S5). For example, when the true model is NE-RI, the nested exchangeable model with RVE has slightly higher power than the exchangeable model with RVE. To specify the former structure beyond simple exchangeability, the linear mixed model is arguably much more accessible and may have a computational advantage over GEE when the cluster size is large. On the other hand, GEE inference may enjoy other advantages such as directly reporting the correlation parameters on the natural scale of the outcomes, and may be particularly preferred for the analysis of binary and count outcomes in SW-CRTs with a non-identity link function.^39–41,56 In a pair of recent systematic reviews of 160 published SW-CRTs by Nevins et al.,^26,48 70% (112/160) adopted a mixed-effects model in the primary analysis of the primary outcome, likely due to software accessibility and flexibility in specifying the random-effects structure. However, although not examined in the previous systematic review, our experience is that the RVE is infrequently used in practice as it tends to return anti-conservative inferences with a small number of clusters (i.e. the standard RVE tends to have lower than nominal coverage). Therefore, an important contribution of our study is promoting awareness and empirically demonstrating the use of bias-corrected RVE as a reliable tool for robust inference in misspecified linear mixed models in SW-CRTs.

Previous studies have indicated that design-based inference, such as the permutation test, is a powerful tool to provide robust inference under model misspecification. With a limited number of clusters, one clear advantage of the permutation test is that the inferences are exact such that the type I error rate can often be controlled at the nominal level without the need for small sample corrections. For example, Wang and De Gruttola⁶⁹ and Ji et al.⁷⁰ have considered testing the null hypothesis based on identifying the permutational distribution of the Wald statistic estimated from the simple exchangeable and the nested exchangeable random-effects model. In the same spirit of the use of RVE in our article, they both demonstrated that the specification of the random-effects has little impact on the type I error rate but only affected the power of the permutation test. Alternative test statistics with fewer modeling assumptions in SW-CRTs have subsequently been discussed by Thompson et al.,⁷¹ Kennedy-Shaffer et al.,⁷² Hughes et al.,⁵⁹ under the principle of conditional exchangeability (which often serves as the basis for permutation inference). Zhang and Zhao⁷³ further carefully distinguished between the permutation test and the randomization test, and illustrated the practical importance of this distinction with a SW-CRT example. More recently, Ren et al.⁶⁰ have found that the permutation test may not protect against all types of random effect misspecification. Specifically, they found that when the true model includes a random intervention effect, but the fitted model did not account for that, the type I error rate of the permutation test based on a Wald statistic was inflated. This matches our additional simulation results and is mainly because the treatment sequence affects both the mean and variance structures, leading to violations of conditional exchangeability even under the null based on the working model with a random intervention effect. In this scenario, however, the use of RVEs based on a misspecified linear mixed model can successfully preserve the validity of inference, and therefore could be preferred. Finally, it is worth noting that the construction of the permutation-based confidence intervals can require substantial computation from inverting the permutation test. As an improvement, Rabideau and Wang⁷⁴ recently studied more efficient search algorithms for obtaining permutation confidence intervals for SW-CRTs. In comparison, the construction of the confidence interval based on an RVE is simple and essentially instantaneously computed. We provide in Table 7 of our revision a concise summary of the potential advantages and caveats of using RVEs in mixed-effects models, RVEs in GEE, and permutation tests to provide robust inference in SW-CRTs.

Table 7.

A concise comparison of methods regarding their features for robust analyses of SW-CRTs under model misspecification.

Methods	Potential advantages	Caveats
Mixed-effects models with RVE	• It is convenient to specify more complicated random-effects structures to account for multiple sources of intracluster correlations.• The implementation is available in standard software.	• For logistic link function and binary outcomes, the interpretation of the fixed-effects parameter may change according to different specifications of the random-effects structure within a generalized linear mixed model.• The use of RVE requires small sample correction with a small number of clusters.
GEE models with RVE	• The estimation of the marginal mean structure and the correlation structure is based on separate estimating equations, and misspecification of the correlation structure does not affect the interpretation of the marginal mean parameters, regardless of the type of the outcome and link function.• GEE with a simple exchangeable correlation structure is implemented in standard software	• Recently developed software for implementing GEE with more complicated correlation structures in SW-CRTs may be limited to certain two-parameter of three-parameter correlation structures.• When specifying a more complicated correlation structure, it is likely that one needs to provide the design matrix for the correlation estimating equations, which requires careful coding.• The computation is often intensive for with a complex correlation structure when the cluster size becomes large.• The use of RVE requires small sample correction with a small number of clusters.
Permutation inference	• Under conditional exchangeability, the inference is exact and does not require small sample correction even when there is a small number of clusters.	• Estimating the permutation distribution of the test statistic may require more computation due to repeatedly fitting multilevel models under permuted treatment assignments.• Estimation of the permutation confidence interval is not trivial and requires substantial computation from a search algorithm.

RVE: robust variance estimator; SW-CRTs: stepped-wedge cluster randomized trials; GEE: generalized estimating equation.

In practice, while we still maintain the usual standpoint of carefully specifying the random-effects structure in SW-CRTs (for reasons such that we could obtain intra-cluster correlation coefficients that are essential for future trial planning,¹⁴ or that we might be able to obtain a more efficient treatment effect estimator when the specified random-effects structure approximates the truth), RVEs are recommended when a more complex model cannot be fit or fails to converge, or researchers are strongly concerned about model misspecification. It is also recommended to fit a model with a RVE as a sensitivity analysis even if researchers have a strong belief about the true correlation structure—in such a case one would expect the results to be similar between the model-based variance estimator and the RVE if the number of clusters is large enough. We also conjecture that, because treatment effects in other types of multiple-period cluster randomized trial designs are often estimated via linear mixed models, our results can be generalized to such other types of CRTs (such as longitudinal parallel-arm CRTs and cluster randomized crossover trials).

Non-convergence is another practical issue when the fitted linear mixed model involves a complicated random-effects structure. We observed more than a 30% non-convergence rate for the DTD-RI model when the trial is small. In practice, the DTD-RI model is still encouraged if the investigators believe it reflects their beliefs about the true random effect correlation structures. In case of convergence failure, investigators could change the default optimizer or follow the troubleshooting documentation included in the “glmmTmb” package. SAS and STAT users may follow some strategies denoted in an instruction document by Kiernan.⁷⁵

Our study has several limitations. First, although we have demonstrated that appropriate inference for treatment effects can be obtained from misspecified random-effects structures in linear mixed models by using the RVE, we wish to point out that the estimates of the random-effects variance components from misspecified models will no longer be valid. If the goal is to obtain valid ICC estimates (e.g. to inform future sample size calculations), then the interpretation of the variance component and ICC estimates from a misspecified working model can become challenging.^21,23,76 An exception is discussed by Kasza et al.,⁷⁶ where one may obtain decay parameter estimates from variance component estimates under a misspecified exchangeable random-effects model. Second, we only considered continuous outcomes as RVEs for non-Gaussian outcomes are currently not available across all software packages. For example, the “clubSandwich” R package does not support non-Gaussian outcomes. Even though SAS GLIMMIX allows for RVE, the estimation routine for a generalized linear mixed model proceeds by linearization and should be considered as pseudo-GEE. However, a more conceptual challenge with using misspecified generalized linear mixed models with binary outcomes, and particularly with a logit link, is that the treatment effect coefficient may carry a different interpretation when the random-effects structure is misspecified, due to non-collapsibility.⁷⁷ Future studies should be carried out to reconcile the challenges in using misspecified generalized linear mixed models in SW-CRTs. Despite all these complexities, the linear mixed models we considered can still be used to estimate the treatment effect on the risk difference scale for binary outcomes, in which case the RVE becomes essential as it can simultaneously account for a misspecified variance function and potentially misspecified random-effects structure. Third, when considering the models used to generate the data, we assumed the DTD-RI was the most complex model. In fact, more complex correlation structures, such as Toeplitz and unstructured, have been discussed previously.³¹ Although RVE is in theory valid under more complex linear mixed data generators, its performance under such scenarios was not examined in this study. Fourth, in this study, we still found that when the number of clusters is small (e.g. eight), RVE with small-sample correction tended to have slightly inflated type I error rates. Under the GEE framework, alternative small-sample corrected RVEs (such as that due to Fay and Graubard⁷⁸) have been studied, but under relatively restrictive data-generating processes.^27,79–82 However, such a bias-reduced estimator has not been widely implemented for linear mixed models and its performance with a potentially misspecified mixed model is unclear. Currently, the Fay and Graubard⁷⁸ variance estimator for linear mixed models is only available in SAS PROC GLIMMIX (which fits mixed models via a GEE-like routine rather than a maximum likelihood routine). Future studies may consider evaluating this additional RVE for linear mixed models fitted in SAS PROC GLIMMIX. Lastly, we only discussed cross-sectional designs. The cohort design requires additional random effects for repeated measures on each individual. We expect that our results can be extended to cohort designs, but future studies should investigate the performance of each RVE when additional random effects are specified for repeated measurements.

Supplemental Material

sj-xlsx-1-smm-10.1177_09622802241248382 - Supplemental material for Maintaining the validity of inference from linear mixed models in stepped-wedge cluster randomized trials under misspecified random-effects structures

Supplemental material, sj-xlsx-1-smm-10.1177_09622802241248382 for Maintaining the validity of inference from linear mixed models in stepped-wedge cluster randomized trials under misspecified random-effects structures by Yongdong Ouyang, Monica Taljaard, Andrew B Forbes and Fan Li in Statistical Methods in Medical Research

Supplemental Material

sj-pdf-2-smm-10.1177_09622802241248382 - Supplemental material for Maintaining the validity of inference from linear mixed models in stepped-wedge cluster randomized trials under misspecified random-effects structures

Supplemental material, sj-pdf-2-smm-10.1177_09622802241248382 for Maintaining the validity of inference from linear mixed models in stepped-wedge cluster randomized trials under misspecified random-effects structures by Yongdong Ouyang, Monica Taljaard, Andrew B Forbes and Fan Li in Statistical Methods in Medical Research

Footnotes

Acknowledgements

This research was enabled in part by support provided by Compute Ontario (computeontario.ca) and the Digital Research Alliance of Canada (alliancecan.ca).

Data availability

Data sharing is not applicable to this article as no new data were created or analyzed in this study. Some code used for simulation is available at

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: MT and FL are supported by the National Institute of Aging (NIA) of the National Institutes of Health (NIH) under Award Number U54AG063546, which funds NIA Imbedded Pragmatic Alzheimer's Disease and AD-Related Dementias Clinical Trials Collaboratory (NIA IMPACT Collaboratory). Research in this article was also partially supported by a Patient-Centered Outcomes Research Institute Award® (PCORI® Award ME-2022C2-27676, to FL). YO is funded by the Canadian Institutes of Health Research Health System Impact Postdoc Fellowship. The statements presented in this article are solely the responsibility of the authors and do not necessarily represent the official views of NIH, PCORI®, or its Board of Governors or Methodology Committee.

ORCID iDs

Yongdong Ouyang

Fan Li

Supplemental material

Supplemental material for this article is available online.

References

Donner

. Design and analysis of cluster randomization trials in health research. London: Arnold, 2000.

Donner

Koval

. Design considerations in the estimation of intraclass correlation. Ann Hum Genet 1982; 46: 271–277.

Brown

Lilford

. The stepped wedge trial design: a systematic review. BMC Med Res Methodol 2006; 6: 54.

Hussey

Hughes

. Design and analysis of stepped wedge cluster randomized trials. Contemp Clin Trials 2007; 28: 182–191.

Hughes

Hemming

, et al. Mixed-effects models for the design and analysis of stepped wedge cluster randomized trials: an overview. Stat Methods Med Res 2021; 30: 612–639.

Hooper

Teerenstra

de Hoop

, et al. Sample size calculation for stepped wedge and other longitudinal cluster randomised trials. Stat Med 2016; 35: 4718–4728.

Girling

Hemming

. Statistical efficiency and optimal design for stepped cluster studies under linear mixed effects models. Stat Med 2016; 35: 2149–2166.

Kasza

Hemming

Hooper

, et al. Impact of non-uniform correlation structure on sample size and power in multiple-period cluster randomised trials. Stat Methods Med Res 2019; 28: 703–716.

Hemming

Kasza

Hooper

, et al. A tutorial on sample size calculation for multiple-period cluster randomized parallel, cross-over and stepped-wedge trials using the shiny CRT calculator. Int J Epidemiol. Epub ahead of print 22 February 2020. DOI: 10.1093/ije/dyz237.

10.

Ouyang

Karim

, et al. CRTpowerdist: an R package to calculate attained power and construct the power distribution for cross-sectional stepped-wedge and parallel cluster randomized trials. Comput Methods Programs Biomed 2021; 208: 106255.

11.

Chen

Zhou

, et al. Swdpwr: a SAS macro and an R package for power calculations in stepped wedge cluster randomized trials. Comput Methods Programs Biomed 2022; 213: 106522.

12.

Voldal

Hakhu

Xia

, et al. swCRTdesign: an R package for stepped wedge trial design and analysis. Comput Methods Programs Biomed 2020; 196: 105514.

13.

Ouyang

Preisser

, et al. Sample size calculators for planning stepped-wedge cluster randomized trials: a review and comparison. Int J Epidemiol 2022; dyac123: 2000–2013.

14.

Ouyang

Hemming

, et al. Estimating intra-cluster correlation coefficients for planning longitudinal cluster randomized trials: a tutorial. Int J Epidemiol 2023; dyad062: 1634–1647.

15.

Hughes

Granston

Heagerty

. Current issues in the design and analysis of stepped wedge trials. Contemp Clin Trials 2015; 45: 55–60.

16.

Hemming

Taljaard

Forbes

. Analysis of cluster randomised stepped wedge trials with repeated cross-sectional samples. Trials 2017; 18: 101.

17.

Drikvandi

Verbeke

Molenberghs

. Diagnosing misspecification of the random-effects distribution in mixed models. Biometrics 2017; 73: 63–71.

18.

Litière

Alonso

Molenberghs

. Type I and type II error under random-effects misspecification in generalized linear mixed models. Biometrics 2007; 63: 1038–1044.

19.

Heagerty

Kurland

. Misspecified maximum likelihood estimates and generalised linear mixed models. Biometrika 2001; 88: 973–985.

20.

Hui

FKC

Müller

Welsh

. Random effects misspecification can have severe consequences for random effects inference in linear mixed models. Int Stat Rev 2021; 89: 186–206.

21.

Voldal

Xia

Kenny

, et al. Model misspecification in stepped wedge trials: random effects for time or treatment. Stat Med 2022; 41: 1751–1766.

22.

Rezaei-Darzi

Kasza

Forbes

, et al. Use of information criteria for selecting a correlation structure for longitudinal cluster randomised trials. Clin Trials 2022; 19: 316–325.

23.

Bowden

Forbes

Kasza

. Inference for the treatment effect in longitudinal cluster randomized trials when treatment effect heterogeneity is ignored. Stat Methods Med Res 2021; 30: 2503–2525.

24.

Kasza

Forbes

. Inference for the treatment effect in multiple-period cluster randomised trials when random effect correlation structure is misspecified. Stat Methods Med Res 2019; 28: 3112–3122.

25.

Nevins

Nicholls

Ouyang

, et al. Reporting of and explanations for under-recruitment and over-recruitment in pragmatic trials: a secondary analysis of a database of primary trial reports published from 2014 to 2019. BMJ Open 2022; 12: e067656.

26.

Nevins

Davis-Plourde

Macedo

, et al. A scoping review described diversity in methods of randomization and reporting of baseline balance in stepped-wedge cluster randomized trials. J Clin Epidemiol. Epub ahead of print 15 March 2023. DOI: 10.1016/j.jclinepi.2023.03.010.

27.

Thompson

Hemming

Forbes

, et al. Comparison of small-sample standard-error corrections for generalised estimating equations in stepped wedge cluster randomised trials with a binary outcome: a simulation study. Stat Methods Med Res 2021; 30: 425–439.

28.

Liang

K-Y

Zeger

. Longitudinal data analysis using generalized linear models. Biometrika 1986; 73: 13–22.

29.

Huber

. The behavior of maximum likelihood estimates under nonstandard conditions. Proc Fifth Berkeley Symp Math Stat Probab Vol 1 Stat 1967; 5: 221–234.

30.

White

. A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity. Econometrica 1980; 48: 817–838.

31.

Ouyang

Kulkarni

Protopopoff

, et al. Accounting for complex intracluster correlations in longitudinal cluster randomized trials: a case study in malaria vector control. BMC Med Res Methodol 2023; 23: 64.

32.

Barker

McElduff

D’Este

, et al. Stepped wedge cluster randomised trials: a review of the statistical methodology used and available. BMC Med Res Methodol 2016; 16: 69.

33.

Huang

Wiedermann

Zhang

. Accounting for heteroskedasticity resulting from between-group differences in multilevel models. Multivar Behav Res 2022; 0: 1–21.

34.

Cheong

Fotiu

Raudenbush

. Efficiency and robustness of alternative estimators for two- and three-level models: the case of NAEP. J Educ Behav Stat 2001; 26: 411–429.

35.

Pustejovsky

Tipton

. Small-Sample methods for cluster-robust variance estimation and hypothesis testing in fixed effects models. J Bus Econ Stat 2018; 36: 672–683.

36.

Mancl

DeRouen

. A covariance estimator for GEE with improved small-sample properties. Biometrics 2001; 57: 126–134.

37.

Pustejovsky

. clubSandwich: Cluster-Robust (Sandwich) Variance Estimators with Small-Sample Corrections, https://CRAN.R-project.org/package=clubSandwich (2022, accessed 6 November 2022).

38.

Maas

CJM

Hox

. The influence of violations of assumptions on multilevel parameter estimates and their standard errors. Comput Stat Data Anal 2004; 46: 427–440.

39.

Turner

Preisser

. Sample size determination for GEE analyses of stepped wedge cluster randomized trials. Biometrics 2018; 74: 1450–1458.

40.

Zhang

Preisser

Turner

, et al. A general method for calculating power for GEE analysis of complete and incomplete stepped wedge cluster randomized trials. Stat Methods Med Res 2023; 32: 71–87.

41.

Rathouz

, et al. Marginal modeling of cluster-period means and intraclass correlations in stepped wedge designs with binary outcomes. Biostatistics 2022; 23: 772–788.

42.

Paik

. Repeated measurement analysis for nonnormal data in small samples. Commun Stat - Simul Comput 1988; 17: 1155–1171.

43.

Piedmonte

Williams

. Small sample validity of latent variable models for correlated binary data. Commun Stat - Simul Comput 1994; 23: 243–269.

44.

Lipsitz

Laird

Harrington

. Using the jackknife to estimate the variance of regression estimators from repeated measures studies. Commun Stat - Theory Methods 1990; 19: 821–845.

45.

Bell

McCaffrey

. Bias reduction in standard errors for linear regression with multi-stage samples. Stat Can, https://www150.statcan.gc.ca/n1/pub/12-001-x/2002002/article/9058-eng.pdf (2002, accessed 8 November 2022).

46.

Kauermann

Carroll

. A note on the efficiency of sandwich covariance matrix estimation. J Am Stat Assoc 2001; 96: 1387–1396.

47.

. Design and analysis considerations for cohort stepped wedge cluster randomized trials with a decay correlation structure. Stat Med 2020; 39: 438–455.

48.

Nevins

Ryan

Davis-Plourde

, et al. Adherence to key recommendations for design and analysis of stepped-wedge cluster randomized trials: a review of trials published 2016-2022. Clin Trials Lond Engl 2024: 21: 199–210.

49.

Korevaar

Kasza

Taljaard

, et al. Intra-cluster correlations from the CLustered OUtcome dataset bank to inform the design of longitudinal cluster trials. Clin Trials Lond Engl 2021; 18: 529–540.

50.

Kuznetsova

Brockhoff

Christensen

RHB

, et al. lmerTest: Tests in Linear Mixed Effects Models, https://CRAN.R-project.org/package=lmerTest (2020, accessed 15 November 2022).

51.

Brooks

Bolker

Kristensen

, et al. glmmTMB: Generalized Linear Mixed Models using Template Model Builder, https://CRAN.R-project.org/package=glmmTMB (2022, accessed 15 November 2022).

52.

Powell

. The BOBYQA algorithm for bound constrained optimization without derivatives, https://www.semanticscholar.org/paper/The-BOBYQA-algorithm-for-bound-constrained-without-Powell/0d2edc46f81f9a0b0b62937507ad977b46729f64 (2009, accessed 26 November 2023).

53.

Morris

White

Crowther

. Using simulation studies to evaluate statistical methods. Stat Med 2019; 38: 2074–2102.

54.

Ford

Westgate

. Maintaining the validity of inference in small-sample stepped wedge cluster randomized trials with binary outcomes when using generalized estimating equations. Stat Med 2020; 39: 2779–2792.

55.

Davis-Plourde

Taljaard

. Sample size considerations for stepped wedge designs with subclusters. Biometrics. Epub ahead of print 31 October 2021. DOI: 10.1111/biom.13596.

56.

Tian

Preisser

Esserman

, et al. Impact of unequal cluster sizes for GEE analyses of stepped wedge cluster randomized trials with binary outcomes. Biom J Biom Z 2022; 64: 419–439.

57.

Satterthwaite

. An approximate distribution of estimates of variance components. Biom Bull 1946; 2: 110–114.

58.

Molenberghs

Verbeke

. Linear Mixed Models for Longitudinal Data. New York, NY: Springer. Epub ahead of print 2000. DOI: 10.1007/978-1-4419-0300-6.

59.

Hughes

Heagerty

Xia

, et al. Robust inference for the stepped wedge design. Biometrics 2020; 76: 119–130.

60.

Ren

Hughes

Heagerty

. A simulation study of statistical approaches to data analysis in the stepped wedge design. Stat Biosci 2020; 12: 399–415.

61.

Bilderbeck

Price

Hinds

, et al. OXTEXT: The development and evaluation of a remote monitoring and management service for people with bipolar disorder and other psychiatric disorders. NIHR Report for Programme Grants for Applied Research Programme (Reference Number RP-PG-0108-10087), https://www.journalslibrary.nihr.ac.uk/programmes/pgfar/RP-PG-0108-10087/#/ (2015).

62.

Nickless

Voysey

Geddes

, et al. Mixed effects approach to the analysis of the stepped wedge cluster randomised trial—Investigating the confounding effect of time through simulation. PLoS ONE 13. Epub ahead of print 13 December 2018. DOI: 10.1371/journal.pone.0208876.

63.

Kasza

Hooper

Copas

, et al. Sample size and power calculations for open cohort longitudinal cluster randomized trials. Stat Med. Epub ahead of print 4 March 2020. DOI: 10.1002/sim.8519.

64.

Ouyang

Karim

Gustafson

, et al. Explaining the variation in the attained power of a stepped-wedge trial with unequal cluster sizes. BMC Med Res Methodol 2020; 20: 166.

65.

Grantham

Forbes

Heritier

, et al. Time parameterizations in cluster randomized trial planning. Am Stat 2020; 74: 184–189.

66.

Kenny

Voldal

Xia

, et al. Analysis of stepped wedge cluster randomized trials in the presence of a time-varying treatment effect. Stat Med 2022; 41: 4311–4339.

67.

Maleyeff

Haneuse

, et al. Assessing exposure-time treatment effect heterogeneity in stepped-wedge cluster randomized trials. Biometrics 2023; 79: 2551–2564.

68.

Emrich

Piedmonte

. On some small sample properties of generalized estimating equation estimates for multivariate dichotomous outcomes. J Stat Comput Simul 1992; 41: 19–29.

69.

Wang

De Gruttola

. The use of permutation tests for the analysis of parallel and stepped-wedge cluster-randomized trials. Stat Med 2017; 36: 2831–2843.

70.

Fink

Robyn

, et al. Randomization inference for stepped-wedge cluster-randomized trials: an application to community-based health insurance. Ann Appl Stat 2017; 11: 1–20.

71.

Thompson

Davey

Fielding

, et al. Robust analysis of stepped wedge trials using cluster-level summaries within periods. Stat Med 2018; 37: 2487–2500.

72.

Kennedy-Shaffer

De Gruttola

Lipsitch

. Novel methods for the analysis of stepped wedge cluster randomized trials. Stat Med 2020; 39: 815–844.

73.

Zhang

Zhao

. What is a randomization test? J Am Stat Assoc 2023; 0: 1–15.

74.

Rabideau

Wang

. Randomization-based inference for a marginal treatment effect in stepped wedge cluster randomized trials. Stat Med 2021; 40: 4442–4456.

75.

Kiernan

Tao

Gibbs

. Tips and Strategies for Mixed Modeling with SAS/STAT® Procedures, http://support.sas.com/resources/papers/proceedings12/332-2012.pdf (2012).

76.

Kasza

Bowden

Ouyang

, et al. Does it decay? Obtaining decaying correlation parameter values from previously analysed cluster randomised trials. Stat Methods Med Res. 2023; 32(11): 2123–2134.

77.

Thompson

Fielding

Davey

, et al. Bias and inference from misspecified mixed-effect models in stepped wedge trial analysis. Stat Med 2017; 36: 3670–3682.

78.

Fay

Graubard

. Small-sample adjustments for Wald-type tests using sandwich estimators. Biometrics 2001; 57: 1198–1206.

79.

Dahmen

Ziegler

. Generalized estimating equations in controlled clinical trials: hypotheses testing. Biom J 2004; 46: 214–232.

80.

Preisser

Qaqish

, et al. A comparison of two bias-corrected covariance estimators for generalized estimating equations. Biometrics 2007; 63: 935–941.

81.

Scott

deCamp

Juraska

, et al. Finite-sample corrected generalized estimating equation of population average treatment effects in stepped wedge cluster randomized trials. Stat Methods Med Res 2017; 26: 583–597.

82.

Preisser

Qaqish

. Finite sample adjustments in estimating equations and covariance estimators for intracluster correlations. Stat Med 2008; 27: 5764–5785.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.33 MB

0.07 MB