Abstract
Permutation tests are useful in stepped-wedge trials to provide robust statistical tests of intervention-effect estimates. However, the
1 Introduction
Permutation tests are a commonly used nonparametric statistical technique for calculating p-values without making distributional assumptions (Pitman 1937; Eden and Yates 1933; Fisher 1971). In individually randomized trials, they are used because they make no distributional assumptions, provide exact p-values and confidence intervals, and do not rely on large-sample approximations (Ernst 2004); the
While the benefits of permutation tests hold for more complex randomized designs, such as stepped-wedge cluster-randomized trials (SW-CRTs),

Schematics of an SW-CRT. White = time in control condition. Gray = time in intervention condition.
Here we introduce a new command,
2 Technical details
The
2.1 Permutation tests with individual randomization
Details of the permutation test can be found elsewhere, such as in Good (2005); here, we briefly summarize.
In an individually randomized trial, we have a sample of observations, half of which were collected under a control condition and half under an intervention condition. We want to know whether the control and intervention conditions result in different distributions of outcomes. If there is truly no difference between the two conditions, then assignment of observations to each condition is arbitrary, and for any set of assignments of the observations to the control and intervention conditions, we can estimate an intervention effect. By repeating this process for each unique assignment of observations to conditions, we obtain the exact distribution of the intervention-effect estimator under the null hypothesis of no effect. The p-value, defined as the probability of the observed data if there is no intervention effect, is then given as the proportion of permuted intervention effects the same as or more extreme than that observed.
Monte Carlo permutations
This process is computationally simplified by randomly sampling a number of permutations from all possible permutations with or without replacement, a process known as Monte Carlo permutations (Good 2005). The p-value calculated may differ when the process is repeated with a different set of permutations.
Constructing confidence intervals
Confidence intervals are created by finding the boundaries of hypothesized intervention effects that lead to two-sided p-values less than the α level. One way to identify the confidence limits is to test several hypothesized intervention effects to see whether the p-value is larger or smaller than α.
A hypothesized intervention of θ = θ A is tested by first subtracting θ A from observations collected in the intervention condition, and then running the permutation test as described above to get a p-value (Good 2005; Rigdon and Hudgens 2015). The random-number seed should be set to the same seed as the original analysis so that one set of permutations are used throughout the analysis, allowing the confidence intervals and p-value to coincide with one another.
2.2 Extending permutation tests to SW-CRTs
Two assumptions are required for permutation tests to be valid.
First, permutation tests test equivalence of distributions between the conditions. This means they will return a small p-value if either the means or the variances of outcomes differ. The effect of an intervention varying between observations is an example of the latter.
Second, permutation tests assume exchangeability of observations. This means that any assignment of observations to the conditions is equally likely. In the context of SW-CRTs, exchangeability holds for the assignment of clusters to sequences but will not hold at the individual observation level. Clusters should therefore be permuted between sequences. The
2.3 Selecting an intervention-effect estimator for a stepped-wedge trial
Permutation tests provide a p-value and confidence intervals for a given interventioneffect estimator. A key design feature of all SW-CRTs is that the intervention effect is confounded with time. Therefore, the chosen estimator must account for this confounding either by adjusting for period effects or by conditioning on periods.
To adjust for period effects, generalized linear models or generalized linear mixed models can be used (Bellan et al. 2015; Ji et al. 2017; Wang and De Gruttola 2017).
To condition on the period, the analysis can be conducted within each period with resulting within-period estimates combined as a weighted average. More details of this method, also known as a vertical analysis, are given in Thompson et al. (2018). Any analysis that can be used for a parallel CRT could be used within each period; for example, Thompson et al. (2018) suggested using a cluster-level analysis in each period.
The overall intervention-effect estimate can be estimated more accurately by using appropriate weights for each period (Hayes and Moulton 2009). Periods can be weighted by the imbalance in the number of clusters in the control and intervention conditions (Matthews and Forbes 2017) or by the precision of within-period estimates (Thompson et al. 2018).
3 The swpermute command
The
3.1 Data requirements
The
The data should be in long format, with observations in each period given in different rows of the data.
3.2 Syntax
The syntax of the
exp specifies the result to be collected from results stored by the execution of command. Examples are
3.3 Options
Main
Options
Within-period analysis
where s 0 j and s 1 j are the numbers of clusters in the control condition and the intervention condition, respectively, in period j. This is the default and is recommended if the total variance is not expected to vary between periods (Matthews and Forbes 2017).
exp 2 j is assumed to be the variance of the estimate from the jth period. This specification is suggested by Thompson et al. (2018) when the variance of the outcome is expected to vary between periods.
Reporting
3.4 Stored results
Scalars
Matrices
3.5 The dialog box
The
Running these commands from within Stata will install only the dialog box for the current session of Stata. To install the menus permanently, place the above commands into your
4 Examples
To demonstrate the use of
We will focus on a secondary outcome of the proportion of patients with a bacterial confirmation of their TB diagnosis (Trajman et al. 2015). These examples use real trial data, but the data cannot be provided with the command. Instead, a simulated dataset is included that closely mimics the characteristics of these trial data but will not reproduce these example results.
The trial included 14 laboratories (clusters). At initiation of the study, all laboratories were using sputum smear microscopy to diagnose TB. Following a month of baseline data collection, the Xpert test was rolled out to two randomly assigned laboratories each month (that is, within seven months, all laboratories were using the Xpert test). The dataset contains 3,924 patients; their diagnoses were recorded as either clinical (with a negative test or no test done) or bacterially confirmed. The Xpert test was used to diagnose 2,147 (55%) patients. Across both trial arms, 2,833 (72%) had a confirmed TB diagnosis.
The output below describes the dataset:
Each row gives the diagnosis type,
We will explore two analyses with permutation tests: the first will use a generalized linear mixed model with a permutation test, and the second will demonstrate a withinperiod analysis.
4.1 Example 1: Generalized linear mixed model
A mixed-effects logistic regression, adjusting for period effects as a fixed categorical variable and with a random intercept for cluster, can be used in combination with a permutation test to analyze this trial, as shown below.
The table below this matrix gives the results of the permutation test. First, we see the intervention-effect estimate observed in the data (
The intervention-effect estimate is the value estimated by the
Only 2/1000 permutations gave a result the same as or more extreme than that observed, giving the p-value 2/1000 = 0.002 shown in column 5 (
The last two columns give a two-sided 95% confidence interval for the p-value that indicates the level of uncertainty around the p-value from the random selection of permutations. In this example, the interpretation of the p-value does not substantively change for values within this interval. Where interpretation would be altered for different values within the interval, the analysis should be rerun with more permutations.
4.2 Example 2: Within-period analysis and generating confidence intervals
In our second example, we will use a within-period analysis to calculate the difference in the risk (the proportion) of a confirmed diagnosis using a cluster-level analysis within each period, and we show how to construct confidence intervals.
First, we calculate the proportion of confirmed diagnoses in each cluster period by collapsing the data. We run
We are warned that study month 1 and study month 8 are not included in this analysis. All clusters are in the same condition during these periods, so an intervention effect cannot be calculated.
The command displays a list of effect estimates and weights for each period in the study. The greatest weight is given to study month 6 despite the imbalance in clusters in the control and intervention conditions. This is because there was less variability in the cluster-level outcomes during this period, leading to a lower variance for the estimated intervention effect.
The observed value in the table of results is the weighted average of these period-specific estimates. The percentage of patients with a confirmed diagnosis was 10.5% higher in patients diagnosed with the Xpert test compared with patients diagnosed with smear microscopy, and there is some evidence against the intervention having no effect (p = 0.05).
Next, we demonstrate the construction of 95% confidence intervals. The initial estimate of the confidence interval boundaries can be found by assuming that the intervention-effect estimate follows a normal distribution and using the p-value to estimate a standard error as follows:
Permutation tests are conducted to test these initial values. An example is shown below for the initial proposed upper boundary; the dialog boxes shown in the appendix replicate this example.
Depending on the p-value, the proposed boundary value is either increased or decreased until the boundary value with p > 0.05 is identified. To identify the lower boundary in our example, the initial estimate was 0.00, which gives p = 0.0500, so we tested a null value of −0.001 to see if a smaller value also fell with the 95% confidence interval. This has p = 0.048, so the lower boundary is 0.00. For the upper boundary, the initial estimate of 0.211 gave p = 0.026, well outside the 95% confidence interval. A null of 0.20 gave p = 0.045, null = 0.197 gave p = 0.054, null = 0.198 gave p = 0.051, and lastly null = 0.199 gave p = 0.047. The largest value within the 95% confidence interval is 19.8%. Therefore, our 95% confidence interval is [0.0%, 19.8%].
5 Concluding remarks
However,
7 Programs and supplemental materials
Supplemental Material, st0577 - Permutation tests for stepped-wedge cluster-randomized trials
Supplemental Material, st0577 for Permutation tests for stepped-wedge cluster-randomized trials by Jennifer Thompson, Calum Davey, Richard Hayes, James Hargreaves and Katherine Fielding in The Stata Journal
Footnotes
6 Acknowledgments
This work was cofunded by the UK Medical Research Council (MRC) (MR/L004933/1- P27) and jointly funded by the MRC and the UK Department for International Development (DFID) under the MRC/DFID Concordat agreement and is also part of the EDCTP2 program supported by the European Union (MR/K012126/1) and (MR/R010161/1).
We thank Professor Anete Trajman, Dr. Betina Durovni, Dr. Valeria Saraceni, Professor Frank Cobelens, and Dr. Susan van den Hof for allowing us to use data from their study to demonstrate the command.
7 Programs and supplemental materials
To install a snapshot of the corresponding software files as they existed at the time of publication of this article, type
A Appendix: Dialog boxes to run the example in section 4.2
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
