Abstract
We introduce two commands,
Keywords
1 Introduction
Randomized controlled trials are the gold standard for testing whether a new treatment is better than the current standard of care. Multiarm multistage (MAMS) trial designs are efficient adaptive designs that have been proposed to speed up the evaluation of new therapies and improve success rates in identifying effective ones (Parmar et al. 2008). The MAMS design achieves this goal with two main components: The multiarm aspect allows multiple experimental arms to be compared with a common control (which is generally taken as the current standard of care) in one trial, and the multistage aspect allows interim analyses before the planned end of the study. This enables us to cease recruitment early to potentially inefficacious experimental arms or stop early for the overwhelming efficacy. This allows multiple research questions to be efficiently answered under the same protocol.
Royston, Parmar, and Qian (2003) and Royston et al. (2011) developed a MAMS design for trials with time-to-event outcomes that uses an intermediate (I) outcome at interim stages. This increases the efficiency of the MAMS design further because it allows for the earlier stopping of treatment arms for lack of benefit at interim stages while maintaining a low probability of false negatives (that is, 1 − power). In this framework, the information on the I outcome accrues at the same or a faster rate than information for the definitive (D) or primary outcome of the trial. The I outcome should be on the causal pathway to D, but it does not necessarily have to be a surrogate outcome (Parmar et al. 2008). If there is no effect of treatment on I, then it is highly desirable that the same holds for D; otherwise, there is an increased risk of wrongly stopping a study early for lack of benefit. Choodari-Oskooei et al. (2022) give an extensive account of Royston, Parmar, and Qian (2003) and Royston et al.’s (2011) MAMS designs and discuss their underlying principles.
Examples of intermediate and primary outcomes are progression-free (or diseasefree) survival and overall survival for many cancer trials, CD4 count and disease-specific survival for HIV trials, or culture status (a binary marker for whether a patient has tuberculosis) and patient relapse (binary) in tuberculosis trials. When one uses an I outcome, each of the experimental arms is compared pairwise with the control arm on the I outcome. In the absence of an obvious choice for I, a rational choice of I might be D itself earlier in time. In this article, the MAMS designs that use the I outcome for the lack-of-benefit analysis at the interim looks are denoted by I ≠ D. Designs that use the same primary outcome at the interim looks are denoted by I = D. Throughout, we use the acronym MAMS to refer to the multiarm multistage design described by Royston, Parmar, and Qian (2003) and Royston et al. (2011).
Binary (or dichotomous) outcomes are widely used in many clinical studies. The MAMS design has been extended to binary outcomes with the risk difference as the primary outcome measure (see Bratton, Phillips, and Parmar [2013]) and can easily be extended to designs with the log odds-ratio as the primary outcome measure (Abery and Todd 2018). It is one of the few adaptive designs being deployed both in several trials and across a range of diseases, including trials in COVID-19, cancer, tuberculosis, and surgery. One example is the MAMS ROSSINI 2 trial in surgical site infection (SSI), which is used in this article as an example and for illustration (ROSSINI 2 2023).
The purpose of this article is twofold. First, it introduces two commands,
The structure of this article is as follows. Section 2 presents the specification of the MAMS design with binary outcomes. It also describes a class of efficient admissible MAMS designs in section 2.3 and introduces a flexible family of α functions to allow for a larger set of such designs to be found. Sections 3 and 4 present the
2 MAMS designs with binary outcomes
This section presents the specification of the MAMS design with binary outcomes. For a MAMS trial with K experimental arms and J stages, parameters πjk and πj 0 are the risks of developing the outcome of interest at stage j in an experimental arm k and the control arm, respectively. The treatment effect is the difference in risks, that is, a reduction in an unfavorable event rate, and is being measured by θjk = πjk − πj 0, where j = 1,…, J and k = 1,…, K. For simplicity, because we assume that all K pairwise comparisons have the same design parameters (that is, all have the same design stagewise significance level αj and power ωj ), we remove the subscript k from the notations of design parameters.
Without loss of generality, assume that a negative value of θjk indicates a beneficial effect of treatment k. In trials with K experimental arms, where a set of K null hypotheses are tested at each stage j, the null and alternative hypotheses are
for some prespecified (design) null effects
At each stage, we define the design significance level α = (α
1
,…, αJ
) and power ω = (ω
1
,…, ωJ
) for testing each pairwise comparison. Let Zjk
be the z test statistic comparing experimental arm k against the control arm at stage j, where Zjk
follows a standard normal distribution, Zjk ∼ N(0, 1), under the null hypothesis. Note that all the cumulative data from previous stages are used in the calculations of each z test statistic. In other words, the pairwise analyses at each stage includes all the individuals that were included in the analyses of previous stages. The joint distribution of the z test statistics therefore follows a multivariate normal distribution with the location parameter as the J × K matrices of the standardized mean treatment effects and the corresponding covariance matrix (Σ) between the J × K test statistics. Note that the Fisher’s (observed) information (Vjk
) contained in If pjk ≥ αj
, the result for the pairwise comparison of experimental arm k against the control arm crosses the jth interim lack-of-benefit stopping rule; therefore, recruitment to that experimental arm can be stopped for lack of benefit. If pjk < αj
, continue recruitment in the experimental arm k and control arm and move to the next stage.
At the final analysis J, the treatment effect is estimated on the primary (D) outcome for each experimental arm and includes all the randomized individuals from previous stages in comparison k. As a result, one of two conclusions can be made: If pJk ≤ αJ
, reject the null hypothesis corresponding to the definitive outcome and claim efficacy. If pJk ≥ αJ
, the corresponding null hypothesis cannot be rejected.
2.1 Steps to design a MAMS trial with a binary outcome
The MAMS design requires the specification of the following design parameters to calculate the sample size and trial duration for each stage (j): the (stagewise) design power (ωj
) and significance levels (αj
); the allocation ratio, which is the number of randomized individuals in each experimental arm for every individual that is randomized to the control arm (A); the target effect size under the null ( Choose the number of experimental (E) arms, K, and stages, J. Choose the definitive D outcome and (optionally, in I ≠ D designs) the I outcome. Choose the null values for the underlying treatment effect, θ—for example, the difference in risks on the definitive and (in I ≠ D designs) intermediate outcomes. Choose the minimum clinically relevant target treatment-effect size, Choose the control-arm event rate. Choose the allocation ratio A (E:C), the number of patients allocated to each experimental arm for every patient allocated to the control arm. For a fixed sample (one-stage) multiarm trial, the optimal allocation ratio (that is, the one that minimizes the sample size for a fixed power) is approximately In I ≠ D designs, choose an estimate of the probability of experiencing the definitive (final) outcome given the patient has had the intermediate outcome—that is, the positive predictive value (PPV)—for the control arm and for experimental arms. This allows us to estimate the correlation between the treatment effect on the intermediate outcome and that of the definitive outcome to calculate the overall pairwise power. An estimate of the PPV can be obtained using data from previous trials, through expert opinion, or both—more information is included in Bratton, Phillips, and Parmar (2013). In the ROSSINI 2 design, the same outcome was used at interim stages (that is, I = D design), so this was not required—see appendix A in the online supplemental material for a trial example with I ≠ D design. Choose the accrual rate per stage (and optionally, loss to follow-up) to calculate the trial timelines. The Choose a one-sided design significance level for lack of benefit and the target power for each stage (αj
, ωj
). The chosen values for αj
and ωj
are used to calculate the required sample sizes for each stage.
The
2.2 Type I error rate and power
In trials with lack-of-benefit interim stopping boundaries, a type I error is made only if the null hypothesis for the D outcome is rejected in final-stage analysis. In designs with J stages and stopping boundaries for lack of benefit, Royston et al. (2011) showed that, in I = D designs, the overall pairwise type I error rate, α, and power, ω, for each of the k pairwise comparisons are calculated from
where Φ
J
is the J-dimensional multivariate normal distribution function with correlation matrix
In I ≠ D designs, the calculation of α in (1) is made under the assumption that H 0 is true for both I and D. However, in this case the type I error rate is maximized when the experimental treatment is highly or infinitely effective on I but the null hypothesis is true for D. Therefore, the maximum pairwise type I error rate, α max, is equal to the final-stage significance level, αJ (Bratton et al. 2016).
In multiarm trials, there are multiple ways to commit a type I error. In some trials such as the ROSSINI 2 trial, it is required to control the overall FWER at a prespecified level, usually at 2.5% (one sided). The FWER is the probability of incorrectly rejecting the null hypothesis for the primary outcome for at least one of the experimental arms from a set of comparisons in a multiarm trial. The FWER is maximized under the global null hypothesis,
2.3 Admissible MAMS designs
In Royston, Parmar, and Qian (2003) and Royston et al.’s (2011) framework, a MAMS design is constructed by specifying a one-sided significance level and power for the pairwise comparisons at each stage of the study along with the minimum target treatment effect for the outcome of interest in that stage and the allocation ratio for the trial. Given these design parameters, the sample size required for each analysis is calculated. The (one-sided) design significance levels act as the stopping boundaries for lack of benefit. Previous MAMS trials such as the STAMPEDE trial (Sydes et al. 2012) have used the recommendations given by Royston et al. (2011) to choose the stagewise significance levels and powers.
Royston et al. (2011) suggested using high power in the intermediate stages (for example, 0.95) and also the final stage (for example, 0.90) to ensure high overall power for the trial. They also suggested using a descending geometric sequence such as αj = 0.5 j for the significance levels in the intermediate stages. However, this approach is problematic for two main reasons. First, it may not result in a “feasible” design with the desired overall operating characteristics. To achieve this, a time-consuming trial-anderror approach is required in which users must continually tweak the stagewise (design) operating characteristics until a feasible design with the desired overall operating characteristics is found. Second, there are likely to be many feasible designs for any pair of overall operating characteristics, some requiring smaller sample sizes than others. Therefore, the chosen design may not be the most efficient or optimal for a particular true treatment effect. Thus, the most efficient feasible MAMS design for a particular study is unlikely to be found if this approach is used for trial design.
To address these difficulties, Bratton (2015) developed a systematic grid-search procedure over the stagewise significance levels and power to find a large set of feasible designs, that is, designs with a particular (prespecified) overall type I error rate and power. The procedure then selects the most efficient feasible designs, called admissible MAMS designs, using an optimality criteria proposed by Jung et al. (2004), which is a weighted sum of the expected sample size under the global null hypothesis, E(N|H 0), and the hypothesis in which all arms are effective, E(N|H 1):
Feasible designs that minimize (2) for some q ∊ [0, 1] are called admissible. Note that the user chooses q based on the prior beliefs about the effectiveness of the treatment under study. Special cases are the null-optimal design with q = 0, which minimizes the expected sample size under the global null hypothesis, that is, E(N|H 0), and minimax designs with q = 1, which minimizes E(N|H 1). However, other admissible designs that minimize a more balanced weighting of the two measures exist. Jung et al. (2004) found that these “balanced” admissible designs are often much more appealing in practice because they usually possess similar desirable properties to the null-optimal or minimax designs but do not have such large maximum or expected sample sizes, respectively. The parameter q could encompass the prior beliefs about the effectiveness of the experimental treatment regimens used in each research arm of the trial or the relative importance of the expected sample sizes under the global null or alternative hypotheses. Designs that minimize the loss function for a wider range of values of q are likely to be more desirable because they are admissible for a wider range of prior beliefs or scenarios. Hence, it is important to find the admissible designs for all values of q so that those that cover the broadest range of opinions can be found. The final choice of design will therefore depend on prior beliefs about the effectiveness of the treatment under study, the relative importance of the maximum and expected sample sizes to the investigators, or both.
2.3.1 α functions to find design significance levels
In two-stage settings, a simple grid-search procedure can be used to search over all four stagewise design parameters (α 1, α 2, ω 1, and ω 2) to find feasible designs. For designs with more than two stages, the addition of an extra two parameters for each additional stage drastically increases the search time, rendering a full grid search impractical. To increase search speed, one should apply some constraints to limit the number of parameters to search over without significantly reducing the number of feasible designs found. Particularly, to limit the number of design significance level parameters to search over and to ensure that the stagewise significance levels decrease with each stage as suggested by Royston et al. (2011), one can use a monotonically decreasing function to automatically determine the parameters that are not included in the search.
An “α function” similar to that proposed by Royston et al. (2011) that determines the significance levels for the intermediate stages given the significance level for the first stage is
To find a range of feasible designs using this function, one can search over various values of α 1 with the final-stage significance level, αJ , chosen such that the desired type I error rate is achieved. However, very few sets of significance levels will be searched over using this function, so few, if any, feasible designs are likely to be found. Bratton (2015) introduced (4) as an alternative, and more flexible, family of functions that pass through specified values of α 1 and αJ and require the definition of a parameter 0 ≤ r ≤ 1 as follows:
By performing a grid search over α 1 and αJ , one can use this function to automatically determine the significance levels for stages j = 2,…, J − 1 for a range of prespecified values of r. The search time will therefore be longer than it is when using (3). However, more feasible designs are likely to be found. The shapes of both of the above α functions are shown in figure 1 for J = 3, 4, and 5 stages, α 1 = 0.5, αJ = 0.05 and, for (4) only, r = 0 (linear), 0.5, and 1. The stagewise significance levels corresponding to each function are shown in table 1 with intermediate significance levels rounded in units of 0.01 for practical reasons.

Examples of α functions generated using (3) (“Royston’s function”) and (4) for r = 0, 0.5, and 1; J = 3, 4, and 5 stages; α 1 = 0.5; and αJ = 0.05.
Stagewise significance levels obtained from the α functions shown in figure 1 for three-, four-, and five-stage designs
Figure 1 shows that as r increases, the α functions in (4) become more curved. This causes the significance level to decrease more rapidly during the initial stages, thus increasing their sample size and duration (except for the first stage, whose duration is determined by the fixed value α 1). The functions then level off, so the number of patients recruited in the later stages will decrease. From table 1, it appears that using a value of r greater than 1 for many stages (for example, J = 5) will result in negligibly small decrements in the significance levels between later stages, thus making them too small. On the other hand, α functions that curve in the opposite direction will have very short early intermediate stages, while later stages will be lengthy. Such designs are likely to be impractical and inefficient in practice. Thus, only values of r between 0 and 1 are considered (Bratton 2015). Table 1 also shows that for three or four stages, the significance levels found using (3) almost coincide with a set found using (4). In the five-stage example, the decrease in the significance level between the penultimate (α 4 = 0.06) and final stages (α 5 = 0.05) using Royston’s function is too small and unlikely to result in a practical design. The search procedure uses the same (high) power in all intermediate stages and a different lower power at the final stage.
The
3 The nstagebin command
The syntax for
3.1 Syntax
Note that the number of values given in each numlist must equal the number of stages in the trial as specified in the
3.2 Options
3.3 Dialog box
The

Screenshot of the first tab of the

Screenshot of the second tab of the

Screenshot of the third tab of the

Screenshot of the final tab of the
In the first tab (
4 The nstagebinopt command
The syntax for
4.1 Syntax
4.2 Options
5 Example: Application to the ROSSINI 2 trial
This section presents the outputs from the
5.1 ROSSINI 2 MAMS trial
The reduction of SSI using several novel interventions (ROSSINI 2) trial [NCT03838575] is a phase III MAMS design investigating in-theater interventions to reduce SSI. The composite binary outcome of SSI up to 30 days is the definitive outcome that is used at both the interim and the final stages of this trial, that is, I = D MAMS design. In this eight-arm, three-stage MAMS trial, three interventions (skin prep, drape, and sponge) are being tested, with patients being randomized to receive none (control arm), one, or any combination of these interventions; that is, there are seven experimental arms in total. The primary outcome measure θ is the absolute difference in the proportion of patients reporting SSI up to 30 days after surgery between each of the experimental arms and that of the control arm. The same primary outcome measure is used at all stages for analysis and dropping of arms. No formal stopping rule for early evidence of efficacy has been specified at the design stage of the trial. The trial design also allowed for treatment selection. For simplicity, we disregard this aspect of the trial design and consider it as a standard MAMS trial. Table 2 shows the stagewise design parameters for the ROSSINI 2 trial.
Design specification for the eight-arm three-stage ROSSINI 2 MAMS trial. The target effect size is 5% reduction in the SSI event rate in each of the seven experimental arms from the control-arm event rate of 15%—see section 5.3 for more details.
In this trial, the overall FWER was strongly controlled at 0.025 (one-sided) to account for multiplicity as a result of multiple pairwise comparisons. The overall pairwise power is 0.85. The
5.2 nstagebinopt output
In multiarm trials,
In the design stage of the ROSSINI 2 trial, the control-arm SSI rate was assumed to be 0.15, that is, specified using the
The output from the
The results indicate that the design that is admissible for q ∊ [0.10, 0.65] has an expected sample size of 4,683 patients, which is just 25 patients higher than the nulloptimal design with 4,658 patients. However, this admissible design has a much smaller E(N|H
1) than that of the null optimal design. Overall, this design is the preferred choice. So the chosen stagewise significance levels and powers are used in the

The expected sample sizes for the four different admissible designs presented in section 5.2—see
5.3 nstagebin output
This section presents the
6 Conclusions
This article presented the
In NI designs, however, the same analysis method that was assumed at the design stage should be applied. Otherwise, the type I and II error rates might not be controlled at the prespecified levels because changing the analysis scale in NI designs requires redefining the NI margin (Li et al. 2022). Appendix B of the online supplementary material includes an example NI trial design and the corresponding
There are limitations within the MAMS framework. First, in designs with a binary intermediate outcome, the designs assume the same probability of experiencing the definitive (binary) outcome given they have had the intermediate outcome (PPV) for the control arm and for experimental arms. This is a reasonable assumption to make and is often the case under the null hypothesis. Second, the MAMS framework and the corresponding
Finally, we validated the stagewise sample sizes from
We hope that the commands and this article can facilitate the uptake and implementations of MAMS designs and help to optimize MAMS designs with binary outcomes.
8 Programs and supplemental materials
Supplemental Material, sj-zip-1-stj-10.1177_1536867X231196295 - Facilities for optimizing and designing multiarm multistage (MAMS) randomized controlled trials with binary outcomes
Supplemental Material, sj-zip-1-stj-10.1177_1536867X231196295 for Facilities for optimizing and designing multiarm multistage (MAMS) randomized controlled trials with binary outcomes by Babak Choodari-Oskooei, Daniel J. Bratton and Mahesh K. B. Parmar in The Stata Journal
Footnotes
7 Acknowledgments
We are grateful to Professor Stephen Jenkins, the editor, and an external reviewer for useful comments and suggestions on the earlier draft of this manuscript, which have improved the article markedly. We also thank Professor Ian White for helpful comments on the earlier draft of this manuscript. This work was supported by the Medical Research Council (MRC) grant numbers MC_UU_00004_09 and MC_UU_123023_29.
8 Programs and supplemental materials
To install a snapshot of the corresponding software files as they existed at the time of publication of this article, type
For the latest version of the
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
