Abstract
Background
The multi-arm multi-stage framework uses intermediate outcomes to assess lack-of-benefit of research arms at interim stages in randomised trials with time-to-event outcomes. However, the design lacks formal methods to evaluate early evidence of overwhelming efficacy on the definitive outcome measure. We explore the operating characteristics of this extension to the multi-arm multi-stage design and how to control the pairwise and familywise type I error rate. Using real examples and the updated nstage program, we demonstrate how such a design can be developed in practice.
Methods
We used the Dunnett approach for assessing treatment arms when conducting comprehensive simulation studies to evaluate the familywise error rate, with and without interim efficacy looks on the definitive outcome measure, at the same time as the planned lack-of-benefit interim analyses on the intermediate outcome measure. We studied the effect of the timing of interim analyses, allocation ratio, lack-of-benefit boundaries, efficacy rule, number of stages and research arms on the operating characteristics of the design when efficacy stopping boundaries are incorporated. Methods for controlling the familywise error rate with efficacy looks were also addressed.
Results
Incorporating Haybittle–Peto stopping boundaries on the definitive outcome at the interim analyses will not inflate the familywise error rate in a multi-arm design with two stages. However, this rule is conservative; in general, more liberal stopping boundaries can be used with minimal impact on the familywise error rate. Efficacy bounds in trials with three or more stages using an intermediate outcome may inflate the familywise error rate, but we show how to maintain strong control.
Conclusion
The multi-arm multi-stage design allows stopping for both lack-of-benefit on the intermediate outcome and efficacy on the definitive outcome at the interim stages. We provide guidelines on how to control the familywise error rate when efficacy boundaries are implemented in practice.
Keywords
Introduction
The multi-arm multi-stage (MAMS) adaptive clinical trial design developed by Royston et al.1,2 has many practical advantages when evaluating treatments, such as increased efficiencies in time and patients required, and a greater probability of success than a traditional parallel-group, single-stage design.
3
Interim stages are used to identify early evidence of lack-of-benefit of each research arm over the control arm. The MAMS framework utilises an intermediate
Efficacy stopping boundaries can be implemented as a means of assessing interim data as they accumulate to identify treatment arms indicating overwhelming efficacy over the course of the trial. Data monitoring committees may recommend terminating a trial before its planned end in order to report the results, or be submitted for regulatory approval, earlier than planned. Permitting early stopping for efficacy would increase the efficiency of the MAMS design further, by minimising patients being exposed to inferior treatment regimens and decreasing the time for effective treatments to reach patients. Popular stopping boundaries are the Haybittle–Peto rule, 5 the O’Brien–Fleming rule 6 and other approaches utilising an alpha-spending function.7,8
Multiple testing in MAMS trials may increase the risk of a type I error, 9 also known as the pairwise error rate (PWER) in two-arm designs. In a multi-arm setting, it is generally referred to as the familywise error rate (FWER): the probability of at least one ineffective research arm being recommended at an interim stage or at the end of the trial. Whether or not the FWER should be controlled in a MAMS trial should be decided on a case-by-case basis. However, it may be important that its value is calculated, even in trials that do not require strong control of the FWER. 10
As far as we are aware, no alternative MAMS trial design formally assesses lack-of-benefit on an intermediate outcome and efficacy on the definitive outcome simultaneously at interim analyses for time-to-event data. For this reason, this extension to the existing framework of Royston and colleagues,11,12 and the development of the associated nstage software, will provide the necessary evidence required by regulatory agencies to allow interim efficacy guidelines to be incorporated into MAMS designs and allow trials to measure and control the impact on the operating characteristics of the design.
This article explores this design extension via a simulation study, to quantify the extent to which the error rates are affected by formal interim efficacy looks according to different design parameters. We also illustrate how the FWER can be controlled in practice by modifying the design specification, using real MAMS trials as examples.
MAMS in practice: the STAMPEDE trial
Table 1 illustrates how the MAMS proposal has been applied to a clinical trial evaluating systemic therapies in prostate cancer.
13
STAMPEDE was initially designed as a six-arm, four-stage trial, using the composite intermediate outcome measure of failure-free survival (FFS), for assessing lack-of-benefit at interim stages, and a definitive outcome of overall survival (OS) at the final analysis for efficacy. Table 1 shows the design specification for the original treatment comparisons at each stage: the outcome measure, target hazard ratio (HR) under the alternative hypothesis for the research arms (HR
1
), power
Design specification for the six-arm four-stage STAMPEDE trial.
HR: hazard ratio; FFS: failure-free survival; OS: overall survival.
Methods
The MAMS design
For a K-arm, J-stage trial, one-sided significance levels are specified for stages 1 to J to compare each of the research arms against the control arm on the intermediate outcome at interim analyses and definitive outcome at the final analysis. No formal comparisons are made between the research arms. The design targets high pairwise power at interim stages (e.g. 95%) to increase the probability of continuing with promising research arms. 2 For the chosen power, the stagewise significance levels form a boundary for lack-of-benefit, since rejection of the null hypothesis at an interim analysis indicates that the arm continues recruitment to the subsequent stage.
The timing of interim analyses is driven by the number of intermediate outcome events observed in the control arm for trials with time-to-event outcomes and is determined by how liberal or conservative the significance levels are. Large p-values indicate early interim analyses, requiring only a small number of events. More conservative boundaries, with smaller p-values, trigger relatively later interim analyses when more events have been accrued. At each interim analysis, research arms demonstrating lack-of-benefit on the intermediate outcome may be dropped from the subsequent stages, optimising resources in the ongoing trial. By allowing the specification of an efficacy boundary, recruitment can also be terminated early to the research arms demonstrating overwhelming evidence of efficacy on the definitive outcome at an interim analysis. Detailed guidelines for designing a MAMS trial are provided in Supplemental Appendix A.
In the MAMS design, correlation is induced between the estimated treatment effects of pairwise comparisons in two ways: first due to the shared control arm and second due to the shared or correlated outcome measures across stages. In the case of STAMPEDE, the intermediate and definitive outcome measures were strongly correlated due to FFS being a composite measure of OS (see Supplemental Appendix C), but the source of the correlation may differ for alternative outcome measures.
Type I error rate
In the MAMS setting, type I errors can only be made on decisions based on the definitive outcome. The PWER for comparison k is defined as the probability of a type I error made on comparison k, while the FWER is the probability of a type I error made on any pairwise comparison.
For a design where
where
To calculate the FWER, the union of all events leading to a type I error is considered. The probability also depends on whether the trial continues with the remaining arms or is terminated when a research arm crosses the efficacy boundary. In the former case, when
In cases where an intermediate outcome measure is used for assessing lack-of-benefit at interim
Power
The power of a clinical trial is the probability an effective treatment is identified by the final analysis. In the MAMS setting, assuming binding boundaries, the power is conditional on the treatment arm passing all interim stages prior to rejection of the null hypothesis, without being dropped for lack-of-benefit. Three different definitions of power can be calculated in multi-arm trials: per-pair, any-pair and all-pair powers.
18
Per-pair power is the probability of detecting a treatment effect in a particular arm. Any-pair power is the probability of detecting at least one true treatment effect among several arms and all-pair power is the probability of detecting every true treatment effect from all pairwise comparisons. The measures are calculated under the global alternative hypothesis: the assumption that all research arms are effective. The three measures of power will be identical in a two-arm trial,
19
but when considering a multi-arm design the power measure of interest may depend on the objective of the trial. When efficacy bounds are implemented, per-pair power can be evaluated using a generalised form of equation (1) under the alternative hypothesis
Simulation study
Treatment arm–level data were simulated for 3 million trials. The type I error measure of interest was the PWER for two-arm scenarios and the FWER for multi-arm settings. Multi-arm scenarios considered the three measures of powers previously defined in section ‘Power’. Operating characteristics were evaluated empirically from the simulation results, though were compared against analytical solutions for the two-arm scenarios.
We explored the impact of implementing efficacy stopping rules on the type I error and power under different plausible design specifications which may be implemented in a MAMS trial, as described below. A separate stopping rule 20 was assumed for all the results presented, with operating characteristics calculated assuming that the trial continues with the remaining research arms if any one arm is dropped early for efficacy. However, the impact of terminating the trial after this occurrence was also investigated, since in some cases it may be unethical to continue the trial. This approach to stopping early for efficacy has been termed a simultaneous stopping rule.
Simulations under an
Definition of simulation parameters
Efficacy stopping rule
The form of the efficacy stopping rule will determine how stringent the boundaries
Assuming survival outcomes, only beneficial treatment effects were considered (i.e. HR < 1) so the lack-of-benefit thresholds serve as an upper boundary and the efficacy thresholds as a lower boundary. The direction of these may differ for alternative outcomes.
The Haybittle–Peto guideline
5
uses the same threshold at stages 1 to
The Haybittle–Peto guideline was used as a default rule when investigating other design parameters, since it is unaffected by the timing of the stages.
Other design parameters
Table 2 shows the range of values used in the simulation study, for the parameters known to have an influence on the operating characteristics of the MAMS design. The times at which the interim analyses are to be conducted are dictated by the stagewise significance levels for assessing lack-of-benefit. A large
Simulation parameter values.
We used one-sided lack-of-benefit boundaries of
Strong control of the FWER
Controlling the FWER in the strong sense limits its value under any underlying treatment effect of the I- or D-outcomes. The maximum FWER is calculated by assuming non-binding lack-of-benefit boundaries, such that all research arms pass all interim stages. The actual FWER of the trial must be less than or equal to this maximum and so it is controlled.
A program was written using linear interpolation to determine the final-stage significance level
We applied this method to two MAMS trials utilising an intermediate outcome measure (i.e.
Results
Simulation results
Two-arm designs
Our simulations indicate that in a two-arm two-stage design the inclusion of the Haybittle–Peto efficacy rule at the interim stage has a minimal impact on the PWER under any configuration of the timing of interim analysis, the value of the final-stage significance level and the design allocation ratio. See Supplemental Appendix E for further details of these results. The extent of inflation of the FWER is determined by the choice of efficacy stopping boundary and whether an intermediate outcome is used (see Table 3). While non-binding lack-of-benefit boundaries increase the absolute FWER, the relative inflation is no larger than that under binding boundaries so the assumed approach does not affect the interpretation of the results presented.
Impact of the choice of efficacy boundary (EB)
Implementing the Haybittle–Peto rule in a three-stage design (
For a design where I = D and
An O’Brien–Fleming type rule inflates the FWER the most by 17% when I = D, due to the liberal p-values required at the first two stages to declare efficacy (e.g.
MAMS designs
Table 4 shows the impact of increasing the number of pairwise comparisons and stages when I = D and
Impact of the number of stages and arms on the FWER with Haybittle–Peto efficacy boundary (EB; p = 0.0005) (all SEs < 0.0002; lack-of-benefit boundaries as described in text; allocation ratio = 1 (for alternative allocation ratios in two-stage designs, see Supplemental Appendix E)).
FWER: familywise error rate.
The relative inflation increases with the number of stages in the trial, as the number of opportunities to drop arms early for efficacy increases. However, the inflation when I = D is arguably negligible at less than 2%, and the maximum FWER inflation remains below 5% when
Extending the design to MAMS settings does not materially change the results observed from the two-arm two-stage simulations. While the absolute FWER naturally increases, there is no impact on the relative effect of incorporating efficacy looks with more research arms, and the relative inflation when increasing stages remains constant with any number of arms.
In accordance with Table 3, an O’Brien–Fleming type rule implemented in a MAMS design inflates the FWER by up to 17% when I = D, but no inflation of the maximum FWER is observed when
The three power measures are almost unaffected by the implementation of efficacy boundaries for all possible design configurations. The induced between-arm correlation due to the common control arm is found to increase all-pair power, compared to a design with independent treatment arms, and (negligibly) decrease any-pair power.
When adopting a simultaneous stopping rule, the FWER is unaffected by whether or not the trial terminates early compared to a separate stopping rule. Since the FWER measures the probability of at least one type I error under the global null, type I errors made after an arm is dropped for efficacy do not increase the FWER. Simulations found that the PWER decreases marginally (e.g. by 0.001 for a four-stage design with four arms).
Example: implementing efficacy boundaries in MAMS trials
The operating characteristics for the example MAMS trials STAMPEDE and ICON5 are shown in Table 5 for the original design specifications and with each of the three efficacy stopping rules. Both trials observe some inflation of the type I error when efficacy bounds are hypothetically incorporated, due to the use of an intermediate outcome, reflecting the theoretical results observed in the simulation study. How to control the FWER in these trials for such stopping rules is also demonstrated.
Impact on operating characteristics of STAMPEDE and ICON5 when controlling the FWER at 2.5% with the addition of efficacy boundaries (EBs). The designs with no EBs assessed non-binding lack-of-benefit only at interim analyses (ICON5:
FWER: familywise error rate; HP: Haybittle–Peto; OBF: O’Brien–Fleming; PWER: pairwise error rate.
ICON5:
The actual FWER in both trials differed due to the research arms being dropped, as described in the text.
The two-stage ICON5 trial, when retrospectively designed with the Haybittle–Peto stopping rule, would require the final-stage significance level
For the original STAMPEDE design, the trial would be vulnerable to greater inflation than ICON5 when incorporating efficacy bounds at interim for the definitive outcome, due to the additional two stages in the design. A total of 19 (3%) additional control arm events would be required to control the maximum FWER at 2.5% when using a Haybittle–Peto efficacy stopping rule compared to a design only assessing lack-of-benefit, reducing
Discussion
In this article, we have demonstrated how efficacy stopping rules can be incorporated into MAMS designs under the framework of Royston et al. We have also addressed concerns about how the operating characteristics would be affected by early assessments for efficacy on the definitive outcome. There is no consensus under which circumstances the FWER should be controlled.10,23 However, we have demonstrated how to control the FWER in practice if required, using the four-stage original STAMPEDE trial design as an example, by modifying the final-stage significance level, thereby increasing the number of patients and length of the trial. Control of the PWER could be achieved using the same methods by specifying the trials as two-arm designs.
In summary, our findings suggest that (binding) lack-of-benefit stopping rules will generally decrease the type I error rates and, marginally, the power. In contrast, efficacy stopping boundaries have the potential to increase the type I error rate with no impact on power. The simulation results indicate that the extent of this increase primarily depends on the shape and p-value thresholds of the stopping rule used. They also show that in two-stage designs the inflation remains below 2% for varying configurations of the allocation ratio, number of research arms and timing of the analyses. Designs with three or more stages may see greater inflation of the FWER when
When choosing an efficacy stopping boundary, for a three-stage design the Haybittle–Peto rule was not observed to inflate the FWER but can be conservative. When

Choosing an efficacy stopping rule based on the design and willingness to modify
A fundamental aspect of the design is that the timing of interim analyses is driven by the accrual of control arm events on the intermediate outcome. At the design stage, it should be considered whether it is too early to assess efficacy at the interim stages based on the number of events expected on the definitive outcome. If data from previous trials are available, a judgement can easily be made on whether or not to implement efficacy boundaries; otherwise, a sensitivity analysis can be made under different assumptions for the distribution of I- and D-outcomes. Royston et al.
2
recommend the significance level for lack-of-benefit at stage 1 be no larger than
Considering the use of hypothesis testing, early assessments of efficacy may result in some small bias in the point estimates for the arms dropped early. Choodari-Oskooei et al. 24 demonstrated how bias in point estimates for arms dropped for lack-of-benefit is reduced by following up patients until the planned end of the trial. We expect to observe a similar result with efficacy boundaries, but this should be formally explored.
The choice and definition of error rates depend on the research question and the design of a MAMS trial. There are at least three possible approaches on how to proceed should a pairwise comparison for a research arm cross an efficacy boundary: (1) stop the trial and cease recruitment to all arms; (2) continue with the remaining research arms to make the final decision based on the totality of evidence and (3) add the efficacious regimen to the remaining arms and continue with combination therapies in both control and remaining research arms (e.g. the approach taken in STAMPEDE). 4 Note that this is only appropriate where the original research arms include the control arm. The results in this article have investigated the first two approaches (focusing on the second), but can also handle the third, since pairwise comparisons are only made between the research and control arms on patients recruited contemporaneously. Some alternative MAMS designs adopt the first approach, where it may be of interest to stop the entire trial as soon as an effective regimen is identified, such as in dose-ranging trials. Examples of these are the MAMS design proposed by Magirr et al. 15 using the MAMS package in R and the EAST6 software (http://www.cytel.com/software/east), though neither can accommodate intermediate measures for time-to-event outcomes at the time of submission.
We have updated the nstage program and help documentation in Stata to support the use of efficacy stopping rules in MAMS trial designs and the option to search for boundaries which preserve the FWER at the desired level assuming non-binding lack-of-benefit boundaries. 25 The PWER, FWER and the three power measures described are evaluated by simulation in the program. See Supplemental Appendix D for the relevant commands.
Efficacy stopping rules can easily be implemented for alternative outcome measures in MAMS designs, such as binary or continuous outcomes, using the same principles applied here. The impact on the FWER can be investigated by following the same simulation procedure in nstage 12 to evaluate the FWER.
Supplemental Material
Supplementary_material – Supplemental material for Assessing the impact of efficacy stopping rules on the error rates under the multi-arm multi-stage framework
Supplemental material, Supplementary_material for Assessing the impact of efficacy stopping rules on the error rates under the multi-arm multi-stage framework by Alexandra Blenkinsop, Mahesh KB Parmar and Babak Choodari-Oskooei in Clinical Trials
Footnotes
Acknowledgements
We would like to thank Prof. Patrick Royston, Prof. Cyrus Mehta and Matt Sydes for their helpful comments on an earlier version of this manuscript. We also thank three anonymous reviewers and the associate editor for their detailed comments.
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship and/or publication of this article.
Funding
This work was supported by the Medical Research Council (grant no. MC_UU_12302329).
Supplemental material
Supplemental material for this article is available online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
