Sage Journals: Discover world-class research

Abstract

Background/Aims:

Multi-arm, multi-stage trials frequently include a standard care to which all interventions are compared. This may increase costs and hinders comparisons among the experimental arms. Furthermore, the standard care may not be evident, particularly when there is a large variation in standard practice. Thus, we aimed to develop an adaptive clinical trial that drops ineffective interventions following an interim analysis before selecting the best intervention at the final stage without requiring a standard care.

Methods:

We used Bayesian methods to develop a multi-arm, two-stage adaptive trial and evaluated two different methods for ranking interventions, the probability that each intervention was optimal (P_best) and using the surface under the cumulative ranking curve (SUCRA), at both the interim and final analysis. The proposed trial design determines the maximum sample size for each intervention using the Average Length Criteria. The interim analysis takes place at approximately half the pre-specified maximum sample size and aims to drop interventions for futility if either P_best or the SUCRA is below a pre-specified threshold. The final analysis compares all remaining interventions at the maximum sample size to conclude superiority based on either P_best or the SUCRA. The two ranking methods were compared across 12 scenarios that vary the number of interventions and the assumed differences between the interventions. The thresholds for futility and superiority were chosen to control type 1 error, and then the predictive power and expected sample size were evaluated across scenarios. A trial comparing three interventions that aim to reduce anxiety for children undergoing a laceration repair in the emergency department was then designed, known as the Anxiolysis for Laceration Repair in Children Trial (ALICE) trial.

Results:

As the number of interventions increases, the SUCRA results in a higher predictive power compared with P_best. Using P_best results in a lower expected sample size when there is an effective intervention. Using the Average Length Criterion, the ALICE trial has a maximum sample size for each arm of 100 patients. This sample size results in a 86% and 85% predictive power using P_best and the SUCRA, respectively. Thus, we chose P_best as the ranking method for the ALICE trial.

Conclusion:

Bayesian ranking methods can be used in multi-arm, multi-stage trials with no clear control intervention. When more interventions are included, the SUCRA results in a higher power than P_best. Future work should consider whether other ranking methods may also be relevant for clinical trial design.

Keywords

Bayesian adaptive trial multi-arm multi-stage trial Surface Under the Cumulative Ranking curve clinical trial design paediatric emergency department

Introduction

Novel interventions are frequently evaluated in two-arm clinical trials, which compare the intervention against either a placebo or standard of care control.¹ However, head-to-head comparisons of different effective interventions are less frequent, particularly when multiple interventions are developed concurrently.² Multi-arm studies, which compare a relatively large number of interventions can be a more efficient method for performing these comparisons,³ particularly if they stop recruitment to less effective interventions early.⁴ This is a type of adaptive trial, where the number of interventions enrolling is based on study data.^5,6

Typically, multi-arm adaptive trials perform all analyses in a pairwise fashion against a common control^7,8 to determine whether each of the interventions is superior to the control. However, this framework of pairwise comparisons is neither economical nor ethical^9–12 and restricts comparisons among experimental interventions, which could be the target of the proposed trial.¹³ This is especially crucial when there is no obvious consensus for standard of care intervention, for example, if there is a large variation in practice across institutions or novel interventions have been developed concurrently.¹⁴ In two-arm trials, the designation of a ‘control’ arm between two active comparators is not crucial, but in multi-arm trials, it is not obvious how the multiple arms should be compared.¹⁰

Thus, we aimed to design a multi-arm trial to identify the optimal intervention from a set of active comparators, equivalent to a phase III study. This trial is most relevant to settings where a range of interventions are used, with different interventions favoured by different sites, and there is limited evidence on which intervention offers superior performance. In this setting, we can assume that the investigators are comparing the efficacy of interventions for which safety is well-established and hope to identify less effective interventions early. To achieve this, our proposed trial uses a two-stage design where low-ranked interventions are dropped at the first stage, and the remaining interventions are assessed to determine which, if any, is optimal. We implemented this design in a Bayesian framework and made decisions using the rank of each intervention.¹⁵ The Bayesian framework is well-suited to this design as posterior ranks can be easily computed and it naturally fits within an adaptive design framework.^16,17

This study compared two methods to rank the interventions. The first method calculates the probability that an intervention is better than all other interventions $(P_{best})$ , that is, the probability that the intervention is associated with the most desirable mean value.¹⁸ However, $P_{best}$ is sensitive to the uncertainty in the estimates as, if all the interventions have the same point estimate, it associates the highest rank with the intervention with the most uncertainty.¹⁸ Thus, we also considered ranking the interventions using the Surface Under the Cumulative Ranking curve (SUCRA),^15,19,20 which can mitigate the drawbacks of $P_{best}$ .¹⁸ The SUCRA accounts for the complete ranking distribution of an intervention by averaging the cumulative rank probabilities and indicates the fraction of interventions that are less effective than that intervention.¹⁸ In settings where all interventions have previously demonstrated efficacy, ranking methods alone could be used to determine the most effective intervention.¹⁸ If novel interventions were included, it would be important to also include an assessment of safety and rankings may not be suitable if interventions differ substantially in other characteristics, for example, costs and ease of accessibility.

To our knowledge, the proposed ranking methods have only been compared for network-meta-analysis.¹⁸ Thus, our study is the first to compare different ranking methods for decision-making in multi-arm trial designs. We compared the ability of the SUCRA and $P_{best}$ to identify the optimal intervention based on the power and expected sample size, that is, the average number of patients randomized to each intervention, across a range of numbers of trial arms and minimally important clinical differences. To ensure a fair comparison of the two methods, we imposed the same frequentist type 1 error across the two ranking methods. We also implemented our novel design for a trial comparing three anxiolytic agents used in the paediatric emergency department to reduce distress in children undergoing a laceration repair. The design for this trial used prior distributions extracted from the literature^21–23 and previous studies from our study team.²⁴ We identified the sample size for this trial using Bayesian methods^25,26 and determined the study power and expected sample size.

Methods

Multi-arm multi-stage trials

Multi-arm multi-stage (MAMS) trials evaluate multiple interventions and use interim analyses to determine whether trial arms should be dropped or continue to the next stage.^4,27 Frequentist MAMS have been well-defined and use repeated statistical tests to determine whether interventions should be dropped.^4,28,29 In contrast, Bayesian MAMS determine whether interventions should be discontinued or declared superior using predefined decision rules based on estimates from the posterior distribution of the parameters of interest.⁴ Some MAMS trials will continue until a conclusion has been reached,³⁰ while others pre-specify a maximum sample size and number of stages.⁴

In our design, we considered two stages and applied different decision rules for the interim and final analyses. During the interim analysis, we decided which interventions were promising enough to proceed to the final analysis stage by determining whether their efficacy exceeded a given stopping boundary.³¹ The final analysis aimed to identify the optimal intervention among those that had not been stopped and took place when all participants had been recruited to those interventions. Thus, our proposed trial requires a pre-specified maximum recruitment level for each intervention and the boundaries to (1) stop an intervention at an interim analysis and (2) declare superiority at the final analysis. These key trial design components can then be chosen to minimize the sample size while ensuring that the trial meets its predefined target(s).³¹

Our trial is a two-stage $T$ -arm trial to compare $T$ interventions with a maximum sample size of $n$ patients per arm and one interim analysis after approximately half the maximum sample size had been assigned to each arm $t$ $(t = 1, \dots, T)$ . The decisions at the interim and final analyses used ‘ranking’ the interventions conditional on the posterior distribution of the mean effectiveness. We denoted the quantity computed to determine futility and superiority as $Ψ_{t} (r, s)$ , where $t = 1, \dots, T$ is the intervention, $r = 1, 2$ is the ranking method (either $P_{best}$ , $r = 1$ or the SUCRA, $r = 2$ ) and $s = 1, 2$ is the analysis stage (either interim $s = 1$ or final $s = 2$ ).

At the interim analysis, $Ψ_{t} (r, 1)$ $(t = 1, \dots, T)$ was compared to a ranking method-specific threshold $δ_{<}^{r}$ with the intervention $t$ is declared futile if $Ψ_{t} (r, 1) < δ_{<}^{r}$ . If all but one of the interventions meets this threshold, then the remaining intervention was declared superior at the interim analysis and the trial was terminated. Otherwise, we proceeded to the final stage for all the non-futile interventions, which may be all of them. At the final stage, we calculated $Ψ_{t} (r, 2)$ and compared it to a ranking method-specific threshold $δ_{>}^{r}$ . Superiority was declared if $Ψ_{t} (r, 2) > δ_{>}^{r}$ . Note that, $δ_{>}^{r}$ should be sufficiently high so that it is not mathematically possible for more than one intervention to satisfy this superiority condition. Figure 1 represents this design pictorially when $T = 3$ .

Figure 1.

A pictorial representation of the decision-making in the proposed two-stage trial design. The first and second scenarios proceed to the final analysis stage, where at least two interventions are evaluated at the maximum sample size. Superiority is declared at the interim analysis stage in the final scenario, so the trial is terminated early and no interventions continue to the maximum sample size.

Implementing our proposed design

Before presenting the methods to select the maximum sample size, $δ_{<}^{r}$ and $δ_{>}^{r}$ , we introduce the model used for the data. Our example trial had a continuous outcome, and to allow for efficient computation, we modelled this outcome using a normal distribution with a normal-gamma conjugate prior for the mean and precision. Let $t$ $(t = 1, 2, \dots, T)$ denote the intervention the patient received and $Y_{i, t}$ be the outcome for patient $i$ $(i = 1, 2, \dots, n_{t})$ , where $n_{t}$ is the number of patients recruited to intervention $t$ . The outcome is modelled

Y_{i, t} ~ N (μ_{t}, \frac{1}{τ_{t}})

(1)

where $μ_{t}$ is the effectiveness of intervention $t$ , and $τ_{t}$ is the first-order precision. The priors for $μ_{t}$ and $τ_{t}$ are defined as follows

μ_{t} | τ_{t} ~ N (μ_{t}^{0}, \frac{1}{n_{t}^{0} τ_{t}}), τ_{t} ~ Gamma (α_{t}, β_{t})

(2)

where $μ_{t}^{0}$ is the prior mean for $μ_{t}$ , and $n_{t}^{0}$ is the prior effective sample size.³² In an applied trial, $μ_{t}^{0}$ , $n_{t}^{0}$ , $α_{t}$ and $β_{t}$ should be defined using data from previous studies. The closed form definitions for $μ_{t}^{0}$ , $n_{t}^{0}$ , $α_{t}$ and $β_{t}$ can support the use of available data to define the prior distributions. Specifically, $μ_{t}^{0}$ can be set equal to the mean of the outcome from the previous data, ${\bar{y}}_{t}$ , and $n_{t}^{0}$ can be set to the number of patients used to estimate the mean. For the precision, $α_{t}$ can be set to $\frac{m_{τ}}{2}$ , where $m_{τ}$ is the number of patients used to estimate the precision, and $β$ can be set to $\frac{(m_{τ} - 1) s^{2}}{2}$ , where $s^{2}$ is the sample variance from a previous study. The use of conjugate priors also facilitates the simulation study as the posterior distribution for $μ_{t}$ and $τ_{t}$ , conditional on $Y_{i, t}$ , can be determined analytically. Note, however, that our design can be extended to alternative likelihoods using conjugate or non-conjugate distributions.

Determining the maximum sample size

The proposed trial design required specifying the maximum sample size for each trial arm. We proposed that the maximum sample size is derived using the Average Length Criterion (ALC), a Bayesian method for sample size determination.²⁵ The ALC controls the average length of the posterior credible interval for parameters of interest, typically the treatment effect but $μ_{t}$ in this study.

To adapt the ALC to a multi-arm study, we computed the length of the longest posterior credible interval for $μ_{t}$ across the $T$ interventions, denoted by $a$ . The maximum sample size was then chosen to limit $a$ , calculated by simulation. Using the design prior for $μ_{t}$ and $τ$ , we generated data from the prior-predictive distribution of $Y_{it}$ across a range of sample sizes.²⁵ The posterior distributions for $μ_{t}$ conditional on these data were found by combining the data with an analysis prior. The analysis prior is not necessarily the same as the design prior, particularly in trials as the analysis prior is often non-informative.²⁵ We then calculated $a$ for a range of sample sizes, and we computed the average maximum length $\bar{a}$ across all simulated data sets, separately for each sample size.

In this study, we selected the maximum sample size as the smallest sample size $n$ such that the longest prior credible interval for $μ_{t}$ was reduced tenfold in the posterior. The length of credible intervals specified using the prior distribution reflects the strength of evidence of the outcomes before we observe the data. Thus, the maximum sample size ensured the estimate of the outcomes after observing the data would be 10-fold more informative than before the data were observed. Once $n$ was selected, we had to choose the thresholds to control which interventions would be declared futile and superior.

Determining the futility and superiority thresholds

We selected the futility and superiority thresholds, $δ_{<}^{r}$ and $δ_{>}^{r}$ , respectively, to maintain a frequentist type 1 error of 0.025. In MAMS trials, the commonly used measures of type 1 error are pairwise and familywise type 1 error rates. Pairwise error is the probability that the null hypothesis is incorrectly rejected for a specific intervention at the end of the study, while familywise error is the probability that the null hypothesis of at least one intervention is incorrectly rejected in a multi-arm study.³³ As our study does not focus on rejecting a null hypothesis but rather declaring an intervention superior from a set of multiple interventions, we defined type 1 error as the proportion of trials that declares any of the $T$ interventions superior when no differences between the interventions exist.

To evaluate type 1 error, we accounted for the prior uncertainty in the parameters $τ$ and $μ_{t} = μ$ for all $t = 1, \dots, T$ . Thus, the type 1 error was computed by first simulating the value for $μ$ and $τ$ from a shared design prior. The data $Y_{i, t}$ were then simulated conditional on these values for $μ$ and $τ$ . The hyperparameters of the posterior distributions were calculated, and the ranking of each intervention was calculated using simulations from each posterior distribution $μ_{t}$ .

Bayesian predictive power

Once $δ_{<}^{r}$ and $δ_{>}^{r}$ were specified, the trial design was evaluated using Bayesian predictive power. In the frequentist framework, power is the probability that an intervention will be declared superior if it is superior.³⁴ The power calculation for our design accounted for uncertainty in $τ$ and $μ_{t}$ for all $t = 1, \dots, T$ . Thus, the values for $μ_{t}$ and $τ$ were simulated from their prior distributions. This prior simulation means that even when the prior mean $μ_{t}^{0}$ for a given intervention was superior, the intervention may not be superior in a given simulation. Thus, we computed power by determining the simulation specific superior intervention based on which of the simulated values for $μ_{t}$ was the smallest. If the simulated trial identified that intervention as superior, then it was a successful trial. To compute the power, we have two possible methods for declaring an intervention is superior:

At the interim analysis, all interventions except one meet the futility criterion; $Ψ_{t} (r, 1) < δ_{<}^{r}$ , or;

At the final analysis, a single intervention meets the superiority criterion; $Ψ_{t} (r, 2) > δ_{>}^{r}$ .

Predictive power can also be computed by evaluating the probability of detecting superiority for a specific intervention, but this power is restricted by the prior probability that the intervention is optimal, whereas our proposed definition can reach 1.

Evaluating the ranking methods

We performed a simulation study to compare two different ranking methods for decision-making in our trial, reported using the aims, data-generating mechanisms, estimands, methods, and performance measures (ADEMP) framework.³⁵

Aims

We aimed to compare the use of $P_{best}$ or the SUCRA to make adaptive design decisions in our trial based on predictive power and expected sample size.

Data-generating process

The data were simulated from the prior-predictive distribution of the normal-gamma conjugate model (see equations (1) and (2)). The parameters for the normal-gamma conjugate model were chosen to mimic the trial design for Anxiolysis for Laceration Repair in Children Trial (ALICE) (described below) and are set to $n_{t}^{0} = 1.05$ , $α_{t} = 5.1$ , and $β_{t} = 12.73$ for all $t = 1, \dots, T$ . We considered four different scenarios for the prior mean, generated by setting $μ_{1}^{0} = 0.4$ and $μ_{t}^{0} = μ_{t - 1}^{0} + d$ for $t = 2, \dots, T$ and $d = 0, 0.5, 1.0, 1.5$ and four different values for $T$ , the number of interventions in the trial, $T = 3, 5, 8, 12$ . We used a maximum sample size for each intervention of 100, with an interim analysis at $\frac{100}{2} = 50$ .

Estimands

Depending on the underlying assumptions for the prior mean of the interventions, the key estimand is either the type 1 error, defined as the probability that a trial declares superiority of any interventions when $d = 0$ , or the predictive power, defined as the probability that the intervention with the simulation-specific lowest mean outcome is declared superior. We also evaluate the expected sample size, defined as the expected number of individuals recruited to each intervention, which varies due to dropping interventions at the interim analysis. Finally, we evaluated the proportion of trials that conclude superiority for a non-superior intervention.

Methods

We compared two ranking methods for drawing conclusions of superiority and futility during the trial. The rankings are computed using the posterior mean for each intervention; $μ_{t}$ for $t = 1, \dots, T$ . The first method was $P_{best}$ , the probability that the intervention ranks first

P {μ_{t} = max_{t} μ_{t}}

(2)

estimated through simulation from the posterior distributions with 10,000 simulations. The second method was the SUCRA, which numerically summarizes of the entire ranking distribution. To define SUCRA, let $P (t, s)$ be the probability that the posterior mean of intervention $t$ is smaller than $s$ posterior means among the $T$ alternatives, that is, $P (t, 1)$ is the probability that the posterior mean is the smallest and the intervention ranks first. Next, define the cumulative probability $F (t, s) = \sum_{k = 1}^{s} P (t, k)$ for $t$ and $s = 1, 2, \dots, T$ . Finally, the SUCRA of intervention $t$ is then computed as its average cumulative ranking

SUCRA (t) = \sum_{k = 1}^{T - 1} F (t, k) .

(2)

To ensure comparability between the two ranking methods, the thresholds $δ_{>}^{r}$ and $δ_{<}^{r}$ were selected such that the type 1 error was fixed at 2.5%. We considered three potential thresholds for $δ_{>}^{r}$ , 0.95, 0.975, 0.99. These were chosen to ensure that only one intervention would be ranked superior. The value of $δ_{<}^{r}$ was then calibrated, separately for each superiority threshold to maintain a type 1 error. The final thresholds were chosen to result in the maximal power conditional on maintaining type 1 error control.

Performance measures

We used 10,000 simulated trials in each scenario, which guarantees that the estimated quantities have a 95% probability of being with 0.002 of the reported value. We estimate the type 1 error and predictive power as the proportion of simulations in which the relevant criteria are met. We report the chosen thresholds $δ_{>}^{r}$ and $δ_{<}^{r}$ for each scenario, the type 1 error or predictive power and the Monte Carlo standard error for the predictive power. Finally, the expected sample size is calculated as the average total sample size for each simulated trial, divided by $T$ , the number of trial interventions.

Designing the ALICE trial

The ALICE trial is a phase III, multi-centre, single-blinded, randomized, three-arm, adaptive trial that aims to compare three anxiolytic agents, intranasal midazolam (INM) $(t = 1)$ , inhaled nitrous oxide (N₂O) $(t = 2)$ and intranasal dexmedetomidine (IND) $(t = 3)$ , used for laceration repairs in the paediatric emergency department. Children, particularly young children, are often distressed when undergoing laceration repair,^36–38 which often requires physical restraint.³⁷ Anxiolytic agents can reduce distress during the procedure, making it less technically challenging for the proceduralist and reducing negative experiences for children and caregivers. INM and N₂O are widely used in the paediatric emergency department, but there is little consensus on which agent is most effective. As both agents have limitations,^21–23 IND has been suggested as an effective alternative, but a head-to-head comparison of these three anxiolytics has not been performed. Crucially, due to variation in clinical practice, a standard of care cannot be selected.

The ALICE trial will enrol children between 1 and 13 years who present to the emergency department with a single laceration requiring simple interrupted sutures alone. The child or caregiver must also desire anxiolysis for the repair. The primary outcome is a weighted mean anxiolysis score, measured using the Observational Scale of Behavioral Distress – Revised (OSBD-R),³⁹ which ranges from 0 (no distress) to 23.5 (maximal distress).

Determining the priors

The prior information for the ALICE trial was extracted from the published literature and from study undertaken by our team.²⁴ The prior for the mean of the OSBD-R for INM $μ_{1}^{0} = 0.4$ and for N₂O $μ_{2}^{0} = 1.9$ is extracted from a trial randomizing 51 patients to each intervention.⁴⁰ For IND, a dose-funding study enrolling 21 patients indicated a mean OSBD-R score of $μ_{3}^{0} = 3.96$ . A pooled standard deviation of $1.58$ for the OSBD-R score was estimated from 204 patients.⁴⁰ We assumed that the individual-level precision was the same across all three interventions, that is, $α_{t}$ and $β_{t}$ are the same for all $t = 1, \dots, T$ , and therefore, this study represents the best possible evidence for the precision. The values for $α_{t}$ and $β_{t}$ are computed using the sample size and pooled standard deviation. To reduce the impact of the prior on the results, we down-weighted all the prior sample sizes by a factor of 20, indicating that the information from 20 patients in the previous trial is equivalent to one patient in the ALICE trial. The final prior distributions for the ALICE trial are in Table 1.

Table 1.

The prior parameters used for the ALICE trial.

Intervention	$μ_{t}^{0}$	$n_{t}^{0}$	$α_{t}$	$β_{t}$
Inhaled Nitrous Oxide (N₂O)	0.4	2.55	5.1	12.73
Intranasal Midazolam (INM)	1.9	2.55	5.1	12.73
Intranasal Dexmedetomidine (IND)	3.96	1.05	5.1	12.73

N₂O: nitrous oxide; INM: intranasal midazolam; IND: intranasal dexmedetomidine.

Sample size determination

The sample size for the ALICE trial was selected using the ALC, conditional on the priors in Table 1. We simulated 10,000 data sets, conditional on the design prior, for eight different sample sizes from 70 to 140, in increments of 10. For each data set, we obtained the $95 %$ high-density posterior credible intervals, estimated by 10,000 simulations from the analytic posterior distribution. The longest length $a$ was estimated for each simulation, and the average maximum length $\bar{a}$ was identified for each simulated data set. The longest length of 95% high-density prior credible interval was also computed, denoted by $b$ . We then selected the smallest sample size such that $\bar{a} < \frac{b}{10} = \frac{6.72}{10} = 0.67$ .

Designing the ALICE trial

We evaluated both ranking methods to determine the optimal design for the ALICE trial. The optimal thresholds $δ_{>}^{r}$ and $δ_{<}^{r}$ from the simulation study were used to evaluate the predictive power and expected sample size. The predictive power was evaluated based on 10,000 simulated studies.

Results

Evaluating the ranking methods

Tables 2 and 3 display the results of our simulation study of the two ranking methods. We report the results for the optimal threshold $δ_{>}^{r} = 0.975$ , $r = 1, 2$ , with results for the other thresholds reported in the Supplemental Material. $δ_{>}^{r} = 0.99$ results in a lower predictive power, and it was not possible to control type 1 error below 2.5% for $δ_{>}^{r} = 0.95$ and $T = 3.5$ . The futility threshold $δ_{<}^{1}$ for $P_{best}$ was lower than $δ_{<}^{2}$ for the SUCRA as $Ψ_{t} (2, s)$ were strictly larger than $(T - 1) Ψ_{t} (1, s)$ . The thresholds $δ_{<}^{r}$ also increased as the number of interventions in the trial increased. Thus, the thresholds for dropping interventions became more lenient, and more interventions were dropped at the interim analysis as $T$ increased. This can also be seen in the expected sample size, which decreased as $T$ increased.

Table 2.

The type 1 error and expected sample size (ESS) under the null hypothesis, obtained for four different scenarios that vary the number of interventions in the adaptive trial.

Number of intervention	Ranking approach	$δ_{<}$	$δ_{>}$	Type 1 error	ESS
3	SUCRA	0.274	0.975	0.025	89
	$P_{best}$	0.026	0.975	0.025	94
5	SUCRA	0.477	0.975	0.025	77
	$P_{best}$	0.057	0.975	0.025	79
8	SUCRA	0.436	0.975	0.025	80
	$P_{best}$	0.073	0.975	0.025	73
12	SUCRA	0.5	0.975	0.025	75
	$P_{best}$	0.079	0.975	0.025	66

ESS: expected sample size; SUCRA: surface under the cumulative ranking curve.

Table 3.

The predictive power, expected sample size (ESS), and probability of incorrectly identifying a superior treatment under the alternative hypothesis, obtained from 12 simulated cases varying the number of interventions and the incremental difference between the prior means $(d)$ .

Number of intervention	The increment between interventions, $d$	Ranking approach	Power	ESS	Probability of incorrect superiority
3	0.5	SUCRA	0.78	83	0.002
		$P_{best}$	0.79	60	0.002
	1.0	SUCRA	0.81	83	0.001
		$P_{best}$	0.82	59	0.001
	1.5	SUCRA	0.84	83	0.001
		$P_{best}$	0.85	57	0.002
5	0.5	SUCRA	0.75	77	0.002
		$P_{best}$	0.76	56	0.005
	1.0	SUCRA	0.80	78	0.002
		$P_{best}$	0.81	55	0.003
	1.5	SUCRA	0.85	78	0.001
		$P_{best}$	0.85	54	0.004
8	0.5	SUCRA	0.78	77	0.006
		$P_{best}$	0.75	54	0.007
	1.0	SUCRA	0.84	77	0.003
		$P_{best}$	0.81	53	0.005
	1.5	SUCRA	0.88	77	0.003
		$P_{best}$	0.86	52	0.004
12	0.5	SUCRA	0.80	75	0.009
		$P_{best}$	0.75	53	0.009
	1.0	SUCRA	0.86	75	0.007
		$P_{best}$	0.82	52	0.005
	1.5	SUCRA	0.89	75	0.004
		$P_{best}$	0.86	51	0.004

ESS: expected sample size; SUCRA: surface under the cumulative ranking curve.

Ranking interventions using $P_{best}$ resulted a smaller expected sample size when true differences exist and a larger expected sample size when not differences exist. For our optimal design, $P_{best}$ provides the highest power and a lower expected sample size for three and five interventions but the SUCRA outperform $P_{best}$ in terms of predictive power as the number of interventions increases, with a 5% increase for $T = 12$ and $d = 0.5$ . For $δ_{>}^{r} = 0.99$ , $P_{best}$ outperforms the SUCRA in terms of predictive power until $T = 12$ , albeit at a lower level, while for $δ_{>}^{r} = 0.95$ , the SUCRA clearly outperforms $P_{best}$ . The proportion of trials that conclude superiority for a non-superior intervention is below 1% for all studies. Thus, this design is unlikely to produce incorrect superiority conclusions when a superior intervention is available.

The expected sample size does not change substantially as the difference between the outcomes increases. We believe this is because the precision of the estimates is the same across the different scenarios and the relatively small sample sizes ensure that interventions are retained in the trial.

The ALICE trial

Table 4 displays the average longest 95% high-density posterior credible interval length for the mean of the OSBD-R score. The maximum sample size of 100 is chosen as the maximum sample size for the ALICE trial as the average longest length of 0.66 close to $\frac{b}{10} = 0.67$ . The interim analysis is after 50 patients have been enrolled for each intervention.

Table 4.

The range of sample size and corresponding $95 %$ high-density posterior credible interval length for the mean of the OSBD-R score.

n	$\bar{a}$	n	$\bar{a}$
70	0.79	110	0.63
80	0.74	120	0.60
90	0.70	130	0.59
100	0.66	140	0.56

Based on $δ_{>}^{r} = 0.975$ and $δ_{<}^{1} = 0.026$ and $δ_{<}^{2} = 0.274$ , the predictive power for $P_{best}$ is $86 %$ and the SUCRA is $85 %$ . The expected sample size with no difference in the mean of the interventions was $ES S_{0} = 94$ and $ES S_{0} = 89$ for $P_{best}$ and the SUCRA, respectively. Similarly, $ES S_{1} = 57$ and $ES S_{1} = 67$ for $P_{best}$ and the SUCRA, respectively. Based on these analyses, we chose $P_{best}$ as the ranking method in the ALICE trial.

Discussion

This study evaluated two methods for ranking interventions to make decisions in a Bayesian, multi-arm, two-stage, adaptive trial across 12 scenarios. Broadly, this study showed that $P_{best}$ is more likely than the SUCRA to drop futile interventions at the interim analysis, resulting in a smaller expected sample size and a lower power for trials with a larger number of interventions. To our knowledge, this is the first evaluation of using the SUCRA to make decisions in an adaptive trial, as it has primarily been used in network meta-analyses^20,41 and provides evidence that a further exploration of these methods and other relevant ranking methods could be useful.

We then used our novel design for the ALICE trial, a randomized trial to determine the optimal anxiolytic agent among three interventions to reduce distress for children undergoing laceration repair. Due to substantial practice variation, it was not obvious which intervention should be considered as the common comparator, necessitating the use of treatment rankings to make trial conclusions. This design is useful for trials where a placebo or clear standard of care is not available and the interventions have previously been shown to be effective. For example, variation in clinical practice where head-to-head trials have not been done, novel interventions developed at the same time by different teams/companies, common off-label use of drugs, for example, in paediatrics where trials are lacking, and the comparison of non-drug related interventions such as different implementation methods. A key challenge of the proposed trial design is choosing the design and analysis priors. We extracted these from previous literature but, in some examples, absolute outcome values may not be available, for example, if only relative treatment effects are reported, which would make this method challenging to implement. Another limitation of the proposed method was the use of conjugate distributions, which limited the models we could consider. For example, we could have considered a pooled precision across the different interventions, but this would have created an infeasible computational burden for our simulation study.

Furthermore, the simulation study could have been expanded to consider additional ranking methods, which is an important avenue for future research. In particular, using only ranking metrics, rather than the absolute effect of interventions, may violate consistency as different metrics may provide different treatment hierarchies.¹⁸ Moreover, the efficacy that we evaluate using the ranking metric may not be clinically significant. A future extension of this design could consider the effect size and a minimum clinically important value in the superiority criteria. This has been suggested in the network meta-analysis¹⁹ and could be extended to our trial design.

Conclusion

In multi-arm clinical trials with no obvious control, Bayesian methods for ranking interventions can be determine the optimal outcome from a set of effective alternatives. In trials with a small number of interventions, the probability that the treatment is superior provides high predictive power, while for larger numbers of interventions, the SUCRA offers increased predictive power. Our results showed both ranking metrics could provide valid, powerful trials with different operating characteristics. Thus, we suggest that investigators carefully consider their trial design and appropriate ranking method before the trial.

Supplemental Material

sj-pdf-1-ctj-10.1177_17407745241251812 – Supplemental material for A comparison of alternative ranking methods in two-stage clinical trials with multiple interventions

Supplemental material, sj-pdf-1-ctj-10.1177_17407745241251812 for A comparison of alternative ranking methods in two-stage clinical trials with multiple interventions by Nam-Anh Tran, Abigail McGrory, Naveen Poonai and Anna Heath in Clinical Trials

Supplemental Material

sj-pdf-2-ctj-10.1177_17407745241251812 – Supplemental material for A comparison of alternative ranking methods in two-stage clinical trials with multiple interventions

Supplemental material, sj-pdf-2-ctj-10.1177_17407745241251812 for A comparison of alternative ranking methods in two-stage clinical trials with multiple interventions by Nam-Anh Tran, Abigail McGrory, Naveen Poonai and Anna Heath in Clinical Trials

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship and/or publication of this article.

ORCID iD

Nam-Anh Tran

Anna Heath

Supplemental material

Supplemental material for this article is available online.

References

Bratton

Phillips

PPJ

Parmar

MKB

. A multi-arm multi-stage clinical trial design for binary outcomes with application to tuberculosis. BMC Med Res Methodol 2013; 13(1): 139.

Schöttker

Lühmann

Boulkhemair

, et al. Indirect comparisons of therapeutic interventions. GMS Health Technol Assess 2009; 5: Doc09.

Parmar

MKB

Carpenter

Sydes

. More multiarm randomised trials of superiority are needed. Lancet 2014; 384(9940): 283–284.

Bassi

Berkhof

De Jong

, et al. Bayesian adaptive decision-theoretic designs for multi-arm multi-stage clinical trials. Stat Methods Med Res 2021; 30(3): 717–730.

Chow

S-C

Chang

Pong

. Statistical consideration of adaptive methods in clinical development. J Biopharm Stat 2005; 15(4): 575–591.

Phillips

PPJ

Gillespie

Boeree

, et al. Innovative trial designs are practical solutions for improving the treatment of tuberculosis. J Infect Dis 2012; 205(suppl. 2): S250–S257.

Lin

Bunn

. Comparison of multi-arm multi-stage design and adaptive randomization in platform clinical trials. Contemp Clin Trials 2017; 54: 48–59.

Ghosh

Liu

Mehta

. Adaptive multiarm multistage clinical trials. Stat Med 2020; 39(8): 1084–1102.

Chang

. Introductory adaptive trial designs: a practical guide with R, vol. 75. Boca Raton, FL: CRC Press, 2015.

10.

Streiner

. Alternatives to placebo-controlled trials. Can J Neurol Sci 2007; 34(S1): S37–S41.

11.

Cheah

Steinkamp

Von Seidlein

, et al. The ethics of using placebo in randomised controlled trials: a case study of a plasmodium vivax antirelapse trial. BMC Med Ethics 2018; 19(1): 19.

12.

Stang

Hense

H-W

Jöckel

K-H

, et al. Is it always unethical to use a placebo in a clinical trial? PLoS Med 2005; 2(3): e72.

13.

Magaret

Angus

Adhikari

NKJ

, et al. Design of a multi-arm randomized clinical trial with no control arm. Contemp Clin Trials 2016; 46: 12–17.

14.

Evans

. Clinical trial structures. J Exp Stroke Transl Med 2010; 3(1): 8–18.

15.

Rücker

Schwarzer

. Ranking treatments in frequentist network meta-analysis works without resampling methods. BMC Med Res Methodol 2015; 15(1): 58.

16.

Sheiner

. Learning versus confirming in clinical drug development. Clin Pharmacol Ther 1997; 61(3): 275–291.

17.

Berry

Carlin

Lee

, et al. Bayesian adaptive methods for clinical trials. Boca Raton, FL: CRC Press, 2010.

18.

Salanti

Nikolakopoulou

Efthimiou

, et al. Introducing the treatment hierarchy question in network meta-analysis. Am J Epidemiol 2022; 191(5): 930–938.

19.

Mavridis

Porcher

Nikolakopoulou

, et al. Extensions of the probabilistic ranking metrics of competing treatments in network meta-analysis to reflect clinically important relative differences on many outcomes. Biom J 2020; 62(2): 375–385.

20.

Salanti

Ades

Ioannidis

JPA

. Graphical methods and numerical summaries for presenting results from multiple-treatment meta-analysis: an overview and tutorial. J Clin Epidemiol 2011; 64(2): 163–171.

21.

Miller

Capino

Thomas

, et al. Sedation and analgesia using medications delivered via the extravascular route in children undergoing laceration repair. J Pediatr Pharmacol Ther 2018; 23(2): 72–83.

22.

Conway

Rolley

Sutherland

. Midazolam for sedation before procedures. Cochrane Database Syst Rev 2016; 2016(5): CD009491.

23.

National Clinical Guideline Centre. Sedation in children and young people: sedation for diagnostic and therapeutic procedures in children and young people. London: Royal College of Physicians, 2010.

24.

Poonai

Sabhaney

Ali

, et al. Optimal dose of intranasal dexmedetomidine for laceration repair in children: a phase II dose-ranging study. Ann Emerg Med, 2023. 82(2): 179–190.

25.

Joseph

Belisle

. Bayesian sample size determination for normal means and differences between normal means. J Roy Stat Soc D: Sta 1997; 46(2): 209–226.

26.

Joseph

Bélisle

. Bayesian consensus-based sample size criteria for binomial proportions. Stat Med 2019; 38(23): 4566–4573.

27.

Jaki

Wason

JMS

. Multi-arm multi-stage trials can improve the efficiency of finding effective treatments for stroke: a case study. BMC Cardiovasc Disord 2018; 18(1): 215.

28.

Magirr

Jaki

Whitehead

. A generalized Dunnett test for multi-arm multi-stage clinical studies with treatment selection. Biometrika 2012; 99(2): 494–501.

29.

White

Choodari-Oskooei

Sydes

, et al. Combining factorial and multi-arm multi-stage platform designs to evaluate multiple interventions efficiently. Clin Trials 2022; 19(4): 432–441.

30.

Cheng

Shen

. Bayesian adaptive designs for clinical trials. Biometrika 2005; 92(3): 633–646.

31.

Wason

Stallard

Bowden

, et al. A multi-stage drop-the-losers design for multi-arm clinical trials. Stat Methods Med Res 2017; 26(1): 508–524.

32.

Morita

Thall

Müller

. Determining the effective sample size of a parametric prior. Biometrics 2008; 64(2): 595–602.

33.

Bratton

Parmar

MKB

Phillips

PPJ

, et al. Type I error rates of multi-arm multi-stage clinical trials: strong control and impact of intermediate outcomes. Trials 2016; 17(1): 309.

34.

Jones

Carley

Harrison

. An introduction to power and sample size estimation. Emerg Med J 2003; 20(5): 453–458.

35.

Morris

White

Crowther

. Using simulation studies to evaluate statistical methods. Stat Med 2019; 38(11): 2074–2102.

36.

Hall

Patel

Thomas

, et al. Certified child life specialists lessen emotional distress of children undergoing laceration repair in the emergency department. Pediatr Emerg Care 2018; 34(9): 603–606.

37.

Kumar

Ali

Sabhaney

, et al. Anxiolysis for laceration repair in children: a survey of pediatric emergency providers in Canada. Can J Emerg Med 2022; 24(1): 75–83.

38.

Gursky

Kestler

Lewis

. Psychosocial intervention on procedure-related distress in children being treated for laceration repair. J Dev Behav Pediatr 2010; 31(3): 217–222.

39.

Elliott

Jay

Woody

. An observation scale for measuring children’s distress during medical procedures. In: Roberts

Koocher

Routh

, et al. (eds) Readings in pediatric psychology. Boston, MA: Springer, 1993, pp. 259–267.

40.

Luhmann

Kennedy

Porter

, et al. A randomized clinical trial of continuous-flow nitrous oxide and midazolam for sedation of young children during laceration repair. Ann Emerg Med 2001; 37(1): 20–27.

41.

Daly

Neupane

Beyene

, et al. Empirical evaluation of SUCRA-based treatment ranks in network meta-analysis: quantifying robustness using Cohen’s kappa. BMJ Open 2019; 9(9): e024625.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.19 MB

0.09 MB

A comparison of alternative ranking methods in two-stage clinical trials with multiple interventions: An application to the anxiolysis for laceration repair in children trial

Abstract

Background/Aims:

Methods:

Results:

Conclusion:

Keywords

Introduction

Methods

Multi-arm multi-stage trials

Implementing our proposed design

Determining the maximum sample size

Determining the futility and superiority thresholds

Bayesian predictive power

Evaluating the ranking methods

Aims

Data-generating process

Estimands

Methods

Performance measures

Designing the ALICE trial

Determining the priors

Sample size determination

Designing the ALICE trial

Results

Evaluating the ranking methods

The ALICE trial

Discussion

Conclusion

Supplemental Material

sj-pdf-1-ctj-10.1177_17407745241251812 – Supplemental material for A comparison of alternative ranking methods in two-stage clinical trials with multiple interventions

Supplemental Material

sj-pdf-2-ctj-10.1177_17407745241251812 – Supplemental material for A comparison of alternative ranking methods in two-stage clinical trials with multiple interventions

Footnotes

Declaration of conflicting interests

Funding

ORCID iD

Supplemental material

References

Supplementary Material