Sage Journals: Discover world-class research

Abstract

Background/Aims:

Bayesian designs for clinical trials using assurance to choose the sample size have been proposed in various trial contexts. Assurance allows for the incorporation of uncertainty on both the treatment effect and nuisance parameters into the sample size calculation. In the case of two-arm cluster randomised trials with continuous outcomes, assurance has been proposed with both a frequentist analysis (hybrid designs) and a Bayesian analysis (fully Bayesian designs). A Bayesian analysis in this context ensures a consistent treatment of probability throughout the design and analysis of the trial. In the fully Bayesian design, inference has been achieved via Markov chain Monte Carlo sampling, and since assurance itself is evaluated via simulation, the result is a computationally intensive and often slow-to-run approach. In the case of two-arm cluster randomised trials with binary outcomes, assurance has not yet been explored to specify sample sizes, either in the hybrid or fully Bayesian case.

Methods:

This article considers fully Bayesian designs for two-arm cluster randomised trials with continuous and binary outcomes. For the analysis of the trial, we use a (generalised) linear mixed-effects model. We summarise the inference for the treatment effect based on quantiles of the posterior distribution. We use assurance to choose the sample size. In the continuous case, we investigate Integrated Nested Laplace Approximations for inference to speed up calculation of the assurance and compare Integrated Nested Laplace Approximations in computation time and accuracy to Markov chain Monte Carlo. In the binary case, we develop the first fully Bayesian design for cluster randomised trials and conduct a similar comparison between Integrated Nested Laplace Approximations and Markov chain Monte Carlo. We demonstrate our novel approach using assurance to choose sample sizes for the SPEEDY cluster randomised trial, based on the results of a formal prior elicitation exercise with two clinical experts.

Results:

We report comparisons of Integrated Nested Laplace Approximations and Markov chain Monte Carlo for a range of different scenarios for cluster randomised controlled trials (RCTs), to determine when each inference scheme should be used, balancing the computational cost in terms of speed and accuracy. Overall Markov chain Monte Carlo with a very large number of samples produces very accurate inference but does not scale well in terms of computational speed compared to Integrated Nested Laplace Approximations. Based on our simulation study, we recommend that Integrated Nested Laplace Approximations is used for inference in cluster trials with binary outcomes and large (n> 500) cluster trials with continuous outcomes, and that Markov chain Monte Carlo is used in smaller (n≤500) cluster trials with continuous outcomes. Our case study demonstrated how to incorporate the uncertainty of trial clinicians into the sample size calculation to give an overall assessment of the likelihood of success of the trial.

Conclusion:

A fully Bayesian design can be used for two-arm cluster trials with both continuous and binary outcomes. Integrated Nested Laplace Approximations can allow for more efficient assessment of the assurance for cluster trials with binary outcomes and large cluster trials with continuous outcomes, without loss of accuracy in inference. A fully Bayesian design of a cluster randomised trial provides a coherent design and analysis framework and incorporates uncertainty in model parameters when choosing the sample size.

Keywords

Bayesian design cluster randomised controlled trial continuous outcome binary outcome design and analysis priors sample size

Background/Aims

Introduction

Sample size calculations are important in clinical trials, as they balance the need for precision while taking into account practical considerations such as cost and time. It is unethical to recruit more participants than needed, but too few participants risk not being able to answer the research question, wasting time and money, and inconveniencing patients. In this article, we focus on sample size calculations for two-arm superiority cluster randomised trials (CRTs), both with a continuous outcome¹ and with a binary outcome.²

For the CRT sample size calculation, we will use a Bayesian approach, as it has the advantage of using prior knowledge, or information from previous studies, which is useful when there is uncertainty in the parameters and complexity in the inferential model. The Bayesian approach gives an intuitive interpretation in these cases. It also allows more flexible decision-making. The Bayesian approach used to calculate the sample size is assurance,¹ which is an alternative to power. The evaluation of the assurance typically requires a two-loop Monte Carlo scheme, sampling from a design prior distribution in the outer loop and performing a Markov Chain Monte Carlo (MCMC) update to obtain samples of the treatment effect in the inner loop for each sample in the outer loop.

A particular challenge in this case, which is a problem more generally in Bayesian design of experiments, is computational cost. It can be time-consuming to run a full MCMC scheme for every iteration in the Monte Carlo procedure described above. In an attempt to reduce computation time, in this article, we investigate Integrated Nested Laplace Approximations (INLA)³ as an alternative to MCMC.⁴ This approach has been considered for individually randomised controlled trials,⁵ but it has not been investigated before in CRTs, which are more complex trials inferentially, requiring modelling of the cluster effects and intra-cluster correlation coefficient (ICC). There are other papers that have discussed the comparison between INLA and MCMC^6–9 with regression models of various types, but either their sole focus was accuracy, they only considered very large MCMC runs, or the models they considered were not comparable to those in this article. Our investigation focuses on the trade-off between speed and accuracy of inference on the treatment effect based on approximation using INLA and MCMC under varying numbers of posterior samples. As such, it provides a new perspective on the relative merits of MCMC and INLA in a clinical trials context.

We compare the inference resulting from MCMC using different numbers of posterior samples and INLA for continuous outcomes, considering a linear mixed-effects model as in Wilson (2023).¹ In general, it should be faster to obtain the posterior distribution for the marginal treatment effect using INLA than using a sampling scheme such as MCMC, particularly for complex designs and large sample sizes. However, INLA is an approximation, whereas MCMC samples from the true posterior distribution, and so with enough samples, can be arbitrarily accurate. We further outline Bayesian inference for a CRT with a binary outcome and undertake a comparison of MCMC and INLA for this case. Based on our investigation, we provide guidance on when INLA and MCMC are most suitable for Bayesian analysis of CRTs.

We demonstrate the approach by calculating the sample size of the case study SPEEDY trial¹⁰ using assurance,¹ for both continuous and binary co-primary outcomes. Based on our investigation, we use MCMC for the continuous outcome and INLA for the binary outcome. To evaluate the assurance, we use the prior distributions resulting from an expert elicitation exercise with the two co-leads in SPEEDY. We report the assurance and required sample sizes in each case from the priors for each expert and from an equally weighted prior between the two experts.

This article is structured as follows. We review a standard approach to power calculations for two-arm superiority CRTs for continuous and binary outcomes. We detail Bayesian inference for two-arm superiority CRTs with continuous and binary outcomes. We detail how to calculate assurance for CRTs. Then, we perform a simulation study comparing inference via MCMC and INLA in both cases, evaluating their accuracy and computation time. After that, we have the application to the SPEEDY trial. Finally, we summarise this article and identify future work.

Power calculation for two-arm superiority CRTs

Here, we summarise standard power calculations for CRTs, to provide a contrast to the assurance described later.

The power for a two-arm CRT with a continuous outcome is given by the conditional probability that we reject the null hypothesis of a treatment effect of zero (for example), given an assumed treatment effect and values chosen for a set of nuisance parameters detailed below. We can approximate the power function, for sample size $n$ given by the product of the number of clusters $C$ and the average sample size in a cluster $\bar{n}$ , for a one-sided Wald test of the treatment effect at significance level $α$ ,¹¹ via,

P (n | δ, ψ) = Φ (δ \sqrt{\frac{C (\bar{n})}{4 σ^{2} [1 + {(ν^{2} + 1) (\bar{n}) - 1} ρ]}} - z_{1 - α})

(1)

where $δ$ is the treatment effect, $ψ = (σ, ρ, ν)$ is the vector of nuisance parameters given by the overall standard deviation $σ$ , ICC $ρ$ and coefficient of variation in cluster sizes $ν$ , $z_{1 - α}$ is the $100 \times (1 - α)$ quantile of the standard normal distribution, $α$ is the significance level of the Wald test, and $Φ$ is the cumulative distribution function of the standard normal distribution. The sample size is chosen to be the smallest value which gives at least a desired power $1 - β$ , where $β$ is the type II error rate.

The power in the binary case can be expressed^12,13 as

\begin{array}{l} P (n ∣ p_{1}, p_{2}, ρ) = Φ {\frac{(p_{2} - p_{1}) - z_{1 - α} σ_{p}}{σ_{D}}} \\ + Φ {\frac{(p_{1} - p_{2}) - z_{1 - α} σ_{p}}{σ_{D}}} \end{array}

(2)

where $(p_{1}, p_{2})$ are the probabilities of a positive primary outcome in the control and treatment arms, respectively, and $σ_{p}$ is the pooled standard deviation given by

σ_{p} = \sqrt{\frac{τ [\bar{p} (1 - \bar{p})]}{n}}

and $σ_{D}$ is the standard deviation of the difference between the probabilities and is given by

σ_{D} = \sqrt{\frac{2 τ [p_{1} (1 - p_{1}) + p_{2} (1 - p_{2})]}{n}} .

here $\bar{p} = (p_{1} + p_{2}) / 2$ and $τ = 1 + ρ (\bar{n} - 1)$ is the design effect, which is assumed equal in the control and treatment arms. We choose $n$ as the smallest value that gives the required power $1 - β$ . In both formulas, for the continuous and binary outcomes, if you use $\frac{α}{2}$ in place of $α$ , that will give the power for the two-sided test.

In general, we will have uncertainty about the true values of the (nuisance) parameters in the power calculations above. By defining a prior distribution on the (nuisance) parameters, rather than assuming single values as in power, we can take this uncertainty into account in the sample size calculation. The resulting quantity is known as the assurance and can be used to choose the sample size for a CRT in combination with either a frequentist or a Bayesian analysis, respectively known as a hybrid and a fully Bayesian design.

Methods

Bayesian inference for two-arm CRTs

An alternative to the hypothesis-testing analyses which formed the basis of the power functions in the previous section is to perform a Bayesian analysis of the trial. This has the advantage of allowing prior information to be incorporated into the analysis and provides a coherent framework for design and analysis if the assurance is to be used to choose the sample size, which will be described below. In this section, we detail Bayesian inference for CRTs.

We describe the inference for the treatment effect for a CRT with a continuous outcome, based on the posterior distribution, as described in Spiegelhalter.¹⁴ For the binary outcome, we can perform inference using a similar approach to that of Turner.² Then, based on this inference, we use the developed assurance from Wilson¹ to choose the CRT sample size. For the inference, we consider comparison of treatment with control.

A (generalised) linear mixed-effects model can be used, with continuous response $Y_{i j} ~ N (μ_{i j}, σ_{w}^{2})$ or binary response $Y_{ij} ~ Bern (θ_{ij})$ , where $Y_{ij}$ are observed for individuals $i = 1, \dots, n_{j}$ in clusters $j = 1, \dots, J$ , and the linear predictor is given by

η_{ij} = λ + X_{j} δ + c_{j}, c_{j} ~ N (0, σ_{b}^{2})

with $η_{ij} = μ_{ij}$ for the continuous outcome and $η_{ij} = \log (\frac{θ_{ij}}{1 - θ_{ij}})$ for the binary outcome. In addition, $λ$ is the control arm mean response, $X_{j} = 1$ if cluster $j$ is the treatment arm and $X_{j} = 0$ otherwise, $δ$ is the treatment effect, and $c_{j} ~ N (0, σ_{b}^{2})$ is a random cluster effect, with $σ_{b}^{2}$ being the between cluster variance, with additionally $σ_{w}^{2}$ being the within-cluster variance in the continuous case.

For Bayesian inference, the parameters $Ψ = (λ, δ, σ_{b}^{2})'$ and possibly $σ_{w}^{2}$ require prior distributions. There are various possibilities, but suitable forms for the marginal prior distributions^1,14 are

\begin{matrix} λ ~ N (m_{λ}, v_{λ}), δ ~ N (m_{δ}, v_{δ}), \\ τ_{b} = \frac{1}{σ_{b}^{2}} ~ Γ (r_{b}, s_{b}), τ_{w} = \frac{1}{σ_{w}^{2}} ~ Γ (r_{w}, s_{w}), \end{matrix}

where each $(m, v)$ and $(r, s)$ are hyperparameters to be chosen. In the analysis at the end of the trial, we may choose to make these prior distributions relatively non-informative, to be consistent with equipoise.

The inference in both cases is not conjugate, and so numerical or approximation methods are needed to evaluate the posterior distribution on the treatment effect $δ$ .

Previous work^1,14 in the continuous case has considered simulation from the posterior distribution of the treatment effect using MCMC. For large or complex CRTs, this can be computationally costly, and, when many runs of the MCMC are required as described for the design of the trial in the assurance, it may not be feasible to use MCMC at all. We propose INLA as an alternative to MCMC for inference on the treatment effect in CRTs and will compare MCMC and INLA under various scenarios, focusing on their accuracy and computational cost.

In the analyses in this article, we perform inference via MCMC using the R package rjags.⁴ The rjags package is used for Bayesian data analysis and interfaces between R and the JAGS library.¹⁵ It uses a combination of Gibbs sampling, Metropolis-Hastings sampling and slice sampling to sample from the posterior distribution. In our implementation of MCMC in rjags, we use a burn-in period to allow the MCMC chains to converge before recording samples.

To perform inference using INLA, we use the INLA package from the R-INLA project.³ The idea behind INLA is that it approximates the required integral to evaluate the posterior distribution using Laplace’s method. It can be used for the analysis of CRTs since the (generalised) linear mixed-effects models can be written as latent Gaussian models, for which the Laplace method can be applied (for further information, see Gómez-Rubio¹⁶). We obtain the required quantiles from the posterior distribution of the treatment effect directly from INLA, without the need for sampling.

Assurance

Following Wilson,¹ assurance evaluates the unconditional probability that the trial finds a significant treatment effect. This allows an appropriate sample size choice in the planning of any cluster randomised controlled trial (RCT) and is not conditional on chosen values of unknown parameters in the same way as the power.

Define an event ‘Success’ to be the successful outcome of the CRT, that is, treatment is superior to control. Then, for the sample size $n$ , the assurance is given by

\begin{matrix} A (n) = \int \int I_{A} [Success | y] f (y | Ψ, n) π_{D} (Ψ) d Ψ d y, \end{matrix}

where $y$ is the vector of responses, $Ψ$ is the vector of model parameters, $I_{A} [Success | y]$ is an indicator function which takes the value 1 if the trial results in a success, $f (y | Ψ, n)$ is the probability density function of $y$ , and $π_{D} (Ψ)$ is the design prior distribution for $Ψ$ .

The total sample size in a cluster RCT is given by $n = \sum_{j = 1}^{J} n_{j}$ . Specifying a total sample size in place of each individual cluster sample size is standard practice in cluster RCTs. In the case where there will not be the same number of individuals in each cluster, we can model the number of individuals in each cluster $n = {(n_{1}, \dots n_{J})}^{'}$ as

n ~ Multinomial (n, p)

where $p = {(p_{1}, \dots, p_{J})}^{'}$ and $p_{j}$ is the random selection probability of an individual coming from cluster $j$ . Similar to Wilson,¹ we choose for $p$ a symmetrical Dirichlet prior distribution, $p ~ Dirichlet (a)$ , in the case where we have no reason to think any particular cluster is likely to be larger than any other a priori. In this case, $a = {(a_{1}, \dots, a_{J})}^{'}$ , and $a = a_{1} = \dots = a_{J}$ . When the values of $a$ are smaller the variation in cluster sizes will increase. When $a_{j} \neq a_{j^{'}}$ for $j \neq j^{'}$ , this will lead to unequal prior probabilities of recruitment in each cluster.

The assurance for total sample size $n$ can be evaluated using a standard Monte Carlo simulation approach, as

A (n) = \frac{1}{L} \sum_{ℓ = 1}^{L} I (Success | y^{(ℓ)}),

where ‘Success’ denotes that treatment is found superior to control based on the posterior distribution in the analysis of the CRT and $I$ is an indicator variable which takes the value 1 if this is true. To obtain this, we sample ${(Ψ, p)}^{(ℓ)} = {(α, δ, σ_{w}, σ_{b}, p)}^{(ℓ)}$ in the continuous case or ${(Ψ, p)}^{(ℓ)} = {(α, δ, σ_{b}, p)}^{(ℓ)}$ in the binary case from the design prior distribution, for $ℓ = 1, \dots, L$ , and then, based on these values, we sample $n^{ℓ}$ from the multinomial distribution and $y^{(ℓ)}$ from the likelihood function. Based on this synthetic trial data, we evaluate the posterior distribution based on the analysis prior distribution.

In the case of MCMC, we obtain samples of $δ^{(ℓ)} ∣ y^{(ℓ)}$ and assess if a required quantile is above zero (or perhaps above the MCID) empirically to evaluate the indicator function. This results in a two-loop sampling scheme. We denote the samples of $δ$ in this inner loop using subscript $k$ , that is, $δ^{(k ℓ)}$ for $k = 1, \dots, K$ . In the case of INLA, we can obtain the approximation of the required quantile of $δ$ directly, with no additional inner loop sampling. For a chosen sample size, these approximations will provide the assurance based on a total number of samples of $L \times K$ or $L,$ respectively, excluding the burn-in iterations in the MCMC and the approximation calculations in INLA.

Results

Comparison of INLA versus MCMC

For both the continuous and binary outcomes, we simulate a CRT with two different numbers of clusters ( $C = 8$ and $C = 12$ ) and choose the following true values for the parameters ( $α = 1$ and $δ = 2)$ . The value of the intercept is arbitrary, and different intercept values do not affect the reported results. The precisions are $τ_{b} = \frac{1}{σ_{b}^{2}} = {5, 10}$ and $τ_{w} = \frac{1}{σ_{w}^{2}} = {0.25, 0.01}$ for the continuous outcome, which gives two different ICC values of $ρ = {0.05, 0.01}$ , representing moderate and relatively strong intra-cluster correlations in a CRT. Therefore, we have four different simulation scenarios for the comparison, considering 8 and 12 clusters with ICC values of 0.01 and 0.05.

To compare MCMC to INLA, we consider a range of numbers of MCMC samples, $K$ , from ‘small’ runs to ‘large’ runs, specifically $K = {100, 1000, 10000}$ . We also used $K = 100, 000$ for one scenario ( $C = 8$ clusters with $ρ = 0.05$ ) and decided not to include it for the other scenarios as it was very slow and gave almost identical results to when $K = 10, 000$ . We vary the sample size in the simulated hypothetical trial $N = {100, 500, 1000, 2000, 10000}$ , and record the time in seconds to obtain the posterior distribution and the accuracy of the inference, evaluated as the difference between the posterior median of the treatment effect and its true value. We repeat the simulation of each hypothetical trial 100 times and report the mean values and standard deviations of these two quantities.

Figure 1(a) and (c) and Figure 2(a) and (c) show the reported mean values of the posterior median minus $δ$ , and Figure 1(b) and (d) and Figure 2(b) and (d) show the time to obtain the posterior distribution, for the continuous and binary outcomes, respectively. In each case, we include both the mean and an approximate 95% interval, the mean plus and minus two standard deviations.

Figure 1.

The difference between the posterior median and the true treatment effect and the run time for each method, in the continuous outcome case for scenarios $(C = 8, ρ = 0.05)$ and $(C = 12, ρ = 0.01)$ under each total sample size. In (a), MCMC with $K > 100$ and INLA are both accurate in the scenario with $C = 8$ clusters and an ICC of $ρ = 0.05$ . However, when using $K = 100$ MCMC samples considering sample sizes of $N = 2, 000$ and $N = 10, 000$ , the result is not accurate, as the posterior seems not to converge due to the small number of MCMC samples. INLA appears be as accurate as MCMC with $K = 10, 000$ MCMC samples. In (c), for the scenario $C = 12$ and $ρ = 0.01$ , MCMC is more accurate overall than the MCMC in (a). Also, when $K = 100$ with sample sizes of $N = 2000$ and $N = 10, 000$ , the inference is much improved. In (b) and (d), MCMC run time is generally faster than INLA when the sample size is small. However, INLA scales better to large sample sizes. In addition, the computation time does not differ much when using different ICC values but increases substantially when increasing the number of clusters.

Figure 2.

The difference between the posterior median and the true treatment effect and the run time for each method in the binary outcome case for scenarios $(C = 8, ρ = 0.05)$ and $(C = 12, ρ = 0.01)$ under each total sample size. In (a) for eight clusters with an ICC of 0.05 and with a sample size of $N = 100$ , the result is not consistently accurate for any of the inference methods. With binary data, there is not enough information for accurate inference with such a small sample size. However, for larger sample sizes, the estimation of the treatment effect is accurate for INLA and all of the different numbers of MCMC samples. The accuracy of INLA is consistently between the accuracy of MCMC with $K = 10, 000$ and $K = 100, 000$ MCMC samples. Similarly, in (c) for 12 clusters with an ICC of 0.01, the result is accurate except when using $N = 100$ , and the uncertainty decreases when using $N \geq 1000$ , as a result of the low ICC value of 0.01. For (b) and (d), MCMC with a large number of MCMC samples runs very slowly. Therefore, using INLA in general for the binary outcome case is useful.

Overall, in the continuous outcome case, we see that both INLA and MCMC are accurate for small trial sample sizes, with MCMC requiring at least 10,000 samples from the posterior distribution for trials with large sample sizes to ensure convergence. MCMC is faster than INLA for small sample sizes, but INLA is much faster than MCMC for large CRTs. This suggests that we should use MCMC with at least 10,000 samples to analyse continuous outcome CRTs with sample sizes of 100–500, and INLA for CRTs with a sample size above 500. In addition, as we increase the number of clusters to $C = 12$ and reduce the ICC to $ρ = 0.01$ , the result tends to be more accurate, even when using a small number of MCMC samples—this makes intuitive sense as both of these changes increase the effective sample size. The results for the remaining two scenarios ${(C = 8, ρ = 0.01), (C = 12, ρ = 0.05)}$ are given in the Supplementary Materials.

For the binary case, INLA is as accurate as MCMC with a large number of posterior samples for all CRT sample sizes and is considerably faster. In the binary case, MCMC is not able to exploit the same conjugacy in the precision priors as the continuous case, explaining this disparity. The result is that INLA is a suitable approach to use for inference for two-arm cluster RCTs with a binary outcome, irrespective of the sample size of the CRT.

Based on the simulations in each of the four scenarios, we provide the following conclusions for the fastest approaches that provide accurate Bayesian inference in two-arm CRTs:

For a continuous outcome, we found MCMC with $10, 000$ posterior samples to be best when the total sample size is small (generally less than 1000) and INLA to be best when the total sample size is large.

For a binary outcome, we found INLA to be best for all total sample sizes.

Application to the SPEEDY trial

Introduction to SPEEDY

SPEEDY¹⁰ is a two-arm CRT which aims to determine the clinical and cost-effectiveness of a novel specialist prehospital redirection pathway intended to facilitate thrombectomy treatment for acute stroke. The comparator is standard care. The unit of cluster randomisation is ambulance stations which are work bases for ambulance practitioners who initiate the SPEEDY pathway or standard care. A broad study population is being enrolled, but the power calculation was based on a subset titled ‘primary analysis population’. This primary analysis population comprises the group of participants who are eligible for both pathway deployment and subsequent thrombectomy treatment. The wider population allows for other impacts of the pathway to be evaluated. The study has a co-primary outcome of thrombectomy rate and time from stroke symptom onset to thrombectomy. The sample size for thrombectomy rate is 894 participants, and time to thrombectomy is 564 participants.

The sample size for time to thrombectomy is based on 90% power, $α = 0.05$ , the one-sided significance level, $δ = 30$ minutes as a reasonable smallest clinically meaningful difference for the time to thrombectomy between the arms, 150 clusters allocated 1:1 to the two arms, $ρ = 0.01$ based on the previous studies,^17,18 the ICC and $σ = 120$ , the standard deviation of the time to thrombectomy in minutes. In terms of the power calculation detailed above, the value used for the coefficient of variability in cluster size was $ν = 0$ , as cluster size variability was not considered in the sample size calculation. The required average cluster size can then be found from the power calculation and then multiplied by the total number of clusters to give the required sample size.

Similarly, for the sample size calculation for the thrombectomy rate, the same values of power, significance level $α$ , the number of clusters, cluster allocation and ICC were used with, additionally, assumed rates of $p_{1} = 0.132$ and $p_{2} = 0.216$ . Based on the power calculation, we find the required sample size. We will use the SPEEDY trial¹⁰ as a case study to demonstrate the Bayesian CRT design.

Elicitation for the SPEEDY Trial

In line with standard frequentist sample size calculations, the SPEEDY trial did not account for uncertainty in the model parameters. We wish to incorporate such uncertainty by using the assurance in place of power. This requires informative design prior distributions for each model parameter. We used expert elicitation to determine suitable prior distributions for the SPEEDY trial parameters, relating elicited values on observable quantities to the design prior distributions of interest. We will use these design prior distributions in our assurance calculation in the next section.

To perform the elicitation, we first prepared an evidence dossier for the quantities of interest. We held an elicitation workshop with two experts who are the co-leads of the SPEEDY trial. In this elicitation workshop, we used the quartile method to perform individual elicitations of the quantities of interest. However, we did not elicit the cluster size variability $ν$ in the session, as the experts felt that this would be better specified based on existing data. Instead we specified this prior based on the number of staff at each of the ambulance stations in SPEEDY, assuming that this would be proportional to the number of patients they would recruit in the trial. The elicitation approach we used was a variation on the Sheffield Elicitation Framework, detailed in Gosling¹⁹ and Hagan et al.²⁰ Full details of the elicitation and the documentation used are provided in the Supplementary Material.

Assurance for the time to thrombectomy

We reproduce the sample size calculation for time to thrombectomy using the assurance, as detailed in the assurance. Based on the general advice from the results of the comparison simulation study, we use MCMC for inference with $K = 10, 000$ samples. To do so, we need to define the design prior distribution on the model parameters based on the elicitation results. We have three different sets of elicited design prior distributions using the information from expert 1, expert 2 and an equally weighted average of both experts’ distributions. The priors resulting from the elicitation for experts 1 and 2, and the average, for $λ$ , $δ$ , $ν$ , $σ$ and $ρ$ , for time to thrombectomy, are given in Table 1. The marginal prior distributions for $λ$ , $δ$ , $ρ$ and $σ$ are also provided in Figure 3.

Table 1.

The elicited prior distributions for each expert, and the average of both, for time to thrombectomy and thrombectomy rate.

	Expert 1	Expert 2	Average
$ν$	$Γ (0.48, 0.16)$	$Γ (0.48, 0.16)$	$Γ (0.48, 0.16)$
$ρ$	$B (0.10, 2.1)$	$B (0.08, 2.1)$	$B (0.09, 2.1)$
Time to thrombectomy
$λ$	$N (300, {133.4}^{2})$	$N (390, {222.4}^{2})$	$N (345, {177.9}^{2})$
$δ$	$N (120, {66.7}^{2})$	$N (60, {22.3}^{2})$	$N (90, {44.5}^{2})$
$σ$	$Γ (7.99, 0.06)$	$Γ (11.73, 0.08)$	$Γ (9.68, 0.07)$
Thrombectomy rate
$λ$	$N (- 1.22, {0.45}^{2})$	$N (- 1.99, {0.35}^{2})$	$N (- 1.64, {0.42}^{2})$
$δ$	$N (0.54, {0.36}^{2})$	$N (0.6, {0.19}^{2})$	$N (0.58, {0.28}^{2})$
$σ_{b}$	$Γ (0.14, 0.63)$	$Γ (0.11, 0.39)$	$Γ (0.12, 0.47)$

$Γ$ represents the gamma distribution and $B$ represents the beta distribution. The prior distributions for $ν$ and $ρ$ are used in both cases.

Figure 3.

The elicited prior probability density functions for the parameters ( $λ, δ, σ, ρ$ ), for time to thrombectomy. Expert 1 is given in red, expert 2 in blue and the average in green. The vertical dashed lines are the values used in the original power calculation.

The estimated sample size using the design prior distributions for expert 1, expert 2 and the average was 150 in each case, based on a minimum assurance of 90%. This is due to the fact that the experts were very optimistic about the improvement in time to thrombectomy in the treatment arm, represented by $δ$ , with almost all of the prior mass in each case being above zero in Figure 3. We note that this sample size is much smaller than that from the original power calculation, assuming an MCID of 30 minutes, of 564. To investigate the relationship between the assurance and the cluster (and hence sample) size, we instead use 50 clusters. The results are given in Figure 4(a).

Figure 4.

(a) The assurance with different average cluster sizes for the time to thrombectomy. Expert 1 is more optimistic about the result than expert 2 since the assurance for any chosen average cluster size is larger. (b) The assurance with different average cluster sizes in the case of the thrombectomy rate. Expert 1 is more pessimistic about the result than expert 2 in this case, and we can see expert 2 and the average do reach an assurance of 0.9 in the plot, while for expert 1, the assurance with average cluster sizes of 25 is around 0.87.

We see that in each case the assurance, like power, is an increasing function with cluster size. Expert 1 is most optimistic about the treatment, with expert 2 less optimistic and the average lying somewhere between the two. As the cluster size gets very large. Each assurance curve will tend to the probability, under that expert’s design prior distribution, that the treatment effect is positive.

Assurance for the thrombectomy rate

The elicited design prior distributions associated with the thrombectomy rate from expert 1, expert 2 and the average are given in Table 1 and plotted in Figure 5. We use these to calculate the assurance, and hence sample size, with a target assurance value of 90%. In this case, we used INLA for our assurance and sample size calculations based on the general conclusions from the simulation comparison result.

Figure 5.

The elicited prior probability density functions of the parameters ( $λ, δ, σ_{b}$ ) for the thrombectomy rate. Expert 1 is given in red, expert 2 in blue and the average in green. The vertical dashed lines are the values used in the original power calculation.

The estimated sample sizes using the design prior distributions from expert 1, expert 2 and the average are 4800, 1650 and 3150, respectively. In this case, the experts were relatively pessimistic about the likely values of the treatment effect for the thrombectomy rate, relative to the sample size estimate from the power calculation of 894. However, the required primary analysis population to ensure an adequate sample size for the time to thrombectomy outcome means that there will need to be between 2600 and 4300 patients recruited to the trial, meaning that in practice both expert 2 and the average will likely achieve 90% assurance, and expert 1 will achieve relatively high assurance. Assurance also has a different interpretation to power, and so there is no reason why matching the values of power and assurance is an equivalent exercise.

We have produced a plot of the assurance for different average cluster sizes, based on the 150 clusters in SPEEDY, for both experts and the average, and this is given in Figure 4(b).

We see a similar scenario as in the continuous outcome case, with the assurance increasing for increasing numbers of patients in each cluster. The main difference is in the ordering of the curves, with expert 2 providing the highest assurance for each cluster size and expert 1 providing the lowest, whereas in Figure 4(a), this was the opposite way round.

Conclusion

In this article, we have considered the problem of choosing the sample size, using a Bayesian approach, for a two-arm superiority cluster RCT with a continuous outcome and a binary outcome. We have compared the inference using MCMC to INLA based on appropriate mixed-effect models. From the comparison, we found that the use of INLA has advantages in CRTs with Bayesian designs, as it was as accurate as MCMC with a large number of MCMC samples ( $K = 10, 000$ or more), but was typically faster to implement compared to MCMC, especially when the trials require a large sample size in the continuous case and, in general, in the binary outcomes case.

We used the SPEEDY trial as a case study of the sample size choice via an assurance calculation, as SPEEDY has two primary outcomes: both a continuous and a binary outcome. In the original sample size calculation, SPEEDY did not consider the uncertainty in the model parameters, and so we performed an expert elicitation to specify suitable design prior distributions for the parameters. The expert elicitation was performed with two experts, and we calculated the assurance and, hence, sample size for each expert separately and the average of both experts. The findings were that the assurance and resulting sample sizes were smaller than with the original power calculation for the continuous case, since both experts were relatively optimistic about the ability of the SPEEDY pathway to reduce the time to thrombectomy by more than the values used in the power calculation, whereas the resulting sample sizes were much larger than their values from the power calculations for the binary outcome, as both experts felt that the value used for power in this case was fairly ambitious. Due to the nature of the trial, with the two outcomes needing to be powered simultaneously, the actual sample size for the binary outcome realised in SPEEDY will provide high assurance for both experts.

In general, assurance is particularly beneficial when there is substantial uncertainty in the values of nuisance parameters to which the power calculation is sensitive. One such parameter considered in this article is the ICC in a CRT, which can be particularly challenging to estimate accurately a priori. Assurance provides a way to take this uncertainty into account and provides a sample size which is more robust to mis-specification than a power calculation using a single estimated value. Trial statisticians should consider using assurance in place of power whenever they have substantial uncertainty about sensitive parameters in a power calculation.

The calculation of the assurance and sample size for large trials, particularly with binary outcomes, would be almost prohibitively computationally expensive and time-consuming given current widely available computing power without the use of INLA, as MCMC takes a very long time in these cases. This assurance approach, together with INLA (or MCMC) for inference, could be extended to more complex CRT designs, including survival outcomes, longitudinal designs, multi-arm trials and adaptive designs. This is left for future work.

Supplemental Material

sj-pdf-1-ctj-10.1177_17407745261421842 – Supplemental material for Bayesian design and analysis of two-arm cluster randomised trials using assurance: Extension to binary outcomes and comparison of Markov chain Monte Carlo and Integrated Nested Laplace Approximations

Supplemental material, sj-pdf-1-ctj-10.1177_17407745261421842 for Bayesian design and analysis of two-arm cluster randomised trials using assurance: Extension to binary outcomes and comparison of Markov chain Monte Carlo and Integrated Nested Laplace Approximations by Abdullah Aloufi, Kevin J Wilson, Nina Wilson, Lisa Shaw and Christopher Price in Clinical Trials

Supplemental Material

sj-pdf-2-ctj-10.1177_17407745261421842 – Supplemental material for Bayesian design and analysis of two-arm cluster randomised trials using assurance: Extension to binary outcomes and comparison of Markov chain Monte Carlo and Integrated Nested Laplace Approximations

Supplemental material, sj-pdf-2-ctj-10.1177_17407745261421842 for Bayesian design and analysis of two-arm cluster randomised trials using assurance: Extension to binary outcomes and comparison of Markov chain Monte Carlo and Integrated Nested Laplace Approximations by Abdullah Aloufi, Kevin J Wilson, Nina Wilson, Lisa Shaw and Christopher Price in Clinical Trials

Supplemental Material

sj-png-3-ctj-10.1177_17407745261421842 – Supplemental material for Bayesian design and analysis of two-arm cluster randomised trials using assurance: Extension to binary outcomes and comparison of Markov chain Monte Carlo and Integrated Nested Laplace Approximations

Supplemental material, sj-png-3-ctj-10.1177_17407745261421842 for Bayesian design and analysis of two-arm cluster randomised trials using assurance: Extension to binary outcomes and comparison of Markov chain Monte Carlo and Integrated Nested Laplace Approximations by Abdullah Aloufi, Kevin J Wilson, Nina Wilson, Lisa Shaw and Christopher Price in Clinical Trials

Supplemental Material

sj-png-4-ctj-10.1177_17407745261421842 – Supplemental material for Bayesian design and analysis of two-arm cluster randomised trials using assurance: Extension to binary outcomes and comparison of Markov chain Monte Carlo and Integrated Nested Laplace Approximations

Supplemental material, sj-png-4-ctj-10.1177_17407745261421842 for Bayesian design and analysis of two-arm cluster randomised trials using assurance: Extension to binary outcomes and comparison of Markov chain Monte Carlo and Integrated Nested Laplace Approximations by Abdullah Aloufi, Kevin J Wilson, Nina Wilson, Lisa Shaw and Christopher Price in Clinical Trials

Supplemental Material

sj-png-5-ctj-10.1177_17407745261421842 – Supplemental material for Bayesian design and analysis of two-arm cluster randomised trials using assurance: Extension to binary outcomes and comparison of Markov chain Monte Carlo and Integrated Nested Laplace Approximations

Supplemental material, sj-png-5-ctj-10.1177_17407745261421842 for Bayesian design and analysis of two-arm cluster randomised trials using assurance: Extension to binary outcomes and comparison of Markov chain Monte Carlo and Integrated Nested Laplace Approximations by Abdullah Aloufi, Kevin J Wilson, Nina Wilson, Lisa Shaw and Christopher Price in Clinical Trials

Supplemental Material

sj-png-6-ctj-10.1177_17407745261421842 – Supplemental material for Bayesian design and analysis of two-arm cluster randomised trials using assurance: Extension to binary outcomes and comparison of Markov chain Monte Carlo and Integrated Nested Laplace Approximations

Supplemental material, sj-png-6-ctj-10.1177_17407745261421842 for Bayesian design and analysis of two-arm cluster randomised trials using assurance: Extension to binary outcomes and comparison of Markov chain Monte Carlo and Integrated Nested Laplace Approximations by Abdullah Aloufi, Kevin J Wilson, Nina Wilson, Lisa Shaw and Christopher Price in Clinical Trials

Footnotes

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship and/or publication of this article.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

ORCID iDs

Abdullah Aloufi

Kevin J Wilson

Supplemental material

Supplemental material for this article is available online.

References

Wilson

Bayesian design and analysis of two-arm cluster randomised trials using assurance. Stat Med 2023; 42(25): 4517–4531.

Turner

Omar

Thompson

SG.

Bayesian methods of analysis for cluster randomized trials with binary outcome data. Stat Med 2001; 20(3): 453–472.

Martino

Riebler

Integrated Nested Laplace Approximations (INLA). arXiv preprint, 2019, https://arxiv.org/abs/1907.01248

Plummer

rjags: Bayesian graphical models using MCMC. R Package Version4-13, 2022, https://CRAN.R-project.org/package=rjags

Chen

Berger

Castellucci

, et al. A comparison of computational algorithms for the Bayesian analysis of clinical trials. Clinical Trials 2024; 21(6): 689–700.

Held

Schrödle

Rue

Posterior and cross-validatory predictive checks: a comparison of MCMC and INLA. In: Kneib

Tutz

(eds) Statistical modelling and regression structures. New York: Springer, 2010, pp. 91–110.

Fong

Rue

Wakefield

Bayesian inference for generalized linear mixed models. Biostatistics 2010; 11(3): 397–412.

Taylor

Diggle

PJ.

INLA or MCMC? A tutorial and comparative evaluation for spatial prediction in log-Gaussian Cox processes. J Stat Comput Simul 2014; 84(10): 2266–2284.

Carroll

Lawson

Faes

, et al. Comparing INLA and OpenBUGS for hierarchical Poisson modeling in disease mapping. Spat Spatiotemporal Epidemiol 2015; 14–15: 45–54.

10.

Shaw

Allen

Day

, et al. Specialist Pre-hospital redirection for ischemic stroke thrombectomy (SPEEDY): study protocol for a cluster randomised controlled trial with included health economic and process evaluations. BMJ Open. 2026; 16(1): e112545.

11.

Williamson

Tishkovskaya

Wilson

KJ.

Hybrid sample size calculations for cluster randomised trials using assurance. Clinical Trials 2025; 22: 517–526.

12.

Fleiss

Levin

Paik

MC.

Statistical methods for rates and proportions. 3rd ed. New York: Wiley, 2003.

13.

Agresti

Categorical data analysis. 3rd ed. Hoboken, NJ: Wiley, 2013.

14.

Spiegelhalter

DJ.

Bayesian methods for cluster randomized trials with continuous responses. Stat Med 2001; 20(3): 435–452.

15.

Plummer

. JAGS: a program for analysis of Bayesian graphical models using Gibbs sampling. In: Proceedings of the 3rd international workshop on distributed statistical computing, Vienna, 20–22 March 2003, pp. 1–10.

16.

Gómez-Rubio

Bayesian inference with INLA. Boca Raton, FL: Chapman and Hall/CRC Press, 2020.

17.

Snooks

Anthony

Chatters

, et al. Support and assessment for fall emergency referrals (SAFER) 2: a cluster randomised trial and systematic review of clinical effectiveness and cost-effectiveness of new protocols for emergency ambulance paramedics to assess older people following a fall with referral to community-based care when appropriate. Health Technol Assess 2017; 21(13): 1–218.

18.

Price

Shaw

Islam

, et al. Effect of an enhanced paramedic acute stroke treatment assessment on thrombolysis delivery during emergency stroke care: a cluster randomized clinical trial. JAMA Neurol 2020; 77(7): 840–848.

19.

Gosling

. SHELF: the Sheffield elicitation framework. In: Dias

Morton

Quigley

, et al. (eds) Elicitation: the science and art of structuring judgement. Cham: Springer, 2018, pp. 61–93.

20.

Hagan

Buck

Daneshkhah

, et al. Uncertain judgements: eliciting expert probabilities. Chichester: Wiley, 2006.