Randomized controlled clinical trials provide the gold standard for evidence generation in relation to the efficacy of a new treatment in clinical research. Relevant information from previous studies may be desirable to incorporate in the design and analysis of a new trial, with the Bayesian paradigm providing a coherent framework to formally incorporate prior knowledge. Many established methods involve the use of a discounting factor, sometimes related to a measure of ‘similarity’ between historical and the new trials. However, it is often the case that the sample size is highly nonlinear in those discounting factors. This hinders communication with subject-matter experts to elicit sensible values for borrowing strength at the trial design stage. Focussing on a method that can incorporate historical data from multiple sources, we highlight a particular issue of nonmonotonicity and explain why this causes issues with interpretability of discounting factors (hereafter referred to as ‘weights’). We propose a solution from which an analytical sample size formula is derived. We then propose a linearization technique such that the sample size changes uniformly over the weights. This leads to interpretable weights (as a percentage of information to borrow/discount) which could facilitate easier elicitation of expert opinion on their values.
In clinical drug development randomized controlled trials (RCTs) are regarded as the gold standard for evaluating the efficacy of new treatments or interventions. Randomization of trial participants to the new treatment or a control group aims to reduce bias and provide a rigourous tool to examine whether a causal relationship exists between an intervention and outcome.1 Sample size calculations are an essential part of clinical trial design, with a sample needing to be at least large enough to meet the study objectives but also small enough to minimize (for example) ethical or cost concerns.2 In the frequentist paradigm, the number of participants recruited onto a study is often chosen to control the type I error rate (the rate of incorrectly declaring a treatment efficacious) and power (the rate of correctly declaring a treatment efficacious) to pre-specified levels, based on assumptions about the sampling distribution of the data and the size of the treatment effect considered clinically meaningful.
Designing a trial with a large enough sample size to achieve the frequentist power can sometimes be infeasible, especially when there are limited numbers of participants available. This might be the case, for example, in rare disease trials or trials in pediatric populations. Pre-trial information, from historical studies conducted under similar circumstances, or elicited directly from expert opinion, could be useful to overcome this challenge, with the Bayesian paradigm offering a powerful tool to formalize this approach. In the Bayesian framework, a prior distribution is formed for a parameter of interest, which is then updated by the observed data to form a posterior distribution from which inferences can be made. Instead of designing a trial around frequentist type I error rates and power, Bayesian designs rely on alternative metrics for success; for instance, specification of posterior decision thresholds (the level of confidence we desire to have that a treatment is efficacious or futile), or the width or coverage probabilities of Bayesian credible intervals. The application of Bayesian methodology for trial design to the specific areas noted above has been considered in the literature, for example, by Hampson et al.3 for trials in very rare diseases, and Wadsworth et al.4 for pediatric studies.
Neuenschwander et al.5 classify Bayesian methods for clinical trial design incorporating historical data according to the approach of constructing a prior distribution for a parameter of interest as follows:
‘Irrelevance’, where a prior is formed without reference to previous studies.
‘Similar’, also termed ‘exchangeable’, where a prior is formed by assuming that the parameter of interest in the new trial has been generated from the same underlying distribution as the parameter(s) in the historical trial(s). The meta-analytic predictive (MAP) prior proposed in Neuenschwander et al.5 is based on this assumption, with the authors noting the importance of careful selection of relevant historical data to render the exchangeability assumption plausible. A robust extension6 aims to effectively discount historical data in the case of prior/data conflict by using a weighted mixture distribution consisting of the MAP prior and a weakly informative component.
‘Equal but discounted’, which assumes parameters are the same, but discounts the precision of the parameter in the historical trial(s). The ‘power prior’ suggested by Ibrahim and Chen7 takes this approach, whereby historical evidence is downweighted by taking its likelihood to a power, .
‘Biased’, which assumes historical parameters are potentially biased versions of the parameter in the new trial. The ‘commensurate prior’8,9 comes under this category, where historical information is downweighted by a commensurability parameter to form a predictive prior for the new study. The commensurability parameter directly parameterizes the similarity between each historical source and new data.
‘Equal’, equivalent to pooling historical data with the new study data.
The importance of carefully selecting historical trials to be included for planning a new trial is well understood. If the assumption of similarity is not satisfied, this can result in increased mean square error (MSE) of point estimates due to bias and either reduced power or increased type I error rate depending on the direction of the bias.10 Conversely, incorporation of quality historical information allows for reduced MSE and increased power (or reduced type I error rate) within the new trial. A seminal paper by Pocock11 provided a set of criteria for assessing the comparability between historical and current trials. Expert elicitation can play an important role in assessing comparability and helping to choose model parameters but the elicitation process is not trivial.12 Johnson et al.13 review different methods to elicit beliefs for Bayesian priors.
This paper focusses on the design of a new two-arm RCT incorporating historical data from similar RCTs. We follow the series of research in sample size determination based on ‘commensurate priors’ in Zheng et al.14 in which the use of discrepancy weights quantifying the probability of (ir)relevance of information from multiple historical sources (with respect to the new trial) was recommended. The methodology in Zheng et al.14 was later extended to basket trials in Zheng et al.15. In the setting of borrowing from historical data, specification of study-specific discrepancy weights at the design stage provides an explicit opportunity to make judgments concerning the relevance and rigour of past studies with respect to the new study.5 Furthermore, the elicitation of study-specific discrepancy weights may be more intuitive than eliciting model parameters of a distribution.
It would be desirable that the discrepancy weights recommended in Zheng et al.14 act uniformly with respect to the amount of information that would subsequently be incorporated from a particular source. For example, specifying a historical study-specific weight of should result in incorporation of of the information from that source into the new trial design. In Section 2 we demonstrate that this is not the case, and the weights in fact exhibit undesirable highly nonlinear behaviour. Of primary concern is nonmonotonicity, caused by the method used to aggregate information from multiple sources into a single prior, which hinders interpretability and makes elicitation of such weights difficult. Additional nonlinearity is also an issue, whereby small values of weights result in faster changes in the amount of information incorporated into the prior than their complement. We propose a solution in two parts. Firstly, in Section 3, an alternative method of prior aggregation is proposed, for which the nonlinearity then has a simpler pattern, and from which a Bayesian sample size formula is derived. Secondly, a technique for linearization is provided such that the weights provide uniform shrinkage with respect to the sample size. The aim is to make interpretability simpler and thereby facilitate easier elicitation of such values. Section 4 provides a motivating example in which a sample size is sought for a hypothetical new RCT using historical data from several real-life historical clinical trials. Section 5 presents a brief simulation study confirming pre-specified statistical properties are preserved across a range of scenarios with sample sizes determined according to our method. We finish with a discussion highlighting areas for future research in Section 6.
Problem formulation
Consider planning a two-arm randomized controlled superiority trial (referred to as ‘new trial’ in the following) to evaluate an investigational treatment or intervention. Let be the measured post-randomization outcomes in the new trial for patient in treatment group . Explicitly, refers to the experimental treatment group and refers to the control group. We assume outcomes are normally distributed with common variance in the outcome measures such that . The groupwise sample means therefore follow a normal distribution, . Considering the distribution of the difference in group means leads to
where the parameter is the primary inferential target. are the total number of trial participants randomized (to treatment or control) at the initiation of the trial and is the proportion randomly assigned to the experimental treatment arm.
In the Bayesian framework with no borrowing from historical data (for assumed known ), a prior for is specified,
where and are user-defined hyper-parameters (which might be chosen for example in the case of no prior information such that the prior is only weakly informative relative to the likelihood). The prior is then updated by the trial data to give a posterior distribution,
The posterior mean is given by
and the posterior variance is
Formulating priors from multiple historical sources
Suppose instead that there are sources of historical data, , that are relevant to incorporate in the planning of the new trial. are the parameter counterparts of in the historical trials and it is assumed they have been summarized by posterior distributions, . Defining as the prediction for in the new trial based on the information from trial alone, a set of commensurate predictive prior distributions for are formed centred on each ,
We let , where parameterizes the ‘commensurability’15 between and in terms of precision (further details are given in the following section).
Estimating
To quantify the relevance of each historical data source in respect of the new experiment, Zheng et al. introduce discrepancy parameters, . The ‘discrepancy’ of interest, for a continuous parameter like treatment effect, is the mismatch in either the location or scale parameters, or both. That is, are prior weights intended to represent preliminary skepticism about how similar (and/or ) and (and/or ) are. Weights are incorporated into a Gamma mixture prior for the precision parameter, :
with . This mixture prior is favoured for robust inferences as it offers flexible downweighting or borrowing from source depending on the value of . Briefly, the values of are chosen such that the first Gamma mixture component has its mass on small values, therefore when , data from source is increasingly discounted. At the extreme, setting indicates complete irrelevance of information from source to the new trial. On the other hand, values of are chosen such that the second Gamma mixture component has its mass on large values. In this case, setting results in a greater degree of incorporation of information from source . Setting indicates exchangeability between and , that is . It is anticipated in a real application that, at the design stage of a new trial, are chosen in collaboration with a subject-matter expert(s) to reflect the anticipated degree of (ir)relevance between each historical trial and the new experiment. As detailed in Zheng et al.,14 the Gamma mixture prior in (4) can be approximated by matching the first two moments of a unimodal t mixture distribution. This leads to an approximation of the between-trial variance (i.e. between source and the new experiment),
We note that if we were being fully Bayesian we would keep the prior for in its distributional form, however in this paper we are looking to propose an asymptotically approximate sample size formula and so we make a simplifying assumption. The variance between each source and the new trial is therefore estimated as
The above equation for highlights the importance of choosing values of , , , according to the minimum and maximum amount of information to borrow from external sources. We generally suggest choosing values of , so that the discounting term, , is large enough to effectively discount all information from a particular source when (i.e. so that when ). Similarly, , should be chosen so that the borrowing term, , is sufficiently small (i.e. close to zero) to enable all information to be incorporated from a particular source when (i.e. so that when ). Alternatively, users may be interested in exploring various values for to adjust the minimum/maximum sample size saving that is available. We encourage the end-user to use our openly available code to obtain this suited for their context.
Aggregating multiple distributions to form a collective prior
In Zheng et al.,14 an informative collective prior (hereafter referred to as ‘CP’) is formed by aggregating the predictive distributions in (3) into a single prior such that using the convolution operator for the sum of normal random variables,16 where
are synthesis weights, set to a decreasing function of , such that are all between and and sum to . In Zheng et al.,14
where is a pre-defined concentration parameter which governs how much influence have on . Further details on the function in (6) and how to choose are provided in Zheng and Wason17 and Zheng et al.14 The CP is updated by the trial data to give the posterior,
In the same way as Equations (1) and (2), the posterior mean and variance are given by
and
Varying to alter the amount of information from source
For fixed , , the CP precision is a function of and is a measure of the amount of prior information on the treatment effect in the new trial (which varies depending on the values of ),
In Figure 1, we visualize how varies according to in an example when . For illustrative purposes, values of all other parameters are held fixed (). It can be seen that for , corresponding to full incorporation of information from both historical sources, the CP precision is maximized (as desired). Similarly, for , corresponding to full discounting of information from both sources, the CP precision is minimized (as desired).
Collective prior (CP) precision, (equation 7), with respect to varying discrepancy weights, and , for borrowing from two historical sources, . Undesirable nonmonotonicity can clearly be seen around and .
We nonetheless also see the (undesirable) highly nonlinear nature of with respect to varying in two respects. Firstly, it is clear that the majority of the change in prior precision occurs rapidly across , rather than evenly as we would like; beyond around , there is almost no discernible change in . Assuming that are expert elicited probabilities, this could result in a large loss of information because specifying any will result in almost full discounting of data from source . This rapid nonlinear change is due to the functional form of the precision (specifically the general rectangular hyperbolic shape that results from taking the reciprocal of the variance), therefore occurs for any .
Secondly, and more importantly, when , local minima/maxima can be seen around and . This nonmonotonic behaviour in equation (7) occurs whenever due the method of prior aggregation as well as the higher order terms in introduced by the synthesis weighting function, equation (6). This is in contrast to how we would fundamentally wish the discrepancy weights to behave; it should be the case that increasing always leads to decreasing .
To be clear, this is a general problem, and not only for a specific set of parameters; that is nonlinearity (hyperbolic change and nonmonotonicity) of the prior precision occurs in varying degrees for any value of and regardless of the values that the other parameters are fixed at. These two issues mean that are not interpretable as probabilities and hinder communication with subject-matter experts to elicit sensible values at the trial design stage.
An alternative method of prior aggregation (and therefore a new way of formulating the CP precision) is necessary so that the nonlinearity has a simpler form. Specifically, the CP precision should be monotonically decreasing with respect to increasing . Details of our proposal are given in Section 3.1. Following derivation of a Bayesian sample size formula in Section 3.3, we also seek to recalibrate the weights. This is achieved in Section 3.4 via a functional transformation of each , where , such that the prior precision (and therefore the derived sample size function) varies linearly with respect to .
Methods
Proposed method of prior aggregation
Following the set of predictive priors in (3), we propose an alternative method of prior aggregation suggested in Winkler.18 This results in a new CP, , where
As in Section 2.3, the CP mean, , is a weighted linear sum of the means from (3). Synthesis weights now incorporate information on both and , rather than only as in equation (6) (since ). This preserves the desirable property that smaller correspond to larger , and introduces the (also desirable) property that smaller correspond to larger . As required, sum to and are all between and .
The CP variance, , is the reciprocal of the sum of the precisions, . Again, this preserves the desirable property that a smaller results in source receiving a larger weight in . The formulation of the CP mean and variance in this manner is exactly in line with the theory of Bayesian updating of normal distributions with conjugate priors, with an initial noninformative prior for (as discussed in Winkler18).
An advantage of both the proposed prior aggregation method and the method detailed in Section 2.3 is that they allow for analytic sample size calculations. The proposed aggregation method preserves the desirable properties of the previous method of prior aggregation (described above) as well as fitting neatly into our Bayesian framework. However, central to the purpose of this paper, the nonlinearity of in equation (8) with respect to varying now has a simpler pattern when compared with equation (7). Crucially, the proposed CP variance, , no longer relies on the original synthesis weights in equation (6), which caused the undesirable non-monotonic behaviour observed in Figure 1. This means that the precision, , is now strictly monotonically decreasing over . This can be proven by examining the first derivative of with respect to each which is always negative. In contrast to equation (7), this is now how we would wish to behave – that is increasing should always lead to decreased prior precision. This is visualized in Figure 2 using the same parameters as Figure 1 for borrowing from two historical datasets.
Proposed collective prior (CP) precision, (equation (8)), with respect to varying discrepancy weights, and , for borrowing from two historical sources, . The prior precision is now monotonically decreasing with respect to increasing .
Additional motivation to use this method of aggregation in our particular application is that terms in the CP precision relating to each are now linearly independent of each other, that is
This means we can now achieve linearization of with respect to sample size (details in the following sections), which would be impossible to achieve by the previous aggregation method due to the issue of nonmonotonicity. As in Section 2, the CP is updated by the trial data to give the posterior,
where
and
Bayesian decision making framework
We now introduce a Bayesian decision making framework proposed in Whitehead et al.19 For pre-specified posterior decision thresholds and , we seek a sample size to guarantee we have sufficient evidence to conclude either efficacy or futility respectively. These thresholds represent the degree of evidence we would require to be convinced of efficacy or futility of treatment over control. Explicitly, if then we conclude that the treatment is efficacious and if then we conclude that the treatment is futile, where and and is some minimally clinically important treatment effect size.
For a generic posterior distribution , the probability that the treatment effect is greater than zero is
where denotes the standard normal cumulative distribution function. Therefore, we will conclude convincing evidence of treatment benefit when , where satisfies .
Similarly, the posterior probability that the treatment effect is less than (or equal to) is
Therefore, convincing evidence of treatment futility occurs when , where satisfies .
Bayesian sample size formula
Following the same approach detailed in Zheng et al.,15 to reach a decisive conclusion regarding treatment efficacy, we require a large enough sample size such that either or , that is
Simplifying and rearranging, this is equivalent to requiring that
We see that the left hand side of (11) is equal to the posterior precision. Replacing with the variance in (2), we therefore obtain a Bayesian sample size formula in the case of no borrowing,
Note that if we wished to consider a purely frequentist formulation of the problem, then the necessary sample size is simply,
where and are the usual parameters set to control type I and type II error rates, respectively.
Replacing in (12) with from (8), we obtain our sample size calculation informed by sources of historical data,
that is
with . We note explicitly the assumptions embedded into this sample size formula, which are common to many normal models. The validity of the sample size calculation depends on these assumptions being satisfied:
Common (and known) variance in outcomes from the new trial.
Independence of observations.
Homoscedasticity and normality of residuals.
For non-normal data, a suitably adapted formula based on the approach of constructing a normal test statistic in the generalized linear model framework via a transformation could be applied. In Supplemental Materials A.1, A2 and A.3, we demonstrate this by deriving sample size formulas for RCTs with binary and time-to-event data, and for single-arm settings with binary outcomes.
Interpretable discrepancy weights
We now detail the linearization steps which result in that are directly interpretable as a degree of discrepancy on the information scale, . The idea is similar to the idea of functional uniform priors proposed in Bornkamp20,21 for nonlinear regression, in which a method for formulating a prior for a parameter of interest is proposed such that it is uniform in the space of functional shapes of the underlying nonlinear function. We start by isolating each nonlinear part of the sample size function in equation (14) with respect to (for fixed ). These are the individual precision terms making up in equation (8), that is
This essentially ‘draws a line’ between and so that changes in (and therefore the corresponding sample size) are spread evenly across the full range of . This also necessarily ensures that the mapping preserves the property that and .
Step 2: Find the inverse of (15). This allows calculation of any value corresponding to a given :
Step 3: Substitute the linearized values obtained in (16) into (17):
is now the necessary transformation of . Now, if we obtain expert elicited values of , corresponding to a percentage degree of discrepancy between each historical source and the new trial, we can use equation (14) with to incorporate of the information from the corresponding historical dataset in the planning of the new trial.
This transformation is possible due to the proposed method of prior aggregation. Unlike the original in equation (7), the proposed is a monotonic function in , and each of the terms (which form ) are linearly independent (i.e. separated by the addition operator). The transformation procedure can easily be extended to any number of sources, , with each functional transformation of being performed independently with no additional complexity.
The effect is visualized in Figures 3 and 4, which compare the sample size function (plotted at the boundary of the inequality, that is the smallest possible sample size fulfilling equation (14)) with respect to before and after the functional transformation of . These examples are for the simplest cases of borrowing from one and two historical datasets.
Sample size, , with respect to varying , for borrowing from a single source of data, both before (left) and after (right) functional transformation of . Note that as required, minimum and maximum sample sizes corresponding to and respectively remain identical in both cases.
Sample size (vertical axis) borrowing from two historical sources, with respect to varying and . The left figure is before the functional transformation of , that is , the right figure is after, that is .
Sample sizes corresponding to and remain fixed after the transformation of as required. This will be the case for borrowing from any number of sources, that is borrowing from sources will have fixed points corresponding to each unique combination of . Between these fixed points, via the proposed transformation, the change in the sample size is evenly distributed across . As before, the sample size is minimized with full incorporation of information from all sources, that is and maximized with full discounting of information from both sources, that is .
Application to the design of a randomized controlled trial in Alzheimer’s disease
In this section, we consider how the proposed method could be applied to determine an appropriate sample size for a hypothetical new trial using real data from several relevant historical RCTs.
Alzheimer’s disease (AD) is a chronic age-related illness characterized by cognitive decline. It is the most common form of dementia, with incidence increasing globally due to increasing life expectancy. There are limited pharmaceutical interventions which are effective in reducing symptoms of cognitive decline, however a systematic review by Du et al.22 highlighted that several previous studies have suggested that exercise may slow the progression of cognitive decline in patients with AD.
Consider planning a new two-arm RCT to investigate whether physical activity can improve cognition in patients with Alzheimer’s disease. The two treatments to be compared in the new trial are denoted (physical activity) and (standard/usual care). The primary outcome is the difference in treatment group means at a single post-randomization followup timepoint in the Mini Mental State Examination (MMSE) score.23 MMSE is a 30-point questionnaire that provides a summary measure of cognitive function where a higher score represents better cognitive performance. It is used extensively in clinical research settings to estimate the severity of impairment, and to document change in impairment over time. Suppose in the new trial that the MMSE of each subject at 4 months post-randomization will be denoted by , and will be treated as normally distributed with mean and common (known) variance , as in Section 3.3. The observed difference in means is assumed to be normally distributed, with positive values indicating an advantage for the physical activity group. Based on a recent study of MMSE scores in those with cognitive impairments, .24
Consider first a frequentist formulation of the sample size calculation. Suppose we wish to detect a minimum clinically important difference (MCID) between treatment groups of point on the MMSE (it was reported by Mishra et al.25 that MCID thresholds for MMSE in AD trials are commonly between 1 and 3 points). For a one-sided type I error rate and power , the total sample size required is minimized by equal allocation to treatment and control groups, that is, . For these parameters, equation (13) yields a total sample size of (rounded up to the nearest even integer). The Bayesian sample size calculation with no borrowing, equation (12), gives the same result setting a large (e.g. ), with and .
For obvious reasons, recruiting large numbers of patients onto AD trials might be challenging, with limitations due to ethical and practical issues. Furthermore, high costs can be a concern with trial participants necessarily needing more intense monitoring compared to cognitively intact individuals.26
Now, suppose that data from 7 historical trials is available with which to form an informative prior for , summarized in Table 1.
Results of seven historical RCTs measuring MMSE outcomes for individuals with AD, adapted from Du et al.22
Note: Treatment effects have been summarized in the form of . RCT: randomized controlled trial; MMSE: mini mental state examination; AD: Alzheimer’s disease.
It is clear from Table 1 that there is substantial heterogeneity between studies, therefore with the help of a clinical expert we suppose we have elicited probabilities which quantify the irrelevance of each historical trial in respect of the new study.
We note that it might be easier to elicit these quantities as a degree of relevance (rather than degree of skepticism); for example, if an expert thinks data source is relevant to the new trial then we set . We also note that the proposed methodology assumes that a single expert is consulted, or that multiple experts can agree on single values for . The process of eliciting and reconciling multiple expert opinions on probabilities is a complex topic outside the scope of this paper; see, for example, Hora34 for an in depth discussion. For illustrative purposes, let us assume that we have elicited a set of probabilities , with . This set would imply a desire to incorporate the most amount of information from source and least from source .
Firstly, using equation (18) to transform results in . By the method in Section 3.1, using , and leads to an informative prior for the treatment effect in the new trial, , where and . equation (14) (setting ), gives the total sample size (for and ) as (rounded up to the nearest even integer).
Note, that if we just used the ‘raw’ in equation (14), we would be faced with the issue of over-discounting (described in previous sections), resulting in a sample size of . Also note that if we wished to include information to the specified degree (without transforming ), then we would have to have elicited values of , which would have been very difficult to elicit even if the expert(s) had substantial statistical knowledge.
Performance evaluation
We present a brief simulation study where the purpose is to verify that the proposed sample size function and linearization technique achieve the pre-specified statistical properties across a range of scenarios. To be clear, according to the criteria of the Bayesian decision framework in Section 3.2, the sample size should be large enough to guarantee a conclusion of efficacy, such that , or, if not, futility, such that . The goal of the simulation study therefore is not to compare our sample size formula and linearization technique against another method, but rather to test the hypothesis that by the proposed method of trials will reach a definitive conclusion.
Basic settings
Four contrasting configurations, , of hypothetical historical data are investigated, with each containing historical information from 5 independent sources shown in Table 2. We suppose that probabilities have been elicited to implement the proposed approach for borrowing information from each respective source. We fix to be the same across all four configurations to facilitate easier comparisons: . We suppose for demonstration purposes that data from historical data source is considered particularly relevant to the new trial, setting (in a real situation, this might be for example because the historic trial has been performed most recently, or that it was undertaken at an earlier stage in the same pharmaceutical development pipeline). Sources 2–5 have been considered less relevant with set accordingly.
Configurations , of hypothetical historical data where mean treatment effect parameter from source is assumed to have been independently summarized by .
Historical data source, :
Config.
Config. description
Parameter
1
2
3
4
5
Weak
0.10
0.24
0.37
0
-0.05
A
historical
1.25
0.73
0.92
1.29
0.66
info.
0.20
0.40
0.80
0.60
0.70
0
-0.05
2.14
0.37
1.10
B
Mixed 1
1.29
0.66
0.50
0.92
0.75
0.20
0.40
0.80
0.60
0.70
1.10
0.37
-0.05
2.14
0
C
Mixed 2
0.75
0.92
0.66
0.50
1.29
0.20
0.40
0.80
0.60
0.70
Strong
1.10
2.14
1.07
0.60
0.85
D
historical
0.75
0.50
0.82
0.89
0.26
info.
0.20
0.40
0.80
0.60
0.70
Note: Each source is accompanied by a for borrowing of information, summarizing pre-experimental information about .
Configuration descriptions classify the nature of the treatment effects observed in the historical trials, with ‘weak historical info’. meaning low/neutral relative treatment effects observed historically with relatively high variances, and ‘strong historical info’. indicating more positive historic treatment effects with comparatively smaller variances. Mixed 1 and Mixed 2 use a combination of and from A and D. Weights in Mixed 1 favour the neutral trials, while weights in Mixed 2 favour the more positive trials. Based on the example in Section 4, the MCID between treatment arms in the new trial is set to be and we assume a common (known) variance in outcome measures of . Probability boundaries for decision making in terms of efficacy are , and for futility, . For each configuration of historical data, a sample size is calculated for the new trial: First, using equation (18) to transform , and then equation (14), with and (setting ). Note that, although and have been set to be equivalent to the often used and in the frequentist paradigm (which are set to control type I error rate and power respectively), it must be remembered that these values do not represent the same quantity. As discussed in Whitehead et al.,19 there is no reason to assume any form of equivalence since their meanings are fundamentally different.
For the new trial, we set equal allocation to treatment and control, . Outcomes in the control group are generated for each configuration according to (where indexes configurations ). Outcomes in the treatment group are generated according to . For each simulation replicate, true treatment effects are set to be one of the following:
Treatment efficacy, .
Treatment futility, .
A Bayesian analysis model is applied to each simulation replicate, with prior set according to the CP from each configuration. Evidence of treatment efficacy will be concluded if . If then, according to our pre-specified criteria, it should be the case that . Results are summarized for and , respectively by calculating the percentage of trials in which a decisive conclusion can be reached by averaging across 10,000 simulated trial replicates. This results in a total of 8 scenarios. The Bayesian analysis model is fitted analytically usingequations (9) and (10) in R version 4.2.1 (2022-06-23).
Results
Table 3 gives transformed values of via the method described in Section 3.4.
Transformed values of .
Config.
0.20
0.40
0.80
0.60
0.70
A
B
C
D
CP: collective prior.
Table 4 displays sample sizes (rounded up to the nearest even integer to allow for equal allocation between treatment groups) for each configuration calculated using equation (14) with and , along with the corresponding prior parameters used for design and analysis. The prior for configuration A is centered closer to zero with a higher variance than the priors for other configurations, resulting in a sample size of . The prior for configuration D is the most ‘enthusiastic’, centered on a positive treatment effect with a lower variance, resulting in . Configuration B results in a prior centered on a low treatment effect, whereas the prior derived from configuration C is centered on a positive treatment effect. Configurations B and C result in priors with similar variances.
Priors for treatment effect in the new experiment, , along with corresponding sample sizes (rounded up to nearest even integer for ) for configurations .
Config.
A
0.131
0.405
204
B
0.515
0.358
186
C
1.015
0.325
170
D
1.276
0.242
112
CP: collective prior
Table 5 displays the percentage of simulated trials concluding that experimental treatment is efficacious (% Eff.) or futile (% Fut.) for each configuration in scenarios where and respectively. The percentage efficacious is defined as the percentage of trials out of 10,000 simulations in which , while the percentage futile is the percentage of trials out of 10,000 simulations in which and . The total percentage is in all scenarios, demonstrating that the pre-specified statistical properties are upheld by the proposed method.
Percentage of simulated trials that conclude treatment is efficacious or futile when and (analyzed using informative priors for as specified in Table 4.)
:
:
Config.
% Eff.
% Fut.
Total %
% Eff.
% Fut.
Total %
A
204
49.3
50.7
100
2.6
97.4
100
B
186
66.0
34.0
100
7.3
92.7
100
C
170
88.7
11.3
100
29.2
70.8
100
D
112
98.7
1.3
100
79.8
20.2
100
We emphasize that an investigation of frequentist operating characteristics was not the purpose of this section. Nonetheless, as anticipated, and as mentioned in Section 1, it is clear from Table 5 that to realize the benefits of historical borrowing (at least, in traditional frequentist terms), the treatment effect in the new trial should be similar to the treatment effect in historical trials. When this is the case, we observe higher ‘power’ (as in configurations C and D when ) (or a lower ‘type I error rate’, as in configurations A and B when ) by borrowing of information. However, this necessarily comes at the risk of a higher type I error rate/reduced power when there is a high degree of heterogeneity between historical and current trials. As discussed in Kopp-Schneider et al.,35 if one wishes to control the type I error rate in the traditional sense, all prior information must be disregarded in the analysis. It may however be desirable to determine the weight parameters alongside consideration of the type I error rate, which is agreed upon by sponsors and regulators during the design stage, as discussed, for example, in Lee.36 Whichever operating characteristics are considered, in any practical application, careful selection of historical trials for inclusion as well as extensive simulations at the trial design stage would be necessary.
Discussion
The central goal in this paper has been twofold: firstly, to offer a solution for the problem of nonmonotonic behaviour of discrepancy weights caused by the prior aggregation method proposed in Zheng et al.14,15 Our proposed alternative ensures that discrepancy weights behave monotonically with respect to the amount of information included from a particular source. This leads us to derive a Bayesian sample size formula and achieve our second goal of linearization to improve interpretability. Following our methodology, given a set of historical data sources, clinical expert(s) only have to specify the amount of information to borrow (discount) from each historic data source with respect to the current trial () for a trial statistician to then to incorporate the specified amount of information (using ). We hope that these ideas can encourage effective communication between statisticians and subject-matter experts to elicit sensible values for these weights.
Focus in this work is on the design of a two-armed trial where there is prior information on the difference in means between treatment and control arms. We acknowledge that it is much more common in practice to consider borrowing only on the control arm (i.e. using historical control information to augment or replace a concurrent control). The methods presented here could be adapted to this case such that a prior would be formed for the arm-based statistic(s). In this case, the weight(s) would then relate to the anticipated (dis)similarity between the historical control data and the new control data, implying a reduced number of patients on the new control. As noted in Zheng et al.,14 selection of historical data on a single arm should be done carefully to avoid bias that may affect the inference of the difference in means.
There are a number of ways in which this work could be extended/generalized. One possibility would be extension to other Bayesian methods proposed for clinical trials which utilize weights for borrowing, such as the robust MAP prior.6 More broadly, the methodology could be applied in any research area (not just clinical trials) where it would be desirable to design an experiment using information from previous studies or external data.
In the case of applying the method to survival data (see Supplemental Materials), the assumption of an exponential distribution is analytically convenient, which is not uncommon especially in the practice of designing clinical trials. It is important to state that sample size formulae (including ours), based on assumptions for analytical convenience, are typically a good design approximation, but not an exact one. As a reviewer rightly noted, analysis of real survival data rarely relies on such assumptions. If knowing the exact analysis model to apply, a simulation-based approach to sample size determination would be more accurate.
We note that in this work we have assumed independence of historical data sources as a simplified case of aggregating information by the method of Winkler,18 in which a method is proposed in the case that historical sources are dependent. When historical studies are conducted on distinct patients the independence assumption would seem reasonable. However, if the historical data relate to multiple trials in the same patients (for example, phase II/III trials), then the dependence between studies could easily be accounted for by the same method in Winkler,18 with calculation of the pairwise correlations between sources.
Our sample size formula and linearization technique could also be extended to other clinical trial designs where borrowing can be incorporated; for example, combined phase II/III trials using borrowing from the phase II part of the trial to reduce the sample size for the phase III part, or a basket trial setting (for concurrent borrowing between subtrials) in which a sample size is sought for each subtrial, , with sample sizes being solved as a system of simultaneous equations.
The proposed methodology utilizes a single prior for both the design and analysis of the new experiment. There may be instances where it is desirable to modify the analysis prior according to the observed similarity between the historic datasets and current trial. In this case, a distributional distance metric such as the Hellinger distance37 might be useful in updating for the analysis. However, as noted in Zheng et al.,15 this would affect the properties of the Bayesian decision making framework on which the sample size formula is based. Specifically, when are set to larger values in the analysis than in the design (i.e. less borrowing is implemented than planned), it may not be possible to reach a decisive conclusion regarding efficacy or futility. Conversely, using smaller in the analysis than the design (i.e. more borrowing is implemented than planned) would lead to a more precise posterior distribution which may have a higher risk of bias.
In our approach, we have restricted focus to known variance in outcome measure, (common in many settings), and we approximated by making some simplifying assumptions, which resulted in a closed form for the sample size calculation. One avenue for development would be a more fully Bayesian approach in which priors are specified for and/or . Furthermore, in this paper we have focussed on the Bayesian decision making framework proposed in Whitehead et al.,19 however, it would be simple to adapt the sample size formula for consideration of other Bayesian properties. For example, a sample size formula controlling average properties of posterior interval probabilities could be achieved in a similar manner as in Zheng et al.,14 where a sample size formula is proposed for control of the average coverage criterion or the average length criterion; for implementation of our method this would simply require replacing the prior precision () proposed in Zheng et al.14 with our alternative proposal ().
In conclusion, historical data from a range of sources are often available in the planning of a new trial, but inclusion of such data for study design and analysis is not common practice. Part of the reason might be difficulty in interpretability of discrepancy parameters. We hope our work will help to bridge this gap and encourage uptake of these innovative methods, however we caution that consideration of sample size on its own should not be the only factor when determining whether a borrowing method is appropriate. Simulation is generally still needed to evaluate its performance (bias, power, type I error, etc.).
Supplemental Material
sj-pdf-1-smm-10.1177_09622802261432816 - Supplemental material for Bayesian sample size determination using robust commensurate priors with interpretable discrepancy weights
Supplemental material, sj-pdf-1-smm-10.1177_09622802261432816 for Bayesian sample size determination using robust commensurate priors with interpretable discrepancy weights by Lou E Whitehead, James MS Wason, Oliver Sailer and Haiyan Zheng in Statistical Methods in Medical Research
Footnotes
Acknowledgements
Dr Zheng’s contribution to this work was supported by Cancer Research UK (RCCPDF/100008, RCCCDF-May24/100001). James M. S. Wason is funded by NIHR Research Professorship (NIHR301614).
ORCID iDs
Lou E Whitehead
James MS Wason
Haiyan Zheng
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: Dr Zheng’s contribution to this work was supported by Cancer Research UK (RCCPDF/100008, RCCCDF-May24/100001). James M. S. Wason is funded by NIHR Research Professorship (NIHR301614).
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Software
R code for reproducing the Motivating Example and Performance Evaluation is posted online at GitHub: .
Supplemental materials
Supplemental material for this article is available online.
References
1.
HaritonELocascioJJ. Randomised controlled trials – the gold standard for effectiveness research. BJOG2018; 125: 1716–1716.
2.
JuliousSA. Sample sizes for clinical trials. Boca Raton, FL (USA) / New York, NY (USA) / London (UK): CRC Press, 2023.
3.
HampsonLVWhiteheadJEleftheriouD, et al.Bayesian methods for the design and interpretation of clinical trials in very rare diseases. Stat Med2014; 33: 4186–4201.
4.
WadsworthIHampsonLVJakiT. Extrapolation of efficacy and other data to support the development of new medicines for children: a systematic review of methods. Stat Methods Med Res2018; 27: 398–413.
5.
NeuenschwanderBCapkun-NiggliGBransonM, et al.Summarizing historical information on controls in clinical trials. Clin Trials2010; 7: 5–18.
6.
SchmidliHGsteigerSRoychoudhuryS, et al.Robust meta-analytic-predictive priors in clinical trials with historical control information. Biometrics2014; 70: 1023–1032.
7.
IbrahimJGChenMH. Power prior distributions for regression models. Stat Sci2000; 15: 46–60.
8.
HobbsBPCarlinBPMandrekarSJ, et al.Hierarchical commensurate and power prior models for adaptive incorporation of historical information in clinical trials. Biometrics2011; 67: 1047–1056.
9.
HobbsBPSargentDJCarlinBP. Commensurate priors for incorporating historical information in clinical trials using general and generalized linear models. Bayesian Anal2012; 7: 639–674.
10.
VieleKBerrySNeuenschwanderB, et al.Use of historical control data for assessing treatment effects in clinical trials. Pharm Stat2014; 13: 41–54.
11.
PocockSJ. The combination of randomized and historical controls in clinical trials. J Chronic Dis1976; 29: 175–188.
12.
DiasLCMortonAQuigleyJ. Elicitation: the science and art of structuring kudgement. International Series in Operations Research & Management Science. Springer International Publishing, 2017.
13.
JohnsonSRTomlinsonGAHawkerGA, et al.Methods to elicit beliefs for Bayesian priors: a systematic review. J Clin Epidemiol2010; 63: 355–369.
14.
ZhengHJakiTWasonJMS. Bayesian sample size determination using commensurate priors to leverage preexperimental data. Biometrics2023; 79: 669–683.
15.
ZhengHGraylingMJMozgunovP, et al.Bayesian sample size determination in basket trials borrowing information between subsets. Biostatistics2023 Aug; 24: 1000–1016.
16.
GrinsteadCMSnellJL. Introduction to probability. Providence, RI: American Mathematical Society, 1997.
17.
ZhengHWasonJMS. Borrowing of information across patient subgroups in a basket trial based on distributional discrepancy. Biostatistics2022 May; 23: 120–135.
18.
WinklerRL. Combining probability distributions from dependent information sources. Manage Sci1981; 27: 479–488.
19.
WhiteheadJValdés-MárquezEJohnsonP, et al.Bayesian sample size for exploratory clinical trials incorporating historical data. Stat Med2008; 27: 2307–2327.
20.
BornkampB. Functional uniform priors for nonlinear modeling. Biometrics2012; 68: 893–901.
21.
BornkampB. Practical considerations for using functional uniform prior distributions for dose-response estimation in clinical trials. Biom J2014; 56: 947–962.
22.
DuZLiYLiJ, et al.Physical activity can improve cognition in patients with Alzheimer’s disease: a systematic review and meta-analysis of randomized controlled trials. Clin Interv Aging2018; 13: 1593–1603.
23.
Arevalo-RodriguezISmailagicNRoqué-FigulsM, et al.Mini-mental state examination (MMSE) for the early detection of dementia in people with mild cognitive impairment (MCI). Cochrane Database Syst Rev2021; 7.
24.
SalisFCostaggiuDMandasA. Mini-mental state examination: optimal cut-off levels for mild and severe cognitive impairment. Geriatrics2023; 8: 12–10.
25.
MishraBSudheerPAgarwalA, et al.Minimal clinically important difference (MCID) in patient-reported outcome measures for neurological conditions: review of concept and methods. Ann Indian Acad Neurol2023; 26: 334–343.
26.
ChandraMHarbishettarVSawhneyH, et al.Ethical issues in dementia research. Indian J Psychol Med2021; 43: S25–S30.
27.
VreugdenhilACannellJDaviesA, et al.A community-based exercise programme to improve functional ability in people with Alzheimer’s disease: a randomized controlled trial. Scand J Caring Sci2012; 26: 12–19.
28.
HoffmannKSobolNAFrederiksenKS, et al.Moderate-to-high intensity physical exercise in patients with Alzheimer’s disease: a randomized controlled trial. J Alzheimers Dis2016; 50: 443–453.
29.
VenturelliMScarsiniRSchenaF. Six-month walking program changes cognitive and ADL performance in patients with Alzheimer’s. Am J Alzheimers Dis Other Demen®2011; 26: 381–388.
30.
DkyMSzetoSLMakYF, et al.A randomised controlled trial on the effect of exercise on physical, cognitive and affective function in dementia subjects. Asian J Gerontol Geriatr2008; 3: 8–16.
31.
YangSYShanCLQingH, et al.The effects of aerobic exercise on cognitive function of Alzheimer’s disease patients. CNS Neurol Disord Drug Targets2015; 14: 1292–1297.
32.
HolthoffVAMarschnerKScharfM, et al.Effects of physical activity training in patients with Alzheimer’s dementia: results of a pilot RCT study. PLoS ONE2015; 10: e0121478.
33.
KwakY-SUmS-YSonT-G, et al.Effect of regular exercise on senile dementia patients. Int J Sports Med2007; 29: 471–474.
34.
HoraS. 497. Probability Elicitation. In: The Oxford handbook of probability and philosophy. Oxford University Press, 2016 Sep. doi:10.1093/oxfordhb/9780199607617.013.30
35.
Kopp-SchneiderACalderazzoSWiesenfarthM. Power gains by using external information in clinical trials are typically not possible when requiring strict type i error control. Biom J2020; 62: 361–374.
36.
LeeSY. Eliciting the discount parameter in a power prior method on the basis of the type I error consideration. Stat Biopharm Res2024; 17: 1–24.
37.
DeyDKBirmiwalLR. Robust Bayesian analysis using divergence measures. Stat Probab Lett1994; 20: 287–294.