Bayesian adaptive design for clinical trials with potential subgroup effects

Abstract

Adaptive clinical trial designs increasingly aim to improve efficiency while accommodating subgroup heterogeneity, yet most existing methods fix assumptions about drug efficacy and subgroup effects. We propose a Bayesian adaptive design that explicitly models and learns from uncertainty in both components. A hierarchical mixture prior represents uncertainty about overall treatment efficacy and the magnitude of a biomarker-defined subgroup effect. Interim data are used to update these hyperparameters into posterior distributions, enabling a decision-theoretic framework that adaptively selects the optimal testing strategy among three options: continuing with the overall population, focusing on the subgroup, or conducting a joint test of both. When joint testing is chosen, the posterior information further determines the optimal allocation of Type I error between populations by selecting an evidence-based $α$ -splitting parameter that maximizes expected power under error-rate constraints. The resulting optimization is solved efficiently using GPU-accelerated quasi-Monte Carlo integration and smooth search procedures. Simulation studies across a range of subgroup prevalences and effect sizes demonstrate that the proposed design maintains nominal error control, achieves superior power and decision accuracy, and adapts appropriately to prior misspecification. By unifying posterior learning and adaptive $α$ -allocation within a principled Bayesian framework, this design provides a transparent and computationally practical tool for confirmatory clinical trials with uncertain subgroup effects, supporting precision-medicine decision-making and regulatory reproducibility.

Keywords

Bayesian adaptive design decision-theoretic clinical trial design clinical trial GPU-acceleration Type I error control alpha allocation

1. Introduction

The increasing adoption of molecularly targeted therapies in oncology has shifted the focus of drug development toward biomarker-defined subgroups of patients who are most likely to benefit from treatment.¹ This evolution in precision medicine introduces new statistical challenges for confirmatory phase III trials. In early development, there is often substantial uncertainty about both the true treatment efficacy and the strength of the biomarker effect, which may lead to incorrect assumptions when designing later-phase studies. Because these assumptions must be made before data are observed, some degree of prior misspecification is inevitable. Such misspecification can distort operating characteristics, leading to inefficient designs or inflated Type I error, and motivates the need for adaptive frameworks that can learn and self-correct as evidence accrues.

Regulatory agencies have increasingly encouraged innovative designs that improve efficiency and evidence generation. For example, the U.S. Food and Drug Administration recently launched Project FrontRunner, which promotes adaptive and seamless designs that can simultaneously support accelerated and regular approval pathways.² In parallel, several statistical frameworks have been developed to address biomarker uncertainty in confirmatory settings, including subset-effect testing, adaptive informational designs, and extension of the 2-in-1 design.^3–5 Chen and colleagues introduced the “2-in-1” design, which allows a phase II trial to expand seamlessly into a phase III confirmatory study, thereby reducing total development time and controlling Type I error across dual objectives.^6,7,8 Recent Adaptive Phase~II/III work has also extended this framework to dose optimization, reflecting Project Optimus —motivated concerns about dose selection, Type~I error control, and potential inconsistency between Phase~II and Phase ~III evidence.^9,10 Shentu et al. proposed the auto-adaptive alpha allocation design, which dynamically re-weights the significance levels between the overall and subgroup hypotheses using interim information.¹¹ Lu et al. further extended this line of work by incorporating prior distributions on efficacy to determine fixed optimal testing thresholds before the trial begins, highlighting the computational complexity of such optimization problems.¹²

Despite these advances, existing frequentist strategies rely on prespecified decision rules and cannot correct for prior misspecification once the trial begins. Even designs that use prior distributions for calibration do so only at the planning stage, treating the prior as fixed and not subject to learning. Consequently, key decisions—such as whether to continue testing the entire population, the biomarker-defined subgroup, or both jointly, and how to optimally allocate the Type I error when a joint test is conducted—remain fixed rather than evidence-driven.

To address these limitations, we propose a Bayesian adaptive design that integrates posterior learning into both decision-making and error allocation. The design explicitly models two main components of uncertainty: (i) the unknown true treatment efficacy and (ii) the potential biomarker-specific effect. These are represented through a hierarchical mixture prior that captures uncertainty in both overall and subgroup-level parameters. Crucially, interim data are used to update all prior distributions into a posterior framework, allowing the design to automatically correct for inevitable prior misspecification at the design stage. This posterior learning informs two adaptive actions: First, it selects the most appropriate trial branch for hypothesis testing based on the blinded interim data: testing only for the entire population, testing only for the subgroup or testing jointly for both populations (formal definitions of the three branches are given in Section 2.3). Second, if the joint testing is selected, the design determines the optimal allocation of the Type I error (the $α$ -splitting parameter) that maximizes conditional power while maintaining global error control.

The resulting optimization problem involves multidimensional integration and repeated evaluations across design parameters such as subgroup prevalence ( $r$ ) and interim timing ( $t$ ). To ensure computational feasibility, we employ GPU-accelerated quasi-Monte Carlo integration and smooth optimization methods. Simulation studies across a broad range of biomarker effect sizes and prevalences demonstrate that the proposed Bayesian adaptive design maintains nominal Type I error control, improves power and decision accuracy relative to existing approaches, and exhibits robustness to prior misspecification. This framework provides a practical and interpretable approach to adaptive confirmatory trials under joint uncertainty in treatment efficacy and subgroup effects, aligning with the goals of modern precision medicine.

The remainder of this paper is organized as follows. Section 2 introduces the proposed Bayesian adaptive design, including the hierarchical mixture prior, posterior updating, and the decision-theoretic framework for adaptive branching and $α$ -splitting. Section 3 describes the computational implementation, including GPU-accelerated quasi-Monte Carlo integration and the optimization algorithm. Section 4 presents extensive simulation studies evaluating operating characteristics across a range of biomarker prevalences and effect sizes. Section 5 concludes with a discussion of practical considerations, potential extensions, and implications for future biomarker-driven confirmatory trials.

2. Methods

2.1. Notation and interim test statistics

Consider a randomized phase III trial comparing an experimental treatment to control, where the treatment may have a stronger effect in a predefined biomarker-positive subgroup. Let $Δ_{1}$ denote the log-hazard ratio in the entire population and $Δ_{2}$ the corresponding effect in the subgroup, and write $Δ = (Δ_{1}, Δ_{2})^{⊤}$ . An interim analysis is scheduled when a fraction $t \in (0, 1)$ of the total information has accrued. At interim, the $z$ -scores $X_{t} = (X_{1, t}, X_{2, t})^{⊤}$ for testing $H_{0 i} : Δ_{i} = 0$ are available; at the final analysis ( $t = 1$ ), the complete $z$ -scores are $X = (X_{1}, X_{2})^{⊤}$ .

Following Shentu et al.,¹¹ conditional on $Δ$

X_{t} ∣ Δ \sim N (μ_{0} = (\begin{matrix} \sqrt{I t} Δ_{1} \\ \sqrt{r I t} Δ_{2} \end{matrix}), Σ_{0} = (\begin{matrix} 1 & \sqrt{r} \\ \sqrt{r} & 1 \end{matrix}))

(2.1)

where

I

represents the total information, corresponding to 25% of the planned number of events, which is directly related to the trial’s sample size. And

r

represents the proportion of participants in the biomarker-positive subgroup. The correlation

\sqrt{r}

can be analytically proven because the subgroup

z

-score is based on a subset of the overall data set. A larger

r

implies that the drug would be administered to a broader patient population. However, this may also be associated with a weaker subgroup effect.

The trial is successful if at the final analysis one of the test statistics exceeds its critical value. Denote the final critical values by $Z_{1}$ for the entire population and $Z_{2}$ for the subgroup. These thresholds will be chosen by our optimization procedure.

2.2. Drug efficacy’s prior assumption and posterior updated from interim results

To encode uncertainty in both overall efficacy and a potential subgroup advantage, we specify a two-component normal mixture prior

Δ \sim (1 - p_{1}) N (μ_{1}, Σ_{1}) + p_{1} N (μ_{1}^{'}, Σ_{1})

(2.2)

with

μ_{1} = (δ, δ)^{⊤}

and

μ_{1}^{'} = (δ, δ + d)^{⊤}

, where

δ

is the expected overall effect,

d \geq 0

is the anticipated subgroup advantage, and

p_{1} \in [0, 1]

is the prior probability of a true subgroup effect. We set

Σ_{1} = (δ / k)^{2} Σ_{0}

with dispersion factor

k \geq 1

to tune prior strength.

A convenient hierarchical representation of the above mixture prior is

\begin{aligned} Δ ∣ i & \sim N ((δ, δ + d i)^{⊤}, Σ_{1}), i \in {0, 1} \end{aligned}

(2.3)

\begin{aligned} i & \sim Bernoulli (p_{1}) \end{aligned}

(2.4)

The latent binary indicator

i

represents the existence of a subgroup effect (

i = 1

) versus no subgroup effect (

i = 0

). The uncertainty of the subgroup effect is modeled by the Bernoulli distribution. The values of parameters

δ

d

p_{1}

, and

Σ_{1}

are required user inputs, which can be specified using information learned from early-phase clinical trials or literature.

Given interim data $X_{t}$ , the posterior is again a two-component mixture

Δ ∣ X_{t} \sim (1 - p_{2}) N (μ_{2}, Σ_{2}) + p_{2} N (μ_{2}^{'}, Σ_{2})

(2.5)

with closed-form updates

\begin{aligned} Σ_{2} & = (Σ_{1}^{- 1} + A_{t} Σ_{0}^{- 1} A_{t})^{- 1}, A_{t} = diag (\sqrt{I t}, \sqrt{r I t}), \\ μ_{2} & = Σ_{2} (Σ_{1}^{- 1} μ_{1} + A_{t} Σ_{0}^{- 1} X_{t}), μ_{2}^{'} = Σ_{2} (Σ_{1}^{- 1} μ_{1}^{'} + A_{t} Σ_{0}^{- 1} X_{t}), \\ p_{2} & = \frac{p_{1} f_{1} (X_{t})}{(1 - p_{1}) f_{0} (X_{t}) + p_{1} f_{1} (X_{t})} \end{aligned}

where

f_{i} (X_{t})

is the (closed-form) marginal density of

X_{t}

under

i \in {0, 1}

. For a full derivation of these updates, refer to Supplemental material C.1.

The posterior probability $p_{2}$ reflects updated confidence in a subgroup effect based on interim data. If the subgroup and overall population have similar efficacy, then $f_{0} (X_{t}) > f_{1} (X_{t})$ , leading to $p_{2} < p_{1}$ . Conversely, if a subgroup effect exists, then $f_{0} (X_{t}) < f_{1} (X_{t})$ , increasing $p_{2}$ .

This Bayesian updating is central: it not only propagates uncertainty in both $δ$ (overall efficacy) and $d$ (biomarker effect), but also mitigates unavoidable prior misspecification at the design stage by shrinking parameters toward values supported by interim data.

2.3. Adaptive decision branches and optimization

After the interim analysis, the trial proceeds along one of the three branches depending on the posterior evidence for a subgroup effect, as shown in Figure 1. Each branch specifies its own recruitment plan, confirmatory hypotheses, and final test. Within the selected branch, statistical testing is conducted at the nominal level $α^{⋆}$ (e.g. $α^{⋆} = 0.025$ for a two-sided 5% test). As proved in Supplemental Section C (using arguments similar to Zhang et al.,¹³ Chen et al.¹⁴), controlling the Type I error within each branch at level $α^{⋆}$ implies strong control of the overall Type I error for the entire three-branch design at the same level $α^{⋆}$ , despite data-driven branch selection of the branch. Consequently, all three branches maintain the same Type I error control, and the main operating differences between them arise in their statistical power and precision of decision in different biomarker-effect scenarios.

Figure 1.

Diagram of the Bayesian adaptive design with three branches based on interim analysis results. The trial evaluates drug efficacy in both the (blue color) entire population and an (orange color) subgroup with a potentially better response. At an interim time $t$ , partial data are used to assess efficacy, update prior knowledge, and guide the decision on which branch to proceed.

Branch #1: Joint confirmatory test (entire population + subgroup)

Recruitment: continue as originally planned from the entire population. Hypotheses and success:

H_{0} : Δ_{1} = 0, Δ_{2} = 0 versus H_{1} : Δ_{1} \neq 0 or Δ_{2} \neq 0

and the trial succeeds in this branch if

X_{1} \geq Z_{1}

X_{2} \geq Z_{2}

at the final analysis. Non-rejection probability

S (Δ, X_{t}, Z_{1}, Z_{2}) = Φ_{\sqrt{r}} (\frac{Z_{1} - \sqrt{t} X_{1, t}}{\sqrt{1 - t}} - \sqrt{(1 - t) I} Δ_{1}, \frac{Z_{2} - \sqrt{t} X_{2, t}}{\sqrt{1 - t}} - \sqrt{(1 - t) r I} Δ_{2})

(2.6)

where

Φ_{\sqrt{r}}

is the bivariate normal cumulative distribution function (CDF) with correlation

\sqrt{r}

. Type I error and expected power (conditional on $X_{t}$ )

\begin{aligned} α (X_{t}, Z_{1}, Z_{2}) & = 1 - S (Δ = 0, X_{t}, Z_{1}, Z_{2}) \\ Q (X_{t}, Z_{1}, Z_{2}) & = 1 - \int S (Δ, X_{t}, Z_{1}, Z_{2}) f_{Δ ∣ X_{t}} (Δ) d Δ \end{aligned}

Optimization (branch-level)

\arg max_{Z_{1}, Z_{2}} Q (X_{t}, Z_{1}, Z_{2}) s.t. α (X_{t}, Z_{1}, Z_{2}) \leq α^{⋆}

(2.7)

Branch #2: Entire-population only

Recruitment: continue as originally planned from the entire population. Hypotheses and success:

H_{0} : Δ_{1} = 0 versus H_{1} : Δ_{1} \neq 0

and success if

X_{1} \geq Z_{1}^{'}

. Non-rejection probability, Type I error, and expected power

\begin{aligned} S (Δ_{1}, X_{1, t}, Z_{1}^{'}) & = Φ (\frac{Z_{1}^{'} - \sqrt{t} X_{1, t}}{\sqrt{1 - t}} - \sqrt{(1 - t) I} Δ_{1}) \end{aligned}

(2.8)

\begin{aligned} α (X_{1, t}, Z_{1}^{'}) & = 1 - S (0, X_{1, t}, Z_{1}^{'}) \\ Q (X_{1, t}, Z_{1}^{'}) & = 1 - \int S (Δ_{1}, X_{1, t}, Z_{1}^{'}) f_{Δ_{1} ∣ X_{1, t}} (Δ_{1}) d Δ_{1} \end{aligned}

For one-sided level

α^{⋆}

, the optimal threshold has the closed form

Z_{1}^{'} = \sqrt{1 - t} Φ^{- 1} (1 - α^{⋆}) + \sqrt{t} X_{1, t} .

Branch #3: Subgroup-only recruitment and testing

Recruitment: after the interim, enroll only biomarker-positive patients (the subgroup); the complement is no longer accrued. Hypotheses and success:

H_{0} : Δ_{2} = 0 versus H_{1} : Δ_{2} \neq 0

and success if

X_{2} \geq Z_{2}^{'}

. Non-rejection probability, Type I error, and expected power

\begin{aligned} S (Δ_{2}, X_{2, t}, Z_{2}^{'}) & = Φ (\frac{Z_{2}^{'} - \sqrt{r t / (1 - (1 - r) t)} X_{2, t}}{\sqrt{(1 - t) / (1 - (1 - r) t)}} - \sqrt{I (1 - t)} Δ_{2}) \end{aligned}

(2.9)

\begin{aligned} α (X_{2, t}, Z_{2}^{'}) & = 1 - S (0, X_{2, t}, Z_{2}^{'}) \\ Q (X_{2, t}, Z_{2}^{'}) & = 1 - \int S (Δ_{2}, X_{2, t}, Z_{2}^{'}) f_{Δ_{2} ∣ X_{2, t}} (Δ_{2}) d Δ_{2} \end{aligned}

The optimal one-sided boundary is

Z_{2}^{'} = \sqrt{\frac{1 - t}{1 - (1 - r) t}} Φ^{- 1} (1 - α^{⋆}) + \sqrt{\frac{r t}{1 - (1 - r) t}} X_{2, t}

Summary

Branches 2 and 3 admit closed-form optimal thresholds $Z_{1}^{'}$ and $Z_{2}^{'}$ that achieve level $α^{⋆}$ . Branch 1 requires numerical optimization of (2.7) because the constraint depends on the joint CDF in (2.6). The associated multidimensional integrations motivate our GPU-accelerated quasi-Monte Carlo implementation (Section 3). Proofs of (2.6)–(2.9) and the branch-selection error control are provided in Supplemental material Sections C.1 and C.2.

Supplemental Table S.1 summarizes the input parameters, including their ranges and examples, as well as the dependent parameters derived from these inputs and the outputs. The table also provides the relationships between input and dependent parameters to ensure clarity, and facilitate reproducibility of the simulation results. The dependent parameters vary based on the data scheme after the interim analysis (joint, subgroup, or full model), but they all share the same input to ensure consistency in the problem setting.

3. Computational considerations and GPU-accelerated implementation

Evaluating the objective $Q (X_{t}, Z_{1}, Z_{2})$ and the constraint $α (X_{t}, Z_{1}, Z_{2})$ in (2.7) requires repeated evaluations of the bivariate normal probability in (2.6) and two nested expectations with respect to the posterior mixture (2.5). Because the design must be assessed across many $(r, t, I)$ settings and interim outcomes, a CPU-only approach quickly becomes prohibitive. Moreover, in real trial planning, the optimization needs to be repeated for multiple combinations of prior parameters $(δ, d, p_{1}, k)$ for sensitivity analysis. Although posterior updating in our framework already mitigates the effect of prior misspecification, regulators and sponsors typically prefer transparent exploration of robustness across priors. This process produces a decision surface summarizing optimal boundaries and expected power under different priors (see Section 4), making GPU acceleration especially valuable in practice. At the same time, GPU acceleration is not strictly required to apply the proposed design. A standard CPU implementation is generally feasible for single-design evaluation or a small number of scenario checks, whereas GPU computing becomes most useful when the optimization must be repeated many times across prior and design settings. To facilitate practical adoption, we provide an open-source Python/CUDA implementation on GitHub. In addition, readers without dedicated local GPU hardware may still access GPU resources through institutional computing clusters or commercial cloud services such as AWS.

On-device evaluation of $Q$ and $α$

We implement a high-throughput, all-on-device algorithm to compute $Q (X_{t}, Z_{1}, Z_{2})$ and $α (X_{t}, Z_{1}, Z_{2})$ entirely within GPU memory, thereby eliminating CPU–GPU data transfers and achieving substantial speed gains. Specifically, (i) posterior samples ${Δ^{(k)}}_{k = 1}^{N_{Δ}}$ are drawn from (2.5) directly on the GPU; (ii) for each $Δ^{(k)}$ , we evaluate the bivariate normal cumulative distribution $S (Δ^{(k)}, X_{t}, Z_{1}, Z_{2})$ in (2.6) using a quasi-Monte Carlo method that combines the Genz transformation for Gaussian probabilities¹⁵ with the lattice rule of Sloan and Kachoyan¹⁶ and a van der Corput sequence for low-discrepancy sampling^17,18; and (iii) we average across samples to approximate the expected power

Q (X_{t}, Z_{1}, Z_{2}) \approx 1 - \frac{1}{N_{Δ}} \sum_{k = 1}^{N_{Δ}} S (Δ^{(k)}, X_{t}, Z_{1}, Z_{2})

(3.1)

where

(Δ^{(1)}, \dots, Δ^{(N_{Δ})})

are posterior samples from

Δ ∣ X_{t}

defined in (2.5). The Type I error

α (X_{t}, Z_{1}, Z_{2}) = 1 - S (0, X_{t}, Z_{1}, Z_{2})

is computed using the same kernel; details appear in Supplemental Algorithm S.1, the vectorized computation of

S

is described in Supplemental Algorithm S.2, while the approximation of expected power

Q

is described in Supplemental Algorithm S.3. This design ensures all sampling, function evaluation, and aggregation occur on the GPU, fully exploiting parallelism and minimizing communication overhead.

Solving the constrained optimization

For Branch #1, we solve (2.7) using two complementary strategies. First, we evaluate $Q (X_{t}, Z_{1}, Z_{2})$ on a moderate grid of $(Z_{1}, Z_{2})$ values satisfying $α (X_{t}, Z_{1}, Z_{2}) \leq α^{⋆}$ , then apply thin-plate spline (TPS) smoothing to interpolate and denoise the grid. Because each $Q$ evaluation involves stochastic approximation via quasi-Monte Carlo, TPS smoothing serves not only as an interpolator but also as a variance-reduction mechanism, borrowing information from neighboring grid points to suppress Monte Carlo randomness while preserving the smooth underlying shape of $Q (Z_{1}, Z_{2})$ . The smoothed surface is continuously differentiable, enabling efficient optimization using gradient-based solvers such as BFGS.¹² Second, for large-scale sensitivity analyses involving numerous combinations of design and prior parameters, we employ derivative-free optimizers, including COBYLA,¹⁹ SLSQP,²⁰ and Bayesian Optimization solvers.^21,22 These algorithms adaptively balance exploration and exploitation, estimate local constraints dynamically, and achieve high efficiency without requiring analytic gradients or large numbers of function evaluations. Branches #2 and #3 admit the closed-form thresholds $Z_{1}^{'}$ and $Z_{2}^{'}$ given below (2.8)–(2.9), and therefore do not require numerical optimization.

Reproducibility and implementation details

All computational components—including GPU kernels, quasi-Monte Carlo integration routines, and optimization solvers—are documented in Supplemental Section B (Algorithms S.1–S.3). Our implementation executes the entire workflow on device, performing sampling, CDF evaluation, and aggregation fully within GPU memory without intermediate CPU–GPU data transfer, which substantially improves throughput and scalability. The combination of quasi-Monte Carlo variance reduction and full GPU parallelization yields substantial speed gains over standard CPU computation. A reference implementation in Python/CUDA is openly available at the project’s GitHub repository to facilitate reproducibility and adaptation to other adaptive trial designs.

4. Application and numerical examples

To demonstrate the application of our Bayesian three-branch adaptive design and its operating characteristics, we conduct a set of computational experiments and discuss their results. The scenarios are chosen to reflect realistic phase III oncology settings and to stress-test the design under varying biomarker prevalence, biomarker effect size, prior strength, and interim timing.

4.1. Parameter settings of the computational experiments

We evaluate the operating characteristics under the inputs in Table 1. The grid spans interim fraction $t \in {0.25, 0.50}$ , subgroup prevalence $r \in {0.10, 0.15, \dots, 1.00}$ , prior strength via $k$ , and prior probability of a subgroup effect $p_{1}$ . For each scenario, the derived interim statistics $X_{t}$ are set to their conditional expectations given the assumed true effects $(δ, δ + d)$ under the model in (2.1), namely

E [X_{1, t} ∣ Δ_{1} = δ] = \sqrt{I t} δ, E [X_{2, t} ∣ Δ_{2} = δ + d] = \sqrt{r I t} (δ + d)

equivalently

X_{2, t} = \sqrt{r} X_{1, t} + \sqrt{r I t} d

. This lets us compare designs at a common information scale and isolates the impact of branching and thresholds.

Table 1.
Summary of user inputs explored in the numerical experiments.

Parameter Value/specification

$X_{t}$ (for scenario generation) $X_{1, t} = \sqrt{I t} δ$ ;

$X_{2, t} = \sqrt{r} X_{1, t} + \sqrt{r I t} d = \sqrt{r I t} (δ + d)$

$I$ 157 (no biomarker); 261 (with biomarker)

$r$ 19 equally spaced points from $0.10$ to $1.00$ (step $0.05$ )

$δ$ $0.25$ (no biomarker); $0.20$ (with biomarker)

$d$ $0$ (no biomarker); $d = 0.6 (1 - r)$ (with biomarker)

$p_{1}$ ${0.25, 0.50}$ (no biomarker); ${0.50, 0.75}$ (with biomarker)

$t$ $0.25$ or $0.50$

$k$ 4 (weak prior, sensitivity analysis in Supplemental materials)

Parameter	Value/specification
$X_{t}$ (for scenario generation)	$X_{1, t} = \sqrt{I t} δ$ ;
	$X_{2, t} = \sqrt{r} X_{1, t} + \sqrt{r I t} d = \sqrt{r I t} (δ + d)$
$I$	157 (no biomarker); 261 (with biomarker)
$r$	19 equally spaced points from $0.10$ to $1.00$ (step $0.05$ )
$δ$	$0.25$ (no biomarker); $0.20$ (with biomarker)
$d$	$0$ (no biomarker); $d = 0.6 (1 - r)$ (with biomarker)
$p_{1}$	${0.25, 0.50}$ (no biomarker); ${0.50, 0.75}$ (with biomarker)
$t$	$0.25$ or $0.50$
$k$	4 (weak prior, sensitivity analysis in Supplemental materials)

Quantities like $X_{t}$ are derived from assumed true effects for scenario generation and are not design inputs.

The choice of values for total information $I$ is determined according to Equation (7.3.8) of Chow et al.²³

I = \frac{(Z_{1 - α} + Z_{1 - β})^{2}}{{\log (1 - δ)}^{2}}, α = 0.025 (two-sided 5\%), 1 - β = 0.95

which links the required information to the target log-hazard ratio reduction

Δ

under a given Type I error and power. Using this relationship, the prespecified values

I = 157

(no biomarker) and

I = 261

(with biomarker) correspond approximately to

Δ \approx 0.25

and

0.20

, respectively. These values are consistent with the expected standardized drug efficacies used in our design inputs (

δ = 0.25

without biomarker and

δ = 0.20

with biomarker), ensuring that the chosen

I

represents a realistic level of information for the assumed effect sizes in the entire population. In all numerical studies,

I

is fixed at these values to allow direct comparison of operating characteristics across branching strategies, prior settings, and interim timings.

To reflect increasing informativeness with $r$ , we set the working prior covariance as in Section 2.2

Σ_{0} = (\begin{matrix} 1 & \sqrt{r} \\ \sqrt{r} & 1 \end{matrix}), Σ_{1} = (\frac{δ}{k})^{2} Σ_{0}

with

k \geq 1

tuning prior dispersion. We consider two canonical efficacy patterns: (a) No biomarker effect

(δ = 0.25, d = 0)

; and (b) Biomarker effect present

(δ = 0.20, d = 0.6 (1 - r))

. When discussing “weak” versus “strong” biomarker effects in the results, we refer only to the magnitude induced by

d = 0.6 (1 - r)

at different

r

(small

r \Rightarrow

larger

d

, large

r \Rightarrow

smaller

d

); it is not a separate model specification. We emphasize that

X_{t}

δ

d

p_{1}

, and

Σ_{1}

are scenario-defining quantities for operating-characteristic evaluation, not tunable design parameters.

4.2. Numerical results

4.2.1. Optimal power across scenarios

We compare the three branches (entire population, subgroup-only, joint) across $r$ , $t$ , and $p_{1}$ under each efficacy pattern, and identify the strategy that maximizes expected power subject to $α^{⋆} = 0.025$ at the branch level, as defined in Section 2.3. Major results of these experiments are shown in Figure 2.

Figure 2.

Operating characteristics of the three-branch design as a function of the subgroup proportion $r$ under two efficacy scenarios (rows) and two prior settings (columns). Top row: no biomarker effect, with overall efficacy $δ = 0.25$ and $d = 0$ . Bottom row: biomarker effect present, with lower overall efficacy $δ = 0.20$ and subgroup effect $d = 0.6 (1 - r)$ . Accordingly, in the bottom row the subgroup effect varies inversely with subgroup size $r$ : smaller values of $r$ (left side of the panels) correspond to larger $d$ and hence stronger biomarker effects, whereas larger values of $r$ correspond to smaller $d$ and weaker biomarker effects. Left and right columns correspond to different prior probabilities of a subgroup effect ( $p_{1}$ values), which have only a modest impact on the qualitative patterns. Colors distinguish interim analysis times $t = 0.25$ and $t = 0.50$ . Curves show the expected power of the joint, subgroup-only, and full-population branches.

No biomarker effect ( $δ = 0.25, d = 0$ , the top row of Figure 2)

The joint and entire-population branches deliver nearly identical power for all $r$ , reflecting the absence of effect heterogeneity. The subgroup branch’s power increases with $r$ and coincides with the entire-population branch at $r = 1$ . Later interim time ( $t = 0.50$ ) can slightly reduce power due to the implicit multiple-testing cost without compensatory enrichment benefit. Because $d = 0$ , the mixture prior effectively collapses ( $p_{1}$ has no operational impact), consistent with our posterior mechanics in Section 2.2.

Weak biomarker effect ( $δ = 0.20, d = 0.6 (1 - r)$ at larger $r$ )

This setting corresponds to the right-hand side of the bottom-row panels in Figure 2, where the subgroup advantage is present but moderate because larger $r$ implies smaller $d$ . When the subgroup advantage exists but is moderate, the joint branch is optimal for intermediate $r$ (roughly $0.4$ – $0.8$ in the bottom row of Figure 2), where borrowing across populations improves efficiency. The subgroup branch underperforms in this range because it discards information from the complement group, while the entire-population branch is uniformly least powerful. Power for the subgroup branch decreases with $r$ (dilution of $d$ ), whereas the joint branch remains comparatively stable.

Strong biomarker effect ( $δ = 0.20, d = 0.6 (1 - r)$ at smaller $r$ )

This setting corresponds to the left-hand side of the bottom-row panels in Figure 2, where the subgroup advantage is strong because smaller $r$ implies larger $d$ . For sufficiently small subgroup size $r < 0.4$ , the subgroup branch yields the highest power by focusing recruitment and testing on the enriched population (Section 2.3). As $r$ approaches $0.5$ , $d$ shrinks and the joint branch becomes preferable. The subgroup branch is more sensitive to $p_{1}$ because posterior learning of a true subgroup advantage ( $p_{2} ↑$ ) has a larger impact on its decision quality.

Across all scenarios

The relative performance of the three models depends critically on the interplay among the subgroup proportion $r$ , biomarker effect $d$ , interim timing $t$ , and the information-adjusted conditional distribution of the interim test statistics $X_{t}$ . As $r \to 1$ , the subgroup effect vanishes and the power of all models converges. The subgroup model is optimal when effect heterogeneity is large, the joint model performs best when differences are moderate—with its performance closely tracking the magnitude of $d$ —and the entire-population model is preferred when no biomarker effect is present. Together, these findings highlight that the proposed framework adapts flexibly across a broad spectrum of biomarker prevalence and treatment-effect heterogeneity.

4.2.2. Impact of prior variance on optimal power

We vary $k$ in $Σ_{1} = (δ / k)^{2} Σ_{0}$ to study how prior dispersion affects operating characteristics at interim. As $k$ increases, the prior variance decreases, concentrating around $δ$ (see Supplemental Figure 1). For a non-interim single-population setting, $Q_{0}$ approaches the familiar bound $1 - Φ (Z_{1 - α} - δ \sqrt{I})$ . In our experiments ( $X_{1, t} = 1.96$ , $I = 211$ , $p_{1} = 0.25$ , $δ = 0.2$ , $d = 0$ ), decreasing prior variance improves power for most branches, with the joint branch approaching the entire-population branch when heterogeneity is negligible. Sharper priors enable more informative posteriors, which reduces uncertainty at interim and yields better thresholding decisions; see Supplemental Figure 2 for trajectories of power versus $k$ .

4.2.3. Robustness to prior misspecification: posterior learning and true power

We stress-test robustness in settings where the data-generating mechanism includes a true biomarker effect (so $p_{true} = 1$ ). We fix $r = 0.5$ , $k = 4$ , $t = 0.25$ , and take $p_{1} \in {0.5, 0.75}$ , deliberately understating the prior probability of a subgroup advantage. For each $(p_{1}, δ_{true}, d_{true})$ configuration we analyze three working-prior cases: (i) $\hat{δ} = δ_{true}$ with $\hat{d} < d_{true}$ (biomarker effect underestimated); (ii) $\hat{d} = d_{true}$ with $\hat{δ} < δ_{true}$ (overall efficacy underestimated); and (iii) $\hat{δ} = δ_{true}$ , $\hat{d} = d_{true}$ (effects correctly specified) while $p_{1}$ remains conservative.

For each configuration, the adaptive interim design and a non-interim comparator are both evaluated using the true effects to compute true power. This setting mimics a realistic scenario where trial planners adopt cautious priors, and we assess whether posterior updating at interim can recover the loss in operating characteristics that would otherwise arise from prior misspecification.

Table 2 summarizes the results using the power-gain metric

Δ (Q_{true}) = Q_{true} - Q_{true,naive}

where

Q_{true}

and

Q_{true,naive}

denote the true power of the interim and non-interim designs, respectively. Because both designs are evaluated under identical true parameter values, any positive

Δ (Q_{true})

directly quantifies the gain achieved through interim posterior learning.

Table 2.
Comparison of interim vs. non-interim designs at $t = 25 %$ , $r = 0.5$ , and $k = 4$ .

True $μ$ Misspec Prior $μ_{1}^{'}$ Posterior $μ_{2}^{'}$ $p_{2}$ $Δ (Q_{true})$

(a) $p_{1} = 0.50$

$(0.25, 0.65)$ $d < d_{true}$ $(0.25, 0.45)$ $(0.24, 0.46)$ 0.897 +0.072

Correct $(δ, δ + d)$ $(0.25, 0.65)$ $(0.25, 0.65)$ 0.947 +0.070

$δ < δ_{true}$ $(0.20, 0.60)$ $(0.21, 0.61)$ 0.992 +0.025

$(0.30, 0.70)$ $d < d_{true}$ $(0.30, 0.50)$ $(0.29, 0.51)$ 0.804 +0.044

Correct $(δ, δ + d)$ $(0.30, 0.70)$ $(0.30, 0.70)$ 0.868 +0.042

$δ < δ_{true}$ $(0.25, 0.65)$ $(0.26, 0.66)$ 0.949 +0.029

$(0.35, 0.75)$ $d < d_{true}$ $(0.35, 0.55)$ $(0.34, 0.56)$ 0.726 +0.006

Correct $(δ, δ + d)$ $(0.35, 0.75)$ $(0.35, 0.75)$ 0.786 +0.006

$δ < δ_{true}$ $(0.30, 0.70)$ $(0.31, 0.71)$ 0.870 +0.018

(b) $p_{1} = 0.75$

$(0.25, 0.75)$ $d < d_{true}$ $(0.25, 0.55)$ $(0.24, 0.56)$ 0.993 +0.045

Correct $(δ, δ + d)$ $(0.25, 0.75)$ $(0.25, 0.75)$ 0.996 +0.042

$δ < δ_{true}$ $(0.20, 0.70)$ $(0.21, 0.71)$ 1.000 +0.013

$(0.30, 0.80)$ $d < d_{true}$ $(0.30, 0.60)$ $(0.29, 0.61)$ 0.973 +0.045

Correct $(δ, δ + d)$ $(0.30, 0.80)$ $(0.30, 0.80)$ 0.983 +0.041

$δ < δ_{true}$ $(0.25, 0.75)$ $(0.26, 0.76)$ 0.996 +0.019

$(0.35, 0.85)$ $d < d_{true}$ $(0.35, 0.65)$ $(0.34, 0.66)$ 0.943 +0.034

Correct $(δ, δ + d)$ $(0.35, 0.85)$ $(0.35, 0.85)$ 0.958 +0.031

$δ < δ_{true}$ $(0.30, 0.80)$ $(0.31, 0.81)$ 0.983 +0.021

True $μ$	Misspec	Prior $μ_{1}^{'}$	Posterior $μ_{2}^{'}$	$p_{2}$	$Δ (Q_{true})$
(a) $p_{1} = 0.50$
$(0.25, 0.65)$	$d < d_{true}$	$(0.25, 0.45)$	$(0.24, 0.46)$	0.897	+0.072
	Correct $(δ, δ + d)$	$(0.25, 0.65)$	$(0.25, 0.65)$	0.947	+0.070
	$δ < δ_{true}$	$(0.20, 0.60)$	$(0.21, 0.61)$	0.992	+0.025
$(0.30, 0.70)$	$d < d_{true}$	$(0.30, 0.50)$	$(0.29, 0.51)$	0.804	+0.044
	Correct $(δ, δ + d)$	$(0.30, 0.70)$	$(0.30, 0.70)$	0.868	+0.042
	$δ < δ_{true}$	$(0.25, 0.65)$	$(0.26, 0.66)$	0.949	+0.029
$(0.35, 0.75)$	$d < d_{true}$	$(0.35, 0.55)$	$(0.34, 0.56)$	0.726	+0.006
	Correct $(δ, δ + d)$	$(0.35, 0.75)$	$(0.35, 0.75)$	0.786	+0.006
	$δ < δ_{true}$	$(0.30, 0.70)$	$(0.31, 0.71)$	0.870	+0.018
(b) $p_{1} = 0.75$
$(0.25, 0.75)$	$d < d_{true}$	$(0.25, 0.55)$	$(0.24, 0.56)$	0.993	+0.045
	Correct $(δ, δ + d)$	$(0.25, 0.75)$	$(0.25, 0.75)$	0.996	+0.042
	$δ < δ_{true}$	$(0.20, 0.70)$	$(0.21, 0.71)$	1.000	+0.013
$(0.30, 0.80)$	$d < d_{true}$	$(0.30, 0.60)$	$(0.29, 0.61)$	0.973	+0.045
	Correct $(δ, δ + d)$	$(0.30, 0.80)$	$(0.30, 0.80)$	0.983	+0.041
	$δ < δ_{true}$	$(0.25, 0.75)$	$(0.26, 0.76)$	0.996	+0.019
$(0.35, 0.85)$	$d < d_{true}$	$(0.35, 0.65)$	$(0.34, 0.66)$	0.943	+0.034
	Correct $(δ, δ + d)$	$(0.35, 0.85)$	$(0.35, 0.85)$	0.958	+0.031
	$δ < δ_{true}$	$(0.30, 0.80)$	$(0.31, 0.81)$	0.983	+0.021

All scenarios assume a true subgroup effect ( $p_{true} = 1$ ), while the working priors use $p_{1} \in {0.50, 0.75}$ . For each configuration, we show the true parameter values $(μ)$ , the working prior means $(μ_{1}^{'})$ , the posterior means for the subgroup $(μ_{2}^{'})$ , and the updated posterior probability of a subgroup effect ( $p_{2}$ ). The last column reports the gain in true power, $Δ (Q_{true}) = Q_{true} - Q_{true,naive}$ , quantifying the benefit of interim posterior learning.

Across all settings, posterior updating consistently increases the posterior probability of a subgroup advantage ( $p_{2} > p_{1}$ ), even when the prior effect means $(μ_{1}^{'})$ are correctly specified. This posterior correction of $p_{2}$ contributes to positive $Δ (Q_{true})$ in every case, confirming that interim learning improves true power beyond the non-interim design. Moreover, when either the subgroup advantage $d$ or the overall efficacy $δ$ is underestimated, the posterior means $(μ_{2}^{'})$ are shifted in the correct direction—toward the true values—demonstrating effective recovery from conservative priors. The second element of $(μ_{2}^{'})$ , representing the subgroup’s posterior drug efficacy, is always closer to the truth than its prior counterpart, reinforcing that interim evidence refines both the magnitude and probability of a subgroup effect. Together, these results show that the Bayesian adaptive design systematically corrects prior misspecification in both effect size and subgroup probability, thereby preserving or improving true power while maintaining Type I error control.

Figure 3 provides a complementary view of this learning process by comparing prior and posterior distributions for representative scenarios. Posterior densities exhibit clear contraction toward the true parameters, reflected in reduced Kullback–Leibler divergence and 2-Wasserstein distance relative to the prior. These reductions confirm that the interim data drive the posterior closer to the truth—both in probability mass and geometric proximity—thereby enhancing inference accuracy.

Figure 3.

Posterior learning at interim. Panels correspond to the “ $δ < δ_{true}$ ” lines in each sub-block of Table 2. Shaded contours: 95% highest density regions for prior (blue) and posterior (orange). Markers: weighted prior mean ( $+$ ), weighted posterior mean ( $\times$ ), true value ( $*$ ). Subtitles report Kullback–Leibler (KL) divergence and 2-Wasserstein distance to the truth, quantifying information gain.

Together, the results from Table 2 and Figure 3 demonstrate that interim posterior updating not only restores true power lost under misspecified priors but also quantifies learning via explicit information-distance measures, providing robust protection against conservative or uncertain design assumptions.

5. Discussions and conclusion

In this paper, we propose a Bayesian adaptive design for phase III oncology clinical trials. Our design uses a mixture prior to represent uncertainties in both overall drug efficacy and the potential for enhanced response within a biomarker-defined subgroup. Interim results are used to update prior knowledge and obtain the posterior distribution of efficacy parameters in both the entire population and the subgroup. Through this posterior updating, the design corrects prior parameter misspecifications, refines subgroup probabilities, and reduces uncertainty. Conditional on the interim data, the design optimizes the expected study power while controlling the overall Type I error. Based on the optimized expected power, the trial adaptively selects one of the three branches for continuation after interim analysis: testing both populations jointly, testing only the entire population, or testing only the subgroup.

Our design extends and generalizes several recent innovations in adaptive clinical trials. Unlike frequentist strategies such as the “2-in-1” framework^6,7 or the “auto-adaptive alpha allocation” method,¹¹ which predefine reallocation rules for significance levels, our approach dynamically learns from interim data within a Bayesian decision-theoretic framework. Posterior updating not only refines effect-size estimates but also adjusts the inferred probability of a subgroup benefit, allowing decisions to align more closely with the evolving evidence. This adaptive learning makes the design robust to prior misspecification and provides a self-correcting mechanism that protects trial integrity when early assumptions deviate from reality.

5.1. Summary of simulation findings and influence of design parameters

Across scenarios spanning no biomarker effect, moderate differences, and pronounced heterogeneity, the branch that maximizes expected power behaves intuitively: the entire-population branch dominates when no heterogeneity is present; the subgroup branch is best when the biomarker effect is large; and the joint branch provides a stable compromise when differences are modest or uncertain. As $r \to 1$ , subgroup differences vanish and all branches converge in power. In stress tests of conservative planning (underestimated $δ$ or $d$ , and $p_{1} < p_{true}$ ), the interim posterior increased the probability of a subgroup advantage ( $p_{2} > p_{1}$ ), shifted posterior means toward the true effects, and improved true power relative to a non-interim comparator, as quantified by $Δ (Q_{true}) > 0$ in Table 2 and by posterior contraction in Figure 3. Even when the working means matched the truth, gains were observed because $p_{1}$ was conservative and was corrected by interim data.

The interim timing $t$ is consequential: analyses at $t = 0.25$ often outperformed $t = 0.50$ in our studies, striking a better balance between information for posterior learning and the $α$ penalty paid at the end. Subgroup prevalence $r$ also matters operationally and statistically; smaller biomarker-defined subgroups can yield larger effects (benefiting Branch #3) but narrow the target population. Prior dispersion $k$ modulates sensitivity to heterogeneity: tighter priors push the joint branch toward the full-population behavior, while moderately diffuse priors preserve flexibility to detect subgroup signals. Finally, the relative position of $(X_{1, t}, X_{2, t})$ to its model-based expectation helps explain branch switches (e.g. large $X_{2, t}$ with small $X_{1, t}$ favors subgroup-only testing).

5.2. Practical, economic, and operational implications for pharmaceutical development

The development of this Bayesian adaptive framework was motivated by a coauthor’s real-world needs encountered in late-phase oncology programs in the pharmaceutical industry. In early confirmatory design discussions, decision-making often relies on simplified representations of uncertainty that can be communicated clearly to multidisciplinary teams of clinicians, statisticians, and regulatory reviewers. The proposed two-component mixture prior—or its equivalent hierarchical representation with a Bernoulli latent indicator—captures this structure naturally. It allows investigators to encode uncertainty about the existence of a biomarker-defined subgroup effect in a form that is both statistically coherent and intuitively interpretable. This dual representation provides a practical mechanism for achieving consensus within cross-functional drug development teams when data are limited and assumptions must remain transparent.

Compared with previous alpha-allocation and fixed-prior approaches, the mixture prior directly reflects how efficacy evidence is discussed and aggregated in pharmaceutical decision forums. It bridges the gap between formal Bayesian modeling and the heuristic reasoning used in development committees, enabling more informed and quantitatively defensible design choices. The resulting posterior learning not only improves statistical efficiency but also enhances communication and regulatory transparency—two aspects critical for adaptive designs to be adopted in confirmatory settings.

It is also important to recognize the broader economic implications of incremental power gains. Although the absolute power improvements reported in Table 2 may appear modest numerically, their practical impact is substantial. For example, a 1% increase in power is roughly equivalent to a 3.7% reduction in required sample size (under a common confirmatory setting with one-sided $α = 0.025$ and targeted power $β = 90 %$ .) In large-scale oncology trials where each patient can cost $200,000–$250,000, such efficiency translates into multimillion-dollar savings or shorter timelines to reach conclusive evidence. When viewed from a portfolio perspective, even small statistical improvements propagate across multiple programs, producing a material return on investment. For drugs with expected revenues on the order of USD1 billion, a 0.5% risk-adjusted improvement in success probability represents a tangible economic contribution of the statistical design.

In addition to these economic implications, the design also offers tangible operational advantages when the subgroup-only branch is selected. When interim results indicate that efficacy is confined to a biomarker-defined subgroup, continuing recruitment only within that subgroup can substantially reduce cost and accelerate completion without compromising power. In such cases, the design achieves either higher statistical efficiency at fixed cost or equivalent power with fewer patients, yielding meaningful resource savings. However, operationalizing this approach presents non-trivial challenges. Transitioning to subgroup-only recruitment mid-trial requires predefined decision criteria, coordination across sites, and careful engagement with regulatory agencies to ensure transparency and interpretability of the adaptation. Regulators may require clear justification that the decision rule was prespecified and that continued enrollment aligns with ethical principles and data integrity. These operational constraints do not diminish the value of the approach but highlight that its implementation should be guided by practical feasibility and regulatory acceptability in each specific development program. Regardless of how to recruit after we decide to rest subgroup only based on interim evidence, the mathematical framework is the same and only the closed-form solution of subgroup-only need to be adjusted.

From a statistical design perspective, the proposed method selects the preferred branch from three candidates as the one that maximizes predicted power for the observed interim result and design setting. The numerical examples suggest a consistent pattern: the entire-population branch is preferred when there is little evidence of treatment-effect heterogeneity; the subgroup branch is preferred when posterior evidence strongly supports subgroup enrichment; and the joint branch is preferred when heterogeneity is moderate or uncertain. In practice, however, the statistically optimal branch may not always be the most feasible one to implement. Its utility depends on the context of development, the plausibility of a biomarker effect, and the availability of interim data. In programs where efficacy heterogeneity is unlikely or interim data collection is logistically difficult, simpler fixed-threshold approaches may remain more practical. Conversely, when biomarker evidence is uncertain and interim adaptation is feasible, the Bayesian mixture-prior framework offers a conceptually transparent and economically meaningful solution. Users should therefore assess the trade-offs between computational complexity, operational feasibility, and potential gain according to their specific development scenarios.

5.3. Computational considerations and practical implementation

The proposed design involves repeated evaluations of the nested integrals defining the objective function $Q$ and Type I constraint $α$ . This computation is intensive because $S (Δ, X_{t}, Z_{1}, Z_{2})$ in equation (2.6) must be integrated over the posterior mixture distribution of $Δ$ . To make this feasible, we implemented a fully on-device GPU algorithm that evaluates $S$ , $Q$ , and $α$ using quasi-Monte Carlo integration with the Genz transformation¹⁵ and van der Corput lattice sequence.^17,18 All computations are performed entirely on the GPU to avoid CPU–GPU data transfer overhead and to maximize parallel throughput. We further accelerate the constrained optimization using TPS smoothing of $Q$ to mitigate random Monte Carlo noise, followed by gradient-based (BFGS) and derivative-free (COBYLA, SLSQP, Bayesian Optimization) solvers.^19–22 This GPU-accelerated implementation enables high-throughput exploration of design parameters across diverse prior combinations, facilitating robust sensitivity analyses. The complete Python/CUDA implementation and code for reproducibility are available on our GitHub repository (https://github.com/ubcxzhang/BayesianDesign4A).

5.4. Bayesian–Frequentist spectrum

The proposed framework is best viewed as lying on the Bayesian–Frequentist spectrum rather than belonging exclusively to either paradigm. Its Bayesian component arises from the prior-interim data-posterior framework, in which prior assumptions on overall efficacy and subgroup effect are updated by interim data and the resulting posterior distribution is used to guide branch selection and threshold optimization. At the same time, the framework also contains important non-Bayesian components. First, practical design recommendations are obtained by evaluating discrete design points and constructing a smoothed decision surface, which summarizes operating characteristics across parameter settings and supports optimization of the final strategy. Second, the confirmatory stage is calibrated through frequentist Type I error control, with strong control established in Supplemental Section C. Thus, the method combines Bayesian posterior learning for adaptive decision-making with computational decision-surface construction and frequentist confirmatory calibration.

5.5. Future directions

Building upon these findings, we envision two extensions to enhance the flexibility and practical applicability of the proposed design. First, we plan to automate the exploration of a high-dimensional design space by constructing a smooth solution surface for $Q$ across grid points of $(Z_{1}, Z_{2}, r, t, I)$ . Using TPSs, we can interpolate between grid evaluations and visualize a complete decision surface, helping users identify parameter combinations that optimize both power and practical utility. This approach offers a global perspective but requires engineering effort to balance grid density and computational cost. Second, we aim to collaborate with clinical trial practitioners to incorporate realistic utility functions that account not only for statistical power but also for clinical and logistical considerations such as recruitment cost, timeline, and expected patient benefit. This utility-based optimization framework provides a single optimal solution tailored to user-defined objectives, while the grid-based TPS approach offers a global overview of trade-offs. Both approaches are enabled by our GPU framework, which makes such high-throughput evaluations computationally feasible.

5.6. Conclusion

In summary, we developed a Bayesian adaptive design that unifies posterior learning, adaptive $α$ -allocation, and GPU-accelerated computation into a single coherent framework for precision oncology trials. The design maintains nominal Type I error control across all branches, dynamically corrects conservative priors through interim posterior updating, and substantially improves true power under uncertainty. By integrating statistical rigor, computational scalability, and practical interpretability, our framework provides a transparent and efficient approach for confirmatory biomarker-driven clinical trials. Future extensions toward multibiomarker and seamless phase II/III designs will further expand its utility in precision medicine development and regulatory science.

Supplemental Material

sj-pdf-1-smm-10.1177_09622802261449368 - Supplemental material for Bayesian adaptive design for clinical trials with potential subgroup effects

Supplemental material, sj-pdf-1-smm-10.1177_09622802261449368 for Bayesian adaptive design for clinical trials with potential subgroup effects by Xuekui Zhang, Qianyun Zhao, Cong Chen, Belaid Moa and Shelley Gao in Statistical Methods in Medical Research

Footnotes

ORCID iD

Xuekui Zhang

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: Xuekui Zhang is supported the Canada Research Chairs (CRC-2021-00232 X.Z.), Michael Smith Foundation for Health Research Scholar Award (SCH-2022-2553 X.Z.), and NSERC Discover Grant (RGPIN-2022-03050).

Declaration of conflicting interests

The authors declared the following potential conflicts of interest with respect to the research, authorship, and/or publication of this article: Cong Chen is employed at Merck & Co. The other authors have no potential conflicts of interest.

Supplemental material

Supplemental material for this article is available online.

References

Min

H-Y

Lee

H-Y

. Molecular targeted therapy for anticancer treatment. Exp Mol Med 2022; 54: 1670–1694.

Pazdur

Gormley

Kazandjian

. Project FrontRunner—a new paradigm for oncology drug development. N Engl J Med 2022; 387: 1441–1443.

Chen

Beckman

. Hypothesis testing in a confirmatory phase III trial with a possible subset effect. Stat Biopharm Res 2009; 1: 431–440.

Chen

Shentu

, et al. Adaptive informational design of confirmatory phase III trials with an uncertain biomarker effect to improve the probability of success. Stat Biopharm Res 2016; 8: 237–247.

Fan

Zhao

. The extension of 2-in-1 adaptive phase 2/3 designs and its application in oncology clinical trials. Contemp Clin Trials 2020; 98: 106148.

Chen

, et al. Adaptive expansion of biomarker populations in phase 3 clinical trials. Contemp Clin Trials 2018b; 71: 181–185.

Chen

Deng

. Extensions of the 2-in-1 adaptive design. Contemp Clin Trials 2020; 95: 106053.

Chen

Zhang

. From bench to bedside, 2-in-1 design expedites phase~2/3 oncology drug development. Front Oncol 2023; 13: 1251672.

Chen

Huang

Zhang

. Adaptive phase~2/3 design with dose optimization. Contemp Clin Trials 2025; 156: 108048.

10.

Chen

Huang

Zhang

. Adjustment for inconsistency in adaptive phase~2/3 designs with dose optimization. Pharm Stat 2025; 24: 1539–1604.

11.

Shentu

Chen

Pang

, et al. Auto-adaptive alpha allocation: a strategy to mitigate risk on study assumptions. Stat Biosci 2018; 10: 342–356.

12.

Zhou

Xing

, et al. The optimal design of clinical trials with potential biomarker effects: a novel computational approach. Stat Med 2021; 40: 1752–1766.

13.

Zhang

Jia

Xing

, et al. Application of group sequential methods to the 2-in-1 design and its extensions for interim monitoring. Stat Biopharm Res 2024; 16: 130–139.

14.

Chen

Anderson

Mehrotra

, et al. A 2-in-1 adaptive phase 2/3 design for expedited oncology drug development. Contemp Clin Trials 2018; 64: 238–242.

15.

Genz

. Numerical computation of multivariate normal probabilities. J Comput Graph Stat 1992; 1: 141–149.

16.

Sloan

Kachoyan

. Lattice methods for multiple integration: theory, error analysis and examples. SIAM J Numer Anal 1987; 24: 116–128.

17.

Niederreiter

. Random number generation and quasi-Monte Carlo methods. Philadelphia, PA: Society for Industrial and Applied Mathematics, 1992.

18.

Dick

Pillichshammer

. Digital nets and sequences: discrepancy theory and quasi-Monte Carlo integration. Cambridge, UK: Cambridge University Press, 2010.

19.

Powell

MJD

. A direct search optimization method that models the objective and constraint functions by linear interpolation. In: Advances in optimization and numerical analysis. Springer Netherlands, 1994, pp. 51–67.

20.

Kraft

. A software package for sequential quadratic programming. Köln, Germany: Wissenschaftliches Berichtswesen der DFVLR, 1988.

21.

Balandat

Karrer

Jiang

, et al. BoTorch: a framework for efficient Monte-Carlo Bayesian optimization. In: Advances in neural information processing systems 33, 2020.

22.

Akiba

Sano

Yanase

, et al. Optuna: a next-generation hyperparameter optimization framework. In: Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery and data mining, 2019.

23.

Chow

S-C

Wang

Shao

. Sample size calculations in clinical research. New York, USA: CRC Press, 2007.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.70 MB

Bayesian adaptive design for clinical trials with potential subgroup effects

Abstract

Keywords

1. Introduction

2. Methods

2.1. Notation and interim test statistics

Branch #1: Joint confirmatory test (entire population + subgroup)

Branch #2: Entire-population only

Branch #3: Subgroup-only recruitment and testing

Summary

On-device evaluation of Q and α

Solving the constrained optimization

Reproducibility and implementation details

4.1. Parameter settings of the computational experiments

4.2.1. Optimal power across scenarios

No biomarker effect ( δ = 0.25 , d = 0 , the top row of Figure 2)

Weak biomarker effect ( δ = 0.20 , d = 0.6 ( 1 − r ) at larger r )

Strong biomarker effect ( δ = 0.20 , d = 0.6 ( 1 − r ) at smaller r )

Across all scenarios

4.2.3. Robustness to prior misspecification: posterior learning and true power

5.1. Summary of simulation findings and influence of design parameters

5.2. Practical, economic, and operational implications for pharmaceutical development

5.3. Computational considerations and practical implementation

5.4. Bayesian–Frequentist spectrum

5.5. Future directions

5.6. Conclusion

Supplemental Material

sj-pdf-1-smm-10.1177_09622802261449368 - Supplemental material for Bayesian adaptive design for clinical trials with potential subgroup effects

Footnotes

ORCID iD

Funding

Declaration of conflicting interests

Supplemental material

References

Supplementary Material

On-device evaluation of $Q$ and $α$

No biomarker effect ( $δ = 0.25, d = 0$ , the top row of Figure 2)

Weak biomarker effect ( $δ = 0.20, d = 0.6 (1 - r)$ at larger $r$ )

Strong biomarker effect ( $δ = 0.20, d = 0.6 (1 - r)$ at smaller $r$ )