Abstract
Even in rare diseases, where the sample size is limited and blinding is less frequently implemented, randomized controlled trials are considered the gold standard to prove efficacy. Randomization is used to mitigate bias and regulatory guidance recommend the investigation of the impact of bias on the test decision. We quantified how allocation bias affects the test decision in small-sample two-arm group sequential trials under a biasing policy based on the Blackwell–Hodges convergence strategy. Type I error and power were evaluated under Lan–DeMets spending (Pocock-, O’Brien–Fleming-, Wang–Tsiatis-type functions), with and without futility (non-binding, binding), varying interim timing, number of looks and stage-wise restarting of randomization. Allocation bias inflated type I error most for more restrictive randomization procedures, especially permuted blocks with small block sizes. Spending more alpha at interim reduced inflation. Non-binding futility reduced type I error, while binding increased type I error inflation for more aggressive stopping boundaries. Stage-wise restarting modestly reduced inflation for most procedures. Overall, group sequential choices had secondary effect and did not rescue a predictable randomization scheme. When allocation bias cannot be ruled out (e.g. open-label trials), we recommend less restrictive randomization procedures (e.g. big stick design) or, if using permuted blocks, large block sizes.
Keywords
Introduction
Randomized controlled trials (RCTs) are the gold standard in clinical trials, primarily for their unparalleled effectiveness in minimizing biases. Despite safeguards in RCTs such as randomization, concealment, and blinding, bias can still affect their results. The US Food and Drug Administration (FDA) and European Medicines Agency caution that any clinical trial may be subject to systematic biases and urge active steps to reduce them, particularly when blinding is not feasible.1,2 One such threat is allocation bias, which is referred to as selection bias in earlier research papers.
3
It can be classified as a subtype of selection bias, due to predictable randomization sequences.
4
For example, with permuted block randomization (PBR) with block size
Assessments of allocation bias are mostly descriptive, as in the RoB 2 tool, 7 which rates the bias from domains including the randomization process, deviations from intended interventions, missing data, outcome measurement, and selective reporting via signaling questions. Metrics such as the proportion of correct treatment guesses quantify predictability of a randomization procedure (RP),3,8 but they do not show how much this predictability distorts the test decision. The impact of such bias depends on the trial design,9–11 but group sequential designs (GSDs) have not been systematically evaluated.
GSDs are appealing in rare diseases, where sample sizes are small and efficient use of patient data is critical.12,13 These trials are more often open label than conventional trials, 14 increasing the risk of allocation bias. Classic formulations of GSDs assume exact 1:1 allocation at each look.15,16 In practice, the choice of RP can create interim or final imbalances that interact with stopping rules. RP choices that allow imbalance at interim analyses require corresponding choices of stopping boundaries to preserve type I error rates (T1E). Because the Lan–DeMets (LDM) framework recalibrates boundaries using observed information and is flexible under deviations from the planned allocation ratio, we adopt LDM here. 17 We focus on a set of representative RPs that span key properties: unrestricted allocation (complete randomization), terminal balance within blocks (permuted block randomization, with fixed and randomized block sizes and the random allocation rule (RAR)), bounded imbalance via a maximum-tolerated imbalance rule (big stick design (BSD)), probabilistic control around balance (Efron’s biased coin (EBC)), and the combination of bounded imbalance and probabilistic control (Chen’s procedure). We do not consider minimization or covariate-adaptive methods, which target different mechanisms and require modeling of prognostic covariates.
Our objective is to quantify how allocation bias affects T1E and power across representative RPs and how group sequential choices modify that effect. Section 2 defines the biasing policy, GSDs, RPs and the simulation setup. Section 3 is organized as follows: We begin with a fixed-sample design as a benchmark without interim analysis to show how allocation bias alone affects an one-sided
Methods
We consider a two-arm randomized controlled clinical trial using GSDs to prove the superiority hypothesis with a continuous normal distributed endpoint
Biasing policy
Our biasing policy will be based on the following assumptions: Patients can be categorized into groups with positive, neutral, and negative expected responses to the outcome. The investigator has a preference to demonstrate that the experimental treatment is superior to the control and can selectively decline patient enrollment. The investigator knows all past treatment allocations but has no knowledge of future assignments and does not know the RP (or its parameters, such as block size). He is aware only of the target allocation ratio. In this setting, a natural choice is the Blackwell–Hodges convergence strategy, which guesses that the next allocation will go to the currently underrepresented arm. The investigator uses this rule to influence enrollment and admit a patient with a positive expected response when the experimental arm is predicted next and a patient with a negative expected response when the control arm is predicted next. When the arms are balanced he admits a patient expected to be neutral.
These assumptions model a clinical trial, where the inclusion and exclusion criteria involve clinical judgment, allowing the investigator to decline patient participation until a candidate favorable to him is identified and that the investigator can classify candidates as likely good, neutral, or bad responders, for example based on a holistic assessment of the patient’s clinical situation.
According to ICH E9, 2 the investigator should not be aware of the technical specifications of the RP used. However, when using this strategy, the guessing strategy would be the same independent of whether the investigator knows the RP or not.
Under these assumptions, the investigator can change the composition of patients in both arms by guessing the upcoming allocations. For example, under PBR with block size

Illustration of how outcomes in a two-arm randomized controlled trial are affected using an allocation biasing policy for permuted block randomization with block size
We now present a mathematical formulation of this biasing policy. Let
We consider LDM alpha spending with Pocock- and O’Brien–Fleming (OBF)-type functions,19,20 because it updates critical values at each look using observed information fraction. This matters here since several RPs do not keep exact
Analyses use a one-sided
All RPs target an overall 1:1 allocation but can be imbalanced at interim (and sometimes at the final) depending on their constraints. We study the following representatives:
Complete Randomization (CR): Randomization achieved by flipping a fair coin. Random Allocation Rule (RAR): Randomization assigning the same proportion of patients to each treatment. Permuted Block Randomization (PBR( Randomized Permuted Block Randomization (RPBR( Efron’s biased coin (EBC( Big stick design (BSD( Chen’s design (CHEN( Fixed-sample design (no interim analysis). One interim look with futility only (non-binding). Efficacy stopping: With Wang–Tsiatis varying alpha spent during interim. With LDM Pocock and LDM OBF varying timing of the interim look. Power under allocation bias for LDM Pocock and LDM OBF. Varying bias magnitude for LDM Pocock and LDM OBF. Combined efficacy (LDM Pocock/LDM OBF) and futility stopping (non-binding and binding). Stage-wise restart of randomization (new list in each stage) versus a single full list, varying the number of stages
Note, here “RAR” denotes the random allocation rule, not response adaptive randomization. We classify all RPs that limit the number of possible randomization sequences as restricted RPs, that is, all except for CR. To isolate design effects, we evaluate the following scenarios in sequence:
Simulation
For each RP, we generated 16,000 randomization sequences and simulated 1,000 clinical trials per sequence. Outcomes
Results
Investigation of allocation bias strategy for a
-test without interim analysis
As a benchmark, we first examined allocation bias in a fixed-sample trial (no interim analyses) using a one-sided

Mean type I error rates under different randomization procedures in a trial with total sample size
In what follows, we focus on two representative bias levels:

Mean type I error rates under different randomization procedures in a group sequential trial with a maximum sample size of

Mean type I error rates under different randomization procedures in a group sequential trial with a maximum sample size of

Mean type I error rates under different timings of the (efficacy) interim analysis for a maximum sample size of
We next considered a design with a maximum sample size of
Efficacy stopping only
We then evaluated one interim analysis for efficacy using Wang–Tsiatis boundaries with different
We next examine how the timing of the interim analysis influences T1E under these boundaries. The timing axis values
We also evaluated power under allocation bias. Figure 6 presents the results for OBF- and Pocock-type spending functions. For reference, we included the power of PBR without bias (the black line with a square in it). Differences between randomization procedures were most pronounced in the mid-power range (around 50–60%). For example, for OBF-type boundaries, at an effect size of

Mean power under different randomization procedures in a group sequential trial with a maximum sample size of
Next, we investigated the effect of efficacy stopping boundaries under varying levels of allocation bias. Figure 7 shows the T1E as a function of allocation bias

Mean type I error rates under different randomization procedures in a group sequential trial with a maximum sample size of
We then analyzed designs with both efficacy and futility stopping. As shown in Figure 8, under non-binding futility stopping, T1E decreased with stricter futility boundaries (i.e. using lower

Mean type I error rates under different randomization procedures in a group sequential trial with a maximum sample size of
Figure 9 shows the impact of binding futility stopping. Compared with the non-binding case, T1E inflation was even greater. This occurs because futility stopping is incorporated into the efficacy boundaries, which means that the adjusted alpha levels for both interim and final analysis becomes larger compared to the non-binding setting. As a result, when using more aggressive (i.e. lower) binding futility boundaries

Mean type I error rates under different randomization procedures in a group sequential trial with a maximum sample size of
GSDs often assume an exact
To allow a fair comparison between stage-wise and full randomization, we assumed that the investigator is aware of the restart and resets the allocation biasing policy at the beginning of each stage. In this setup, the number of previous allocations to each arm is reset to zero whenever randomization restarts. Figure 10 shows the results for OBF- and Pocock-type spending functions, comparing full randomization sequences (solid lines) with stage-wise restarts (dashed lines). The case

Mean type I error rates under different randomization procedures in group sequential trials with maximum sample size
For BSD, CHEN, and EBC, T1E decreased slightly for stage-wise randomization compared to full randomization in the high-bias setting as the number of interim analyses increased. BSD and CHEN, which share the same maximum-tolerated imbalance, become less restrictive with more stages because the imbalance tolerance applies independently to each stage. This property does not apply to EBC. Interestingly, CR also exhibited lower T1E under stage-wise randomization. The reductions for both CR and EBC can be explained by the reset in the investigator’s biasing policy, which alters the effective patient population. For example, each restart introduces a neutral patient at the beginning of the stage. Details on the distribution of neutral, positive, and negative responders under both full and stage-wise randomization are provided in the Supplemental Appendix.
Finally, under high bias, increasing the number of stages reduced T1E for most restrictive RPs (with the exception of RAR), because more significance level is spent earlier through additional interim analyses. This effect was especially pronounced for Pocock-type boundaries, which allocate more alpha to interim looks than OBF-type boundaries. In particular, T1E dropped substantially under Pocock-type boundaries as the number of stages increased for restrictive RPs with the exception of RAR. By contrast, designs without interim analyses consistently showed higher T1E.
In summary, here are the key observations of our simulation: When using a biasing strategy, the T1E was inflated for all RPs evaluated. Only the magnitude differed by the restrictiveness of the procedure. The gains of power observed depend on the RP at the expense of T1E control. Generally allowing for early stopping (efficacy, non-binding futility) decreases the inflation. The larger a trial, the more biased the test statistic will become. Therefore, if the likelihood to stop early increases, this implies that on average less time is left to bias the test statistic. If a trial allows for early stopping for efficacy, less alpha is left for the final analysis. If early stopping for futility using non-binding futility boundaries is implemented, it means less time is left to “fully” bias the test statistic which would have led to a rejection otherwise. Under our investigated simulation scenarios and design settings, the RP was the main driver for allocation bias. Restrictive RPs, like PBR with small block sizes which has a higher predictability, produce the largest inflation of T1E. OBF-type boundaries resulted in more T1E inflation than Pocock-type boundaries, as more alpha is reserved for the final analysis. More stages reduced the impact of bias on the T1E. Stage-wise restarting lowered T1E for most RPs (BSD, CHEN, EBC, CR), had no effect for PBR, and increased T1E for RAR. The inflation of T1E reduced with more stages, especially under Pocock-type boundaries, as more significance is spent earlier. With non-binding futility, T1E decreased as the futility threshold became more aggressive (smaller Larger block sizes for PBR reduced T1E inflation, randomized block sizes (e.g. RPBR(
The purpose of this research was to investigate the effect of allocation bias on different design choices for a two-arm RCTs, focusing specifically on the impact of different RPs in conjunction with different types of interim analyses. Our findings revealed differences in the T1E across various RPs in all scenarios considered. PBR remains the most widely used RP in GSDs.
17
However, under our biasing policy, PBR with small block sizes produced the greatest T1E inflation among the evaluated procedures, as was also observed in other trial designs.9–11 If interim analyses both for futility and efficacy are performed, then the impact on the inflation on the T1E is less compared to designs without interim analysis. The general tendency is that the higher the likelihood for stopping, the less the T1E inflation is. For example, if all design parameters such as timing of interim analysis and maximum sample size are the same, then when using OBF-type efficacy boundaries, there is a higher inflation of the T1E compared to Pocock-type boundaries. The latter have a higher probability to stop early. The results of our study should not be misinterpreted that we suggest to use Pocock designs and many interim analysis. The reason why Pocock-type spending shows less inflation is simply because less alpha is left for the final analysis (where the test-statistic is most biased). The more patients are included, the more biased the test statistic will be. Similarly, using more interim analyses leaves less alpha for the final analysis. The simulation also show that using binding futility boundaries, that is, increasing the efficacy boundaries for potential stopping is not a good idea, as it further increases T1E inflation. In general, early stopping should require convincingly large effects, which aligns more closely with OBF-type spending. Stage-wise restarting of the RP might be useful, for example, when balance by stage is important, but it also alters the RP itself. For example, RAR becomes much more restrictive (behaving like PBR with block size equal to stage size) and BSD allows for a larger overall imbalance, as the imbalances are additive for each stage. In our simulations, stage-wise restarting slightly reduced T1E for most procedures, the exceptions were PBR(
Whenever feasible, blinding should be ensured. Open-label settings increase predictability. From a bias perspective, unrestricted randomization is preferable, but if there is no bias, this results in a loss of power
17
and, in extreme cases with very early interim looks, could even leave one group without observations at interim, preventing analysis (or with only one observation, in which case a
Regarding the biasing policy we used the concept of good, neutral, and bad responders, which could be considered like a single, unknown prognostic factor. However, in practice, selective enrollment may be driven by the investigator’s overall clinical impression, which may not be directly pictured by observable prognostic factors and may even evolve during trial conduct, as clinicians gain experience with the disease course independent of treatment allocation. A second important consideration is feasibility in rare diseases. We do not assume that investigators can defer enrollment indefinitely until a favorable or unfavorable patient appears. Instead, the biasing policy should be interpreted as a worst-case model used to assess the robustness of different designs. When the inclusion and exclusion criteria leave room for judgment, enrollment decisions may be influenced by the investigator’s expectation of a successful treatment rather than reaching a specific sample size. Even if the total population is very limited, an investigator would not include a patient in a clinical trial if he expects that the patient will not profit from the treatment. Moreover, rare disease trials are often conducted in specialized centers where investigators may know patients well and they may have some influence over the timing and ordering of enrollment. Tamm et al. discuss the rejection of candidates and show that even an imperfect or constrained selection mechanism can affect the test decision.
30
These points underscore why we emphasize robustness at the design stage. In applied work, sensitivity analyses that explicitly model allocation bias mechanisms can be considered; for example, Wied et al. re-analyzed the EPISTOP trial and performed a bias-corrected analysis by including a term representing allocation bias in the statistical model.
31
In multi-center trials, allocation bias can arise when stratifying by center.
10
Central randomization reduces this risk but does not fully eliminate it if a center contributes consecutive patients to a central list. For example, with PBR(6) the next assignment can be correctly predicted, on average,
The scope of our simulation was necessarily limited and we followed the stepwise way adaptive designs are often planned. We did not aim to cover every setting. For example, we focused on a total sample size of
It could be interesting to examine allocation bias when the outcome is defined as a change from baseline. Future work should also evaluate multi-arm design, such as platforms trials, where staggered entry of arms, stage-wise restarting of randomization and unequal allocation ratios may alter the extent of bias. Such investigations will support the credibility of innovative trial designs.
Conclusion
Bias in clinical trials is a key concern for regulatory agencies.1,2,34 Our findings provide insights into mitigating allocation bias in GSDs with small-sample sizes. The size of the problem depends mainly on the RP, with design choices regarding the GSD having only secondary effects in our investigated scenarios. Therefore, if allocation bias cannot be ruled out, a less restrictive RP, such as BSD, should be chosen over, for example, PBR with small block sizes. If PBR is used, large block sizes should be used. Though allocating more alpha at interim led to a lower T1E inflation compared to preserving more alpha for the final analysis, this should not be taken as a justification for such type of spending functions. In our simulation studies the maximum sample size was fixed when comparing different GSDs. It has to be noted that when using Pocock-type boundaries instead of OBF-type boundaries, this would also require much larger sample sizes.18,19 Similarly, if futility stopping is implemented, the sample size has to be increased accordingly. Thus, the choice of design parameters in GSDs should not depend on how they will potentially mitigate the impact of potential biases strategies. Our work demonstrates that the choice of the RP is key.
List of abbreviations
big stick design with maximum-tolerated imbalance Chen’s design with probability complete randomization Efron’s biased coin with probability group sequential design Lan and DeMets O’Brien and Fleming permuted block randomization with block length randomized permuted block randomization with block length random allocation rule randomized controlled trial randomization procedure type I error rate
Supplemental Material
sj-R-1-smm-10.1177_09622802261442914 - Supplemental material for When randomization is not random: Allocation bias in small sample, group sequential randomized clinical trials
Supplemental material, sj-R-1-smm-10.1177_09622802261442914 for When randomization is not random: Allocation bias in small sample, group sequential randomized clinical trials by Daniel Bodden, Ralf-Dieter Hilgers and Franz König in Statistical Methods in Medical Research
Supplemental Material
sj-R-2-smm-10.1177_09622802261442914 - Supplemental material for When randomization is not random: Allocation bias in small sample, group sequential randomized clinical trials
Supplemental material, sj-R-2-smm-10.1177_09622802261442914 for When randomization is not random: Allocation bias in small sample, group sequential randomized clinical trials by Daniel Bodden, Ralf-Dieter Hilgers and Franz König in Statistical Methods in Medical Research
Supplemental Material
sj-pdf-3-smm-10.1177_09622802261442914 - Supplemental material for When randomization is not random: Allocation bias in small sample, group sequential randomized clinical trials
Supplemental material, sj-pdf-3-smm-10.1177_09622802261442914 for When randomization is not random: Allocation bias in small sample, group sequential randomized clinical trials by Daniel Bodden, Ralf-Dieter Hilgers and Franz König in Statistical Methods in Medical Research
Footnotes
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: R-DH is coordinator, and DB and FK are members of RealiseD supported by the Innovative Health Initiative Joint Undertaking (IHI JU) under grant agreement No. 101165912. The JU receives support from the European Union’s Horizon Europe research and innovation program and COCIR, EFPIA, Europa Bío, MedTech Europe, and Vaccines Europe. Views and opinions expressed are those of the author(s) only. This publication reflects the author’s views. They do not necessarily reflect those of the Innovative Health Initiative Joint Undertaking and its members, who cannot be held responsible for them. R-DH received funding from European Rare Disease Research Coordination and Support Action consortium (ERICA) funded by the European Union’s Horizon 2020 research and innovation program under Grant Agreement. no. 964908. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Code availability
Supplemental material
Supplemental material for this article is available online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
