Abstract
In oncology, phase II clinical trials are often planned as single-arm two-stage designs with a binary endpoint, for example, progression-free survival after 12 months, and the option to stop for futility after the first stage. Simon’s two-stage design is a very popular approach but depending on the follow-up time required to measure the patients’ outcomes the trial may have to be paused undesirably long. To shorten this forced interruption, it was proposed to use a short-term endpoint for the interim decision, such as progression-free survival after 3 months. We show that if the assumptions for the short-term endpoint are misspecified, the decision-making in the interim can be misleading, resulting in a great loss of statistical power. For the setting of a binary endpoint with nested measurements, such as progression-free survival, we propose two approaches that utilize all available short-term and long-term assessments of the endpoint to guide the interim decision. One approach is based on conditional power and the other is based on Bayesian posterior predictive probability of success. In extensive simulations, we show that both methods perform similarly, when appropriately calibrated, and can greatly improve power compared to the existing approach in settings with slow patient recruitment. Software code to implement the methods is made publicly available.
Introduction
In oncology, phase II clinical trials are often carried out to test in a specific population of patients whether a therapy is sufficiently effective to warrant further investigation. Due to both ethical and practical constraints, sample sizes of such trials are typically strictly limited. Therefore, many trials apply a single-arm design to explore the treatment’s efficacy before initiating a larger randomized phase II or III clinical trial. Endpoints in such single-arm trials are usually binary, such as response according to Response Evaluation Criteria In Solid Tumors criteria or rate of progression-free survival (PFS) at a specific time point. PFS itself is naturally a time-to-event endpoint, but the measurements of progression are often not continuous but occur in a few discrete time steps, for example, every 3 months. In many trials an option to stop for futility after an interim analysis is implemented. A popular design that allows stopping for futility for a single-arm trial with a binary endpoint is Simon’s two-stage design. 1 This design requires at least a fixed number of responses/survivors being observed in the patients of the first stage to allow continuation to the second stage. The popularity of the design can be accounted to a large extent to the simplicity of the design as well as its relatively small sample size, which often allows conducting of the studies in a single center, thus reducing the expenses of the trial.
If the primary endpoint is assessed sometime after recruitment or treatment initiation, for example, if the endpoint is PFS after 12 months confirmed via magnetic resonance imaging, it may happen that there is a severe delay due to the interim analysis, since the second stage usually cannot be initiated before a sufficient number of survivors was observed in the first stage. This enforced trial interruption not only prolongs drug development, but might also negatively affect the momentum and discourage involved partners. In such situations, it would be attractive to use information on short-term endpoints to guide the decision making whether to stop for futility or to continue, so that there is little or no delay during the trial conduct. As a solution to this problem, Kunz et al. 2 proposed a method that uses a short-term endpoint, for example, 3-month PFS, for the interim decision, but for the final decision at the end of the trial, a more relevant long-term endpoint, for example, 12-month PFS, is used as a primary endpoint. The approach requires making explicit a priori assumptions about the short-term endpoint, and therefore about its correlation with the long-term endpoint. It is possible to specify this assumption in the form of weakly informative distributions, but still, a fixed threshold of short-term survivors that need to be observed to make the interim decision is set before patient recruitment is initialized.
A similar approach to Kunz et al. 2 was presented recently by deVeaux et al. 3 Here, also a fixed threshold for the number of short-term responders is set and particular emphasis is put on the trial duration, which depends on the patient accrual rate. Within the Bayesian framework, Lin et al. 4 proposed a Bayesian time-to-event method to evaluate futility by taking into account the time that patients with pending outcomes have already spent in the trial in single-arm multi-stage designs. However, the method cannot handle early outcome assessments, and once again fixed thresholds for the interim decision are set in the planning stage.
When we planned an actual phase II clinical trial in the field of glioblastoma research and considered implementing the design by Kunz et al., the clinicians were very hesitant to provide any assumptions about the short-term endpoint, which would determine the threshold for stopping for futility and hence the fundamental design of the clinical trial. Even the assumption about the long-term endpoint 12-months PFS was a rather rough estimate based on a single study in similar patients, but for any earlier assessments of PFS, not even historical data was available because published studies only reported data for 12-months PFS. This was our motivation to seek a design that allows incorporation of short-term endpoints into the decision without having to make additional assumptions.
A concept that is naturally suited to guide interim decisions in clinical trials such as stopping for futility, is conditional power (CP), which is the statistical power at a time of some interim analysis given the already observed data. 5 In the randomized clinical trial setting, recent studies used the concept of CP to evaluate futility,6,7 and proposed to further improve the decision-making process by adding information from prognostic baseline covariates. 8 A similar concept to CP is the Bayesian posterior predictive probability of success (PoS).9,10 The PoS evaluates the current probability of reaching the trial’s final goal under consideration of all currently available information. It can be used regardless of whether the primary analysis is Bayesian or frequentist.10,11 There have been approaches to combine the PoS and multi-stage designs similar to Simon’s design.12,13 For time-to-event endpoints, Waleed et al. 14 proposed a method to evaluate futility using the Bayesian PoS and allows more than one interim analysis. However, this method assumes continuous assessment of the endpoints and requires relatively large sample sizes.
Using both the CP and the PoS framework, we propose two new methods for futility stopping in single-arm phase II oncology settings: STE–CP and STE–PoS (combining
In Section 2, we present an overview of the existing approaches on which the new methods are built upon. In Section 3, we introduce the new methods STE–CP and STE–PoS. In Section 4, we present a simulation study to assess the operating characteristics of the new methods and compare them to the existing designs, considering various scenarios for the true underlying survival functions as well as different patient accrual rates. We conclude with a discussion in Section 5.
Methods
Simon’s two-stage design
We consider the case of a standard phase II single-arm clinical trial in the field of oncology with a binary endpoint with the option to stop for futility. Such a design was proposed by Simon
1
and is widely applied today. The original work was using response as an endpoint, but we will focus on its application to survival endpoints with discrete time-to-event measurements, which is a common situation in clinical trials, for example, when tumor progression is assessed via magnetic resonance imaging every 3 months. Simon’s two-stage design tests the hypothesis
For futility stopping, an important operating characteristic of Simon’s two-stage design is the probability of early termination (PET). Under the null hypothesis, the PET should be large, while under the alternative hypothesis, it should be small, and definitely not larger than
Two-stage trials using a short-term endpoint
In a two-stage trial, such as Simon’s two-stage design, recruitment of patients is usually stopped after the first stage until enough survivors have been observed among the
The hypothesis tested at the end of the trial is the same as in Simon’s two-stage design, that is,
Approaches to interim monitoring
At the time point of interim analysis in a phase II clinical trial, the key question that is addressed is: Is the trial sufficiently likely to achieve its goal based on the data observed so far, or should the trial be stopped for futility? There exist two frameworks that are both naturally suited to answer this question: the CP15,5 and the PoS.10,9,16 In the literature, the terminology is not entirely consistent, so in the following their use in this manuscript is defined. Generally speaking, both approaches are used to determine the probability of trial success, that is, rejecting the null hypothesis at the end of the study, given the data observed so far. They differ in how they incorporate the information about the data observed so far. The CP is determined based on the assumption about an unknown parameter
New approaches: STE–CP and STE–PoS
Motivation
If recruitment is rather slow, as in many single-arm phase II trials, it is likely the interim data contains information not only on the short-term assessments of the endpoint but also on later assessments or even the long-term outcome, either because some patients already had the event and hence cannot be event-free at the long-term endpoint or because they have been observed for so long that their long-term outcome is known. This information cannot be used in Kunz’s design for the interim decision. In contrast, our new methods incorporate this information to evaluate the CP or the PoS in the interim analysis, while still allowing for the same reduction of delay (compared to Simon’s design) due to the interim analysis as Kunz’s design. The trial will continue only if the CP or the PoS are sufficiently large in an interim analysis, which can be conducted at any (prespecified) point of time during the trial.
General notation
In the following, we will introduce the notation used by us. A summary is provided in Table 1. The term “survival” will be used to refer to time-to-event endpoints in general, including particularly PFS, and the term “time point” to refer to the predefined time of the measurement or assessment of a patient’s survival status. We assume that a patient’s survival is measured at discrete time points
A key concept of survival analysis is censoring. Since in early phase II trials the time-to-event outcomes are typically treated as binary endpoints with primary evaluation at a specific time point, for which censoring technically does not exist, censoring due to loss to follow-up is commonly treated as failure in order to follow a conservative analysis strategy. Typically, this is expected to happen infrequently, because these trials are very small and follow-up is rather short. However, if the interim analysis is performed before all recruited patients have had their long-term outcome observed, there is administrative censoring during the interim. Patients who have been censored in this way, are taken into account in equations (1) and (2) and will be considered accordingly in the decision process as explained hereinafter.
Glossary of terms.
The calculation of the CP requires the prediction of future data. At the time of the interim analysis, STE–CP assumes the estimated empirical trend for long-term survival probability for each individual, that is, the maximum likelihood estimate of the survival probabilities based on the data observed so far. The trial design aims to assess the outcome of each patient at
Determination of the maximum likelihood estimate requires at least one patient with available data; if this is not the case, then an assumed value must be plugged in instead. Since power is calculated under the alternative hypothesis, we will use the a priori assumption for long-term survival under the alternative hypothesis,
STE–PoS for
measurement time points
The calculation of the PoS requires the likelihood and a prior to deriving the posterior distribution of the conditional survival probabilities, from which then the posterior predictive distribution of long-term survivors can be generated. Given the likelihood in equation (1) and a prior distribution
In this small sample setting with strictly limited available information, the choice of the prior distributions is sensitive. In order to achieve a rather weakly informative prior distribution for the long-term survival probability, we choose negative log-Gamma distributions with shape parameter
For decision making, both STE–CP and STE–PoS require to define a cutoff for the CP or the PoS, respectively, to stop the trial for futility. For both approaches, the PET is given by:
Simulation study
Simulation settings
To assess the operating characteristics of our new methods STE–CP and STE–PoS and to compare them to Kunz’s design, we conducted a simulation study. We assumed a phase II trial with long-term survival as a primary endpoint and the option to stop for futility. Long-term survival was defined as survival after 12 months and measurements were obtained every 3 months, so there were four measurement time points in total and

Visualization of the prior distributions used in the simulation and the posterior distribution for the example interim data presented in Table 2.
For all three designs, an interim analysis was conducted once

Null and alternative hypothesis for the long-term endpoint (after 12 months) with various scenarios for the short-term endpoint (after 3 months) and the corresponding Weibull survival functions with shape parameter
For illustration purposes, a single simulation run is summarized in Table 2. For this, time-to-event was sampled from a
Summary of example simulation run generated under scenario 1. Note that STE–PoS uses the whole posterior distribution, even though in this table only the posterior means are displayed.

Distribution of CP according to STE–CP and the PoS according to STE–PoS for all pairs of null and alternative hypothesis within scenario 1. Results of scenarios 2 and 3 are presented in the Supplementary Material. CP: conditional power; PoS: probability of success; STE: short-term endpoints.
Distribution of CP and PoS, and calibration of a cutoff
For the investigated simulation settings, the CP, as well as the PoS, tended toward small values in case the null hypothesis was true, and toward large values when the alternative hypothesis was true (scenario 1 shown in Figure 3, scenarios 2 and 3 shown in the Supplementary Material). This relation was stronger the slower the recruitment rate, since more long-term outcomes could be observed at the time of the interim analysis. Hence, the calculation could take into account more information. Although the general tendency was the same for STE–CP and STE–PoS, there were a few differences between the distributions. The distribution of the CP had a strong tendency toward the extreme values 0 and 1, while that of the PoS appeared smoother and produced less extreme values. This can be explained by the nature of the calculations within STE–CP and STE–PoS: for the calculation of the CP, a point estimate for the probability of an event was used to generate the new data, while for the calculation of the PoS the whole posterior distribution was used, which results for a specific interim analysis in a larger variance of the distribution of predicted outcomes. Therefore, the PoS will always be shrunken toward less extreme values compared to the CP. For very fast recruitment, that is, 4 patients per month on average, the distinction between null and alternative hypothesis became much less pronounced and the distributions were dominated by the prior assumptions rather than by the interim data. Combined with a high probability of 3 months survival, the PoS and CP were almost identical for the null and the alternative hypothesis scenario.
To define the cutoff for stopping for futility for each method, we assessed the distributions of PoS and CP for the scenario that the assumptions of Kunz’s design were correct and patient recruitment per month was Poisson distributed with mean
Probability of early termination
Using the established cutoff values, the PET for each scenario and each method was calculated (Figure 4, upper panel). In Kunz’s design, the PET was strongly dependent on the 3-month survival regardless of the true 12-month survival rate or patient recruitment speed. For both STE–CP and STE–PoS, the PET was very similar. It remained stable across all scenarios for the 3-month survival rates, if patient recruitment was rather slow, that is,

Probability of early termination after stage 1 of the investigated designs.

Type I error rate and power of the investigated designs. Black dashed lines indicate the prespecified type I error rate (0.10) and power (0.95) of Kunz’s design.
Type I error rate, defined as the rate of rejecting the null hypothesis when it was true, and power, defined as the rate of rejecting the null hypothesis when the alternative hypothesis was true, were calculated with the full data set, taking into account whether the trial has stopped early or continued to the second stage (Figure 5). Similar to the PET, the type I error rate and the power of Kunz’s design were dependent on the true 3-month survival rates but irrespective of the patient accrual. For STE–CP and STE–PoS they were stable across all scenarios for the 3-month survival rates if the patient accrual was slow. The faster the patient recruitment, the more power, and the type I error rate tended to become similar to Kunz’s design. Under slow recruitment of 0.5 or 1 patient per month, the type I error rate remained close to the desired 0.1 level for all scenarios with inflation of 0.006 in the most extreme case. Under fast patient recruitment of 4 patients per month, the type I error rate could not be controlled and was inflated up to 0.19 in the most extreme case for both STE–PoS and STE–CP, similar to Kunz’s design which was inflated to 0.20 in the same scenario. The power in scenario 1 with slow patient recruitment was approximately between 0.75 and 0.8, in scenario 2 between 0.85 and 0.90, and in scenario 3 between 0.90 and 0.93. If the assumptions for the 3-month survival rate were exactly correct, Kunz’s design of course met the requirement of 0.95 power. However, for smaller 3-month survival rates, the power advantage of STE–PoS and STE–CP was remarkable, particularly in scenarios 1 and 2. For example, in scenario 1 with a 3-month survival rate of 0.7 and accrual rate of 1 patient per month, Kunz’ design had a power of 0.33 compared to 0.78 with STE–PoS and 0.79 with STE–CP. If the 3-month survival rate was as low as 0.4, the power of Kunz’s design approximated 0 while the power of STE–PoS and STE–CP was still not affected. In scenario 3, the observed differences between the designs were much smaller, because the parameter space of the 3-month survival rate was very limited due to the high long-term survival rate under the null and alternative hypotheses.
Informative priors
To explore the potential impact of informative priors in STE–PoS, we have additionally simulated scenario 1 with more informative prior distributions. In order to derive these priors, survival data of 10 patients was assumed to be available from a pilot study, of which 5 patients had survived at least up to 12 months, therefore supporting an optimistic expectation toward the trial’s outcome. The posterior distributions of the conditional survival probabilities were assessed. Using the R package
Discussion
Two-stage single-arm trials with the option to stop for futility are often planned according to Simon’s design. However, if the primary outcome takes a long time to be observed, such as 12-month survival, Simon’s design implies a long interruption of patient recruitment to conduct the interim analysis. Kunz et al. 2 have offered an efficient solution by assessing a short-term endpoint, such as 3-month survival, at the interim to make the decision about stopping for futility. We showed that if the assumptions about the short-term endpoint are wrong, Kunz’s design could suffer from drastically increased PET under the alternative hypothesis and thus diminished power. We proposed two methods, STE–CP and STE–PoS, that provide decision rules for such trials based on short-term endpoints without requiring any additional strong assumptions. In this regard, they complement the existing design of Kunz et al. 2 The methods apply the framework of CP and PoS, respectively. Since STE–PoS is embedded in the Bayesian framework, a definition of prior distributions is required. We have presented a default weakly informative prior distribution that can easily be adapted to be more optimistic or conservative.
The two proposed designs are both allowing for completely flexible timing of the interim analysis. In our simulations, the break between the two stages was of the same length as in Kunz’s design, and therefore considerably shorter than in Simon’s design, but that is not a requirement and it could also be shorter (or longer). The longer the break, the more patients will have complete follow-ups in the interim, so the more accurate the predictions will be. To correctly assess the operating characteristics of a specific trial, the length of the break must be prespecified in the planning stage. A (rather weak) assumption of the methods is that for each endpoint there should be at least one observation. If that is not the case, the survival probabilities need to be imputed, for example, by interpolating the assumed underlying survival curve. Our simulations have shown that assuming constant hazards over time work reliably also under non-constant hazards.
The simulations showed that STE–CP and STE–PoS closely control the type I error rate in all considered scenarios with slow patient recruitment and substantially improve power compared to Kunz’s design in many situations. The price is a loss of power in the narrow window where the true parameters of short-term and long-term survival are exactly or very close to as postulated in the design stage. If the assumptions about the short-term survival from the design stage were correct, STE–CP and STE–PoS had lower power than Kunz’s design (between 0.75 and 0.9 compared to 0.95 under Kunz’s design). The more one is off with the assumptions in the design phase about the short-term survival, the greater the advantage of the new methods over Kunz’s design, because the power of Kunz’s design will drop drastically (down to zero) while the power of STE–PoS and STE–CP is barely affected. We emphasize that this is only true for slow patient recruitment and that for trials with fast recruitment, the methods would not be applicable, because then none or only little information on the long-term endpoint would be available for the interim analysis. The extent to which a trial could potentially benefit from STE–CP and STE–PoS depends also on the hypothesis to be tested: if the long-term survival rate is assumed to be rather low, the parameter space of the short-term survival rate is rather large, and hence the benefit in terms of power and type I error rate could be quite large (as seen from scenario 1 in our simulations). If the long-term survival rate is greater, the parameter space of the short-term survival rates gets smaller, hence less power can be gained (as seen from scenario 2 and particularly scenario 3).
If prior information is available, one might want to express this by informative priors. We have presented a simple way of how this can be done with STE–PoS in principle. Adjusting the prior distributions to express certain prior information is easily done, and as expected optimistic informative priors led to decreased PET, and increased type I error rate and power. Interestingly, the stability across different short-term survival rates diminished compared to weakly informative priors, which is a sign of a lack of robustness against deviations from the assumed relations between short-term and long-term survival rates. We would like to emphasize that the origin of this work was the situation of extremely scarce information, so weakly informative priors will be more appropriate in the intended applications, and the formulation of informative priors will require additional research before applicable in practice.
The differences between the operating characteristics of STE–CP and STE–PoS (with weakly informative priors) were small, so they are more relevant on a philosophical level. The distribution of CP values under STE–CP showed a strong tendency toward extreme values close to or equal to 0 or 1 compared to the distribution of PoS values under STE–PoS, which can be explained by the fact that STE–CP uses a single-point estimate of the probability of survival to predict new data, while STE–PoS uses the whole posterior distribution.
One should note that we have considered binding stopping rules, so our results are only valid if one follows the stopping rule. There are good arguments to have a non-binding stopping rule, the most important one being able to react flexibly to information not considered in the stopping rule, such as adverse events. 22 Nevertheless, the most popular design, Simon’s two-stage design, applies a binding stopping rule. Since it is impossible to quantify the operating characteristics under a non-binding stopping rule, calculations often assume no stopping under the null and a binding stopping rule under the alternative hypothesis, which is also how Kunz et al. have originally defined their design. 2 If one would apply such non-binding stopping rules to the presented simulation study, then under the null hypothesis the decision making of our methods would reduce to Kunz’s design and type I error would be strictly controlled for all methods, while under the alternative hypothesis, all would remain as presented. Therefore, we want to emphasize that the improvement in type I error by our methods will vanish if non-binding rules are considered, and that the focus of this research was to improve the power under uncertainty about the expected short-term survival.
Another aspect is that for Kunz’s design, computationally efficient closed-form solutions exist, which implies an important practical advantage. For STE–PoS and STE–CP, calculations are much more time-consuming. Although it would be possible to calculate exact operating characteristics, the number of possible outcomes is too high to be computationally feasible even for the small sample size and the few measurement time points that were considered here. Therefore, we relied on simulations. For the same reason, searching for suitable designs in the whole parameter space does not seem to be feasible at the moment. Therefore, we suggest first finding the optimal Kunz design with a somewhat higher power than desired and then use this design as starting point for simulating the operating characteristics of STE–PoS or STE–CP.
When planning a trial, one should carefully consider whether the advantage of a faster conduct of the trial is worth the more complicated trial design compared to the relatively simple Simon’s two-stage design, which also has a completely stable type I error rate and power regardless of any other endpoints. In our discussions with clinical trial teams, the time advantage appeared often to be desirable, but the trade-off may be different for each situation. With this study, we hope to provide the required tools when facing the dilemma between an undesirably long interruption under Simon’s design and the risk of an underpowered study under Kunz’s design due to uncertainty about the short-term endpoint.
In principle, the presented methods are not limited to two-stage clinical trials with stopping for futility but could be applied to any setting in which one or more interim analyses are conducted and a survival endpoint with discrete measurement times is estimated. As they do not include a decision rule for the final analysis, such final decision rules need to be developed for the specific situation. Also, we have not considered the use of baseline covariates, as proposed recently in the context of randomized clinical trials. 8 Baseline covariates could possibly be utilized to account for informative censoring, although the small sample sizes might make it challenging to achieve a proper adjustment for multiple variables, since in the first stage of such early phase clinical trials often only around 15 to 25 patients are recruited. A further option to improve STE–PoS could be to incorporate informative priors, for which we have suggested a possible workflow, but more thorough investigation, particularly of robust priors, will be required to support their application in practice.
Supplemental Material
sj-pdf-1-smm-10.1177_09622802231188515 - Supplemental material for Using short-term endpoints to improve interim decision making and trial duration in two-stage phase II trials with nested binary endpoints
Supplemental material, sj-pdf-1-smm-10.1177_09622802231188515 for Using short-term endpoints to improve interim decision making and trial duration in two-stage phase II trials with nested binary endpoints by Dario Zocholl, Cornelia U. Kunz and Geraldine Rauch in Statistical Methods in Medical Research
Footnotes
Acknowledgment
We would like to thank Jarle Tufto from the Norwegian University of Science and Technology in Trondheim for pointing out the special relationship between the Beta distribution and the Gamma distribution that is used in this manuscript to motivate the prior distribution. We would further like to thank the two anonymous reviewers for their thoughtful and elaborate comments, which greatly improved the quality of this manuscript.
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
Supplemental material
Supplemental materials for this article are available online.
Appendices
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
