Sage Journals: Discover world-class research

Abstract

It is increasingly common for therapies in oncology to be given in combination. In some cases, patients can benefit from the interaction between two drugs, although often at the risk of higher toxicity. A large number of designs to conduct phase I trials in this setting are available, where the objective is to select the maximum tolerated dose combination. Recently, a number of model-free (also called model-assisted) designs have provoked interest, providing several practical advantages over the more conventional approaches of rule-based or model-based designs. In this paper, we demonstrate a novel calibration procedure for model-free designs to determine their most desirable parameters. Under the calibration procedure, we compare the behaviour of model-free designs to model-based designs in a comprehensive simulation study, covering a number of clinically plausible scenarios. It is found that model-free designs are competitive with the model-based designs in terms of the proportion of correct selections of the maximum tolerated dose combination. However, there are a number of scenarios in which model-free designs offer a safer alternative. This is also illustrated in the application of the designs to a case study using data from a phase I oncology trial.

Keywords

Dose-finding combination therapies model-free designs phase I trials

1. Introduction

The aim of phase I clinical trials investigating a single therapy is to find the highest dose that can be administered whilst ensuring that patients are at a low risk of serious side effects. To offer patients a higher chance of successful treatment, there is willingness to accept a dose that leads to more toxic responses, commonly labelled as dose-limiting toxicities (DLTs). The highest dose for which the treatment has a pre-specified probability of leading to a toxic outcome (target toxicity) is called the maximum tolerated dose (MTD). In an analysis of over 400,000 clinical trials conducted between 2000 and 2015,¹ it was found that 57.6% of all phase I oncology trials successfully progressed to phase II. It was found that in 73% of trials excluding oncology, treatments were successful in moving to phase II, thus demonstrating the importance of successful dose-finding methods in oncology, where drugs are clearly harder to develop.

In this work, we consider phase I oncology trials in which a combination of two therapies is investigated. Here the objective is to identify a maximum tolerated dose combination (MTC), the dose combination with a probability of toxicity closest to the target toxicity. Phase I oncology trials in this dual-agent setting have recently provoked notable interest.² In particular, it was found that immunotherapy, a targeted agent that stimulates the immune system to fight cancerous cells,³ can provide benefit to patients when administered in combination with chemotherapy or another targeted agent.⁴ One difficulty in the dual-agent setting is that the order of toxicity is unknown for some combinations – if the amount of one compound in the combination is increased while another is decreased, it is unknown whether the overall toxicity goes up or down.

A number of dose-finding methods for dual-agent combination phase I trials relaxing the monotonicity assumption on the order of some of the combinations have been proposed in the literature. They broadly belong to one of three categories; rule-based, model-based, and model-free (also known as model-assisted) designs. Rule-based designs (e.g. 3+3+3 or extensions of this^5–7) rely on a number of pre-specified rules to determine when a dose is escalated, de-escalated and chosen as the MTC. Model-based designs (e.g. Bayesian logistic regression model (BLRM),⁸ six-parameter model,⁹ partial ordering continual reassessment method (POCRM)¹⁰ and the modified logistic model¹¹) model the relationship between dose and probability of toxicity through a parametric function. Through the course of a trial, parameter estimates are updated to better describe this relationship. The model-free designs^12,13 do not pre-specify any relationship between dose and toxicity, thus do not rely on any parametric assumptions in their search for the MTC. However, unlike rule-based designs, the decision process in which the dose can be escalated or de-escalated is assisted with a statistical model.

Despite numerous papers demonstrating flaws in rule-based designs and their performance in drug combination trials,^14–16 it was reported that less than 5% of combination trials in oncology between 2011 and 2013 deviated from rule-based designs.¹⁷ It is perhaps the restrictions associated with model-based designs, such as difficulty of implementation or communication to clinicians, that have made these less commonly used in real trials. Recently, model-free designs have attracted attention due to their practicality,¹⁸ although these have not yet been fully evaluated in the literature.

The objective of this work is to review five recently proposed model-free dose-finding designs for phase I dual-agent combination studies, namely, the Bayesian optimal interval (BOIN) design¹⁹ BOIN, the Keyboard design²⁰ (KEY), the surface-free design²¹ (SFD), the product of independent beta probabilities design²² (PIPE), and the Waterfall design.²³ We evaluate their performance in an extensive simulation study. We note that some comparison has already been investigated by previous authors, for example, the comparison of the Keyboard design to two other designs.²⁰ The novelty in our approach is to compare the methods on equal grounds. We hence propose a calibration procedure that selects the parameters of each of the designs that maximise the proportion of correct selections (PCS), subject to a safety constraint. We compare the performance of these designs to three model-based designs; the BLRM, a model-based approach that uses a two-parameter logistic model for each compound⁸; the POCRM ¹⁰; and the modified logistic model,¹¹ as well as a non-parametric optimal benchmark.²⁴ We also evaluate the performance of each of the designs in a case study of neratinib and temsirolimus,²⁵ to highlight the differences between approaches in a real trial setting of a dose-finding trial from combination therapies.

The rest of the paper continues as follows. We first provide a review of model-free designs, before using a novel method to calibrate the parameters of each design leading to good performance. We then present detailed results from our simulation study across a wide range of toxicity scenarios, including the model-based designs for comparison. Each design is also applied to the real case study of neratinib and temsirolimus. We finish with a discussion of our results.

2. Methodological review

In this section, we describe the dose escalation procedure for each of the five model-free approaches in a general dose-finding trial. It is assumed patients enter the trial in cohorts, and the dose combination for the next cohort is assigned once the previous cohort’s responses are available. We first define the admissible combinations for each design. These are the dose combinations that are allowable for assignment for the next cohort of patients based on the last tested combination. We then describe the details of the escalation procedure in each of the designs in the following setting. Consider a dual-agent trial with $I$ doses of drug A, denoted $d_{1}^{A} < \dots < d_{I}^{A}$ and $J$ doses of drug B, denoted $d_{1}^{B} < \dots < d_{J}^{B}$ . Let $d_{i j}$ represent the combination of doses $d_{i}^{A}$ and $d_{j}^{B}$ for $i = 1, \dots, I$ and $j = 1, \dots, J$ . The total number of patients who receive combination $d_{i j}$ and the number of those who experience a toxic response on $d_{i j}$ during the trial are denoted $n_{i j}$ and $y_{i j}$ respectively. The probability of toxic response at $d_{i j}$ is written as $π_{i j}$ and the target toxicity is denoted $ϕ$ .

2.1. Admissible combinations

Before deciding on a dose for the next cohort, each design defines a set of combinations that are admissible; i.e. combinations that the next cohort could be allocated to. These are best illustrated with a diagram, Figure 1. Suppose we are at $d_{22}$ in Figure 1, indicated by the ‘ $∙$ ’ symbol. Admissible combinations for the BOIN, Waterfall and KEY designs are the same combination or adjacent combinations to the current one ( $d_{i j}$ $\to$ { $d_{i - 1, j}$ , $d_{i + 1, j}$ , $d_{i, j - 1}$ , $d_{i, j + 1}$ }), represented by the ‘ $\circ$ ’ symbols.

Figure 1.

Illustration for $I = J = 4$ of the admissible combinations for each design. The ‘ $∙$ ’ symbol illustrates the current dose combination at $d_{22}$ , the symbol ‘ $\circ$ ’ represents dose combinations achieved through escalation/de-escalation in one drug only and the symbol ‘ $*$ ’ represents dose combinations achieved through diagonal de-escalation or anti-diagonal escalation.

In addition to these combinations, the SFD and PIPE also allow for diagonal de-escalation, where the next cohort is administered a combination that is one dose level lower in each drug ( $d_{i j}$ $\to$ $d_{i - 1, j - 1}$ ), and also allow for anti-diagonal escalation, meaning the next cohort receives a combination that is one dose level higher in one drug and one dose level lower in the other ( $d_{i j}$ $\to$ { $d_{i - 1, j + 1}$ , $d_{i + 1, j - 1}$ }). These are depicted by the ‘ $*$ ’ symbols in Figure 1, where reaching $d_{11}$ requires diagonal de-escalation and reaching $d_{31}$ or $d_{13}$ requires anti-diagonal escalation. The rationale is that by enabling faster movement across the combination grid, the design can move to the MTC quickly, and de-escalate quickly if patients are treated at highly toxic combinations.

All designs prohibit diagonal escalation, where the next cohort receives a combination of one dose level higher in each drug ( $d_{i j}$ $\to$ $d_{i + 1, j + 1}$ ) and no dose levels can be skipped. These non-admissible doses are shown by the ‘ $\times$ ’ in the red cells in Figure 1.

2.2. The BOIN design

The BOIN design¹⁹ uses the intuitive estimator ${\hat{π}}_{i j} = y_{i j} / n_{i j}$ for the probability of toxicity at combination $d_{i j}$ , so that ${\hat{π}}_{i j}$ is the proportion of observed toxic responses on $d_{i j}$ across the whole trial. The estimator ${\hat{π}}_{i j}$ only updates after patient responses are observed on $d_{i j}$ , and is then used to guide dose escalation. This escalation process is defined by pre-specified values of $λ_{e}, λ_{d}$ to which ${\hat{π}}_{i j}$ is compared after each cohort. Whilst $ϕ$ is the target toxicity, $ϕ_{1}$ is the highest toxicity probability deemed sub-therapeutic and $ϕ_{2}$ is the lowest toxicity probability deemed overly toxic. These can be specified by the clinicians. Values of $ϕ_{1} < λ_{e} < ϕ$ and $ϕ_{2} > λ_{d} > ϕ$ are chosen to locally minimise the chance of incorrect escalation and de-escalation decisions during a trial, and are calculated using constants $ϕ_{1}$ and $ϕ_{2}$ . $λ_{e}$ and $λ_{d}$ are defined in equation (1):

\begin{aligned} λ_{e} = \frac{\log (\frac{1 - ϕ_{1}}{1 - ϕ})}{\log (\frac{ϕ (1 - ϕ_{1})}{ϕ_{1} (1 - ϕ)})} and λ_{d} = \frac{\log (\frac{1 - ϕ}{1 - ϕ_{2}})}{\log (\frac{ϕ_{2} (1 - ϕ)}{ϕ (1 - ϕ_{2})})} \end{aligned}

(1)

Both

λ_{e}

and

λ_{d}

are invariant to

d_{i j}

n_{i j}

and

y_{i j}

, so that optimising these parameters depends only on constants

ϕ

ϕ_{1}

and

ϕ_{2}

. After defining

λ_{e}

and

λ_{d}

, the rules for the dose-finding procedure are as follows:

If ${\hat{π}}_{i j} \leq λ_{e}$ , the next combination is chosen from $A_{E} = {d_{(i + 1) j}, d_{i (j + 1)}}$ .

If ${\hat{π}}_{i j} > λ_{d}$ , the next combination is chosen from $A_{D} = {d_{(i - 1) j}, d_{i (j - 1)}}$ .

Otherwise, $λ_{e} < {\hat{π}}_{i j} \leq λ_{d}$ and the next combination is the same.

In this way, dose skipping, diagonal escalation and diagonal de-escalation are prohibited – see the ‘Admissible Combinations’ section for more details. If the next combination is to be chosen from an empty

A_{E}

A_{D}

(for example the current combination is the highest in both doses and the design chooses to escalate), then the next cohort receives the same combination. The design assumes each patient response is independent,

y_{i j} \sim Binomial (n_{i j}, π_{i j})

and assigns a vague Beta(1,1) prior distribution to each

π_{i j}

, giving the posterior distribution for

π_{i j}

\begin{aligned} π_{i j} | n_{i j}, y_{i j} \sim Beta (y_{i j} + 1, n_{i j} - y_{i j} + 1) \end{aligned}

(2)

To choose between combinations in the chosen set, the BOIN design computes the posterior probability

P (π_{i j} \in (λ_{e}, λ_{d}) | n_{i j}, y_{i j})

. The combination maximising this probability is administered to the next cohort. For combinations yet to be tested, calculating this probability is based on the vague prior distribution only. In the event of ties, which is always the case when multiple potential combinations are yet to be administered, the next combination is selected at random from the chosen set. Note that no toxicity information is borrowed between the combinations under this model as the combinations are treated independently.

The design uses an overdosing criterion stating that a combination, and any that are more toxic under monotonicity, satisfying $P (π_{i j} > ϕ | n_{i j}, y_{i j}) \geq ϵ_{B O I N}$ for some overdosing probability threshold $0 < ϵ_{B O I N} \leq 1$ , cannot be administered to the next cohort. For the BOIN design, if $d_{i j}$ satisfies this condition, dose $d_{i j}$ and higher combinations are eliminated from the trial, and the dose maximising $P (π_{i j} \in (λ_{e}, λ_{d}) | n_{i j}, y_{i j})$ within $A_{D}$ is chosen for the next cohort. If combination $d_{11}$ satisfies the overdosing criterion, the trial is terminated earlier for safety.

After all patients are treated, estimates of each $π_{i j}$ are calculated via matrix isotonic regression.²⁶ The simple technique guarantees that estimates of $π_{i j}$ at higher combinations are at least as high as estimates of $π_{i j}$ at lower combinations, which follows the assumption of monotonicity. The MTC is selected as the combination with estimated $π_{i j}$ closest to $ϕ$ via isotonic regression.²⁶

2.3. The keyboard design

The Keyboard design (KEY)²⁰ is very similar to the BOIN design, defining an interval about the target toxicity $ϕ$ , denoted $I_{t a r g e t} = (ϕ - Δ_{1}, ϕ + Δ_{2})$ , for constants $Δ_{1}, Δ_{2} > 0$ , which can be chosen by the clinicians. A combination with estimated toxicity probability within this interval is said to have acceptable toxicity. The design then divides the (0,1) space into “keys”, defined as intervals $I_{t}$ of equal length $Δ_{1} + Δ_{2}$ (allowing for shorter keys at either end of (0,1)) for $t = 1, \dots, T$ , where $T$ is the number of keys. The interval $I_{t a r g e t}$ is fixed pre-trial, chosen to minimise the chance of incorrect escalation and de-escalation decisions.

The KEY design assigns a vague Beta(1,1) prior distribution to each $π_{i j}$ , and assumes that the number of toxic responses follows a binomial distribution, $y_{i j} \sim Binomial (n_{i j}, π_{i j})$ . The posterior distribution for each $π_{i j}$ is computed as in equation 2. Again, this means there is no borrowing of toxicity information across combinations. The design then identifies the key $I_{t}$ that is most likely to contain $π_{i j}$ , labelled $I_{m a x}$ ,

I_{m a x} = \underset{I_{t} : t \in (1, \dots, T)}{argmax} P (π_{i j} \in I_{t} | n_{i j}, y_{i j})

(3)

Once the key

I_{m a x}

is identified, escalation and de-escalation decisions happen as follows:

If $I_{m a x} < I_{t a r g e t}$ , the next combination is chosen from $A_{E} = {d_{(i + 1) j}, d_{i (j + 1)}}$ .

If $I_{m a x} > I_{t a r g e t}$ , the next combination is chosen from $A_{D} = {d_{(i - 1) j}, d_{i (j - 1)}}$ .

If $I_{m a x} = I_{t a r g e t}$ , the next combination is the same.

To choose between combinations in

A_{E}

(or

A_{D}

), the design computes the posterior probability

P (π_{i j} \in I_{t a r g e t} | n_{i j}, y_{i j})

for all combinations in

A_{E}

(or

A_{D}

). The combination maximising this probability is administered to the next cohort. The remainder of the escalation process and the selection of the MTC is analogous with the BOIN design, with an identical overdosing rule using

ϵ_{K E Y}

and the MTC chosen via isotonic regression.²⁶

Both the BOIN and KEY designs model dose combinations independently, however in the following two designs, the connections between the dose combinations are also taken into account.

2.4. The surface-free design

The SFD²¹ does not restrict the MTC search to a parametric surface and does not require the order of toxicity between combinations to be known. The main idea is to parametrise ratios between toxicity probabilities for different combinations, defining $θ = 1 - π_{11}$ , and $θ_{i} = \frac{1 - π_{i, j}}{1 - π_{i - 1, j}}$ . Then $θ$ is the probability of a patient having no toxic response on the lowest dose combination and $θ_{i}$ denotes the ratio between the probability of a patient having no toxic response on dose combinations $d_{i j}$ and $d_{(i - 1) j}$ for $j = 2, \dots, J$ and $i = 2, \dots, I$ . Similarly, $τ_{j} = \frac{1 - π_{i, j}}{1 - π_{i, j - 1}}$ is defined as the ratio between the probability of a patient having no toxic response on $d_{i j}$ and $d_{i (j - 1)}$ for $j = 2, \dots, J$ and $i = 2, \dots, I$ . Thus, the probability of toxicity for each combination $d_{i j}$ is

π_{i j} = 1 - θ θ_{2} \dots θ_{i} τ_{2} \dots τ_{j}

(4)

Due to monotonicity, each ratio

θ_{i}, τ_{j} \in (0, 1)

and the SFD assigns each of these ratios an independent Beta prior distribution. The hyper-parameters of the prior distributions can be chosen to match the clinicians’ prior mean estimates of toxicity probability on each combination and effective sample sizes.

After each cohort, the SFD updates the posterior means for ratios $θ, θ_{2}, \dots, θ_{I}, τ_{2}, \dots, τ_{J}$ using Bayes theorem, which can be related back to $π_{i j}$ through equation (4) to give estimates of the toxicity probabilities. In this way, the SFD is borrowing information across various drug combinations previously collected in the trial to make an informed decision on escalation. Additionally, the continual multiplication of Beta random variables implies that $π_{i j}$ for higher combinations has higher variance, allowing for more cautious escalation at higher combinations. Considering all neighbouring combinations apart from the one higher in both doses, the next combination is chosen as the one with estimated $π_{i j}$ closest to $ϕ$ . An overdosing criterion prohibits any combination from being administered if $P (π_{i j} > ϕ | n_{i j}, y_{i j}) \geq ϵ_{S F D}$ for some $ϵ_{S F D} > 0$ , and the trial is terminated if this is satisfied for $d_{11}$ .

Once all patients have been treated, the MTC is selected as the combination with toxicity probability closest to $ϕ$ . Note that the SFD design is more computationally intensive than the other model-free designs as MCMC methods are required to sample from the posterior distribution.

2.5. The PIPE design

The PIPE design²² differs from the model-free designs discussed so far in that it was originally proposed to find the MTC contour, labelled ${MTC}_{ϕ}$ . This is a line partitioning the combination space into safe and overly toxic combinations. Those below the contour are believed to have toxicity probability less than target toxicity $ϕ$ , whilst those above are believed to have toxicity probability greater than $ϕ$ .

Assuming the $π_{i j}$ are independent, they are assigned a Beta prior distribution, $π_{i j} \sim Beta (a_{i j}, b_{i j})$ for hyperparameters $a_{i j}$ and $b_{i j}$ , for $i = 1, \dots, I$ and $j = 1, \dots, J$ . Priors can be pre-specified if knowledge on the toxicity of combinations is available. Assuming each patient is independent such that $y_{i j} \sim Binomial (n_{i j}, π_{i j}) \forall i, j$ , the posterior for $π_{i j}$ can be written as

\begin{aligned} π_{i j} | n_{i j}, y_{i j} \sim Beta (y_{i j} + a_{i j}, n_{i j} - y_{i j} + b_{i j}) \end{aligned}

(5)

The posterior distribution is only updated after a cohort of patients is treated on the corresponding combination, but the

{MTC}_{ϕ}

is re-estimated regardless of which combination was tested. The monotonicity assumption means that the PIPE design needs only to consider contours satisfying this property, limiting the number of possible contours to

(\begin{matrix} I + J \\ I \end{matrix})

Each contour can be represented by a binary matrix, where entries are 0 or 1 depending on whether estimates of the toxicity probability for a combination are below or above the contour respectively. Let $ϑ$ be the set of all monotonic contours for an $I \times J$ dose combination space and define $C_{s} \in ϑ$ as the binary matrix representing the contour $s = 1, \dots, (\begin{matrix} I + J \\ I \end{matrix})$ .

To estimate the ${MTC}_{ϕ}$ given the current data, the design calculates the posterior probability of each toxicity probability being less than or equal to $ϕ$ , that is

\begin{aligned} p_{i j} (ϕ | n_{i j}, y_{i j}) = P (π_{i j} \leq ϕ | n_{i j}, y_{i j}, a_{i j}, b_{i j}) \end{aligned}

(6)

where the right-hand side of equation (6) is equal to the cumulative distribution function of a Beta distribution. Equation (8) gives the general formula for calculating the probability that the

{MTC}_{ϕ}

is defined by the matrix

C_{s}

\begin{aligned} P ({MTC}_{ϕ} = C_{s} | n_{i j}, y_{i j}) \end{aligned}

(7)

\begin{aligned} = \prod_{i = 1}^{I} \prod_{j = 1}^{J} {1 - p_{i j} (ϕ | n_{i j}, y_{i j})}^{C_{s} [i, j]} p_{i j} (ϕ | n_{i j}, y_{i j})^{1 - C_{s} [i, j]} \end{aligned}

(8)

where

[i, j]

represents the entry in the

i

th row and

j

th column of the binary matrix

C_{s}

. The contour maximising equation (8) is the contour most likely to be the

{MTC}_{ϕ}

given the current data. This contour then assists the escalation process by identifying the combinations closest to it, before the design selects one of these for the next cohort based on a weighted randomisation procedure. This involves weighting each combination by the inverse of their sample size, with the rationale being varied experimentation around the

{MTC}_{ϕ}

. Escalation continues in this way until all patients are treated, at which point all combinations closest from below the

{MTC}_{ϕ}

are recommended for phase II.

The design uses an overdosing rule that considers the expected probability of $d_{i j}$ being above the most probable ${MTC}_{ϕ}$ averaged over all monotonic contours. This is written as

q_{i j} = \sum_{C_{s} \in ϑ} C_{s} [i, j] P ({MTC}_{ϕ} = C_{s} | Y^{(m)})

and

d_{i j}

cannot be administered to the next cohort if

q_{i j} \geq ϵ_{P I P E}

for some

ϵ_{P I P E} > 0

. A trial is terminated if combination

d_{11}

satisfies this condition.

The PIPE design can recommend multiple combinations for phase II, as it recommends all combinations closest from below its ${MTC}_{ϕ}$ . For consistency across designs, in our implementation we ensure only one combination is recommended as the MTC. Therefore for each recommended combination, we find the posterior mean probability of toxicity, which can be calculated using the posterior distributions in equation (6). The combination with posterior mean closest to $ϕ$ is selected as the MTC, choosing a combination at random in the event of a tie.

2.6. The Waterfall design

The Waterfall design²³ also aims at finding the MTC contour. This design breaks down the two-dimensional dosing grid into a series of one-dimensional sub-trials. For the $I \times J$ dosing grid, the $I$ sub-trials are as follows:

\begin{aligned} S_{I} & = {d_{11}, \dots, d_{I 1}, d_{I 2}, \dots, d_{I J}} \\ S_{I - 1} & = {d_{I - 1, 2}, \dots, d_{I - 1, J}} \\ S_{I - 2} & = {d_{I - 2, 2}, \dots, d_{I - 2, J}} \\ \dots \\ S_{1} & = {d_{12}, \dots, d_{1 J}} \end{aligned}

Sub-trials are conducted sequentially using a single-agent dose-finding method, with the single-agent BOIN design recommended by the authors.

Firstly, sub-trial $S_{I}$ is conducted, starting at dose $d_{11}$ , and the first so-called ‘candidate MTD’ for the sub-trial is found using the single-agent design. The next sub-trial is chosen based on this candidate MTD. For a candidate MTD of $d_{i *, j *}$ , the next sub-trial to be conducted is $S_{i * - 1}$ . The process of selecting a sub-trial and candidate MTD is repeated until sub-trial $S_{1}$ is completed. All responses are then collated, and matrix isotonic regression is used to select the dose in each row of the dose grid with the estimate of toxicity probability closest to the target, unless all doses in that row are overly toxic. These dose combinations make up the MTC contour that is recommended.

Similarly to the PIPE design, since many combinations are recommended, for consistency we select one MTC based on the posterior mean probability of each dose combination that is recommended.

3. Calibration of designs

Model-based and model-free designs based on a Bayesian framework give clinicians more control over their performance. The PIPE design, the SFD and most model-based designs allow for knowledge on the toxicity of each drug from monotherapy trials to be incorporated into the design through their prior distributions. As the BOIN and KEY designs assign vague priors to the toxicity probabilities, their behaviour is primarily determined by the pre-defined intervals guiding escalation. Although it is in theory possible to incorporate historical data through the prior in the BOIN and KEY designs,²⁷ for the purpose of this comparison, it would defeat the purpose of a design with all escalation boundaries pre-specified at the design stage for ease of implementation.

Since for all designs, the hyper-parameter values of the prior and the values of the intervals have a substantial effect on the escalation procedure, any attempt to compare designs objectively must ensure that these values are specified in a fair way. In this comparison study, the aim of the novel calibration procedure is to give all designs a set-up that leads to consistently high proportions of selections of combinations with toxicity probability close to $ϕ$ in all scenarios, whilst keeping the number of patients allocated to unsafe doses low. Therefore, an a priori pre-specified criterion for selection of the prior parameters is used instead of the originally subjective proposed parameters for each model. Please see the Supplemental Materials for the comparison of calibrated and originally proposed hyper-parameters.

Each design considered in the comparison is calibrated using a novel two-stage approach. The first stage of the calibration is concerned with choosing values for hyper-parameters that give a good performance in selecting the MTC without considering safety. The second stage then focusses on safety, calibrating the overdose rule taking into account not only good performance in terms of recommending no combinations when considering an overly toxic scenario, but also the number of patients who are treated at unsafe doses. Although using similar principles to a standard fine-tuning approach, the novelty of this two-stage calibration is in this lesser subjective but still intuitive choice of hyper-parameters.

This two-stage calibration procedure based on high performance and safety is applied for all designs, employing a grid search over hyper-parameter or interval values (depending on the design). At each stage, this involves running simulations over four clinically plausible scenarios and determining which values lead to superior performance when averaged across the scenarios. We refer to the priors resulting in superior performance across the four scenarios as operational priors. Whilst this procedure gives the Bayesian designs an opportunity to be compared fairly against each other through their performance, these priors can also be applicable in the practical case where no reliable prior information about the compounds is available.

The first stage of the calibration procedure evaluates which design inputs lead to superior performance in recommending the MTC across the four scenarios. The PCS is examined in each scenario. This is the proportion of trials in which a design selects any combination with a true toxicity probability of exactly 0.30. To summarise the overall performance across these four scenarios, the geometric mean PCS is used. For the remainder of this section, the mean will refer to the geometric mean. Suppose $x_{1}, \dots, x_{N}$ represent the PCS in $N$ scenarios. The geometric mean, $(\prod_{i = 1}^{N} x_{i})^{1 / N}$ , is used instead of the arithmetic mean because it has the useful property of penalising cases in which PCS are more dispersed across scenarios. The combination of hyper-parameter or interval values that result in the highest geometric mean PCS across the scenarios is chosen. We note that during the first stage of the calibration procedure, no overdosing rules are included, meaning no trials are to be stopped before all patients have been recruited, because we choose to calibrate the parameter controlling the overdosing rule in the separate second stage. Once the first stage of calibration is complete, this will lead to the selection of intervals for the BOIN, Waterfall and KEY designs, and operational priors for the PIPE and SFD designs. The same approach is used to determine the operational priors for the BLRM, the POCRM and the logistic model, which are used as model-based comparators in the next section.

The second stage of the calibration procedure is for $ϵ$ , the parameter regulating the overdosing rule in each design. Calibration of $ϵ$ involves decreasing its value starting from 1, and observing the number of patients treated at overly toxic doses and the proportion of correct outcomes in the chosen scenarios. As selecting overly toxic combinations is more of an ethical concern, as a general rule we take as a starting point the highest value of $ϵ$ resulting in at least 85% of trials recommending no combinations when considering an overly toxic scenario. We acknowledge this proportion may differ in practice depending on the clinicians’ judgement. It is important to note that the interpretation of $ϵ$ differs between designs because of the construction of each overdosing rule, and should be accounted for when communicating with clinicians. This is reflected by subscripts for the individual designs in the following specifications. The second stage of the calibration procedure for each design is illustrated in Figure 2. Note that the PCS for all four scenarios is shown for completeness, although the second stage of the procedure itself uses this metric only for Scenario 14, where the correct selection is the selection of no dose combination since all are overly toxic.

Figure 2.

Calibration of $ϵ$ for the five designs. (a) BOIN & Waterfall: Patients treated at overly toxic doses; (b) BOIN & Waterfall: Proportion of correct selections; (c) KEY: Patients treated at overly toxic doses; (d) KEY: Proportion of correct selections;(e) Surface-Free: Patients treated at overly toxic doses; (f) Surface-free: Proportion of correct selections; (g) PIPE: Patients treated at overly toxic doses; (h) PIPE: Proportion of correct selections.BOIN: Bayesian optimal interval; KEY: the Keyboard design.

3.1. Setting

Each design is calibrated in the same setting that is then explored in the simulation study, representative of a phase I trial in oncology. There are two drugs with three dose levels each, which results in nine combinations, and the first cohort is treated at the lowest combination. The objective is to select a single combination as the MTC with true toxicity probability $ϕ = 0.30$ . The sample size is 36 patients for which are recruited in cohorts of three patients. All combination-toxicity scenarios are presented in Table 1. However, four scenarios are chosen to explore noticeably different clinical cases, in which the number and location of the MTCs vary, whilst restricting the number of scenarios makes the procedure computationally feasible.

Table 1.
Toxicity scenarios to evaluate the combination designs. Rows and columns refer to the dose of drug A and B respectively. True maximum tolerated dose combinations (MTCs) are in bold and ‘acceptable’ combinations are underlined.

$d_{1}^{B}$ $d_{2}^{B}$ $d_{3}^{B}$

Scenario 1

$d_{1}^{A}$ 0.05 0.10 0.15

$d_{2}^{A}$ 0.10 0.15 0.20

$d_{3}^{A}$ 0.15 0.20 0.30

Scenario 2

$d_{1}^{A}$ 0.05 0.10 0.15

$d_{2}^{A}$ 0.10 0.20 0.30

$d_{3}^{A}$ 0.20 0.30 0.45

Scenario 3

$d_{1}^{A}$ 0.02 0.05 0.10

$d_{2}^{A}$ 0.10 0.15 0.20

$d_{3}^{A}$ 0.20 0.30 0.45

Scenario 4

$d_{1}^{A}$ 0.05 0.10 0.15

$d_{2}^{A}$ 0.10 0.20 0.30

$d_{3}^{A}$ 0.20 0.45 0.60

Scenario 5

$d_{1}^{A}$ 0.02 0.05 0.15

$d_{2}^{A}$ 0.20 0.30 0.45

$d_{3}^{A}$ 0.45 0.55 0.65

Scenario 6

$d_{1}^{A}$ 0.10 0.15 0.30

$d_{2}^{A}$ 0.15 0.30 0.45

$d_{3}^{A}$ 0.30 0.45 0.60

Scenario 7

$d_{1}^{A}$ 0.10 0.20 0.45

$d_{2}^{A}$ 0.15 0.30 0.50

$d_{3}^{A}$ 0.30 0.50 0.60

Scenario 8

$d_{1}^{A}$ 0.05 0.10 0.20

$d_{2}^{A}$ 0.10 0.20 0.30

$d_{3}^{A}$ 0.30 0.45 0.55

Scenario 9

$d_{1}^{A}$ 0.10 0.15 0.30

$d_{2}^{A}$ 0.30 0.40 0.50

$d_{3}^{A}$ 0.40 0.50 0.60

Scenario 10

$d_{1}^{A}$ 0.15 0.30 0.45

$d_{2}^{A}$ 0.30 0.45 0.55

$d_{3}^{A}$ 0.45 0.55 0.65

Scenario 11

$d_{1}^{A}$ 0.02 0.05 0.10

$d_{2}^{A}$ 0.30 0.45 0.60

$d_{3}^{A}$ 0.45 0.60 0.75

Scenario 12

$d_{1}^{A}$ 0.20 0.30 0.45

$d_{2}^{A}$ 0.45 0.50 0.55

$d_{3}^{A}$ 0.65 0.70 0.75

Scenario 13

$d_{1}^{A}$ 0.30 0.45 0.50

$d_{2}^{A}$ 0.45 0.50 0.55

$d_{3}^{A}$ 0.50 0.55 0.60

Scenario 14

$d_{1}^{A}$ 0.45 0.50 0.55

$d_{2}^{A}$ 0.50 0.55 0.60

$d_{3}^{A}$ 0.55 0.60 0.65

Scenario 15

$d_{1}^{A}$ 0.10 0.10 0.10

$d_{2}^{A}$ 0.10 0.10 0.10

$d_{3}^{A}$ 0.10 0.10 0.10

	$d_{1}^{B}$	$d_{2}^{B}$	$d_{3}^{B}$
Scenario 1
$d_{1}^{A}$	0.05	0.10	0.15
$d_{2}^{A}$	0.10	0.15	0.20
$d_{3}^{A}$	0.15	0.20	0.30
Scenario 2
$d_{1}^{A}$	0.05	0.10	0.15
$d_{2}^{A}$	0.10	0.20	0.30
$d_{3}^{A}$	0.20	0.30	0.45
Scenario 3
$d_{1}^{A}$	0.02	0.05	0.10
$d_{2}^{A}$	0.10	0.15	0.20
$d_{3}^{A}$	0.20	0.30	0.45
Scenario 4
$d_{1}^{A}$	0.05	0.10	0.15
$d_{2}^{A}$	0.10	0.20	0.30
$d_{3}^{A}$	0.20	0.45	0.60
Scenario 5
$d_{1}^{A}$	0.02	0.05	0.15
$d_{2}^{A}$	0.20	0.30	0.45
$d_{3}^{A}$	0.45	0.55	0.65
Scenario 6
$d_{1}^{A}$	0.10	0.15	0.30
$d_{2}^{A}$	0.15	0.30	0.45
$d_{3}^{A}$	0.30	0.45	0.60
Scenario 7
$d_{1}^{A}$	0.10	0.20	0.45
$d_{2}^{A}$	0.15	0.30	0.50
$d_{3}^{A}$	0.30	0.50	0.60
Scenario 8
$d_{1}^{A}$	0.05	0.10	0.20
$d_{2}^{A}$	0.10	0.20	0.30
$d_{3}^{A}$	0.30	0.45	0.55
Scenario 9
$d_{1}^{A}$	0.10	0.15	0.30
$d_{2}^{A}$	0.30	0.40	0.50
$d_{3}^{A}$	0.40	0.50	0.60
Scenario 10
$d_{1}^{A}$	0.15	0.30	0.45
$d_{2}^{A}$	0.30	0.45	0.55
$d_{3}^{A}$	0.45	0.55	0.65
Scenario 11
$d_{1}^{A}$	0.02	0.05	0.10
$d_{2}^{A}$	0.30	0.45	0.60
$d_{3}^{A}$	0.45	0.60	0.75
Scenario 12
$d_{1}^{A}$	0.20	0.30	0.45
$d_{2}^{A}$	0.45	0.50	0.55
$d_{3}^{A}$	0.65	0.70	0.75
Scenario 13
$d_{1}^{A}$	0.30	0.45	0.50
$d_{2}^{A}$	0.45	0.50	0.55
$d_{3}^{A}$	0.50	0.55	0.60
Scenario 14
$d_{1}^{A}$	0.45	0.50	0.55
$d_{2}^{A}$	0.50	0.55	0.60
$d_{3}^{A}$	0.55	0.60	0.65
Scenario 15
$d_{1}^{A}$	0.10	0.10	0.10
$d_{2}^{A}$	0.10	0.10	0.10
$d_{3}^{A}$	0.10	0.10	0.10

In stage 1 of the calibration procedure, Scenarios 1, 8, 10 and 13 are chosen to give a diverse range of plausible scenarios. We have found the procedure to be robust to changes in these scenarios provided the qualitative features (e.g. if there is at least one scenario where the optimal combination is off diagonal) are maintained. The number of scenarios is chosen to balance computational feasibility and breadth of differing scenarios. Scenarios 1 and 13 are chosen to represent the extremes: when the highest combination is the only true MTC and all others are safe, and when the lowest combination is the only true MTC and all others are overly toxic, respectively. Scenario 8 covers situations in which most combinations are safe but true MTCs do not lie on the same diagonal. Scenario 10 captures the case where most combinations are overly toxic and true MTCs lie on the same diagonal. Note that we often refer to the set of combinations in a scenario as the combination grid.

In stage 2, simulations are run for each design over Scenarios 8, 10, 13 and 14 for different values of $ϵ$ . In Scenarios 8, 10 and 13, the PCS is as previously defined, whilst in the unsafe Scenario 14 we consider the PCS as the proportion of trials in which no combinations are selected. We refer to selecting no combinations in Scenario 14 as the ‘correct outcome’.

We summarise the optimal choice of hyper-parameters alongside the recommendations from the original proposal of each design in the online supplemental materials.

3.2. Calibrating the BOIN design

To guide dose escalation, the BOIN design relies on the interval $(λ_{e}, λ_{d})$ around the target toxicity. Interval boundaries $λ_{e}$ and $λ_{d}$ are a function of $ϕ$ , $ϕ_{1}$ and $ϕ_{2}$ , where $ϕ_{1} = a_{1} ϕ$ and $ϕ_{2} = a_{2} ϕ$ for constants $a_{1} < 1, a_{2} > 1$ . To calibrate the design, we run 4000 simulations for each scenario for pairs $(a_{1}, a_{2})$ from the sets $a_{1} = {0.85, 0.80, \dots, 0.40}$ and $a_{2} = {1.15, 1.20, \dots, 1.60}$ , resulting in a total of 100 pairs. As constants $a_{1}$ and $a_{2}$ deviate further from 1, the interval becomes wider, thus the design will choose to escalate and de-escalate on fewer occasions.

The optimal values are found to be $a_{1} = 0.65$ and $a_{2} = 1.4$ , which substituting into equation (1), we generate the interval boundaries $λ_{e}$ and $λ_{d}$ to give the interval (0.245, 0.359) to guide dose escalation. This interval implies that escalation occurs if $π_{i j}$ is below 0.245, de-escalation occurs if $π_{i j}$ is above 0.359, else the combination remains the same.

In the second stage of calibration, we find that as $ϵ_{{BOIN}}$ decreases, the design benefits more in Scenario 14, where the proportion of trials in which no combinations are recommended increases (see Figure 2). For $ϵ_{{BOIN}} \leq 0.84$ , over 85% of trials recommend no combinations in Scenario 14. The trade-off in the other scenarios with this $ϵ_{{BOIN}}$ value is that PCS increases steeply when $ϵ_{{BOIN}}$ increases, as well as the number of patients treated on overly toxic doses increasing. Therefore $ϵ_{{BOIN}} = 0.84$ is chosen.

3.3. Calibrating the keyboard design

Using a similar method to BOIN, we first calibrate the parameters that define the interval for KEY. The interval $I_{t a r g e t} = (b_{1}, b_{2})$ guides escalation entirely so is an important component of the design. We run 4000 simulations across each scenario for pairs $(b_{1}, b_{2})$ from the sets $b_{1} = {0.27, 0.25, \dots, 0.19}$ and $b_{2} = {0.33, 0.35, \dots, 0.41}$ , resulting in a total of 25 pairs. Mean PCS are displayed Figure S2 in the online supplemental materials, indicating that interval (0.21, 0.39) yields the highest mean PCS, which differs from the recommendation of (0.25, 0.35) in the original paper.²⁸ As explained in the Methodological Review, this means escalation occurs only if the posterior probability $P (π_{i j} \in (0.03, 0.21) | n_{i j}, y_{i j})$ is higher than $P (π_{i j} \in (0.21, 0.39) | n_{i j}, y_{i j})$ .

In the second stage of calibration, we find that as $ϵ_{{KEY}}$ decreases, the design benefits more in Scenario 14, where the proportion of trials in which no combinations are recommended increases. Choosing $ϵ_{{KEY}} = 0.84$ leads to approximately 85% of trials correctly selecting no combinations in Scenario 14, as shown in Figure 2, in line with the value obtained for the BOIN design.

3.4. Calibrating the surface-free design

The SFD assigns Beta priors to each of its parameters; the ratios between toxicity probabilities. In this setting, there are five ratios ( $θ$ , $θ_{2}$ , $θ_{3}$ , $τ_{2}$ & $τ_{3}$ defined in the Methodological Review) to parametrise, meaning a total of 10 hyper-parameters for the beta priors must be defined for the operational priors. Instead of specifying these directly, we specify a prior mean and prior effective sample size for each ratio, which can be used to calculate the corresponding hyper-parameters. To make the calibration task computationally feasible, we assume that all prior mean ratios, $m$ , are equal (meaning the increase in dose corresponds to the same proportion increase in toxicity) and all effective sample sizes for each ratio, $s_{{SFD}}$ , are equal. This drastically reduces the dimensionality of the grid search. We note, however, that if one has reliable prior information about the ratio not being equal, one can calibrate over various ratios being mindful of the computational complexity of such a calibration. Thus we only need to calibrate pairs of $m$ and $s$ , which we choose from sets $m = {0.95, 0.925, \dots, 0.85}$ and $s_{{SFD}} = {1, 2, \dots, 5}$ .

For each pair, we run 1000 simulations (which is lower than other model-free designs due to the computational demands of the design) and examine the mean PCS across the four scenarios. Our results in Figure S4 in the online supplemental materials show that the mean PCS is highest for $m = 0.875$ and $s_{{SFD}} = 4$ . This is equivalent to every ratio being assigned the prior distribution $Beta (3.5, 0.5)$ , and corresponds to mean prior toxicity probabilities on $d_{11}$ and $d_{33}$ of 0.125 and 0.487, respectively. We also note that the SFD has been found to be robust to small to moderate deviations in these parameters.

Figure 3.

Constructing prior mean toxicity probabilities when calibrating the PIPE design.

Figure 4.

An illustration of the PCS and PAS for Scenarios 1–13 for each design. The solid bars measure the PCS and the more transparent bars measure the PAS. The rightmost group of bars show the means. PCS: proportion of correct selection; PAS: proportion of acceptable selection.

For the calibration of $ϵ_{{SFD}}$ , in Figure 2, we found that $ϵ_{{SFD}} = 0.65$ resulted in at least 85% of trials selecting no combinations in Scenario 14. There is evidence to suggest that increasing or decreasing $ϵ_{{SFD}}$ not only has a sizeable effect on the PCS in Scenario 13, but also the number of patients treated at unsafe doses, demonstrating the design is highly sensitive to changes in its overdosing rule.

3.5. Calibrating the PIPE design

Similar to the SFD, the PIPE designs assigns beta priors to each $π_{i j}$ . A prior mean and prior sample size for each $π_{i j}$ are specified, giving a total of 18 values to specify from which the hyper-parameters for the beta priors can be calculated. To make calibration feasible, we assume that prior sample size $s_{{PIPE}}$ is equal for each combination and to set the prior means, we divide the grid of combinations into five diagonal segments, with toxicity increasing as we move through each segment. In this way, the design follows the monotonicity assumption. Like for the SFD, this drastically reduces the dimensionality of the grid search. To assign toxicity to each combination, we specify the toxicity of the lowest combination, $ρ$ , and the size of the increments in toxicity between each segment, $δ$ . In the illustration in Figure 3, we have chosen $ρ = 0.05$ and $δ = 0.025$ to construct the grid.

Our approach involves calibrating three parameters simultaneously to create operational priors, and are chosen from the sets $ρ = {0.025, 0.05, 0.075, 0.10}$ , $δ = {0.025, 0.05, 0.075, 0.10}$ and $s_{{PIPE}} = {1 / 72, 1 / 36, 1 / 18, 1 / 9}$ . For each triple, we run 2000 simulations in each of the four scenarios, which is fewer than for the BOIN and KEY designs due to the minor increase in computational expense. We provide one grid in Figure S5 in the online supplemental materials to account for mean PCS on each $s_{{PIPE}}$ value. The triple $s_{{PIPE}} = 1 / 18$ , $ρ = 0.05$ and $δ = 0.025$ leads to the highest mean PCS, although we observe that there were many triples that resulted in similar values. We note our choice of prior sample size, $s_{{PIPE}} = 1 / 18$ , only differs to the recommendation of 1/9 in the original paper.²² For prior sample sizes $s_{{PIPE}} \leq 1 / 18$ , the design is found to be robust. Mean PCS only varies between 37% and 40%, suggesting that a number of operational priors could lead to consistently high PCS.

Figure 5.

An illustration of the proportion of overly toxic selections across Scenarios 1–15 for each design. The rightmost groups of bars show the means.

For the second stage of the calibration, the value of $ϵ_{{PIPE}}$ is varied as shown in Figure 2, and $ϵ_{{PIPE}} = 0.50$ is chosen as it provides at least an 85% chance of correctly recommending no combinations in Scenario 14, as well as balancing the number of patients treated at unsafe doses in the four considered scenarios.

3.6. Calibrating the Waterfall design

The calibration of the Waterfall design is in line with the calibration of the BOIN design, since the parameters of escalation are the same. Therefore the values of the hyper-parameters used for the interval $(λ_{e}, λ_{d})$ are the same: (0.245, 0.359) and the parameter for the safety constraint is also the same, $ϵ_{W} = 0.84$ . For the $3 \times 3$ dosing grid, there are three sub-trials, the first has 5 dose combinations and the following two have 2 dose combinations. Therefore, to split the 12 cohorts between the sub-trials, six are allocated to the first sub-trial and three to each of the other sub-trials.

4. Simulation study: $3 \times 3$ dosing grid

In this section we describe the setting for the simulation study of $3 \times 3$ dosing grids before presenting the results, including a comparison to a model-based approach and a non-parametric optimal benchmark.

4.1. Setting

In order to compare the discussed designs, we conduct a simulation study, performing 2000 simulations of each of the 15 scenarios depicted in Table 1 for all five designs. As before, the objective is to select a single combination as the MTC with true toxicity probability $ϕ = 0.30$ . Any combination with probability of toxicity greater than 0.33 is labelled as overly toxic, and any combination with probability of toxicity in the interval [0.16,0.33] is labelled as acceptable. In this section, the mean refers to the arithmetic mean unless specified otherwise. All simulations are performed in R,²⁹ with code provided in the online supplemental materials.

In general, the number of overly toxic combinations available for selection increases as we move through Scenarios 1–14. Scenario 1 has a single MTC which is the highest combination available. Scenarios 3 and 4 contain very few overly toxic combinations and have MTCs on the edge of the grid. Scenario 5 is similar to these, except its only MTC is located in the centre of the grid. In Scenarios 2, 6, 7, 8, 9 and 10, there are multiple combinations to explore which have toxicity probability $ϕ$ . In particular, Scenarios 8 and 9 aim to investigate design behaviour when underlying MTCs are not on the same diagonal. Scenarios 11, 12 and 13 represent settings in which most combinations are overly toxic, meaning designs should avoid combinations away from $d_{11}$ . Scenario 14 is of importance because all of its combinations are overly toxic, making the trial very unethical. In this instance, the only correct outcome is to recommend no combination for phase II. Scenario 15 represents a situation where all combinations are true MTCs, and is used to monitor escalation behaviour when combinations are safe and increasing the dose of either drug does not affect toxicity.

In order to accentuate the differences in the designs, we do not implement any accuracy or sufficient information rules, as these may mask some key elements of the designs. We focus on the operating characteristics of PCS and proportion of acceptable selections (PAS) as measures of accuracy, and proportion of overly toxic selections and the number of patients treated on unsafe dose combinations as measures of safety.

4.2. Model-based comparators

To provide a comparison between model-free and model-based designs, we also consider conventional model-based approaches in our simulation study, the two-dimensional BLRM,⁸ the POCRM,¹⁰ and a modified design based on the logistic model (referred to here as the Riviere design).¹¹

The same proposed calibration procedure as is applied to the model-free designs is applied to the model-based, with details provided in the online supplemental materials. Note that the form of the overdosing rule may be different for the model-based designs, compared the model-free designs, as described in their respective original proposals.

4.2.1. Bayesian logistic regression model

In this approach, the toxicity probability for each combination, $π_{i j}$ , are modelled as in equation 10 for $i = 1, \dots, I$ and $j = 1, \dots, J$ , where doses $d_{i}^{A}$ and $d_{j}^{B}$ are scaled by reference doses. Let $d_{i j}$ be combination of $d_{i}^{A}$ and $d_{j}^{B}$ , while $n_{i j}$ and $y_{i j}$ are the number of patients and toxic responses on each combination respectively. Parameters $α_{1}$ and $β_{1}$ describe the toxicity of drug A, $α_{2}$ and $β_{2}$ describe the toxicity of drug B, and $η$ models the interaction between drugs. The five parameters are assigned normal prior distributions, and the likelihood is a product of Bernoulli densities, proportional to $\prod_{i = 1}^{I} \prod_{j = 1}^{J} π_{i j}^{y_{i j}} (1 - π_{i j})^{n_{i j} - y_{i j}}$ . After each cohort is observed, the joint posterior distribution is approximated using MCMC methods, and samples of each parameter are drawn from their full conditional distributions. Estimates of $π_{i j}$ are made by sampling parameters from their posteriors and substituting these along with the corresponding doses into equation 10. Note that all parameters except $η$ are sampled on the log scale and then exponentiated since they must be positive.

\begin{aligned} π_{i j} (α_{1}, α_{2}, β_{1}, β_{2}, η | d_{i}^{A}, d_{j}^{B}) = \end{aligned}

(9)

\begin{aligned} \frac{[α_{1} (d_{i}^{A})^{β_{1}} + α_{2} {(d_{j}^{B})}^{β_{2}} + α_{1} α_{2} {(d_{i}^{A})}^{β_{1}} {(d_{j}^{B})}^{β_{2}}] \exp (η d_{i}^{A} d_{j}^{B})}{1 + [α_{1} {(d_{i}^{A})}^{β_{1}} + α_{2} {(d_{j}^{B})}^{β_{2}} + α_{1} α_{2} {(d_{i}^{A})}^{β_{1}} {(d_{j}^{B})}^{β_{2}}] \exp (η d_{i}^{A} d_{j}^{B})} \end{aligned}

(10)

The BLRM can only escalate to combinations satisfying the neighbourhood constraint and Escalation With Overdose Control (EWOC) principle. The neighbourhood constraint prevents escalation or de-escalation to any combination that is more than one dose level of either drug away, and also prevents escalation to a combination in which both dose levels are higher. For a trial with target toxicity

ϕ = 0.30

, the EWOC principle states that

d_{i j}

can only be administered if

P (π_{i j} > 0.33) < ϵ_{{BLRM}}

. The combination maximising the probabilistic statement

P (0.16 < π_{i j} < 0.33)

is administered to the next cohort. If no combinations satisfy the two constraints, the trial is terminated. Once the sample size has been exhausted, the MTC is selected from combinations that have been experimented on with at least six patients, and is the one maximising

P (0.16 < π_{i j} < 0.33)

. The BLRM requires dosing quantities for each drug to be specified, in all of the implementations of the BLRM, these doses are 100, 200 and 300 mg for each drug.

4.2.2. Partial ordering continual reassessment method

The Bayesian POCRM¹⁰ generalises the original CRM design to the setting of combination trials. The POCRM design assumes that there are $R$ feasible orderings of the combinations satisfying the monotonicity assumption within each agent. Let $k$ be the index of the combination, $k = 1, \dots, K$ , $r$ be the index of ordering, $r = 1, \dots, R$ , $q_{k r}$ be the standardised regimen level at combination $k$ under the ordering $r$ , and let $π_{k r}$ be the corresponding probability of a DLT. Then, the combination-toxicity model takes the form

\begin{aligned} π_{k r} = q_{k r}^{\exp (α_{r})} \end{aligned}

(11)

where

α_{r}

is the (scalar) model parameter under the ordering

r

that has a normal prior distribution,

N (μ, σ^{2})

. The working models

q_{k r}

are constructed from standardised values,

{\tilde{q}}_{k}

(also known as skeleton) by re-ordering them according to the order

r

. The original design proposal included the possibility of the early stopping if the lower bound of the

η

-% approximated confidence interval³⁰ is above the target toxicity level,

ϕ

. We refer the reader to the original publication for further technical details.

4.2.3. Logistic model by Riviere et.al (2014)

Another model-based approach considered is the modified logistic model.¹¹ Following the recommendation on reducing the dimensionally of the parametric models in Phase I trials,³¹ we consider the 3-parameter logistic model (rather than the original 4-parameter one³²) and was found to result in the same or better, on average, operating characteristics of the design for small to moderate sample sizes.

Specifically, the combination-toxicity is modelled using the 3-parameter logistic model

\begin{aligned} l o g i t (π_{j, k}) = β_{0} + β_{1} u_{j} + β_{2} v_{k} \end{aligned}

(12)

where

β_{0} \in R

β_{1} > 0

β_{2} > 0

are unknown parameters that denote the intercept (

β_{0}

) the toxicity effect of agent 1 (

β_{1}

) and agent 2 (

β_{2}

), and

u_{j} = \log (\frac{p_{j}}{1 - p_{j}})

and

v_{k} = \log (\frac{q_{k}}{1 - q_{k}})

are the standardised doses of the two agents. The parameters

p_{1}, \dots, p_{j}

, and

q_{1}, \dots, q_{k}

, being the prior estimates of the toxicity probabilities for the dose levels of agent 1 and 2, respectively, when administered as monotherapies. The terms

u_{j}

and

v_{k}

are also known as the skeleton and are unchanged throughout the trial. The escalation rules follow the ones originally used.¹¹ The safety constraint for the early stopping is taking the form of if

P (π_{1, 1} > ϕ) > ϵ_{l o g i s t i c}

is satisfied then the trial is stopped earlier.

4.3. A non-parametric optimal benchmark comparator

While the primary goal of this work is to compare the performance of different model-free designs to each other, there is a risk that all methods might perform equally poorly on some scenarios. In this case, the comparison of the designs to each other would not identify why the poor performance is observed – due to the challenging scenario or due to all designs having difficulties identifying a particular MTC. To provide context for the comparison of operating characteristics, we include the performance of the non-parametric benchmark for combination studies, a tool that provides an estimate for the upper bound on the PCS under the given combination-toxicity scenario.^24,33 The benchmark takes into account the ‘difficulty’ of a scenario in terms of how close the toxicity risks for the combinations (under this scenario) are to the target level of 30%, and also accounts for the unknown monotonic ordering in the combination setting. We refer the reader to the recent work by Mozgunov et al.²⁴ for further technical details on the benchmark for combinations implementation.

4.4. Results

4.4.1. Accuracy index and proportions of correct and acceptable selections

The results are presented here in terms of proportions of correct and acceptable selections (PCS and PAS) as defined in the ‘Calibration of Designs’ section. We also calculate an accuracy index, defined in equation (13), where $ρ_{i j}$ is the proportion of simulations that select dose $d_{i j}$ as the MTC (see Hirakawa et al.³⁴ for further details). Table 2 presents these results.

\begin{aligned} A_{n} = 1 - I \times J \times \frac{\sum_{i = 1}^{I} \sum_{j = 1}^{J} | π_{i j} - ϕ | \times ρ_{i j}}{\sum_{i = 1}^{I} \sum_{j = 1}^{J} | π_{i j} - ϕ |} \end{aligned}

(13)

Table 2.

Values of the accuracy index (13) for each of the designs across Scenarios 1–15.

Scenario	BOIN	KEYBOARD	S-F	PIPE	Waterfall	BLRM	POCRM	Riviere
1	0.538	0.541	0.489	0.003	0.205	0.728	0.790	0.767
2	0.484	0.482	0.375	0.042	0.352	0.538	0.251	0.568
3	0.394	0.422	0.345	0.148	0.287	0.435	0.225	0.482
4	0.487	0.477	0.362	0.207	0.326	0.462	0.156	0.437
5	0.416	0.461	0.472	0.362	0.358	0.653	0.180	0.699
6	0.558	0.528	0.578	0.343	0.461	0.493	0.364	0.562
7	0.535	0.549	0.556	0.384	0.378	0.579	0.420	0.611
8	0.539	0.541	0.515	0.258	0.415	0.390	0.178	0.449
9	0.490	0.525	0.603	0.339	0.533	0.350	0.303	0.498
10	0.619	0.646	0.723	0.634	0.517	0.690	0.478	0.663
11	0.329	0.378	0.604	0.380	0.418	0.475	0.330	0.697
12	0.722	0.704	0.723	0.744	0.671	0.798	0.480	0.668
13	0.842	0.854	0.812	0.911	0.842	0.887	0.796	0.794
14	0.903	0.909	0.932	0.938	0.822	0.957	0.908	0.898
15	0.040	0.013	0.032	0.018	0.040	0.008	0.029	0.029
Mean	0.527	0.535	0.541	0.381	0.442	0.563	0.393	0.588

BLRM: Bayesian logistic regression model; POCRM: partial ordering continual reassessment method; BOIN: Bayesian optimal interval; S-F: Surface Free; PIPE: product of independent beta probabilities design.

Figure 4 presents the summary of the operating characteristics of the considered designs in terms of the PCS and PAS (with the full set of results given in the online supplemental materials). Model free designs are shown in blue, model-based designs in purple, and the non-parametric benchmark is in black. Scenarios 14 and 15 have been excluded as these have no true MTCs for the design to select. For scenarios in which the only acceptable combinations are also correct combinations (Scenarios 6, 9, 10, 11 and 13), the PCS and PAS are equal. The mean PCS across Scenarios 1-13 for the BOIN, KEY, SFD, PIPE, Waterfall, BLRM, POCRM and Riviere designs is 39.8%, 42.4%, 41.6%, 31.2%, 32.3%, 40.0%, 32.2% and 48.3% respectively, whilst the mean PAS are 58.7%, 62.1%, 59.0%, 56.0%, 53.4%, 58.4%, 44.4% and 64.6% respectively.

First of all, the benchmark reveals the differences in how challenging it is to identify the MTC in the considered scenarios: the PCS for the benchmark varies between approximately 35% under Scenario 7 to more than 80% under Scenario 13. As expected, the benchmark corresponds to the highest average PCS and PAS – 55% and nearly 70%, respectively. Similarly, under the majority of scenarios the benchmark corresponds to the highest PCS and PAS as it employs the concept of the complete information. The largest difference between the benchmark and other designs can be seen under Scenario 13. At the same time, there are scenarios under which the benchmark is outperformed by a competing design – this can be a sign of the design favouring particular combinations under the calibrated priors – for example under Scenario 7. Since the aim of the calibration procedure was to obtain a prior with good operating characteristics across many plausible scenarios, in the simulation study some scenarios will have better or worse performance than the benchmark.

The variety of performances across the scenarios demonstrates the variability between the different designs in different settings. Considering the model-free designs, on average the KEY design has the highest proportion of both correct and acceptable selections, but is vastly outperformed in some scenarios by the SFD design. In five of the scenarios, the KEY has the highest PCS out of all the model-free designs, being superior in scenarios with few overly toxic combinations. However, for example in Scenario 11, where the MTC is the middle dose of drug A and lowest dose of drug B, the SFD outperforms the next best performing design by 19.2%. The PIPE design shows poor performance in many scenarios, most notably in Scenario 1 where the PCS is 5.5% and PAS is 54.0%. A likely reason is that for the PIPE design, the choice of MTC must be below the MTC contour, and a scenario where the true MTC is the highest dose combination gives rise to underestimation since we cannot explore above the true MTC contour. In addition, the procedure discussed in the Methodological Review of the PIPE design to choose one MTC from the recommended set will make our results differ from those originally reported by Mander and Sweeting,²² where a ‘correct selection’ was defined as the MTC being in the set of recommended doses. Although the Waterfall design also recommends a set of doses, the performance in the simulations is better than that of the PIPE design. Although it has poor performance in Scenario 1, where the MTC is the highest dose combination, it is the best performing design in Scenarios 9 and 13.

When considering the model-based designs as a comparator, we see that in many scenarios these outperform the KEY. For example, in Scenario 1 where the MTC is the highest combination, all of the model based designs achieve a PCS at least 20% higher than the next best performing design, the KEY. In fact, when including these designs in the comparison, the KEY is only the best performing design in one scenario, Scenario 8. The SFD does however outperform the model-based designs in some cases, with the model-based designs having the highest PCS in Scenarios 1, 2, 3, 5, 7 and 11 and the SFD is the best performing in Scenarios 6, 10 and 12.

In terms of the accuracy index, the SFD, BLRM and Riviere designs show the highest mean value, supporting the collective evidence that these are the most accurate designs in selecting the MTC.

4.4.2. Proportions of overly toxic selections

Figure 5 illustrates the proportion of overly toxic selections for each design. Scenarios 1 and 15 have no overly toxic combinations, so the proportion is zero for these cases. Model-free designs showed lower proportions of overly toxic selection than model-based designs in many scenarios. We observe that the POCRM recommends the most overly toxic combinations on average by far, in 36.8% of trials. It stands out in multiple scenarios with a very high percentage of simulated trials recommending overly toxic doses. In 10 scenarios this is over 30%, highlighting how aggressive this approach is.

Of the model-free designs, the SFD has the highest percentage, in 20.4% of trials. In scenarios 5, 9 and 12, it recommends overly toxic combinations in over 25% of the simulated trials and in 9 of the scenarios, it is the model-free design with the highest proportion of overly toxic recommendations. This is evidence of the trade-off between selecting combinations close to $ϕ$ and the willingness to recommend more overly toxic combinations.

The PIPE design demonstrates a very low proportion of overly toxic selections with a mean of 9.2% across the 13 scenarios, 6.2% below any of the other designs. It has the lowest in all but four scenarios. This is a further illustration of the feature of the design to recommend combinations near but lower than the estimated MTC contour.

A focus on Scenario 14, where all dose combinations are overly toxic, shows the BLRM is the most efficient at stopping for safety, with 93.7% of simulations not recommending any dose combination. Noticeably, the Waterfall design is the least efficient at stopping for safety in this scenario.

4.4.3. Number of patients treated at overly toxic combinations

Figure 6 outlines the mean number of patients treated at overly toxic combinations in Scenarios 1–15 for each design. Note that we report the number rather than proportion of patients, as this will also give insight into how effectively each design stops for safety.

Figure 6.

An illustration of the number of patients treated at overly toxic combinations during trials in Scenarios 1–15 for each design. The rightmost group of bars show the means.

The most notable feature of these results is the large number of patients treated at overly toxic combinations by the model-based designs. This aggressive escalation is driven by the informative prior, calibrated to give high values of PCS. We refer the reader to the online supplemental materials where an alternative BLRM prior leading to more conservative escalation (but considerably lower PCS and PAS) is explored.

The SFD, KEY and BOIN have reasonable performance, with the Waterfall design showing a strong performance with the lowest overall mean number of patients treated on overly toxic doses of five patients.

Careful attention must again be paid to Scenario 14, where all dose combinations are overly toxic. The PIPE design treats an average of 20 patients per trial, over six cohorts, which is an unacceptable level of exploration in such a scenario. In this scenario, we also consider that although the BLRM showed good performance in stopping early for safety in the highest number of simulated trials, it also has a high number of patients treated on average before stopping. The Waterfall design has the fewest patients treated by far, with an average of five patients, but this leads to the erroneous recommendation of a dose in a large proportion of simulated trials.

We see that overall the model-free approaches are more conservative in their escalation than the model-based designs, with fewer patients treated on unsafe doses, with no noticeable increase in PCS. Of the model-free approaches, the SFD shows the most promising PCS over the different scenarios, at the cost of somewhat higher overly toxic selections. It is also worth noting that the SFD has a substantially higher computational cost than the other model-free designs.

5. Simulation study: Alternative dosing grids

All of the simulations so far have concerned a dosing grid whereby there are three levels of each drug. However, it may alternatively be the case where there are differing numbers of dose levels for the two drugs. The most common being two levels of one drug and three or four of the other. Therefore in this section, we consider two alternative dosing grid sizes: $2 \times 3$ and $2 \times 4$ .

5.1. Setting

Six additional scenarios are considered for the alternative dosing grids, outlined in Table 3. Scenarios 16-18 have a $2 \times 3$ dosing grid and are chosen to be similar to scenarios 10, 9 and 6, but without the highest dose of drug A in order to explore the behaviour of the smaller grid. Scenarios 19–21 have a $2 \times 4$ dosing grid, with Scenarios 19 and 21 having an MTC at the fourth dose level of drug B and Scenario 20 similar to Scenario 18, but with the fourth dose level of drug B unsafe.

Table 3.
Alternative dosing grid toxicity scenarios to evaluate the combination designs. Rows and columns refer to the dose of drugs A and B, respectively. True MTCs are in bold and ‘acceptable’ combinations are underlined.

$d_{1}^{B}$ $d_{2}^{B}$ $d_{3}^{B}$

Scenario 16

$d_{1}^{A}$ 0.10 0.30 0.45

$d_{2}^{A}$ 0.30 0.45 0.60

Scenario 17

$d_{1}^{A}$ 0.10 0.15 0.30

$d_{2}^{A}$ 0.30 0.45 0.60

Scenario 18

$d_{1}^{A}$ 0.10 0.20 0.30

$d_{2}^{A}$ 0.15 0.30 0.50

$d_{1}^{B}$ $d_{2}^{B}$ $d_{3}^{B}$ $d_{4}^{B}$

Scenario 19

$d_{1}^{A}$ 0.10 0.15 0.20 0.30

$d_{2}^{A}$ 0.30 0.40 0.50 0.60

Scenario 20

$d_{1}^{A}$ 0.10 0.20 0.30 0.40

$d_{2}^{A}$ 0.20 0.30 0.40 0.50

Scenario 21

$d_{1}^{A}$ 0.15 0.20 0.25 0.30

$d_{2}^{A}$ 0.40 0.45 0.50 0.55

	$d_{1}^{B}$	$d_{2}^{B}$	$d_{3}^{B}$
Scenario 16
$d_{1}^{A}$	0.10	0.30	0.45
$d_{2}^{A}$	0.30	0.45	0.60
Scenario 17
$d_{1}^{A}$	0.10	0.15	0.30
$d_{2}^{A}$	0.30	0.45	0.60
Scenario 18
$d_{1}^{A}$	0.10	0.20	0.30
$d_{2}^{A}$	0.15	0.30	0.50
	$d_{1}^{B}$	$d_{2}^{B}$	$d_{3}^{B}$	$d_{4}^{B}$
Scenario 19
$d_{1}^{A}$	0.10	0.15	0.20	0.30
$d_{2}^{A}$	0.30	0.40	0.50	0.60
Scenario 20
$d_{1}^{A}$	0.10	0.20	0.30	0.40
$d_{2}^{A}$	0.20	0.30	0.40	0.50
Scenario 21
$d_{1}^{A}$	0.15	0.20	0.25	0.30
$d_{2}^{A}$	0.40	0.45	0.50	0.55

The priors and intervals used in each design are equivalent to those used for the $3 \times 3$ dosing grid, meaning that hyper-parameter values are the same. The cohort size remains three patients, with a maximum of 12 cohorts. For the Waterfall design, 8 cohorts are allocated to the first sub-trial and 4 cohorts to the second sub-trial.

5.2. Results

We again compare the designs’ operating characteristics in terms of performance of selecting correct and acceptable dose combinations, and safety in terms of patients allocated to overly toxic doses and selection of overly toxic doses.

Table 4 gives the accuracy index for the designs across the six scenarios, calculated using equation 13. It shows a similar trend to the $3 \times 3$ dosing grid, with the SFD giving the best overall performance of the model-assisted designs, and PIPE giving the poorest performance, let down by scenarios where the MTC is high in the dosing grid.

Table 4.
Values of the accuracy index (13) for each of the designs across the alternative dosing grid scenarios.

Scenario BOIN KEYBOARD S-F PIPE Waterfall BLRM POCRM Riviere

16 0.545 0.577 0.745 0.289 0.447 0.657 0.431 0.689

17 0.535 0.549 0.568 0.241 0.567 0.533 0.206 0.375

18 0.418 0.444 0.495 0.079 0.459 0.612 0.192 0.520

19 0.524 0.497 0.451 0.371 0.505 0.428 0.111 0.345

20 0.366 0.392 0.520 0.229 0.326 0.478 0.307 0.475

21 0.404 0.330 0.397 0.294 0.372 0.386 0.154 0.349

Mean 0.465 0.465 0.529 0.250 0.446 0.516 0.233 0.459

Scenario	BOIN	KEYBOARD	S-F	PIPE	Waterfall	BLRM	POCRM	Riviere
16	0.545	0.577	0.745	0.289	0.447	0.657	0.431	0.689
17	0.535	0.549	0.568	0.241	0.567	0.533	0.206	0.375
18	0.418	0.444	0.495	0.079	0.459	0.612	0.192	0.520
19	0.524	0.497	0.451	0.371	0.505	0.428	0.111	0.345
20	0.366	0.392	0.520	0.229	0.326	0.478	0.307	0.475
21	0.404	0.330	0.397	0.294	0.372	0.386	0.154	0.349
Mean	0.465	0.465	0.529	0.250	0.446	0.516	0.233	0.459

BLRM: Bayesian logistic regression model; POCRM: partial ordering continual reassessment method; BOIN: Bayesian optimal interval; S-F: PIPE: product of independent beta probabilities design.

Figure 7 shows the percentage of correct and acceptable selections for the alternative dosing grid scenarios. In scenarios 16 and 17, the benchmark illustrates that these $2 \times 3$ dosing grid scenarios are less difficult than their $3 \times 3$ comparators. In Scenario 16, all designs perform at least as well as in the $3 \times 3$ case with the SFD performing the best in Scenario 16 and the Keyboard and Waterfall performing the best in Scenario 17, however in Scenario 17, the PIPE design performs slightly worse. PIPE performs considerably worse in Scenario 18, where it is evident that in the $3 \times 3$ grid, dose $d_{31}$ was frequently chosen, and when the third dose level of drug $A$ is removed, the other MTCs are not chosen as frequently. The benchmark shows that the $2 \times 4$ dosing grid scenarios are more challenging than the other sizes of grid, however most designs give a high level of PCS, apart from the PIPE design. However, the PIPE design does give the highest levels of PAS in the $2 \times 4$ dosing grid scenarios.

Figures 8 and 9 show the percentage of overly toxic selections and the number of patients treated at overly toxic combinations for scenarios 16-21, respectively. There is a clear trend that the model-based designs have a higher percentage of overly toxic selections, with the PIPE design showing a very low percentage, substantially lower than all designs apart from in Scenario 21, where it has the same level as the Waterfall design. There is a similar trend in the mean number of patients treated on overly toxic doses, with the PIPE design showing the smallest average overall, and POCRM and Riviere showing the highest. Scenario 20 has many more patients treated on unsafe doses than Scenario 18 in all but the SFD, PIPE and BLRM designs. This indicates that for the other five designs, escalation to the fourth dose level of drug $B$ was restricted by the grid rather than the design itself.

6. Case study

The simulation studies gave insight into the operating characteristics of each design, however for further insight into the escalation behaviour, we apply each method to an example case study. We consider a phase I oncology (breast and lung cancer) study enrolling patients to dosing combinations of four dose levels of neratinib and temsirolimus.²⁵ A total sample size of 60 patients (cohorts of size 2 or 3) were treated on 12 of 16 possible dosing combinations. Results from 52 patients were included and 10 DLTs were observed, with full results of the trial displayed in Table 5.

Table 5.
Results for each of the designs applied to the case study, including the raw trial data of the study by Gandhi et al.²⁵ Each entry represents $y_{i j} / n_{i j}$ . The MTC as chosen by each design is highlighted in bold. In the case of the BLRM, (c) indicates the calibrated prior hyper-parameters were used and (a) indicates the alternative values were used.

Temsirolimus

Raw Trial Data 15 mg 25 mg 50 mg 75 mg

Neratinib 120 mg 0/2 0/4 1/5 0/4

160mg 1/4 1/4 0/5 3/6

200 mg 0/4 1/8 1/2

240 mg 2/4

Temsirolimus

BOIN 25 mg 50 mg 75 mg

Neratinib 120 mg 0/3 0/0 0/0

160 mg 1/6 0/6 3/6

200 mg 2/12 2/3 0/0

Temsirolimus

KEY 25 mg 50 mg 75 mg

Neratinib 120 mg 0/3 0/0 0/0

160 mg 1/6 0/6 3/6

200 mg 2/9 3/6 0/0

Temsirolimus

SFD 25 mg 50 mg 75 mg

Neratinib 120 mg 0/3 0/0 0/6

160 mg 1/6 0/6 4/9

200 mg 0/3 2/3 0/0

Temsirolimus

PIPE 25 mg 50 mg 75 mg

Neratinib 120 mg 0/3 0/0 0/0

160 mg 1/3 1/12 0/0

200 mg 2/15 2/3 0/0

Temsirolimus

Waterfall 25 mg 50 mg 75 mg

Neratinib 120 mg 0/3 0/3 0/6

160 mg 1/6 0/3 3/6

200 mg 1/6 2/3 0/0

Temsirolimus

BLRM (c) 25 mg 50 mg 75 mg

Neratinib 120 mg 0/3 0/3 0/6

160 mg 0/0 0/9 4/9

200 mg 0/0 0/0 5/6

Temsirolimus

BLRM (a) 25 mg 50 mg 75 mg

Neratinib 120 mg 0/3 0/3 0/6

160 mg 0/0 0/9 4/12

200 mg 0/0 0/0 2/3

Neratinib 120 mg 0/3 1/6 4/12

160 mg 0/3 1/6 0/0

200 mg 0/0 2/6 0/0

Temsirolimus

Riviere 25 mg 50 mg 75 mg

Neratinib 120 mg 0/3 0/0 2/6

160 mg 0/3 1/9 2/3

200 mg 0/3 5/9 0/0

		Temsirolimus
Neratinib	120 mg	0/2	0/4	1/5	0/4
	160mg	1/4	1/4	0/5	3/6
	200 mg	0/4	1/8	1/2
	240 mg	2/4
		Temsirolimus
	BOIN	25 mg	50 mg	75 mg
Neratinib	120 mg	0/3	0/0	0/0
	160 mg	1/6	0/6	3/6
	200 mg	2/12	2/3	0/0
		Temsirolimus
	KEY	25 mg	50 mg	75 mg
Neratinib	120 mg	0/3	0/0	0/0
	160 mg	1/6	0/6	3/6
	200 mg	2/9	3/6	0/0
		Temsirolimus
	SFD	25 mg	50 mg	75 mg
Neratinib	120 mg	0/3	0/0	0/6
	160 mg	1/6	0/6	4/9
	200 mg	0/3	2/3	0/0
		Temsirolimus
	PIPE	25 mg	50 mg	75 mg
Neratinib	120 mg	0/3	0/0	0/0
	160 mg	1/3	1/12	0/0
	200 mg	2/15	2/3	0/0
		Temsirolimus
	Waterfall	25 mg	50 mg	75 mg
Neratinib	120 mg	0/3	0/3	0/6
	160 mg	1/6	0/3	3/6
	200 mg	1/6	2/3	0/0
		Temsirolimus
	BLRM (c)	25 mg	50 mg	75 mg
Neratinib	120 mg	0/3	0/3	0/6
	160 mg	0/0	0/9	4/9
	200 mg	0/0	0/0	5/6
		Temsirolimus
	BLRM (a)	25 mg	50 mg	75 mg
Neratinib	120 mg	0/3	0/3	0/6
	160 mg	0/0	0/9	4/12
	200 mg	0/0	0/0	2/3
Neratinib	120 mg	0/3	1/6	4/12
	160 mg	0/3	1/6	0/0
	200 mg	0/0	2/6	0/0
		Temsirolimus
	Riviere	25 mg	50 mg	75 mg
Neratinib	120 mg	0/3	0/0	2/6
	160 mg	0/3	1/9	2/3
	200 mg	0/3	5/9	0/0

In order to strengthen the case that our conclusions are generalisable with less risk of selective reporting, the purpose of this case study is not to investigate whether each design chooses the same MTC as the real study did. The purpose is to give an illustration of how each design explores the dosing grid, given identical patient responses.

In order to use the calibrated prior specifications, and in line with the simulation study, we restrict the dosing grid to three doses of each drug, removing the lowest dose of temsirolimus and the highest dose of neratinib. We also fix the cohort size to three patients and maximum total sample size to 36.

To ensure a fair comparison between designs, we define a fixed set of 36 ordered patient responses for each dose combination. The first patient responses in this set are the true $y_{i j}$ DLT responses and $n_{i j} - y_{i j}$ non-DLT responses, in a random permutation (note that this is the same random permutation for each of the methods). The remaining $36 - n_{i j}$ responses are generated in the following way. Each patient has an individual probability of DLT, generated from Beta $(1 + y_{i j}, 1 + n_{i j} - y_{i j})$ . Then a binary response is generated with this probability. Where there were no patients assigned to the dose combination in the real study, the individual P(DLT) is generated from a Beta(3,3) distribution, to indicate the dose combination is unsafe, since this is the reason the combination was not escalated to. This process uses the information from the real study, but also introduces enough variability in the subsequent responses to account for the small sample size.

Table 5 displays the results of each of the methods, with the number of patients treated at each combination, the number of DLTs observed, and the concluded MTC highlighted in bold.

The BOIN and KEY designs show very similar exploration, first escalating in neratinib, then temsirolimus. The highest combination is not explored, as the combinations with the next lowest dose of each drug were considered unsafe. The only difference is that the KEY assigns one more cohort to the 200 mg/50 mg combination, even when the previous cohort had 2/3 observed DLT responses.

The PIPE design explores differently, not escalating to the highest dose of temsirolimus at all, even though only 1/12 DLT responses were observed on the 160 mg/50 mg combination. The SFD explores more of the highest dose of temsirolimus, although still not the highest combination. An interesting observation here is that the final recommended dose has observed 4/9 DLT responses, a level that would generally be an unsafe standard. This is in line with the simulation results that showed this design to have the highest level of overly toxic selections. The Waterfall design explores the entire dosing grid apart from the highest dose combination 200 mg/75 mg, a more even spread of patients across the grid than the other model-free designs.

The BLRM is executed with two prior distributions, the calibrated prior and the alternative, more realistic based on safety concerns, prior. Surprisingly, both show a more aggressive escalation than the model-free designs, with patients allocated to the highest combination. The calibrated prior gives the most aggressive approach with a second cohort assigned to a dose, even when the first observed 2/3 DLT responses. This also means that for this prior, the dosing grid is not as well explored as some of the model-free designs, as the lowest dose of temsirolimus is only explored in combination with the lowest dose of neratinib. For the calibrated prior, these results are in line with the simulation study, where the BLRM had on average the most patients treated on overly toxic doses and also a high proportion of overly toxic recommendations. However, even the alternative prior shows more aggressive escalation than the model-free designs in this case study. Both the POCRM and the Riviere design had a balanced exploration of the dose grid.

Figure 7.

An illustration of the PCS and PAS for Scenarios 16–21 for each design. The solid bars measure the PCS and the more transparent bars measure the PAS. The rightmost group of bars show the means.

Figure 8.

An illustration of the percentage of overly toxic selections across Scenarios 16–21 for each design. The rightmost group of bars show the means.

The case study highlights some key differences in the approaches, illustrating how both the escalation schemes and final recommendation differ. Particularly of note is the somewhat aggressive behaviour of not de-escalating when observing 2/3 observed DLT responses, and recommending a final dose combination with 4/9 observed DLT responses from both the SFD and calibrated prior BLRM. This behaviour, which could be considered unsafe, is not necessarily obvious from simulation results and underlines the importance of studying the individual escalations in an example case study. It is also important to consider that in practice, such a statistical approach is a guidance for dose recommendation that should be supported by an overall evaluation of the safety, pharmacokinetics and clinical rationale.

7. Discussion

This paper provides a review of a wide range of combination designs in phase I oncology, exploring the more recently proposed model-free designs in detail, as well as providing a novel approach for the calibration of such designs. The comprehensive simulation study we conduct suggests that model-free designs are competitive with the model-based designs in terms of the proportion of correct combinations selected. The operating characteristics of model-free designs in a number of scenarios suggest they offer a safer alternative. The case study example highlighted the key differences in how the methods explore the dosing grid given the same patient responses, with more aggressive approaches missing the lower doses, and conservative approaches missing the higher ones.

The discussed results depend upon the specification of the intervals for the BOIN, KEY and Waterfall designs, and the operational priors for the PIPE, SFD, POCRM, Riviere and BLRM designs, which were calibrated using a novel approach. This included calibrating the overdosing rules in each design to reduce the risk of recommending overly toxic combinations for phase II. Naturally, our work does not allow for comparison between designs when complete and reliable prior information on the toxicity of each drug is available. In practice, the PIPE, SFD, POCRM, Riviere and BLRM designs can exploit this prior knowledge to help the escalation process.

Figure 9.

An illustration of the number of patients treated at overly toxic combinations during trials in Scenarios 16-21 for each design. The rightmost group of bars show the means.

The calibration procedure, although novel in approach, is relatively straightforward to implement. It does however highlight the computational intensity of the different methods. Both the BLRM and SFD are very computationally intensive, with the calibration procedure taking substantially longer than for any of the other designs. It has shown great promise in specifying prior distributions that yield high PCS values, removing the subjectivity from the specification.

Moreover, our simulations do not allow for the early selection of an MTC. For example, if at least 9 patients are treated at a combination and the next cohort is recommended to be treated at this combination, then a trial could be stopped and this combination selected as the MTC. We acknowledge this rule is useful to reduce sample sizes, especially in scenarios where the true MTC is a low-dose combination. One of the advantages of model-based approaches is that they allow for selection of unplanned intermediate doses. This is an advantage that was not used in the simulation study, but must be considered in practice.

An additional area of interest for such dose-finding studies is the sample size and cohort size. Conducting a sensitivity analysis on both of these for each design would be an excellent opportunity to investigate whether designs can still achieve high PCS with fewer patients, or significantly higher PCS with extra patients, and whether a larger or smaller cohort size would lead to better exploration of the dosing grid.

Finally, we conclude this comparison with an overview of recommendations for the use of each design in the context of this work. The BOIN and KEY designs give a balanced approach, with a good level of PCS and PAS across a range of scenarios. Overly toxic explorations and selections are also well-balanced across scenarios. The PIPE design is more cautious in its selection, with a consistently low proportion of overly toxic selections, although at the cost of also recommending correct combinations a lower proportion of the time. The Surface Free design offers a high PCS and PAS and a generally low number of patients treated at overly toxic selections, but this must be balanced with the high proportion of overly toxic selections. The Waterfall design is most cautious in its allocation of patients, with a similar level of overly toxic recommendations as KEY and BOIN. However, the overall PCS and PAS are somewhat lower than the other designs. The model-based designs provide the most aggressive approach with a calibrated prior, with a large number of patients treated on overly toxic doses, however a good level of PCS and PAS. For the BLRM, with an alternative, intuitive prior, the number of overly toxic explorations is reduced, but at the cost of the high PCS values.

Supplemental Material

sj-pdf-1-smm-10.1177_09622802231220497 - Supplemental material for A comparison of model-free phase I dose escalation designs for dual-agent combination therapies

Supplemental material, sj-pdf-1-smm-10.1177_09622802231220497 for A comparison of model-free phase I dose escalation designs for dual-agent combination therapies by Helen Barnett, Matthew George, Donia Skanji, Gaelle Saint-Hilary, Thomas Jaki and Pavel Mozgunov in Statistical Methods in Medical Research

Footnotes

Data Availability Statement

The data that supports the findings of this research are available in of this article, originally from Gandhi et al.,²⁵ with all other data simulated according to the specifications described.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This study was supported by the National Institute for Health Research (NIHR Advanced Fellowship, Dr Pavel Mozgunov, NIHR300576; and Prof Jaki’s Senior Research Fellowship, NIHR-SRF-2015-08-001) and by the NIHR Cambridge Biomedical Research Centre (BRC-1215-20014). The views expressed in this publication are those of the authors and not necessarily those of the NHS, the National Institute for Health Research or the Department of Health and Social Care (DHSC). T Jaki and P Mozgunov received funding from UK Medical Research Council (MC_UU_00002/14).

ORCID iDs

Helen Barnett

Thomas Jaki

Pavel Mozgunov

Supplemental material

Supplemental material for this article is available online.

References

Wong

Siah

. Estimation of clinical trial success rates and related parameters. Biostatistics 2019; 20: 273–286.

Wong

Capasso

Eckhardt

. The changing landscape of phase I trials in oncology. Nature Reviews Clinical oncology 2016; 13: 106–117.

Couzin-Frankel

. Cancer immunotherapy. Science 2013; 342: 1432–1433.

Sharma

Allison

. Immune checkpoint targeting in cancer therapy: toward combination strategies with curative potential. Cell 2015; 161: 205–214.

Fan

Venook

. Design issues in dose-finding phase I trials for combinations of two agents. J Biopharm Stat 2009; 19: 509–523.

Hamberg

Verweij

. Phase I drug combination trial design: Walking the tightrope. J Clin Oncol 2009; 27: 4441–4443. PMID: 19704054.

Ivanova

Wang

. A non-parametric approach to the design and analysis of two-dimensional dose-finding trials. Stat Med 2004; 23: 1861–1870.

Neuenschwander

Matano

Tang

, et al. A Bayesian industry approach to phase I combination trials in oncology. In Statistical methods in drug combination studies, pages 95–135. Chapman & Hall/CRC Press: Boca Raton, FL, 2015.

Thall

Millikan

Mueller

, et al. Dose-finding with two agents in phase I oncology trials. Biometrics 2003; 59: 487–496.

10.

Wages

Conaway

O’Quigley

. Continual reassessment method for partial ordering. Biometrics 2011; 67: 1555–1563.

11.

Riviere

M-K

Yuan

Dubois

, et al. A bayesian dose-finding design for drug combination clinical trials based on the logistic model. Pharm Stat 2014; 13: 247–257.

12.

Abbas

Rossoni

Jaki

, et al. A comparison of phase I dose-finding designs in clinical trials with monotonicity assumption violation. Clinical Trials 2020; 17: 522–534.

13.

Mozgunov

Jaki

. An information theoretic phase I–II design for molecularly targeted agents that does not require an assumption of monotonicity. J R Stat Soc Ser C Appl Stat 2019; 68: 347.

14.

Conaway

Petroni

. The impact of early-phase trial design in the drug development process. Clin Cancer Res 2019; 25: 819–827.

15.

Riviere

M-K

Dubois

Zohar

. Competing designs for drug combination in phase I dose-finding clinical trials. Stat Med 2015; 34: 1–12.

16.

Thall

Lee

S-J

. Practical model-based dose-finding in phase I clinical trials: methods based on toxicity. Int J Gynecol Cancer 2003; 13: 251–261.

17.

Riviere

M-K

Le Tourneau

Paoletti

, et al. Designs of drug-combination phase I trials in oncology: a systematic review of the literature. Ann Oncol 2015; 26: 669–674.

18.

Wages

Ivanova

Marchenko

. Practical designs for phase I combination studies in oncology. J Biopharm Stat 2016; 26: 150–166.

19.

Lin

Yin

. Bayesian optimal interval design for dose finding in drug-combination trials. Stat Methods Med Res 2017; 26: 2155–2167.

20.

Pan

Lin

Zhou

, et al. Keyboard design for phase I drug-combination trials. Contemp Clin Trials 2020; 92: 105972.

21.

Mozgunov

Gasparini

Jaki

. A surface-free design for phase I dual-agent combination trials. Stat Methods Med Res 2020; 29: 3093–3109.

22.

Mander

Sweeting

. A product of independent beta probabilities dose escalation design for dual-agent phase I trials. Stat Med 2015; 34: 1261–1276.

23.

Zhang

Yuan

. A practical Bayesian design to identify the maximum tolerated dose contour for drug combination trials. Stat Med 2016; 35: 4924–4936.

24.

Mozgunov

Paoletti

Jaki

. A benchmark for dose-finding studies with unknown ordering. Biostatistics 2022; 23: 721–737.

25.

Gandhi

Bahleda

Tolaney

, et al. Phase I study of neratinib in combination with temsirolimus in patients with human epidermal growth factor receptor 2-dependent and other solid tumors. J Clin Oncol 2014; 32: 68–75.

26.

Dykstra

Robertson

. An algorithm for isotonic regression for two or more independent variables. The Annals of Statistics 1982; 10: 708–716.

27.

Zhou

Lee

Wang

, et al. Incorporating historical information to improve phase I clinical trials. Pharm Stat 2021; 20: 1017–1034.

28.

Yan

Mandrekar

Yuan

. Keyboard: a novel Bayesian toxicity probability interval design for phase I clinical trials. Clin Cancer Res 2017; 23: 3994–4003.

29.

R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2020.

30.

O’Quigley

Conaway

. Continual reassessment and related dose-finding designs. Statistical Sci: Rev J Inst Math Stat 2010; 25: 202.

31.

Gasparini

. General classes of multiple binary regression models in dose finding problems for combination therapies. J R Stat Soc: Series C (Appl Stat) 2013; 62: 115–133.

32.

Mozgunov

Knight

Barnett

, et al. Using an interaction parameter in model-based phase i trials for combination treatments? a simulation study. Int J Environ Res Public Health 2021; 18: 345.

33.

Mozgunov

Jaki

Paoletti

. A benchmark for dose finding studies with continuous outcomes. Biostatistics 2020; 21: 189–201.

34.

Hirakawa

Wages

Sato

, et al. A comparative study of adaptive dose-finding designs for phase I oncology trials of combination therapies. Stat Med 2015; 34: 3194–3213.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.36 MB

		Temsirolimus
	Raw Trial Data	15 mg	25 mg	50 mg	75 mg
Neratinib	120 mg	0/2	0/4	1/5	0/4
	160mg	1/4	1/4	0/5	3/6
	200 mg	0/4	1/8	1/2
	240 mg	2/4
		Temsirolimus
	BOIN	25 mg	50 mg	75 mg
Neratinib	120 mg	0/3	0/0	0/0
	160 mg	1/6	0/6	3/6
	200 mg	2/12	2/3	0/0
		Temsirolimus
	KEY	25 mg	50 mg	75 mg
Neratinib	120 mg	0/3	0/0	0/0
	160 mg	1/6	0/6	3/6
	200 mg	2/9	3/6	0/0
		Temsirolimus
	SFD	25 mg	50 mg	75 mg
Neratinib	120 mg	0/3	0/0	0/6
	160 mg	1/6	0/6	4/9
	200 mg	0/3	2/3	0/0
		Temsirolimus
	PIPE	25 mg	50 mg	75 mg
Neratinib	120 mg	0/3	0/0	0/0
	160 mg	1/3	1/12	0/0
	200 mg	2/15	2/3	0/0
		Temsirolimus
	Waterfall	25 mg	50 mg	75 mg
Neratinib	120 mg	0/3	0/3	0/6
	160 mg	1/6	0/3	3/6
	200 mg	1/6	2/3	0/0
		Temsirolimus
	BLRM (c)	25 mg	50 mg	75 mg
Neratinib	120 mg	0/3	0/3	0/6
	160 mg	0/0	0/9	4/9
	200 mg	0/0	0/0	5/6
		Temsirolimus
	BLRM (a)	25 mg	50 mg	75 mg
Neratinib	120 mg	0/3	0/3	0/6
	160 mg	0/0	0/9	4/12
	200 mg	0/0	0/0	2/3
Neratinib	120 mg	0/3	1/6	4/12
	160 mg	0/3	1/6	0/0
	200 mg	0/0	2/6	0/0
		Temsirolimus
	Riviere	25 mg	50 mg	75 mg
Neratinib	120 mg	0/3	0/0	2/6
	160 mg	0/3	1/9	2/3
	200 mg	0/3	5/9	0/0

A comparison of model-free phase I dose escalation designs for dual-agent combination therapies

Abstract

Keywords

1. Introduction

2. Methodological review

2.1. Admissible combinations

3. Calibration of designs

3.3. Calibrating the keyboard design

3.4. Calibrating the surface-free design

4. Simulation study: 3 × 3 dosing grid

4.1. Setting

4.2. Model-based comparators

4.2.1. Bayesian logistic regression model

4.4. Results

4.4.1. Accuracy index and proportions of correct and acceptable selections

4.4.3. Number of patients treated at overly toxic combinations

5.1. Setting

Supplemental Material

sj-pdf-1-smm-10.1177_09622802231220497 - Supplemental material for A comparison of model-free phase I dose escalation designs for dual-agent combination therapies

Footnotes

Data Availability Statement

Declaration of conflicting interests

Funding

ORCID iDs

Supplemental material

References

Supplementary Material

4. Simulation study: $3 \times 3$ dosing grid