Sage Journals: Discover world-class research

Abstract

We introduce a new multiple type I error criterion for clinical trials with multiple, overlapping populations. Such trials are of interest in precision medicine where the goal is to develop treatments that are targeted to specific sub-populations defined by genetic and/or clinical biomarkers. The new criterion is based on the observation that not all type I errors are relevant to all patients in the overall population. If disjoint sub-populations are considered, no multiplicity adjustment appears necessary, since a claim in one sub-population does not affect patients in the other ones. For intersecting sub-populations we suggest to control the average multiple type I error rate, i.e. the probability that a randomly selected patient will be exposed to an inefficient treatment. We call this the population-wise error rate, exemplify it by a number of examples and illustrate how to control it with an adjustment of critical boundaries or adjusted $p$ -values. We furthermore define corresponding simultaneous confidence intervals. We finally illustrate the power gain achieved by passing from family-wise to population-wise error rate control with two simple examples and a recently suggested multiple-testing approach for umbrella trials.

Keywords

Enrichment designs family-wise error rate multiple testing platform trials population-wise error rate umbrella trials

1 Introduction

The aim of precision medicine is to provide each patient with an optimal treatment tailored to his or her genetic and/or clinical profile. One strategy for reaching this goal is to undertake trials where one or several treatments are investigated in multiple sub-populations. Examples of such trials are umbrella and basket trials in oncology. In an umbrella trial patients with the same cancer type but different molecular alterations are enrolled and the treatments are tailored to the specific target sub-populations. In a basket trial, patients with different cancer types but one common molecular alteration are enrolled with the aim to study one specific treatment that is targeted to the common alteration (see e.g., Woodcock and LaVange,¹ Strzebonska and Waligora²). In many cases, the target or sub-populations are disjoint by nature, but when many different biomarkers or cancer types are used, it can also occur that patients belong to more than one sub-population. For example, in the FOCUS4 study,³ biomarker tests were conducted to define subgroups based on the mutations present in the patients’ tumour DNA. Some patients belonged to more than one subgroup and thus the subgroups were made disjoint by means of a hierarchical ordering structure defined for the different mutations. Another example are subgroup selection and adaptive enrichment designs (e.g. Brannath et al.,⁴ Glimm and Di Scala,⁵ Wassmer and Brannath,⁶ Stallard et al.⁷) where a single treatment is tested in an overall patient population and a specific biomarker subgroup of the overall patient population, for which earlier studies indicate that the treatment may be more or even only effective. In this manuscript, we explicitly allow biomarker-defined sub-populations to be overlapping such that patients become eligible for multiple targeted treatment strategies including the case of a single treatment with multiple target populations. This means that, by the trial’s result, future patients of the overlap may be exposed to more than a single test decision on potentially inefficient treatment strategies. Moreover, for such studies suitable allocation procedures have to be defined. Issues of eligibility for multiple target therapies have been addressed e.g. in Malik et al.,⁸ Collignon et al.⁹ and Kesselmeier et al.¹⁰

In confirmatory clinical trials with tests of several hypotheses the multiple type I error is usually kept small by controlling the family-wise error rate (FWER). With the growing effort of detecting new and more predictive biomarkers and an increasing focus on rare diseases, it is becoming more and more difficult to undertake clinical trials that are sufficiently powered and also provide sufficient control of type I errors. Since the control of multiple type I errors amplifies this issue, more efficient alternatives to the common approach of FWER control are of strong interest. If a treatment or a treatment strategy is tested in several disjoint populations and each population is affected by only a single hypothesis test, the overall study basically consists of separate trials that merely share the same infrastructure. Therefore, no multiplicity adjustments are needed (e.g. Glimm and Di Scala,⁵ Collignon et al.⁹). However, if some sub-populations are overlapping, these intersections will contain patients that are possibly exposed to multiple erroneously rejected null hypotheses, implying that one has to adjust for multiplicity (e.g. Collignon et al.⁹). Since only patients in the intersections are concerned with this multiplicity issue, there is no need for adjustments for patients in the complements, who can only be affected by at most one false rejection of a null hypothesis. The FWER would therefore be too conservative also in this case. Especially for small and/or highly stratified populations, as for instance encountered in paediatric oncology, a more efficient approach is desirable (e.g. Fletcher et al.¹¹). The purpose of this manuscript is to propose a new concept of multiple type I error control that is less conservative. With this new error rate, which we name population-wise error rate (PWER), we aim to keep the average multiple type I error rate at a reasonable level. This provides control of the probability that a randomly chosen future patient will be exposed to an inefficient treatment policy.

The paper is outlined as follows. First, the PWER is motivated by means of a simple example, followed by the general mathematical definition. Then, we demonstrate how to control the PWER at a pre-specified level by adjusting critical boundaries or $p$ -values. In Section 4, we present a mathematical result (with a limit) for the strata-wise FWER when adjusting the boundary for PWER control. In the subsequent section, the gain in power by using PWER- instead of FWER-control is illustrated by two examples. In the first example we will investigate the case of two overlapping populations and with (i) two different treatments in each population and (ii) the same treatment in both populations. The second example consists of an application of the PWER to a multiple testing approach for umbrella trials suggested in Sun et al.¹² In Section 6 we extend the multiple test with PWER-control to simultaneous confidence intervals (SCIs) and discuss their coverage properties. The paper concludes with a discussion in Section 7. All computations and simulations are done in R. The corresponding R-script files are all available at https://github.com/chillner/RCode_Paper_PWER.

2 The population-wise error rate

In this section, the aforementioned PWER is introduced conceptually and formally. Examples for different settings are given to further deepen the understanding.

2.1 General framework and definition

Consider an overall population $P$ consisting of $m \geq 2$ possibly overlapping sub-populations $P_{1}, \dots, P_{m} \subseteq P$ and suppose that we want to investigate for all $P_{i}$ a treatment $T_{i}$ in comparison to a control treatment. The population $P_{i}$ is defined by specific inclusion and exclusion criteria which may include specific biomarker characteristics. In trials on personalized treatments, the appropriate inclusion and exclusion criteria are often a research question itself and then the same treatment is investigated in several different populations simultaneously. This is the case, for instance, in basket and subgroup selection trials. In other trial examples, like umbrella trials, different experimental treatments are investigated in the different sub-populations. In the sequel we call the tuples $(P_{i}, T_{i})$ the treatment policies where we allow treatments to be different or equal to each other, so that there can be some $T_{i}$ and $T_{j}$ with $i \neq̸ j$ such that $T_{i} = T_{j}$ . To each treatment policy ( $P_{i}, T_{i}$ ) we assign the null hypothesis $H_{i} : θ_{i} \leq 0$ , where $θ_{i} = θ (P_{i}, T_{i})$ quantifies the efficacy of treatment $T_{i}$ in comparison to the control treatment in population $P_{i}$ . Also the control treatments can be the same or can differ between the different populations. The PWER is then given by the risk for a randomly chosen future patient to be assigned to one or more inefficient treatment policies, i.e. to belong to at least one tuple $(P_{i}, T_{i})$ where for the true effect $θ_{i} \leq 0$ but $H_{i}$ has been rejected. This future patient is imagined to be drawn after the study from the same population the study sample was drawn from.

In order to define the PWER mathematically, we partition the overall population into disjoint strata $P_{J} : = ⋂_{j \in J} P_{j} ∖ ⋃_{k \in I ∖ J} P_{k}$ for $J \subseteq I : = {1, \dots, m}$ . In Figure 1 we see an example for such a partition based on three sub-populations $P_{i}$ , $i = 1, 2, 3$ . Note that $P_{{1, 2, 3}} = \emptyset$ . For each non-empty stratum $P_{J}$ , we denote its proportion within the overall patient population $P$ by $π_{J}$ such that $\sum_{J \subseteq I, P_{J} \neq \emptyset} π_{J} = 1$ . In most parts of the paper we will assume that the relative prevalences $π_{J}$ are known. In practical cases, they will have to be estimated. The effect of estimation will be discussed in the Section 5.3. For any future patient in $P_{J}$ , $J \subseteq I$ , we commit a type I error if he/she belongs to at least one $(P_{i}, T_{i})$ with $i \in J$ and $θ_{i} \leq 0$ for which $H_{i}$ has been rejected. The PWER is then defined as

PWER = \sum_{J \subseteq I, P_{J} \neq \emptyset} π_{J} P (falsely reject any H_{i} with i \in J)

(1)To determine the PWER, we need to know for each stratum

P_{J}

the probability of rejecting at least one true null hypothesis that affects the stratum.

Figure 1.

$m = 3$ populations and their disjoint sub-populations.

Compared to the FWER, which controls the maximum risk for future patients to be assigned to an inefficient treatment strategy, the PWER is an average risk. It is more liberal and thereby more powerful, because

\begin{aligned} PWER & = \sum_{J \subseteq I, P_{J} \neq \emptyset} π_{J} P (falsely reject any H_{j} with j \in J) \\ \leq (\sum_{J \subseteq I, P_{J} \neq \emptyset} π_{J}) P (falsely reject any H_{i} for i \in I) = FWER \end{aligned}

Note that

PWER = FWER

only in the case where

P = P_{1} = \dots = P_{m}

2.2 Two intersecting populations

As an example consider a trial with two intersecting sub-populations $P_{1}$ and $P_{2}$ and two treatments $T_{1}$ , $T_{2}$ to be tested by means of the hypotheses $H_{1} : θ (P_{1}, T_{1}) \leq 0$ and $H_{2} : θ (P_{2}, T_{2}) \leq 0$ . Usually, the two treatments will be compared to the same control, however, the basic idea given in the subsequent sections also applies with treatment-specific controls. As illustrated in the left panel of Figure 2, the overall population can be partitioned into three disjoint sub-populations, $P_{{1}} : = P_{1} ∖ P_{2}$ , $P_{{2}} : = P_{2} ∖ P_{1}$ and $P_{{1, 2}} : = P_{1} \cap P_{2}$ . Obviously, we commit a type I error for $P_{{i}}$ whenever $H_{i}$ is falsely rejected, $i = 1, 2$ , and for $P_{{1, 2}}$ whenever $H_{1}$ or $H_{2}$ are falsely rejected. Hence, if $H_{1}$ and $H_{2}$ are both true, then

PWER = π_{{1}} P (reject H_{1}) + π_{{2}} P (reject H_{2}) + π_{{1, 2}} P (reject H_{1} or H_{2})

(2)If

H_{1}

is true and

H_{2}

is false, then

PWER = π_{{1}} P (reject H_{1}) + π_{{1, 2}} P (reject H_{1}) = (π_{{1}} + π_{{1, 2}}) P (reject H_{1})

. Hence, if only one null hypothesis is true, say

H_{i}

, the PWER reduces to the probability of rejecting

H_{i}

multiplied by the relative size of the population

P_{i}

Figure 2.

Left panel: $m = 2$ intersecting populations. Right panel: $m = 3$ nested populations.

In situations where a single treatment $T = T_{1} = T_{2}$ is tested in both sub-populations $P_{1}$ and $P_{2}$ , the specific case in which only one hypothesis $H_{i}$ is rejected deserves some discussion. In clinical trials the final decisions for future patients usually depend on additional analyses that explore efficacy and safety across different populations and strata. Rejection of $H_{i}$ is then a necessary but not a sufficient condition for the decision to treat future patients from $P_{i}$ with treatment $T$ . The PWER is then under control in any case, namely even in the worst-case scenario where all future patients of a population will receive treatment $T$ even though $T$ is inefficient for the other, intersecting population.

2.3 Nested populations

In practice, one often faces the problem of nested populations $P_{1} \supset P_{2} \supset \dots \supset P_{m}$ , as in the right panel of Figure 2. Think of a situation where the optimal eligibility criteria for a single treatment are unknown and thus a sequence of tightening eligibility criteria for testing the efficacy of that treatment is planned which then ultimately leads to a sequence of nested populations. Define the strata $P_{[i]} : = P_{{1, \dots, i}}$ for $i \leq m$ , which are $P_{[i]} = P_{i} ∖ P_{i + 1}$ for $i < m$ and $P_{[i]} = P_{i}$ for $i = m$ . We commit a type I error for $P_{[i]}$ whenever any true $H_{j}$ is rejected for $j \leq i$ . With relative prevalences $π_{[i]} : = π_{{1, \dots, i}}$ of $P_{[i]}$ the PWER under the global null hypothesis is given by

PWER = \sum_{i = 1}^{m} π_{[i]} P (reject at least one true H_{j} for j \leq i)

Especially, if

P_{i}

is defined by a biomarker

X

, i.e.

P_{i} = {X > t_{i}}

for cut-off points

t_{i}

i = 1, \dots, m + 1

(with

t_{m + 1} : = \infty

), the PWER under global null hypothesis can be written as

PWER = \sum_{i = 1}^{m} P (t_{i} < X \leq t_{i + 1}) P (reject at least one true H_{j} for j \leq i)

2.4 Three populations with two intersections

Finally, we give an example where the FWER is strictly conservative even for control of the maximum (instead of the average) type I error rate. Consider three populations $P_{1}$ , $P_{2}$ , $P_{3}$ with $P_{1} \cap P_{2} \neq \emptyset$ , $P_{2} \cap P_{3} \neq \emptyset$ and $P_{1} \cap P_{3} = \emptyset$ , as in Figure 1. Again, hypotheses of the form $H_{i} : θ (P_{i}, T_{i}) \leq 0$ are to be tested in each population, respectively. Under the global null hypothesis, where all null hypotheses $H_{i}$ are true, the PWER is given by

PWER = \sum_{i = 1}^{3} π_{{i}} P (reject H_{i}) + \sum_{i = 1}^{2} π_{{i, i + 1}} P (reject H_{i} or H_{i + 1})

The FWER under the global null hypothesis equals

P (reject H_{1} or H_{2} or H_{3})

. Since no patient can belong to

P_{1}

and

P_{3}

simultaneously, the FWER corrects for a multiplicity no patient is affected by.

3 Control of the PWER

In this section, we demonstrate how to achieve control of the PWER at a pre-specified level $α$ under the general framework of Section 2.1. Suppose that each $H_{i}$ can be tested with a test statistic $Z_{i}$ where larger values of $Z_{i}$ provide evidence against $H_{i}$ . We assume further that the joint distribution of ${Z_{i}}_{i = 1}^{m}$ is known (at least asymptotically). In order to control the PWER at a pre-specified significance level $α \in (0, 1)$ , we need to find the smallest critical value $c^{*} \in R$ such that

{PWER}_{θ^{*}} = \sum_{J \subseteq I} π_{J} P_{θ^{*}} (⋃_{i \in J \cap I (θ^{*})} {Z_{i} \geq c^{*}}) \leq α

(3)where

θ^{*} = (θ_{1}^{*}, \dots, θ_{m}^{*})

is the parameter configuration that maximizes the PWER and

I (θ^{*}) = {i \in I : θ_{i}^{*} \in H_{i}}

is the index set of corresponding true null hypotheses. The maximal PWER is usually obtained under the global null hypothesis, i.e. for

θ^{*} = (0, \dots, 0)

. This is e.g. the case under the subset pivotality condition (see e.g. Westfall and Young,¹³ Dickhaus¹⁴) which applies, for instance, to contrast

z

- or

t

-statistics for normal data with a variance that is homogeneous across treatment groups and population strata. Since the (asymptotic) correlations between the test statistics usually depend only on the relative prevalences

π_{J}

J \subseteq I

, the PWER-level can be exhausted under

θ^{*}

. When each

H_{i}

is tested by means of a p-value

p_{i}

, we can reach

PWER \leq α

by choice of an adjusted significance level

α^{*}

applied to all

p_{i}

. The critical value

c^{*}

in (3) or adjusted significance level

α^{*}

can be found by applying a univariate root finding method. Because the PWER is always bounded by the FWER, the critical value and adjusted significance level are more liberal than for FWER-control. Therefore the PWER leads to a higher power and a lower sample size to achieve a certain power.

Instead of determining the critical value $c^{*}$ we could report the PWER-adjusted $p$ -values

p_{j}^{P W E R} = \sum_{J \subseteq I} π_{J} P_{θ^{*}} (⋃_{i \in J \cap I (θ^{*})} {Z_{i} \geq z_{j}^{obs}}), j = 1, \dots, m

(4)where

z_{j}^{obs}

is the observed value of

Z_{j}

. Obviously,

p_{j}^{P W E R} \leq α

if an only if

z_{j}^{obs} \geq c^{*}

and hence

H_{j}

can alternatively be tested with the PWER-adjusted p-value

p_{j}^{P W E R}

. Furthermore,

p_{j}^{P W E R}

gives the smallest PWER-level the hypothesis

H_{j}

can be rejected with.

Note that we could control the PWER also with population-specific critical values $c_{i}^{*}$ (or adjusted levels $α_{i}^{*}$ ). Unique solutions for $c_{i}^{*}$ can be obtained by setting $c_{i}^{*} = w_{i} c^{*}$ for pre-specified weights $w_{i} > 0$ and searching for the $c^{*}$ that meets the pre-specified PWER-level. Multiplicity adjusted $p$ -values can also be calculated with the weights $w_{i}$ . The weights may, for instance, be larger for smaller populations $P_{i}$ in order to increase the chance of finding efficient treatment policies for small sub-populations. However, due to the weighting by $π_{J}$ in definition (1) and expression (3), the multiple type I error rate $P_{θ^{*}} (⋃_{i \in J \cap I (θ^{*})} {Z_{i} \geq c^{*}})$ for $P_{J}$ will automatically be larger for smaller $π_{J}$ . We will therefore only consider equal critical values $c_{i}^{*} = c^{*}$ in our examples below. This has the additional advantage that, when all test statistics have the same marginal null distribution, the critical value $c^{*}$ cannot fall short of the $(1 - α)$ -quantile of the marginal null distribution (see Appendix 1). We further note that due to the natural heterogeneity of the treatment effects within each sub-population, it would be advisable to apply a stratified test to investigate a treatment in each $P_{j}$ (as done in Section 5).

4 Behaviour of the strata-wise FWER under PWER-control

In order to understand the consequences of PWER-control it is of interest to investigate the behaviour of the FWERs for each stratum $P_{J}$ , i.e. the risk for future patients in $P_{J}$ to be exposed to at least one inefficient treatment strategy. Obviously, under PWER-control the strata-wise FWER for each $P_{J}$ is bounded by $α / π_{J}$ . This bound is useful only for sufficiently large $π_{J}$ , because there is another bound, which is independent from $π_{J}$ : As mentioned in the last section, in the common situation when all test statistics have the same marginal null distribution under $θ^{*}$ , the common critical value $c^{*}$ is bounded from below by the $(1 - α)$ -quantile $z_{α}$ of the marginal null distribution. Therefore, the strata-wise FWER for $P_{J}$ is at most ${\tilde{α}}_{J} : = P_{θ^{*}} (max_{j \in J} Z_{j} \geq z_{α})$ , see Appendix 1. Note that ${\tilde{α}}_{J}$ is independent of the prevalences because the defining probability is conditional on the sample sizes. Moreover, ${\tilde{α}}_{J}$ is at most $\min (| J | α, 1)$ by the Bonferroni inequality. The bound $α / π_{J}$ becomes useless latest when it is above the Bonferroni bound which is equivalent to $π_{J} > α / {\tilde{α}}_{J} \geq max (1 / | J |, α)$ . The following theorem gives an approximation for small $π_{J}$ which is often smaller than ${\tilde{α}}_{J}$ , but depends on the prevalences ${\tilde{π}}_{J^{'}}$ of $P_{J^{'}}$ for $J^{'} \neq̸ J$ relative to the complement $P ∖ P_{J}$ . For this theorem we need to define for $J \subseteq I$ the PWER that we obtain when removing $P_{J}$ from the entire population:

{c P W E R}_{J} (c) = \sum_{J^{'} \subseteq I, J^{'} \neq̸ J} {\tilde{π}}_{J^{'}} P_{θ^{*}} (max_{j \in J^{'} \cap I (θ^{*})} Z_{j} \geq c) where {\tilde{π}}_{J^{'}} = π_{J^{'}} / (1 - π_{J})

(5)Accordingly, we call

{c P W E R}_{J} (c)

the complementary PWER of

P_{J}

. Note that the sum in (5) is a weighted average of all non-

J

terms where the weights are the prevalence

{\tilde{π}}_{J^{'}}

P_{J^{'}}

for

J^{'} \neq̸ J

, relative to

P ∖ P_{J}

Theorem. Assume that for all $J \subseteq I$ the strata-wise ${FWER}_{J} (c) : = P_{θ^{*}} (max_{j \in J \cap I (θ^{*})} Z_{j} \geq c)$ is strictly decreasing and differentiable as a function in $c$ . Then ${PWER}_{θ^{*}} (c)$ and ${cPWER}_{J} (c)$ are differentiable and strictly decreasing in $c$ as well and we find $c^{*}, d_{J}^{*}$ such that ${PWER}_{θ^{*}} (c^{*}) = {cPWER}_{J} (d_{J}^{*}) = α$ . With this we get

{FWER}_{J} (c^{*}) = {FWER}_{J} (d_{J}^{*}) + O (π_{J})

(6)where

O (π_{J})

converges linearly with

π_{J}

to zero while all

{\tilde{π}}_{J^{'}}

J^{'} \neq̸ J

, remain constant. Furthermore, when

{FWER}_{J} (c) \geq {cPWER}_{J} (c)

for all

c

, then

{FWER}_{J} (c^{*}) \leq {FWER}_{J} (d_{J}^{*})

for all

π_{J}

Note that the last statement follows for instance when ${FWER}_{J} (c) = max_{J^{'} \subseteq I} {FWER}_{J^{'}} (c)$ for all $c$ , in particular, for $J = I$ when $π_{I} > 0$ . The proof of the theorem can be found in Appendix 2. Its conditions are weak: The strict monotonicity and differentiability of the strata-wise FWER are satisfied when the joint distribution of $Z_{1}, \dots, Z_{m}$ is continuous. This is often the case, at least asymptotically.

We illustrate the approximation (6) with the examples in Figure 2, starting with the left side, i.e. two intersecting populations. In this case ${FWER}_{{1, 2}} (c) \geq {cPWER}_{{1, 2}} (c) = {\tilde{π}}_{{1}} P_{θ^{*}} (Z_{1} \geq c) + {\tilde{π}}_{{2}} P_{θ^{*}} (Z_{2} \geq c)$ for all $c$ and any ${\tilde{π}}_{{i}} = π_{{i}} / (π_{{1}} + π_{{2}})$ and thereby the approximation ${FWER}_{{1, 2}} (d_{{1, 2}}^{*})$ is an upper bound for the maximal strata-wise family wise error rate ${FWER}_{{1, 2}} (c^{*})$ . If $Z_{1}, Z_{2}$ have the same marginal null distribution, then $d_{{1, 2}}^{*}$ is the $(1 - α)$ -quantile of this distribution, e.g. $Φ^{- 1} (1 - α)$ when $Z_{1}, Z_{2}$ are standard normal under $θ^{*}$ . The upper bound ${FWER}_{{1, 2}} (d_{{1, 2}}^{*})$ then equals the FWER with two un-adjusted tests that are always below $2 α$ and even smaller if we account for the positive correlation between $Z_{1}$ and $Z_{2}$ .

For three nested populations (right side of Figure 2) the maximal strata-wise FWER is, according to the theorem, given by ${FWER}_{{1, 2, 3}} (d^{*})$ where $d^{*}$ solves

{c P W E R}_{{1, 2, 3}} (d^{*}) = {\tilde{π}}_{{1}} P_{θ^{*}} (Z_{1} \geq d^{*}) + {\tilde{π}}_{{1, 2}} P_{θ^{*}} (max (Z_{1}, Z_{2}) \geq d^{*}) = α

with

{\tilde{π}}_{J} = π_{J} / (1 - π_{{1, 2, 3}})

for

J = {1}

and

J = {1, 2}

. The threshold

d^{*}

depends on the relative prevalences

{\tilde{π}}_{1}

and

{\tilde{π}}_{{1, 2}}

and the correlation between

Z_{1}

and

Z_{2}

, whereby the latter is determined by the sample sizes and the situation whether the same or different treatments are tested in the populations (see Section 5.2 below). For a numerical illustration, we assume the nominal level

α = 0.025

, a single treatment and prevalences

π_{{1}} = 0.75

π_{{1, 2}} = 0.24

and

π_{{1, 2, 3}} = 0.01

as well as standard normally distributed endpoints with test statistics

Z_{j}

j = 1, 2, 3

. Assuming balanced treatment groups in each stratum and that the sample sizes perfectly match the population prevalences, we obtain the correlations

Corr (Z_{1}, Z_{2}) = 0.5

Corr (Z_{1}, Z_{3}) = 0.1

and

Corr (Z_{2}, Z_{3}) = 0.2

. The critical value is then

d^{*} = 2.037

and the upper bound

{cPWER}_{{1, 2, 3}} (d^{*})

for the strata-wise FWER for

P_{{1, 2, 3}}

becomes

0.0572

. Since this bound results from taking the limit for

π_{{1, 2, 3}} ↓ 0

(while keeping the sample sizes and complementary conditional prevalences fixed), the maximal strata-wise FWER with prevalence

π_{{1, 2, 3}} = 0.01

is somewhat smaller, namely

0.0566

5 Comparison with FWER-controlling procedures

Due to the PWER being more liberal than the FWER, the next naturally arising question is how much this affects power and sample size. We will at first compare PWER-control with FWER-control for two intersecting sub-populations when (i) the same and (ii) two different treatments are investigated in each sub-population. Secondly, we will apply PWER-control to the multiple testing approach for umbrella trials considered in Sun et al.¹² and compare it to the originally suggested FWER-control.

5.1 Combination of independent samples

We start with a hypothetical, but statistically simple situation. Assume that a treatment $T$ is investigated for two intersecting populations $P_{i}$ , $i = 1, 2$ , that are defined by two different biomarkers. Assume further that a sponsor has decided to test the effect of $T$ for the two biomarker-positive groups in two disjoint samples within a single clinical trial. We consider here the one-sided hypotheses $H_{i} : θ_{i} = θ (P_{i}, T) \leq 0$ for the efficacy of $T$ in $P_{i}$ against a control, where $θ_{i} = (π_{{i}} θ_{{i}} + π_{{1, 2}} θ_{{1, 2}}) / π_{i}$ (with $π_{i} = π_{{i}} + π_{{1, 2}}$ ) is given as a weighted mean of the unknown effects in the respective strata $P_{{i}}$ and $P_{{1, 2}}$ . Since the analysis of the two samples is done in a single study, regulatory authorities may require multiple testing adjustment. Let us assume that PWER-control is accepted as a compromise between control of the FWER and the unadjusted testing, the latter being the case when submitting two different studies. PWER-control bounds the overall probability for a future patient to be exposed to an inefficient treatment strategy.

Since the two treatment strategies $(P_{i}, T)$ , $i = 1, 2$ , are investigated in two independent samples, the corresponding test statistics $Z_{i}$ are stochastically independent. Let us further assume that both $Z_{i}$ are normally distributed with variance 1. The question is now, what we gain in terms of power by switching from FWER- to PWER-control. We will assume an overlap between the two populations $P_{1}$ and $P_{2}$ of probability $π_{{1, 2}}$ that will be varied in our investigation.

Let $Φ$ and $Φ^{- 1}$ be the standard normal distribution and quantile functions, respectively. By the independence of the test statistics, the $FWER = 1 - Φ (c_{F}^{*})^{2}$ is controlled at $α$ by Šidák's critical value $c_{F}^{*} = Φ^{- 1} (\sqrt{1 - α})$ . Following the example in Section 2.2, the PWER is given by

PWER = (1 - π_{{1, 2}}) {1 - Φ (c_{P}^{*})} + π_{{1, 2}} {1 - Φ (c_{P}^{*})^{2}}

where

c_{P}^{*}

is the critical value used for control of the PWER at level

α

. Note that

π_{{1, 2}}

determines how much multiplicity adjustment is needed for PWER-control. Solving

PWER = α

yields

c_{P}^{*} = Φ^{- 1} (\frac{- (1 - π_{{1, 2}}) + \sqrt{(1 - π_{{1, 2}})^{2} + 4 π_{{1, 2}} (1 - α)}}{2 π_{{1, 2}}});

(7)see Appendix 3 for the derivation. For

π_{{1, 2}} ↓ 0

this critical value decreases to

Φ^{- 1} (1 - α)

coinciding with the unadjusted case and for

π_{{1, 2}} ↑ 1

we have

c_{P}^{*} ↑ c_{F}^{*}

To assess the power gain by using PWER- instead of FWER-control, we consider the factor of sample size increase with PWER- or FWER-control in comparison to the one with no multiplicity correction. Aiming for a marginal power of at least $1 - β$ , the sample size for each population $P_{j}$ has to be at least $n_{c} \geq (Φ^{- 1} (1 - β) + c)^{2} / δ_{j}^{2}$ with critical value $c$ and non-centrality parameter $δ_{j}$ in $P_{j}$ . The fractions

q_{α} (c) : = \frac{n_{c}}{n_{Φ^{- 1} (1 - α)}} = {(\frac{Φ^{- 1} (1 - β) + c}{Φ^{- 1} (1 - β) + Φ^{- 1} (1 - α)})}^{2} for c \in {c_{P}^{*}, c_{F}^{*}}

(8)describe how much more sample size one would need for a marginal power of

1 - β

when the multiplicity adjustments are performed.

Figure 3 shows $q_{α} (c)$ for $α = 0.025$ and $β = 0.2$ depending on the size $π_{{1, 2}}$ of $P_{1} \cap P_{2}$ when both populations are assumed to be of equal size. FWER-control requires an increase in sample size of about 21% while PWER-control requires considerably less depending on $π_{{1, 2}}$ . The larger the intersection, the more patients are exposed under the global null hypothesis to two false rejections, therefore the critical value increases and the sample size increases as well. At $π_{{1, 2}} = 1$ , PWER and FWER coincide and so do the sample sizes. If, for instance, the intersection makes up $40 %$ of the union of the two populations only around $10 %$ sample size increase is needed when using PWER-control, less than half of what is necessary with FWER-control.

Figure 3.

Factor of sample size increase compared to the unadjusted case to achieve a marginal power of $1 - β = 80 %$ with PWER- and FWER-control at $α = 0.025$ in a combination of two independent studies with different but overlapping populations. PWER: population-wise error rate; FWER: family-wise error rate.

5.2 Testing population specific effects in one study.

We consider now a single study with two overlapping populations $P_{i}$ , $i = 1, 2$ , for each of which a treatment $T_{i}$ is compared to a common control $C$ . We will investigate two possible scenarios, namely (i) $T_{1} \neq̸ T_{2}$ and (ii) $T_{1} = T_{2}$ . For simplicity, we assume again that both populations have the same size, i.e. $π_{{1}} = π_{{2}}$ . We assume further that the data from each population are normally distributed with mean treatment difference $θ_{i}$ and a common known variance $σ^{2}$ (across treatments and subgroups $P_{J}$ , $J \subseteq {1, 2}$ ) and z-tests are used to test $H_{i} : θ_{i} \leq 0$ . For $J \subseteq {1, 2}$ , we denote by $n_{J} = N \cdot π_{J}$ the sample size in $P_{J}$ and by $N = \sum_{J \subseteq {1, 2}} n_{J}$ the overall total sample size.

(i) Unequal treatments. In scenario (i) we have to think of a way to randomize patients to either treatment or control. In the complements $P_{{i}}$ we simply apply 1:1 randomization to treatment $T_{i}$ or control $C$ . In the intersection $P_{{1, 2}}$ we apply 1:1:1 randomization to the three groups $T_{1}, T_{2}$ and $C$ . By this we can assume that in $P_{{i}}$ there are $n_{{i}} / 2$ patients in the treatment and control group, whereas in the intersection there are $n_{{1, 2}} / 3$ patients in each group.

Obviously, this type of allocation leads to an inconsistency between the sample sizes and prevalences. Say $P_{1}$ has a prevalence of $π_{1} = π_{{1}} + π_{{1, 2}} = 100 / 170 = 0.59$ and of 100 patients in $P_{1}$ , 70 belong to $P_{{1}}$ and 30 to $P_{{1, 2}}$ . However, applying the above allocation rule implies that $35 / 45 \approx 77.7 %$ of the patients sampled from $P_{1}$ and assigned to treatment $T_{1}$ belong to $P_{{1}}$ . This means that the proportions of the strata-wise sample sizes within a treatment group do not match their corresponding proportions in the population. Hence, the population-wise means must be estimated by a weighted sum of strata-wise means:

{\hat{x}}_{i, G_{i}} = (\frac{π_{{i}}}{π_{{i}} + π_{{1, 2}}}) {\bar{x}}_{{i}, G_{i}} + (\frac{π_{{1, 2}}}{π_{{i}} + π_{{1, 2}}}) {\bar{x}}_{{1, 2}, G_{i}}, G_{i} \in {T_{i}, C}

where

{\bar{x}}_{J, G_{i}}

is the mean response in strata

P_{J}

J \subseteq {1, 2}

, under treatment

G_{i}

. In the above example, we would need to compute

{\hat{x}}_{1, T_{1}} = 0.7 \cdot {\bar{x}}_{{1}, T_{1}} + 0.3 \cdot {\bar{x}}_{{1, 2}, T_{1}}

for treatment

T_{1}

The $z$ -test statistic is finally given by $Z_{i} = ({\hat{x}}_{i, T_{i}} - {\hat{x}}_{i, C}) / \sqrt{Var ({\hat{x}}_{i, T_{i}} - {\hat{x}}_{i, C})}$ . Since in the intersection $P_{{1, 2}}$ the same control group is used for both test statistics, they are positively correlated. Assuming $π_{{1}} = π_{{2}}$ , we obtain $Corr (Z_{1}, Z_{2}) = (3 / 2) π_{{1, 2}} / (1 + 2 π_{{1, 2}})$ . The calculation of this correlation and an expression for the variance $Var ({\hat{x}}_{i, T_{i}} - {\hat{x}}_{i, C})$ can be found in Appendix 4.

(ii) Equal treatments. In scenario (ii), we investigate one and the same treatment $T_{1} = T_{2} = T$ in both populations and apply the 1:1 randomization to every stratum. By this we can use for $H_{i}$ the test statistic $Z_{i} = ({\bar{x}}_{T_{i}} - {\bar{x}}_{C}) / (2 σ / \sqrt{n_{{i}} + n_{{1, 2}}})$ . Because we are using the same treatment in both populations, we expect a higher correlation between $Z_{1}$ and $Z_{2}$ . Indeed, for $π_{{1}} = π_{{2}}$ the correlation is equal to $Corr (Z_{1}, Z_{2}) = 2 π_{{1, 2}} / (1 + π_{{1, 2}})$ which is greater or equal to the correlation with different treatments for all $π_{{1, 2}} \in [0, 1]$ ; see Appendix 4.

For both scenarios, we intend to find critical values to control the PWER and FWER, respectively. Following Section 2.2, the PWER under the global null is given by

\begin{aligned} PWER & = π_{{1}} P ({Z_{1} \geq c_{P}^{*}}) + π_{{2}} P ({Z_{2} \geq c_{P}^{*}}) + π_{{1, 2}} P ({Z_{1} \geq c_{P}^{*}} \cup {Z_{2} \geq c_{P}^{*}}) \\ = (1 - π_{{1, 2}}) {1 - Φ (c_{P}^{*})} + π_{{1, 2}} {1 - Φ_{ρ} (c_{P}^{*}, c_{P}^{*})} \end{aligned}

(9)with

c_{P}^{*}

being the critical value that is to be found, and

Φ_{ρ}

is the cumulative distribution function of the bivariate normal distribution with standard normal marginals and correlation

ρ

. A univariate root finding algorithm can now be used to solve

PWER = α

for

c_{P}^{*}

As an example, suppose we are in scenario (i) (multiple treatments) with $π_{{1}} = π_{{2}} = 0.4$ , $π_{{1, 2}} = 0.2$ , $β = 0.2$ and $α = 0.025$ . Then we have $ρ = Corr (Z_{1}, Z_{2}) \approx 0.01$ . We solve $FWER = 1 - Φ_{ρ} (c_{F}^{*}, c_{F}^{*}) = α$ to obtain $c_{F}^{*} \approx 2.23$ and $PWER = α$ to obtain $c_{P}^{*} \approx 2.03$ . Using (8), this yields a sample size increase of around 20% for the FWER and only an increase of 5% for the PWER.

Figure 4 shows graphs of sample size increases for both scenarios and both types of multiple error control in dependence of $π_{{1, 2}}$ . At $π_{{1, 2}} = 0$ (disjoint populations), for instance, the PWER-approach yields no sample size increase, where the FWER-based method yields an increase of over $20 %$ . With increasing intersection size the difference between sample sizes for PWER and FWER-control declines until both values fall together at $π_{{1, 2}} = 1$ where the PWER is equal to the FWER.

Figure 4.

Factor of sample size increase compared to the unadjusted case for FWER- and PWER-control at $α = 0.025$ in a single study with two overlapping populations depending on the size of the intersection $π_{{1, 2}}$ . The Left panel is for scenario (i) with different experimental treatments and a common control; the right panel is for scenario (ii) with equal experimental treatments. The power is $1 - β = 80 %$ in both scenarios. PWER: population-wise error rate; FWER: family-wise error rate.

For the PWER, this graphic also illustrates that the correlation of the test statistics and the degree of adjustments needed to correct for multiplicity behave like opposing ’forces’. At $π_{{1, 2}} = 0$ the test statistics are uncorrelated, but there is no need to adjust for multiplicity with PWER-control. At $π_{{1, 2}} = 1$ there is only one population, so the correlation is 1 which implies again that no multiplicity adjustment is needed, although we are formally testing two hypotheses for everyone. For $π_{{1, 2}}$ between 0 and 1 we obtain the maximum for the PWER and corresponding sample size increase. Mathematically, this can be seen by rewriting the PWER as $1 - Φ (c^{*}) + π_{{1, 2}} {Φ (c^{*}) - Φ_{ρ} (c^{*}, c^{*})}$ . For fixed $c^{*}$ , only the second term depends on $π_{{1, 2}}$ . It is the product of two non-negative factors, $π_{{1, 2}}$ and $Φ (c^{*}) - Φ_{ρ} (c^{*}, c^{*})$ ), where the first increases from 0 to 1 and the second decreases from 1 to 0.

5.3 Estimation of population prevalences

In clinical practice, the assumption of known prevalences $π_{J}$ is rarely justified and it is natural to ask whether the replacement of $π_{J}$ by an estimator ${\hat{π}}_{J}$ will significantly inflate the PWER. A suitable choice for ${\hat{π}}_{J}$ is the maximum likelihood estimator (MLE) from the multinomial distribution of $(n_{J})_{J \subseteq I}$ . Using these estimates instead of the true prevalences, we compute the critical value $c^{*}$ by solving $P W E R = α$ . This guarantees asymptotic control of the PWER, since $({\hat{π}}_{J})_{J \subseteq I}$ is consistent and the joint distribution of the test statistics used in the calculation of $c^{*}$ is conditional on $(n_{J})_{J \subseteq I}$ .

We examine the PWER by means of scenarios (i) and (ii) of Section 5.2. For each constellation of true prevalences, we generate sample size vectors $({\hat{n}}_{J})_{J \subseteq I}$ from the corresponding multinomial distribution and computed the MLEs $({\hat{π}}_{J})_{J \subseteq I}$ . To see by how much the true PWER is inflated, the probabilities for a type I error for each sub-population $P_{J}$ are computed by using the ‘estimated’ critical value and the conditional correlation structure of the involved test statistics. By weighting each of these probabilities by their respective true population prevalence $π_{J}$ , we obtain the true PWER for the given ‘estimated’ critical value. This procedure was repeated 10.000 times and the mean of each true PWER was taken as approximation of the actual overall PWER. Figure 5 shows contour plots of this approximation of the overall PWER for scenarios (i) and (ii) and $N = 50$ and $N = 100$ , respectively. The plots indicate that the target PWER of $0.025$ may be missed only slightly, even for $N = 50$ . More precisely, the mean true PWER values range from around $2.49 \cdot 10^{- 2}$ to at worst $2.51 \cdot 10^{- 2}$ where the largest standard error we observed is at around $1.2 \cdot 10^{- 5}$ , which all in all displays a fairly negligible deviation from 0.025. To be fair, however, this also implies that the standard deviation of a single true PWER amounts to $1.2 \cdot 10^{- 3}$ , meaning that values of around 0.026 are quite possible.

Figure 5.

Contour plots of the actual overall PWER when using ML-estimates ${\hat{π}}_{J}$ for the prevalences $π_{J}$ in the determination of the critical value $c_{P}^{*}$ at level $α = 0.025$ . The first row corresponds to scenario (i) and the second row to scenario (ii) from Section 5.2. Because of $π_{{1}} + π_{{2}} \leq 1$ , the contour plots are restricted to the lower left rectangle of the squares. PWER: population-wise error rate; ML: maximum likelihood.

In the case of a small relative population size $π_{J}$ it might be that, by chance, no patient is recruited in this group. This would mean that we would not account for all the multiplicity of these patients completely. If such an intersection cannot be excluded theoretically from the inclusion and exclusion criteria or due to medical arguments, we could introduce a small minimal number $π_{m i n}$ for all $π_{J}$ in order to be more conservative. Also, different approaches like shrinkage methods or Bayesian estimation of the $π_{J}$ are conceivable options in future research.

Lastly, note that we assumed here that the patient population of the trial is representative of the respective future patient populations. If this cannot be assumed, one could take prevalence estimates from more representative studies. If these estimates are based on sufficiently large sample sizes, the control of the PWER should be achieved similarly or better than with the trial data if the representative data are larger.

5.4 Multiple testing approaches for umbrella trials

We consider now a multiple testing approach for umbrella trials suggested in Sun et al.¹² and investigate the gain in power by switching from FWER- to PWER-control. Following Sun et al.,¹² we assume $l$ disjoint population strata, which are denoted here by $S_{1}, \dots, S_{l}$ and which are subsets of a larger and more general patient population $P = \cup_{i = 1}^{l} S_{i}$ from which we randomly sample. In each stratum a specific experimental treatment $E_{i}$ shall be compared to a control treatment $C$ . For simplicity, we assume that each population has the relative population prevalence $π_{i}$ which is assumed to equal $n_{i} / N$ , where $n_{i}$ is the number of patients in $S_{i}$ and $N$ is the total number of patients in the sample. This holds in practice at least approximately; see also Section 5.3.

With only small $n_{i}$ , the establishment of a treatment effect in the individual strata is difficult and impossible to achieve with sufficient power. Therefore, study designs have been suggested that compare $C$ with the global treatment strategy $E$ which assigns treatment $E_{i}$ to population strata $S_{i}$ . Application examples for such trial designs where a pooled comparison of the global strategy $E$ to $C$ has been chosen for the primary analysis (with the strata-wise comparisons as secondary analyses only) are Hill et al.¹⁵ and Owadally et al.¹⁶ Such an overall comparison of the strategy $E$ with $C$ utilizes the total sample size $N$ and does also not require multiple testing. However, it does not permit a claim for a sub-population when the effect of $E$ is heterogeneous. To improve the approach, Sun et al.¹² suggest to test all sub-strategies $E^{S}$ , $S \subseteq {1, \dots, l}$ , that consider only the union $P^{S} = \cup_{i \in S} S_{i}$ with treatment assignments as in $E$ , against the control in $P^{S}$ . This permits claims also for sub-populations and thereby increases the possibility for efficacy conclusions. Of course, such testing requires an adjustment for multiplicity. Sun et al.¹² provide a (single-step) procedure that controls the FWER. For the formal description of the procedures, let $θ = (θ_{1}, \dots, θ_{l})$ be the vector of unknown treatment effects (mean differences) in the populations, and consider for each $S \subseteq {1, \dots, l}$ the average treatment effect in $P^{S}$ :

θ^{S} = \sum_{i \in S} (π_{i} / π^{S}) θ_{i}

with

π^{S} = \sum_{i \in S} π_{i}

the relative prevalence of

P^{S}

. Sun et al.¹² assume the linear model

Y_{i j} = μ_{i} + θ_{i} X_{i j} + ε_{i j}

(10)where

X_{i j}

denotes the treatment indicator for patient

j

in group

i

which equals 1 if assigned to the experimental treatment

E_{i}

and otherwise

0

, and

θ_{i}

is the treatment effect of

E_{i}

in population

S_{i}

. The error terms

ε_{i j}

are assumed to be i.i.d. normally distributed with mean

0

and homogeneous variance

σ^{2}

. As mentioned above, the authors suggest to test

H^{S} : θ^{S} \leq 0 vs. K^{S} : θ^{S} > 0 for all S \subseteq L = {1, \dots, l}

(11)Note that the

P^{S}

and

H^{S}

S \subseteq L

, correspond to the

P_{i}

and

H_{i}

i \in I

, Section 2 and 3.

From the least squares estimate of the linear model, we obtain one-sided $t$ -test statistics $T^{S}$ for testing $H^{S}$ for each $S \subseteq L$ . In order to control the FWER, Sun et al.¹² conduct a single-step procedure that compares each $T^{S}$ with the upper $α$ -quantile $c_{F}^{*}$ of the distribution of $max {T^{S} | S \subseteq L}$ under the global null hypothesis, i.e. the assumption that none of the treatments $E_{i}$ is superior to the control. We finally select the subset $S_{F}^{*} \subseteq L$ for which a positive treatment effect is claimed and that yields the largest value of $T^{S}$ ,

\begin{aligned} S_{F}^{*} = {\begin{matrix} \arg \max_{S \subseteq I} T^{S}, & if max {T^{S} | S \subseteq L} > c_{F}^{*} \\ \emptyset, & else \end{matrix} \end{aligned}

(12)To achieve PWER-control at the same level

α

, we determine the critical value

c_{P}^{*}

such that

P W E R = α

holds under the global null hypotheses. While

S_{1}, \dots, S_{l}

are disjoint, some of their unions

P^{S}

overlap. Since not all

P^{S}

overlap, the FWER corrects the multiple type I error rate for cases that cannot occur (similar to example 3) and hence may be viewed as overly conservative.

The PWER under the global null hypothesis ( $θ = 0 = (0, \dots, 0)$ ) is given by

{PWER}_{0} = \sum_{i = 1}^{l} π_{i} P_{0} (⋃_{S ∋ i} {T^{S} \geq c_{P}^{*}})

(13)where ‘

S ∋ i

’ denotes all

S \subseteq L

that contain the index

i

. This is because population

S_{i}

is affected by a type I error whenever a hypothesis

H^{S}

is erroneously rejected that corresponds to a population

P^{S}

for which

i \in S

(or

S_{i} \subseteq P^{S}

Due to the assumption of a homogeneous residual variance and the $2 l$ mean parameter in the linear model (10), ${T^{S}}_{S \subseteq L}$ follows a joint $t$ -distribution with $N - 2 l$ degrees of freedom. In R, the distribution function of the multivariate $t$ -distribution is implemented in the mvtnorm-package (see Genz et al.¹⁷) via the function pmvt which needs the degrees of freedom and the correlation matrix of the test statistics as input (see e.g. Bretz et al.¹⁸). The correlation matrix can be computed using the contrast matrix and the design matrix of the linear model. Probabilities in (13) are then calculated by choosing the appropriate sub-matrices of the correlation matrix. Thus, for known values of $π_{i}$ , $i \in L$ , and $l$ , we can numerically determine the critical value $c_{P}^{*}$ such that $P W E R = α$ .

We know that $c_{F}^{*} > c_{P}^{*}$ , which implies that whenever the FWER-approach selects a non-empty $S_{F}^{*}$ , the same set is selected by the PWER-approach, $S_{P}^{*} = S_{F}^{*}$ . We may, however, select the empty set with the FWER-approach, $S_{F}^{*} = \emptyset$ , while $S_{P}^{*} \neq̸ \emptyset$ .

Performance measures. Sun et al.¹² examined several quality and performance measures to assess how good a selected subset $S^{*}$ is. For example, they considered the average effect in the overall population when applying treatment strategy $E^{S^{*}}$ in $P^{S^{*}}$ and the control in the rest of the population. We will consider the relative quantity $RAE = 100 E (\sum_{i \in S^{*}} π_{i} θ_{i}) / θ_{overall}$ where $E$ is the expectation with respect to the sample distribution and $θ_{overall}$ is the weighted average of the positive treatment effects,

θ_{overall} = \sum_{i \in L_{+}} π_{i} θ_{i} / \sum_{i \in L_{+}} π_{i} for L_{+} = {i = 1, \dots, l : θ_{i} > 0}

that describes how efficient the experimental treatment strategy

E

is for the union of sub-populations that benefit from

E

. Since the PWER-procedure chooses a non-empty

S^{*}

more often as the FWER-procedure, this quantity will always be larger for the PWER-approach.

In addition to this measure we will investigate the average size of the ‘correctly’ chosen subgroups within the selected ones, i.e. the average of $π^{S_{+}^{*}} / π^{S^{*}}$ where $S_{+}^{*} = {i \in S^{*} | θ_{i} > 0}$ and $π^{S_{+}^{*}} = \sum_{i \in S_{+}^{*}} π_{i}$ . This gives the fraction of the patient cohort that benefits from the experimental treatment strategy within the one that is exposed to $E^{S^{*}}$ by the results of the study. Analogously, we are interested in the average of the relative size of the ‘falsely’ chosen subgroups within the chosen ones: $π^{S_{0}^{*}} / π^{S^{*}}$ with $S_{0}^{*} = {i \in S^{*} | θ_{i} \leq 0}$ . Lastly, we consider the probability of rejecting at least one false null hypothesis,

Power = P (reject any H^{S} with θ^{S} > 0, S \subseteq L),

as a way to measure the power of the procedures.

Design of the simulation. To make our results comparable to those of Sun et al.,¹² we conducted simulations with roughly the same parameters. That is, for the cases of $l = 2, 4, 6$ sub-populations and a significance level $α = 0.025$ , we chose a total sample size of $N = 1056$ and assume that all group-specific intercepts $μ_{i}$ are equal to 0. Also, for simplicity, each group is assumed to be of equal size, i.e. $π_{1} = \dots = π_{l}$ .

As in Sun et al.,¹² we assume non-negative effects $θ_{i} \geq 0$ and choose $θ = (θ_{1}, \dots, θ_{l})$ based on the number of subgroups $l$ and three further characteristics. The first one is the percentage of true null hypothesis: $q = l_{0} / l$ with $l_{0}$ the size of $L_{0} = {i = 1, \dots, l : θ_{i} = 0}$ . The second one characterizes the treatment effect heterogeneity and is defined as

τ = (θ_{\max} - θ_{\min}) / (θ_{\max} + θ_{\min})

where

θ_{\max} = max_{i \in L_{+}} θ_{i}

and

θ_{\min} = \min_{i \in L_{+}} θ_{i}

. Note that

τ

equals the relative half-range of the positive

θ_{i}

’s, i.e. half of their range divided by the average of their extremes. Obviously, a large

τ

means a large heterogeneity between the positive

θ_{i}

. The third one is the weighted average

θ_{overall}

as previously introduced.

Given values for $l$ , $q$ , $θ_{overall}$ and $τ$ one finds a gird of $l$ equidistant points such that the three characteristics are met. One easily verifies that this grid is uniquely determined by the four quantities. Following Sun et al.¹² we chose $q$ such that $q \cdot l$ is always an integer. Note that for $q \geq (l - 1) / l$ there is at most one $θ_{i} \neq 0$ and so $τ = 0$ (no heterogeneity) is the only possible value for $τ$ . Moreover, we assume $θ_{overall} = 0.1$ in all simulations.

Results. The simulation results for $l = 2$ and $4$ are given in Table 1 and for $l = 6$ and $8$ in Appendix 5. On can see from the tables that control of the PWER, in comparison to FWER-control, provides a substantially larger power and larger average proportion of ‘correctly’ chosen subgroups and a larger average effect. It also increases the proportion of ‘falsely’ chosen subgroups. This is because a subgroup is selected more frequently with PWER-control.

Table 1.

Simulation results for $l = 2$ and $l = 4$ , assuming $θ_{overall} = 0.1$ .

		Power	Correct	False	RAE	Power	Correct	False	RAE
	$l = 2$	$q = 0$
$τ = 0$	PWER	36.4	36.4	0	29
	FWER	31.0	31.0	0	25
$τ = 0.4$	PWER	40.4	40.4	0	33
	FWER	34.6	34.6	0	28
$τ = 0.8$	PWER	51.2	51.2	0	47
	FWER	45.2	45.2	0	42
	$l = 2$	$q = 1 / 2$				$q = 1$
$τ = 0$	PWER	57.7	52.9	4.8	58	0	0	36	0
	FWER	52.0	47.8	4.2	52	0	0	24	0
	$l = 4$	$q = 0$			$q = 1 / 4$
$τ = 0$	PWER	36.2	36.2	0	23	42.2	38.8	3.5	30
	FWER	27.4	27.4	0	17	32.7	30.1	2.6	24
$τ = 0.4$	PWER	37.9	37.9	0	25	44.8	41.3	3.6	33
	FWER	29.1	29.1	0	19	35.5	32.7	2.7	26
$τ = 0.8$	PWER	43.0	43.0	0	30	52.7	48.9	3.7	42
	FWER	33.7	33.7	0	24	43.2	40.2	3.0	35
	$l = 4$	$q = 2 / 4$
$τ = 0$	PWER	53.2	45.7	7.6	46
	FWER	43.8	37.8	6.1	38
$τ = 0.4$	PWER	58.8	51.1	7.8	50
	FWER	49.5	43.1	6.4	42
$τ = 0.8$	PWER	73.9	65.9	8.0	68
	FWER	65.3	58.5	6.8	60
	$l = 4$	$q = 3 / 4$			$q = 1$
$τ = 0$	PWER	81.5	70.1	11.5	81	0	0	4.2	0
	FWER	75.1	64.9	10.2	75	0	0	2.4	0

Results for power (%), the percentage of correctly and falsely chosen sub-populations and the RAE for PWER- and FWER-control under parameter configurations $θ = (θ_{1}, \dots, θ_{l})$ that depend on the fraction of true null hypotheses $q$ and the relative half-range $τ$ of the positive $θ_{i}$ ‘s. RAE: relative average effect; FWER: family-wise error rate; PWER: population-wise error rate.

While the proportion of ‘falsely’ chosen subgroups is increased by at most 2.2% (percentage points) and remains below 5% (one-sided), the proportion of ‘correctly’ chosen subgroups (among the selected ones) and the power are increased by up to 10% and often by more than 5%. The expected effect RAE is always larger with PWER-control.

Under the global null hypothesis ( $P = 1$ ) the average proportion of ‘falsely’ selected populations equals by theory the one-sided FWER. With PWER-control at level 2.5% the FWER was found to be between 3.6% and 4.5% for $l = 2, 4, 6, 8$ . Note that the average proportion of ‘falsely’ selected populations exceeds the level of 2.5% (sometimes substantially) also with FWER-control when there is an effect in some but not all population strata.

In summary, we see that control of PWER substantially increases the chance for a delivery of efficient treatments while the risk of receiving an inefficient treatment and the percentage of patients that do not benefit from the treatment decisions is increased to a moderate extent and remains comparable to the procedure with FWER-control.

6 Extension to SCIs

We are coming back to the general set-up of Section 2 and 3. Utilizing the duality between (multiple) hypothesis tests and (simultaneous) confidence intervals, the multiple test procedure with control of the PWER, introduced in Section 3, can be extended to confidence intervals for the efficacy parameter $θ_{i} = θ (P_{i}, T_{i})$ , $i = 1, \dots, m$ . In this section, we will introduce the dual SCIs and discuss their coverage properties.

To introduce the confidence intervals, let $δ = (δ_{1}, \dots, δ_{m})$ be a vector of possible values for $θ = (θ_{1}, \dots, θ_{m})$ and consider the corresponding null hypotheses $H_{i}^{δ_{i}} : θ_{i} = δ_{i}$ , $i = 1, \dots, m$ . Assume further that $T_{i}^{δ_{i}}, i = 1, \dots, m$ , are (asymptotically) pivotal test statistics for $H_{i}^{δ_{i}}$ , i.e., the (asymptotic) joint distribution of $(T_{1}^{δ_{1}}, \dots, T_{m}^{δ_{m}})$ under $θ = δ$ is the same for all $δ$ . If $T_{i}^{δ_{i}}$ decreases in $δ_{i}$ for the given data, then it makes sense to form the one-sided intervals $C_{i} = [{\tilde{θ}}_{i}, \infty [$ with the lower bound

{\tilde{θ}}_{i} : = \min {δ_{i} : T_{i}^{δ_{i}} \leq c^{*}}

(14)where

c^{*}

is the critical value defined in (3) for

θ^{*} = δ

. Because

(T_{1}^{δ_{1}}, \dots, T_{m}^{δ_{m}})

is pivotal, the critical value

c^{*}

is independent from

δ

. The monotonicity of

T_{i}^{δ_{i}}

applies to most (one-sided) tests and is satisfied e.g. for Wald-type test statistics

T_{i}^{δ_{i}} = ({\hat{θ}}_{i} - δ_{i}) / S E_{i}

where

{\hat{θ}}_{i}

is an estimate of

θ_{i}

(e.g. the MLE) with an standard error

S E_{i}

that is independent of the parameter value

δ

. In this case, we obtain

{\tilde{θ}}_{i} = {\hat{θ}}_{i} - c^{*} S E_{i}

Upper confidence bounds can be derived by applying the same principle and two-sided confidence intervals are obtained by the intersection of the two one-sided intervals. With Wald-type dual tests we obtain the two-sided intervals $C_{i} = [{\hat{θ}}_{i} - c^{*} S E_{i}, {\hat{θ}}_{i} + c^{*} S E_{i}]$ .

We finally discuss the coverage properties of the above-introduced confidence bounds and intervals. We start with the lower confidence bounds ${\tilde{θ}}_{i}$ . To this end, consider a patient $P$ that is randomly drawn from $P$ and let $I_{P}$ be the set of indices of the sub-populations $P_{i}$ the patient $P$ belongs to, i.e. $I_{P} = {i : P \in P_{i}}$ . The set $I_{P}$ gives all population efficacy parameter $θ_{i}$ , $i \in I_{P}$ , that are relevant for patient $P$ . Note that $I_{P}$ is a random set because $P$ is randomly drawn from $P$ . If $θ_{i}$ is the true unknown efficacy parameter, then by the definition (14) we get ${\tilde{θ}}_{i} > θ_{i}$ if and only if $T_{i}^{θ_{i}} > c^{*}$ . Since the dual tests for $H_{1}^{θ_{1}}, \dots, H_{m}^{θ_{m}}$ control the PWER, the (simultaneous) probability that any of the lower confidence bounds ${\tilde{θ}}_{j}$ , $j \in I_{P}$ , fall above the true $θ_{j}$ is at most $α$ . This gives the coverage property

P_{θ} ({\tilde{θ}}_{j} \leq θ_{j} for all j \in I_{P}) \geq 1 - α

(15)meaning that with a probability of at most

1 - α

, for a randomly chosen patient

P

, the lower confidence intervals

[{\tilde{θ}}_{j}, \infty

[, j = 1, \dots, m

, cover all true

θ_{j} = θ (P_{j}, T_{j})

that are relevant to this patient. Because,

I_{P} = J

if and only if

P \in P_{J} = \cap_{j \in J} P_{j} ∖ ⋃_{k \in I ∖ J} P_{k}

we can write the coverage probability as

\sum_{J \subseteq I} π_{J} P_{θ} ({\tilde{θ}}_{j} \leq θ_{j} for all j \in J)

Hence, equation (15) means to control a kind of average simultaneous coverage probability where we focus in each stratum on the relevant confidence statements and average the strata-wise coverage probability over the entire population

P

The upper confidence bounds and two-sided confidence intervals control the same type of average simultaneous coverage probability. As for the classical confidence intervals, the two-sided interval has a twice as large non-coverage probability as the one-sided intervals.

7 Discussion

This paper introduces a new multiple type I error rate concept for clinical trials with multiple and possibly intersecting populations that permits for more powerful tests than control of the FWER. It relies on the observation that not all patients are affected by all test decisions, since not all hypotheses concern all population strata. By averaging the individually relevant, multiple type I errors over the entire population, it provides control of the probability that a randomly selected future patient will be exposed to an inefficient treatment strategy. This average multiple type I error rate, which we call the population-wise error rate (PWER), is smaller than the maximal FWER a patient strata is exposed to. In Section 4, we have discussed several bounds for the strata-wise FWER that follow from control of the PWER and illustrated them with numerical examples. Hence, control of the PWER guarantees control of the maximal strata-wise FWER on a larger and only indirectly defined level.

Let us recall that we only consider population-wise claims, i.e. claims on treatment strategies that consist of a treatment and a population the treatment is intended for and for which the average treatment effect is the estimand of interest. This is also the case when aiming for FWER-control. No individual efficacy claims are anticipated here. Error control of patient-wise claims is impossible without sacrificing power or making strong assumptions. However, a population-wise claim can be viewed as a proxy or approximation for individual claims in the target population.²⁶ Test results from more than a single population may be used for a more informed individual decision. Depending on the inconsistency of the efficacy estimates across the individual strata, we may refrain from taking a formal rejection for a claim in the corresponding population. Hence, with PWER-control, we consider the worst-case scenario, where an efficacy claim for a treatment strategy will always lead to an application of the treatment to all patients in the target population. Note that we do not account for a potential off-label use where a treatment is applied to patients outside its target population.

We have presented a simple approach for achieving PWER-control by adjustment of critical values and have illustrated the power gain when passing from FWER to PWER in a number of examples. We have mainly considered the simple situation of multivariate normal distributed test statistics. This situation applies at least asymptotically to a large number of hypothesis tests for which PWER-control is then guaranteed asymptotically. The methods and principle introduced here can also be implemented with other finite sample distributions like e.g. the multivariate $t$ -distribution (as done in Section 5.4) or be improved via resampling methods. Variance heterogeneity across populations is a general issue for trials with multiple populations that applies similarly to procedures with FWER-control (see e.g. Placzek and Friede²⁴). One can say, whenever control of the FWER is possible then control of the PWER is possible as well, since the latter just controls an average of FWERs. We have also extended the suggested multiple test to SCIs and showed that these intervals control, for a randomly chosen patient, the probability of a simultaneously correct statement on the parameters that are relevant for the corresponding stratum the patient belongs to.

One referee asked at which level the PWER should be controlled. We would suggest to control the PWER at the one-sided level of 2.5% which is usually used for FWER-control, because the PWER has an interpretation that is close to the FWER, namely represents the risk to expose a future patient to an inefficient treatment strategy. Of course, this choice of $α$ is as arbitrary as the choice is for FWER-control and there might be reasons for choosing another level, e.g. in phase II studies.

Control of the PWER requires knowledge of the relative prevalences of all disjoint population strata. These may either be obtained from previous studies or may be estimated at the end of the study. This complicates PWER-control. We have illustrated in an example with two populations that the estimation of the prevalences does not strongly harm PWER-control even with moderate sample sizes. However, more examples with more hypotheses are required to fully explore this issue. At least, PWER-control is always guaranteed asymptotically.

Since our procedure simply results in an adjustment of critical values, power calculations and power simulations deviate only minimally from approaches for classical multiple tests, except for the fact that the critical values may depend on the sample via the prevalence estimates. This can be resolved by using a priori estimates of the prevalences based on experience and past studies. The same issues arises from the estimation of the correlation structure of the test statistics used for an efficient PWER and FWER-control. A miss-specification of the prevalences may be corrected in a mid-trial blinded sample size review.¹⁹

In Section 3, we have suggested a single-step procedure to control the PWER and one might ask whether this procedure can be uniformly improved by a step-down test because this is the case for single-step tests with FWER-control (e.g. Dmitrienko et al.²⁰). For instance, in Example 2.2 with two intersecting hypotheses, we may ask whether we can test $H_{2}$ with a smaller critical $c_{2}^{*} < c^{*}$ when $H_{1}$ has already been rejected with critical value $c^{*}$ . One can quickly see that this is not possible. To this end assume that both hypotheses $H_{1}$ and $H_{2}$ are true. Rejection of $H_{2}$ when $Z_{2} \geq c^{*}$ or $Z_{2} \geq c_{2}^{*}$ with $Z_{1} \geq c^{*}$ , obviously increases the second and third terms in equation (2) of the PWER. Since we have chosen $c^{*}$ to be the smallest critical value that satisfies (3), which leads to a PWER equal to $α$ with continuously distributed $Z_{i}$ (a generic and common situation), we do not control the PWER for any $c_{2}^{*} < c^{*}$ . We may define PWER-controlling step-down tests with an enlarged $c^{*}$ in order to mimic and improve step-down tests with FWER-control. However, such procedures do not uniformly improve the single-step test with PWER-control and are therefore beyond the scope of this paper. The development of step-down tests with PWER-control is a topic for future research.

Single-step procedures have the advantage that they can directly be extended by simple and always informative SCIs. We have illustrated this in Section 6 for single-step tests with PWER-control. An extension to simple and always informative SCIs is impossible for step-down tests: Compatible SCIs often are non-informative in the sense that they do not provide any additional information to the sheer hypothesis tests,^21,22 and sufficiently informative SCIs are compatible only to a modification of the original step-down test.²³ This justifies the use of single-step tests in practice.

We finally remark that an extension of the presented PWER-approach to multi-stage and adaptive designs is under development by the authors and will be a topic of future contributions. Multi-stage and particularly flexible designs provide the opportunity for adding or dropping populations at interim analyses based on the unblinded interim data (e.g. Brannath et al.,⁴ Wassmer and Brannath,⁶ Placzek and Friede²⁴). In the example of Section 2.2, we may for instance add and enrich the intersection of the two populations for an investigation in a second stage of the study if the efficacy of the treatment is seen at interim only in one of the two populations. Hence, the development of adaptive and sequential designs with PWER-control is an interesting and valuable research task. It has also the potential to provide a valuable contribution to platform trials for which FWER-control has been discussed (e.g. Stallard et al.,⁷ Collignon et al.⁹) and alternatives (like FDR-control) have been suggested rather recently (e.g. Zehetmayer et al.,²⁵ Robertson et al.²⁶).

Footnotes

Acknowledgements

The authors thank the referees for their valuable comments that have lead to major improvements of the paper and Dr. Miriam Kesselmeier from the University Hospital of Jena for her constructive comments on a previous version of this manuscript. We finally thank Remi Luschei for his comments on a revised version of the manuscript.

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was supported by the BMBF under funding number 01EK1503B.

ORCID iDs

Werner Brannath

Charlie Hillner

Appendix

References

Woodcock

LaVange

. Master protocols to study multiple therapies, multiple diseases, or both. N Engl J Med 2017; 377: 62–70.

Strzebonska

Waligora

. Umbrella and basket trials in oncology: ethical challenges. BMC Med Ethics 2019; 20: 58. DOI: 10.1186/s12910-019-0395-5.

Kaplan

Maughan

Crook

, et al. Evaluating many treatments and biomarkers in oncology: a new design. J Clini Oncol: Offi J Am Soc Clini Oncol 2013; 31. DOI: 10.1200/JCO.2013.50.7905.

Brannath

Zuber

Branson

, et al. Confirmatory adaptive designs with Bayesian decision tools for a targeted therapy in oncology. Stat Med 2009; 28: 1445–1463.

Glimm

Di Scala

. An approach to confirmatory testing of subpopulations in clinical trials. Biom J 2015; 57: 897–913.

Wassmer

Brannath

. Group sequential and confirmatory adaptive designs in clinical trials. Springer Springer International Publishing, Switzerland, 2016.

Stallard

Todd

Parashar

, et al. On the need to adjust for multiplicity in confirmatory clinical trials with master protocols. Ann Oncol 2019; 30: 506–509.

Malik

Pazdur

Abrams

, et al. Consensus report of a joint NCI thoracic malignancies steering committee: FDA workshop on strategies for integrating biomarkers into clinical development of new therapies for lung cancer leading to the inception of “master protocols” in lung cancer. J Thorac Oncol 2014; 9: 1443–1448.

Collignon

Gartner

Haidich

, et al. Current statistical considerations and regulatory perspectives on the planning of confirmatory basket, umbrella, and platform trials. Clin Pharmacol Ther 2020; 107: 1059–1067.

10.

Kesselmeier

Benda

Scherag

. Effect size estimates from umbrella designs: Handling patients with a positive test result for multiple biomarkers using random or pragmatic subtrial allocation. PLoS ONE 2020; 15: 1–24.

11.

Fletcher

Ziegler

Trahair

, et al. Too many targets, not enough patients: rethinking neuroblastoma clinical trials. Nat Rev Cancer 2018; 18: 389–400.

12.

Sun

Bretz

Gerke

, et al. Comparing a stratified treatment strategy with the standard treatment in randomized clinical trials. Stat Med 2016; 35: 5325–5337.

13.

Westphal

Young

. Resampling-Based Multiple Testing. Wiley, New York, 1993.

14.

Dickhaus

. Simultaneous Statistical Inference. Springer, Berlin, Heidelberg, 2014.

15.

Hill

Whitehurst

Lewis

, et al. Comparison of stratified primary care management for low back pain with current best practice (start back): a randomised controlled trial. The Lancet 2011; 378: 1560–1571.

16.

Owadally

Hurt

Timmins

, et al. PATHOS: a phase II/III trial of risk-stratified, reduced intensity adjuvant treatment in patients undergoing transoral surgery for Human papillomavirus (HPV) positive oropharyngeal cancer. BMC Cancer 2015; 15: 602.

17.

Genz

Bretz

Miwa

, et al. mvtnorm: Multivariate Normal and t Distributions, 2017. https://CRAN.R-project.org/package=mvtnorm. R package version 1.0-6.

18.

Bretz

Hothorn

Westfall

. Multiple Comparisons Using R. CRC Press, New York, 2016. ISBN 9781420010909.

19.

Placzek

Friede

. Clinical trials with nested subgroups: analysis, sample size determination and internal pilot studies. Stat Methods Med Res 2018; 27: 3286–3303.

20.

Dmitrienko

Tamhane

Bretz

. Multiple testing problems in pharmaceutical statistics. CRC Press, New York, 2009.

21.

Strassburger

Bretz

. Compatible simultaneous lower confidence bounds for the Holm procedure and other Bonferroni-based closed tests. Stat Med 2008; 27: 4914–4927.

22.

Guilbaud

. Alternative confidence regions for bonferroni-based closed-testing procedures that are not alpha-exhaustive. Biom J 2009; 51: 721–735.

23.

Brannath

Schmidt

. A new class of powerful and informative simultaneous confidence intervals. Stat Med 2014; 33: 3365–3386.

24.

Placzek

Friede

. A conditional error function approach for adaptive enrichment designs with continuous endpoints. Stat Med 2019; 38: 3105–3122.

25.

Zehetmayer

Posch

Koenig

. Online control of the false discovery rate in group-sequential platform trials, 2021. doi:10.48550/ARXIV.2112.10619. https://arxiv.org/abs/2112.10619.

26.

Robertson

Wason

JMS

König

, et al. Online error control for platform trials, 2022. doi:10.48550/ARXIV.2202.03838. https://arxiv.org/abs/2202.03838.

27.

Liu

Meng

. Comment: a fruitful resolution to simpson’s paradox via multiresolution inference. Am Stat 2014; 68: 17–29.

The population-wise error rate for clinical trials with overlapping populations

Abstract

Keywords

1 Introduction

2 The population-wise error rate

2.1 General framework and definition

2.4 Three populations with two intersections

3 Control of the PWER

5.1 Combination of independent samples

Footnotes

Acknowledgements

Declaration of conflicting interests

Funding

ORCID iDs

Appendix

References