The effect of estimating prevalences on the population-wise error rate

Abstract

The population-wise error rate is a type I error rate for clinical trials with multiple target populations. In such trials, a treatment is tested for its efficacy in each population. The population-wise error rate is defined as the probability that a randomly selected, future patient will be exposed to an inefficient treatment based on the study results. It can be understood and computed as an average of strata-specific family wise error rates and involves the prevalences of these strata. A major issue of this concept is that the prevalences are usually unknown in practice, so that the population-wise error rate cannot be directly controlled. Instead, one could use an estimator based on the given sample, like their maximum-likelihood estimator under a multinomial distribution. In this article, we demonstrate through simulations that this does not substantially inflate the true population-wise error rate. We differentiate between the expected population-wise error rate, which is almost perfectly controlled, and study-specific values of the population-wise error rate which are conditioned on all subgroup sample sizes and vary within a narrow range. Thereby, we consider up to eight different overlapping populations and moderate to large sample sizes. In these settings, we also consider the maximum strata-wise family wise error rate, which is found to be, on average, at least bounded by twice the significance level used for population-wise error rate control.

Keywords

Family wise error rate multiple testing personalized medicine population-wise error rate prevalence estimation umbrella trials

1. Introduction

Many clinical trials in personalized medicine examine multiple hypotheses, each about the efficacy of a medical treatment in a specific patient population. This is done, for example, in umbrella trials that enroll patients with the same cancer type, but define multiple study arms based on different mutations of this cancer. Current examples of umbrella trials and challenges associated with these trials have been discussed by Ouma et al.¹ One difficulty arises from the fact that in case of overlapping populations, taking a false test decision may affect more than one population. In this case, type I error control should be adjusted for multiplicity. For this purpose, Brannath et al.² introduced the population-wise error rate (PWER), a multiple type I error rate which is adapted to clinical trials with overlapping populations. It has been mentioned by Ouma et al.¹ and has the advantage of being more liberal than the family wise error rate (FWER), while still controlling an average type I error. This allows to achieve higher power, what is particularly useful for examining rare diseases that often involve small sample sizes.

The PWER gives the probability that a randomly selected patient will receive an inefficient treatment. For calculating it, the overall population is partitioned into disjoint strata containing the patients that are affected by the same test decisions. For each stratum, the respective multiple type I error probability is then weighted by its relative prevalence, that is, the proportion of the stratum in the total population. In practice, however, the prevalences are often unknown and must be estimated from the study sample. One way to do this, if no further knowledge about the prevalences is available, is to take the maximum-likelihood estimator from the multinomial distribution. This article aims to examine whether the use of this estimator inflates the true PWER and to which extent. To this end, we will carry out simulations in which the relative prevalences will be estimated and used for the calculation of the rejection boundaries in order to control the estimated PWER at the pre-specified level. We will then compare the true PWER (using the true prevalences) to this significance level.

Another important question regarding PWER control is to what extent the FWER can then still be controlled. In this context, the FWER corresponds to the maximal risk for future patients to be assigned to an inefficient treatment, while the PWER only controls the average risk. Brannath et al.² have specified several theoretical bounds for the FWER, which are somewhat rough and can often be improved by the simulated values. Specifically, we will see that the FWER is often less than twice as large as the significance level of PWER control.

We will also cover a special situation that may arise at the calculation of the estimated PWER, when no patients are recruited from (at least) one stratum. In this case, the concerned strata are neglected by the PWER, such that a randomly selected person from one of these strata could have a greater chance of receiving an inefficient treatment than the FWER level. Our solution is to introduce a minimal prevalence to include all concerned strata-wise error rates into the PWER. In our simulations we will therefore also examine to what extent PWER-controlling tests with a minimal prevalence actually become more conservative in this situation.

We start this article with the formal definition of the PWER in Section 2. In Section 3, we construct the test statistics used to handle the population-wise testing problems. The presentation of the simulations and their results is done in Section 4, including a motivating example based on an umbrella trial data set from Kesselmeier et al.³ in Section 4.1. The article ends with a discussion in Section 5. All simulations are done in R. The program files and outputs are available under https://github.com/rluschei/PWER-Estimate-Prevalences.

2. The population-wise error rate

In this section, we first recall the formal definition of the PWER from Brannath et al.² and present a result on least favorable parameter configurations for the PWER which will be used in the calculation of the rejection boundaries for PWER control.

2.1. Definition

Let $P_{1}, \dots, P_{m}$ be the sets representing the given patient populations. They may, in particular, be overlapping. For every $i \in I = {1, \dots, m}$ we want to test a treatment $T_{i}$ in the population $P_{i}$ . So we are interested in testing the null hypotheses $H_{i} : θ_{i} \leq 0$ , where $θ_{i} = θ (P_{i}, T_{i}) \in R$ denotes the effect of $T_{i}$ in comparison to a control treatment $C$ in the population $P_{i}$ . To define the PWER, we partition the overall population $P = \cup_{i = 1}^{m} P_{i}$ into the disjoint strata $P_{J} := (\cap_{j \in J} P_{j}) ∖ (\cup_{k \in I ∖ J} P_{k}), J \subseteq I$ . Each $P_{J}$ includes all individuals affected by the treatments indexed in $J$ . Figure 1 illustrates the resulting partition for $m = 3$ overlapping populations. We denote the relative prevalence of $P_{J}$ among $P$ by $π_{J}$ . Since the prevalences are usually unknown in practice, we will later propose how to estimate them. The population-wise error rate of the multiple testing problem is defined by the following equation:

PWER = \sum_{J \subseteq I} π_{J} P (falsely reject any H_{j} with j \in J)

Hence, it gives the average probability of committing at least one type I error that would concern the individuals in

P

. The PWER is more liberal than the FWER, that is, it fulfills

PWER \leq FWER

(see Brannath et al.²).

Figure 1.

Partition of $m = 3$ overlapping populations.

However, in order to control the PWER with estimated prevalences, we need to move to stronger null hypotheses. For any subset $J \subseteq I$ and $i \in J$ , we define the null hypothesis $H_{J, T_{i}} : θ_{J, T_{i}} \leq 0$ , which states that treatment $T_{i}$ is not better than the control in the stratum $P_{J}$ . We will test the intersection hypotheses $H_{i} = \cap_{J \subseteq I : i \in J} H_{J, T_{i}}$ for each $i \in I$ , stating that no treatment relevant to patients in $P_{i}$ is more effective than the control. We assume that each $H_{i}$ is tested by a right-tailed test with the test statistic $Z_{i}$ and a critical value $c_{i}$ . Then the PWER equals to

{PWER}_{θ} = \sum_{J \subseteq I} π_{J} P_{θ} (⋃_{j \in J \cap I_{0} (θ)} {Z_{j} > c_{j}})

where the vector

θ = (θ_{J})_{J \subseteq I}

with

θ_{J} = (θ_{J, T_{i}})_{i \in J}

equals the true effects, and the true null hypotheses are indexed by

I_{0} (θ) = {i \in I : H_{i} is true}

. In the following, we will mostly consider equal critical values (

c_{i} = c

for all

i \in I

) for the reasons given by Brannath et al.²

2.2. Least favorable parameter configurations

Under two additional assumptions it can be shown that the null vector $θ = 0$ is a worst case parameter for which the PWER attains its maximum. The first condition is the subset pivotality assumption (see Westfall and Young⁴ and Dickhaus⁵), which states that the multivariate distributions of the vectors $(Z_{j})_{j \in J \cap I_{0} (θ)}$ do not depend on the truth or falsity of those hypotheses they are not relevant for. It is formally described by the following equation:

\forall J \subseteq I, J \neq \emptyset : \forall θ \in H_{J} \exists θ^{*} \in H_{I} : P_{θ}^{max_{j \in J} Z_{j}} = P_{θ^{*}}^{max_{j \in J} Z_{j}}

(1)

where

H_{J} := ⋂_{j \in J} H_{j}, J \subseteq I

, and

P_{θ}^{max_{j \in J} Z_{j}}

denotes the distribution of

max_{j \in J} Z_{j}

in dependence of the parameter

θ

. The second assumption is stochastic monotonicity of the subvectors

(Z_{j})_{j \in J}

, stating that

\forall J \subseteq I, J \neq \emptyset : \forall θ_{1}, θ_{2} \in H_{I}, θ_{1} \leq θ_{2} : P_{θ_{1}} (max_{j \in J} Z_{j} > c) \leq P_{θ_{2}} (max_{j \in J} Z_{j} > c)

(2)

Here the relation

θ_{1} \leq θ_{2}

is meant component-wise.

Theorem 1

Under conditions equations (1) and (2) one obtains $sup_{θ} {PWER}_{θ} = {PWER}_{θ^{*}}$ for $θ^{*} = 0$ .

The proof can be found in Appendix A. We call every $θ^{*}$ fulfilling the equation from Theorem 1 a least favorable parameter configuration (LFC) for the PWER. If an LFC exists, the PWER can be strongly controlled at a given significance level $α \in (0, 1)$ , by determining the smallest critical value $c$ that satisfies ${PWER}_{θ^{*}} (c) \leq α$ . Note that this will also bound the FWER to a certain extent—some theoretical bounds are given by Brannath et al.² and we will also investigate this numerically in Section 4.5.

3. Test statistics for a single-stage design

We now want to construct the vector of test statistics $Z = (Z_{i})_{i \in I}$ in order to conduct the multiple test. We follow the approach proposed by Hillner⁶ but replace the true prevalences used there by the given sample proportions. This allows for a concrete calculation in practice, for example, under an umbrella trial framework with a common control as described by Ouma et al.¹

For every $i \in I$ , let $μ_{i, T_{i}} \in R$ be the expected response under treatment $T_{i}$ in the population $P_{i}$ and let $μ_{i, C} \in R$ be the expected response under the control $C$ in $P_{i}$ . Then the mean effect difference in population $P_{i}$ is $θ_{i} = μ_{i, T_{i}} - μ_{i, C}$ . Additionally, for every $J \subseteq I$ and $j \in J$ we define the expectations $μ_{J, T_{j}} \in R$ under $T_{j}$ and $μ_{J, C} \in R$ under $C$ in $P_{J}$ . The expectations in the population $P_{i}$ are then

μ_{i, T_{i}} = \sum_{J \subseteq I : i \in J} \frac{π_{J}}{π_{i}} μ_{J, T_{i}} and μ_{i, C} = \sum_{J \subseteq I : i \in J} \frac{π_{J}}{π_{i}} μ_{J, C}

where

π_{i}

denotes the sum of all

π_{J}

with

i \in J

(and equals the prevalence of

P_{i}

). Suppose that a sample of

N

patients is given, in which

n_{J}

patients belong to the stratum

P_{J}

, for every

J \subseteq I

. Let

n_{J, T}

be the number of patients assigned to treatment

T

in stratum

P_{J}

, with

T \in T_{J} := {T_{j} : j \in J} \cup {C}

, and let

n_{i, T} := \sum_{J \subseteq I : i \in J} n_{J, T}

be the number of patients assigned to

T

in population

P_{i}

. In case of a stratified randomization, that is, in every stratum

P_{J}

the patients are assigned to the treatments evenly, we have

n_{J, T} = n_{J} / (| J | + 1)

3.1. Normal distribution model with known and heterogeneous variances

In every stratum $P_{J}$ , let us represent the measured responses under the treatment $T \in T_{J}$ by the normally distributed random variables $X_{J, T}^{(k)} \sim N (μ_{J, T}, σ_{J, T}^{2})$ , for $k = 1, \dots, n_{J, T}$ . The variances $σ_{J, T}^{2} > 0$ are here assumed to be known, and all observations are assumed to be independent. For every $J$ and for every treatment $T \in T_{J}$ we define the strata-wise arithmetic mean ${\bar{X}}_{J, T} := n_{J, T}^{- 1} \sum_{k = 1}^{n_{J, T}} X_{J, T}^{(k)}$ . Then we can estimate the expected responses $μ_{i, T_{i}}$ and $μ_{i, C}$ through

{\hat{μ}}_{i, T_{i}} := \sum_{J \subseteq I : i \in J} \frac{n_{J, T_{i}}}{n_{i, T_{i}}} {\bar{X}}_{J, T_{i}} and {\hat{μ}}_{i, C} := \sum_{J \subseteq I : i \in J} \frac{n_{J, C}}{n_{i, C}} {\bar{X}}_{J, C}

This provides the test statistics

Z_{i} := \frac{{\hat{μ}}_{i, T_{i}} - {\hat{μ}}_{i, C}}{\sqrt{V_{i}}}, i \in I

with

V_{i} := Var ({\hat{μ}}_{i, T_{i}} - {\hat{μ}}_{i, C}) = \sum_{J \subseteq I : i \in J} \frac{n_{J, T_{i}}}{n_{i, T_{i}}^{2}} σ_{J, T_{i}}^{2} + \frac{n_{J, C}}{n_{i, C}^{2}} σ_{J, C}^{2}

Conditional on the sample sizes, the vector

Z = (Z_{i})_{i \in I}

follows a multivariate normal distribution with location

ν = (ν_{i})_{i \in I}

given by the following equation:

ν_{i} = \frac{1}{\sqrt{V_{i}}} (\sum_{J \subseteq I : i \in J} \frac{n_{J, T_{i}}}{n_{i, T_{i}}} μ_{J, T_{i}} - \frac{n_{J, C}}{n_{i, C}} μ_{J, C}), i \in I

The correlation matrix

Σ = (Σ_{i j})_{i, j \in I}

takes a different form depending on whether and which treatments are the same. When all treatments are different (

T_{i} \neq T_{j}

for

i \neq j

), we have

Σ_{i j} = \frac{\sum_{J \subseteq I : i, j \in J} n_{J, C} σ_{J, C}^{2}}{n_{i, C} n_{j, C} \sqrt{V_{i} V_{j}}}, i \neq j

The derivation of the above results can be found in Appendix B.

The vector $Z$ fulfills the conditions equations (1) and (2), so that $θ = 0$ is an LFC for the PWER. Since $ν$ converges to zero under $θ = 0$ , the maximal PWER of the asymptotic test can be calculated by the following equation:

PWER = \sum_{J \subseteq I} π_{J} (1 - Φ_{0, Σ_{J}} (c, \dots, c))

(3)

where

Φ_{0, Σ_{J}}

denotes the cdf of the normal distribution with parameters

0 \in R^{| J |}

and

Σ_{J} := (Σ_{i j})_{i, j \in J}

3.2. Known and homogeneous variances

When the variances $σ_{J, T}^{2}$ are known and homogeneous ( $σ_{J, T}^{2} = σ^{2}$ for all $J \subseteq I, T \in T_{J}$ ), the representation of the test statistics simplifies to

Z_{i} = \frac{{\hat{μ}}_{i, T_{i}} - {\hat{μ}}_{i, C}}{σ \sqrt{H_{i}}} with H_{i} = n_{i, T_{i}}^{- 1} + n_{i, C}^{- 1}

and the correlation matrix simplifies to

Σ_{i j} = \frac{\sum_{J \subseteq I : i, j \in J} n_{J, C}}{n_{i, C} n_{j, C} \sqrt{H_{i} H_{j}}}

(4)

and is then even independent from

σ

3.3. Unknown and homogeneous variances

We now assume the variances to be unknown, but still homogeneous. For every $J \subseteq I$ and $T \in T_{J}$ an unbiased estimator of $σ^{2}$ is given by the following equation:

{\hat{σ}}_{J, T}^{2} = \frac{1}{n_{J, T} - 1} \sum_{k = 1}^{n_{J, T}} {(X_{J, T}^{(k)} - {\bar{X}}_{J, T})}^{2}

Let

s

be the number of strata treatment combinations

(J, T)

with

n_{J, T} > 1

. We use the pooled variance estimator

{\hat{σ}}^{2} = \frac{\sum_{J \subseteq I, T \in T_{J}} (n_{J, T} - 1) {\hat{σ}}_{J, T}^{2}}{N - s}

to construct the test statistics

T_{i} := \frac{{\hat{μ}}_{i, T_{i}} - {\hat{μ}}_{i, C}}{\hat{σ} \sqrt{H_{i}}}, i \in I

The vector

(T_{i})_{i \in I}

then follows a multivariate

t

-distribution with the scale matrix from fomula equation (4) and

d f = N - s

degrees of freedom (see Kotz and Nadarajah⁷). Hence, under an unknown and common residual variance the PWER is obtained by replacing

Φ_{0, Σ_{J}}

with the cdf of the multivariate

t

-distribution in formula equation (3).

3.4. Unknown and heterogeneous variances

To account for unknown and heterogeneous variances across strata and treatments, we need to include the individual variance estimators ${\hat{σ}}_{J, T}^{2}$ in our test statistics:

T_{i}^{*} := \frac{{\hat{μ}}_{i, T_{i}} - {\hat{μ}}_{i, C}}{\sqrt{{\hat{V}}_{i}}} with {\hat{V}}_{i} = \sum_{J \subseteq I : i \in J} \frac{n_{J, T_{i}}}{n_{i, T_{i}}^{2}} {\hat{σ}}_{J, T_{i}}^{2} + \frac{n_{J, C}}{n_{i, C}^{2}} {\hat{σ}}_{J, C}^{2}

However, finding their joint distribution is an unsolved problem, so that the PWER cannot be calculated in this case. Hasler and Hothorn⁸ showed that FWER control can still be reached by computing individual critical values

c_{i}

from a

t

-approximation with degrees of freedom according to Satterthwaite⁹ and plug-in estimation of the correlation matrix. In Section 4.7, we will show how this can be applied to PWER control.

4. Estimation of the population prevalences

The populations $P_{1}, \dots, P_{m}$ are usually defined by certain inclusion and exclusion criteria and are assumed to have infinite size, because they not only include the study patients, but also patients outside of the trial and potential future patients. Consequently, the exact values of the relative prevalences $π_{J}$ are typically not known in practice and must be estimated in some way to compute the PWER. For this purpose, one could possibly use findings from previous studies. Another possibility is to utilize the sample sizes collected in the study. In the latter case, the vector $π = (π_{J})_{J \subseteq I}$ may be estimated by the maximum-likelihood estimator (MLE) of the multinomial distribution with parameters $N$ and $π$ , whose discrete density $f : (n_{J})_{J \subseteq I} \mapsto N! \prod_{J \subseteq I} π_{J}^{n_{J}} / n_{J}!$ gives the probability of selecting $n_{J}$ patients from the population $P_{J}$ in a random sample of $N = \sum_{J \subseteq I} n_{J}$ patients. The MLE $\hat{π} = ({\hat{π}}_{J})_{J \subseteq I}$ has the form ${\hat{π}}_{J} = n_{J} / N$ , and is a mean unbiased and asymptotically consistent estimator. This implies that for a given fixed critical value $c$ the PWER can be estimated mean unbiasedly and consistently by plugging in ${\hat{π}}_{J}$ . However, in practice we would rather determine the critical value such that the estimated PWER equals the significance level $α$ .

4.1. Umbrella trial example

To illustrate how the test decisions change under control of the estimated PWER compared to FWER control and unadjusted testing, we regard a real data example from Kesselmeier et al.³ It is introduced to investigate two treatment allocation strategies for patients that are eligible for multiple arms in umbrella trials: the pragmatic strategy assigning the patients to the eligible subtrial with currently fewest patients, and random allocation of these patients. The example is based on the MAXSEP study (Brunkhorst et al.¹⁰) that compared the effect of meropenem to the effect of a combination therapy with moxifloxacin and meropenem in patients with severe sepsis. Kesselmeier et al.³ defined two overlapping populations,

\begin{aligned} P_{1} & = patients with baseline lactate value > 2 mmol / L \\ P_{2} & = patients with baseline C-reactive proteine value > 128 mg / L \end{aligned}

and for these they build 1000 bootstrap samples of sizes

N \in {100, 200, 500}

from the study data. For each they apply both allocation stategies, and also the gold-standard independent trial design (where the subtrials screen their patients independently for just one of the biomarkers). We use the resulting allocation numbers and prevalence estimates to compute the rejection boundaries one would get under FWER control and control of the estimated PWER, for both allocation strategies. We assume the significance level

α = 0.025

and equal residual variances. In the independent subtrials case, one would typically not adjust for multiplicity, such that one would use

Φ^{- 1} (1 - α) = 1.96

as critical value, where

Φ^{- 1}

denotes the quantile function of the standard normal distribution. The distribution of the resulting critical values is plotted in Figure 2. We see that they can be significantly reduced by replacing FWER control with PWER control and that PWER control leads to a good compromise between unadjusted testing and the more strict FWER control. In Figure 2, we also added the true critical boundaries for PWER control that we obtain from the true prevalences, to examine how well they are approximated by the estimated ones. Therefore we assume that the true prevalences are equal to the mean estimates over all the 1000 bootstrap samples. We see that the estimated values are close to the true ones (especially their average) and that the approximation improves with increasing sample size.

Figure 2.

Critical boundaries for the real data example from Kesselmeier et al.³ under FWER control, control of the estimated PWER and control of the true PWER. For each bootstrap sample, we utilize the allocation numbers and prevalence estimates that result from applying the pragmatic strategy and the random allocation strategy to calculate the FWER and the PWER. We obtain the true critical values by plugging the mean prevalence estimates into the PWER. We assume known and equal residual variances and use $α = 0.025$ as significance level. The critical value for unadjusted testing is $c_{unadj} = Φ^{- 1} (1 - α) = 1.96$ . FWER: family wise error rate; PWER: population-wise error rate.

4.2. Setup of the simulations

We now want to investigate the accuracy of the PWER estimation more systematically in some simulations. We will first focus on the cases where the distribution of the test statistics is known (Sections 3.1 and 3.2) and will deal with the case of unknown, homogeneous variances in Section 4.7. We find the estimated critical value $\hat{c}$ from the condition

\hat{PWER} (\hat{c}) := \sum_{J \subseteq I} {\hat{π}}_{J} (1 - F_{0, Σ_{J}} (\hat{c}, \dots, \hat{c})) = α

(5)

where

F_{0, Σ_{J}}

denotes the cdf of the Gaussian or

t

-distribution. Our goal is now to find out if the true PWER

PWER (\hat{c}) = \sum_{J \subseteq I} π_{J} (1 - F_{0, Σ_{J}} (\hat{c}, \dots, \hat{c}))

(6)

is then still controlled at the significance level

α

. Asymptotically, this is the case due to the almost sure convergence of

{\hat{π}}_{J}

π_{J}

. We can also show that the sequence of critical values (

{\hat{c}}_{N})

converges in probability towards the true critical value

c (π)

that would be obtained from the true prevalences and correspondingly proportioned sample sizes. This is not even necessary for the proof of asymptotic PWER control, but could be of interest for further considerations (see the discussion). Both results are shown in Appendix C.

For our simulations, we define the populations $P_{1}, \dots, P_{m}$ by $m$ different binary biomarkers that are expressed with probabilities $p_{1}, \dots, p_{m}$ . For every $i \in I$ , the population $P_{i}$ consists of all patients with the $i$ -th biomarker being expressed. In particular, since the biomarkers are not assumed to be exclusive, these populations overlap. It should be noted that we do not consider the configuration where none of the biomarkers are present, since these patients would typically be excluded from the study for ethical reasons. Hence, we consider here the overall population of patients where at least one biomarker is expressed. In the first step of the simulation, we define the biomarker expression probabilities $p_{i}$ (e.g. by randomly generating them from the uniform distribution). From the $p_{i}$ , we generate in turn a multinomial random vector containing the $2^{m} - 1$ strata-wise sample sizes. We find $\hat{π}$ by dividing this vector with the total sample size $N$ . The critical value $\hat{c}$ is then computed from formula equation (5) and plugged into the true PWER (formula equation (6)). We repeat this procedure 10,000 times, draw a boxplot of the resulting values of $PWER (\hat{c})$ , and tabulate their summary statistics.

4.3. Results

We run the simulations described above for many different scenarios in which the following aspects are varied:

We consider different numbers $m$ of biomarkers, from $m = 2$ to $m = 8$ biomarkers.

We consider different total sample sizes, from $N = 25$ to $N = 500$ patients.

We consider independent and dependent biomarkers.

We consider known and unknown homogeneous residual variances and known heterogeneous variances (see Section 4.7 for the unknown, heterogeneous case).

We consider pairwise different treatments ( $T_{i} \neq T_{j}$ for $i \neq j$ ) and equal treatments ( $T_{i} = T_{j}$ ).

We consider fixed biomarker expression probabilities (= fixed true prevalences) and probabilities that vary with each simulation run.

We consider equal patient allocation to the treatments within the strata and random allocation.

We consider different significance levels ( $α = 0.025$ and $α = 0.01$ ).

For all these configurations, the distribution of true PWER (obtained from formula equation (6)) is found to be very similar. The detailed results can be found in our Github repository. Exemplary, we now describe the results for the following configuration: we fix the total sample size at

N = 500

patients, which corresponds to a medium to large size for a multi-population study. We use

t

-distributed test statistics with the correlation matrix from formula equation (4), and regard

m = 2

m = 8

independent biomarkers. The biomarker expression probabilities are randomly and independently generated from the uniform distribution on

(0, 1)

in each simulation run. With this we cover situations with up to

2^{8} - 1 = 255

different strata. Additionally, we assume equal allocation of the patients to the eligible treatments within the strata. As significance level we take

α = 0.025

. The distribution of the true PWER for this configuration is plotted in Figure 3. One can see that for all

m

, the values are clustered quite tightly and symmetrically around

α

. The mean values, which approximate the expected, overall PWER, are all equal to

α

at least up to the fourth decimal place, and the standard deviations are all smaller than

4.2 \times 10^{- 4}

. The detailed summary statistics for Figure 3 can be found in Appendix D. For these reasons, the true PWER appears to be well under control. Small variations may only occur in individual situations: 5.47% of the values in Figure 3 are outside the interval

(0.02375, 0.02625)

and thus deviate from

α

by at least 5%. In summary, this implies that, conditional on the actually observed sample sizes, the PWER may (with a small probability) be moderately inflated or deflated, however, is well under control in the average.

Figure 3.

Distribution of the true population-wise error rate (PWER) for different numbers $m$ of binary and independent biomarkers. This corresponds to $m$ overlapping populations with $2^{m} - 1$ strata. The critical value $\hat{c}$ is computed from the estimated PWER, under the significance level $α = 0.025$ . We assume a total sample size of $N = 500$ screened patients and a stratified randomization to the treatments.

Figure 4 shows the results for another configuration where the number of biomarkers is fixed at $m = 3$ and the sample size varies from $N = 25$ to $N = 500$ . While larger inaccuracies can occur at sample sizes $N = 25$ or $N = 50$ , for $N = 100$ all values of the true PWER already deviate <15% from the significance level, and for $N = 200$ <10%. So the mentioned convergence of $PWER (\hat{c})$ with respect to $N$ takes place very quickly and applies to practical situations.

Figure 4.

Distribution of the true population-wise error rate (PWER) for different overall sample sizes $N$ and $m = 3$ independent biomarkers. The critical value $\hat{c}$ is computed from the estimated PWER, under the significance level $α = 0.025$ .

In Figure 3, we see that the boxplot ranges decrease as the number $m$ of populations increases. This is probably due to the fastly increasing number of strata and thus decreasing prevalences, so that for larger $m$ , single estimation inaccuracies in the strata probabilities do not have a major impact on the PWER. In practice, of course, the sample size would have to be increased in order to have enough patients in all strata. We get similar results when we assume equal prevalences for all strata and leave them constant over all simulation runs. However, if individual prevalences remain constant with an increasing number of strata, this is not the case any more. This can be seen, for example, when one prevalence is set to 0.5 independently of the number of strata. The range of the simulated values then increases with increasing $m$ , up to a standard deviation of $7.7 \times 10^{- 4}$ for $m = 8$ .

The assumption of independent biomarkers is not necessarily guaranteed in practice. But the above simulations can easily be extended to correlated biomarkers by deriving the biomarker probabilities from a normally distributed random vector with an arbitrary covariance matrix. We have done this for various cases and found no particular differences to the results presented above. The same applies to the other cases presented before.

4.4. Marginal sum estimator

When defining the populations for our simulation, we assumed that all patients without any biomaker being expressed would not be included in the study. However, these patients usually also go through the screening process, so their number is usually known. We can use this to define another estimate. Let ${\hat{τ}}_{J}, J \subseteq I$ denote the prevalence estimators that we get when including this stratum into the total population (so that we have ${\hat{τ}}_{\emptyset} > 0$ ). In the independent biomarkers case, another possibility of estimating $π_{J}$ is then using the marginal prevalence estimator

{\tilde{π}}_{J} = (\prod_{j \in J} {\hat{p}}_{j}) (\prod_{k \in I ∖ J} (1 - {\hat{p}}_{k})) / (1 - {\hat{τ}}_{\emptyset})

(7)

which is based on the marginal frequencies

{\hat{p}}_{j} = \sum_{J \subseteq I : j \in J} {\hat{τ}}_{J}

of the strata. Under the independence assumption, this also gives consistent estimates. We adopted the marginal prevalence estimator into our simulations and got similar results to Section 4.3. The deviation of the true PWER from

α

is slightly lower than before, with standard deviations lying between

2.7 \cdot 10^{- 4}

(

m = 2

) and

3.6 \cdot 10^{- 4}

(

m = 4, 5

4.5. Behavior of the strata-wise FWER

In PWER-controlling test procedures, we may also be interested in the behavior of the single strata-wise FWERs, which are defined by ${SWER}_{J} (c) := P (max_{j \in J} Z_{j} > c)$ for every $J \subseteq I, P_{J} \neq \emptyset$ , to verify whether excessive type I error probabilities occur in individual strata. Brannath et al.² give some upper bounds for ${SWER}_{J} (c)$ , whose quality depend on different factors like the prevalence $π_{J}$ or the number of biomarkers the stratum $P_{J}$ belongs to. Under the setup of the previous simulations, we also examine the values of the maximum $max_{J \subseteq I, P_{J} \neq \emptyset} {SWER}_{J} (\hat{c})$ (which equals the global FWER when $P_{I} \neq \emptyset$ ) and of the mean over all ${SWER}_{J} (\hat{c}), J \subseteq I, P_{J} \neq \emptyset$ . They are plotted in Figure 5, for the first case presented (with randomly generated true prevalences, independent biomarkers and unknown residual variances). We see in these simulations that on average, the maximal strata wise FWER is limited by $0.05 = 2 α$ . The mean SWER is on average only slightly larger than $α$ . We get very similar results in most other cases (see the detailed results in the repository). But we note that the bound $2 α$ for the maximal SWER is just an empirical finding and may be exceeded in individual cases (for specific randomly drawn true prevalences and strata-wise sample sizes) which can occur in individual studies. For example, even the average maximal strata-wise FWER is found to be much higher (up to 0.07455 for $m = 8$ populations) in the case where we set one prevalence to 0.5, independently of the number of strata.

Figure 5.

Distribution of the maximal strata-wise FWER and of the mean strata wise FWER for different number of biomarkers $m$ , under PWER control at level $α = 0.025$ . The overall sample size is fixed at $N = 500$ . FWER: family wise error rate; PWER: population-wise error rate.

4.6. Introduction of a minimal prevalence for neglected strata

For very small but non-empty population strata $P_{J}$ , it may happen that no patients are sampled, meaning that they are not included in the estimated PWER. Then we would not directly account for the multiplicity in these strata and might therefore expose future patients to an increased risk of receiving inefficacious treatments. This is especially a problem for strata that are intersections of many different populations, since smaller intersections may partially be controlled by the larger ones. If the biomarkers are independent, a solution could be to replace the MLE of the prevalences with the marginal estimator ${\tilde{π}}_{J}$ from Section 4.4. Since these estimates are based on the empirical marginal prevalences of the larger subpopulations $P_{j}$ , the problem of missed prevalences is avoided or at least much reduced. In the general case, one could introduce a minimal prevalence $π_{min}$ to include all ${SWER}_{J}$ with $n_{J} = 0$ in the estimated PWER, and reduce the other prevalences proportionally. However, all strata with estimated prevalences smaller than $π_{min}$ should not be penalized, so we suggest to increase their weights to $π_{min}$ as well. The minimal prevalence should not be chosen too large, as this could result in greater inaccuracy of the estimations. One possible choice is $π_{min} = 1 / (2^{m + 1} - 2)$ , that is, half of the prevalence that each stratum would have if all strata were of same size. Given the very small value of $π_{min}$ , especially for high number of biomarkers, we are then unlikely to overestimate the true prevalences of the strata with no observations (and underestimate the others).

This approach actually leads to more conservative tests: In additional simulations, we restrict the biomarker probabilities $p_{1}$ ,…, $p_{m}$ to the interval $(0, 0.1)$ , in order to achieve that large intersections preferably get small prevalences. We then compute the critical boundary $\hat{c}$ and the adjusted ${\hat{c}}_{min}$ resulting from weighting up all neglected strata by $π_{min}$ . Figure 6 compares the distributions of the true PWER, the maximal SWER and the mean SWER for these two boundaries, in a setting with $m = 3$ populations. We see that in the unadjusted case, the maximal SWER is increased in comparison to the previous simulation results (because we only consider unfavorable cases here), and is reduced to an acceptable level by the suggested adjustment of the prevalences. The true PWER and the mean SWER are also becoming more conservative. For larger number of biomarkers $m$ , this reduction becomes larger in absolute terms. For example, for $m = 6$ the mean true PWER decreases from 0.025 to 0.01403 and the mean maximal SWER decreases from 0.12579 to 0.0723 by replacing $\hat{c}$ with ${\hat{c}}_{min}$ . Note that in general ${\hat{c}}_{min}$ is not necessarily larger than $\hat{c}$ , especially when strata belonging to only few populations have small prevalences. Therefore we would suggest comparing $\hat{c}$ and ${\hat{c}}_{min}$ and choosing the greater one for PWER control.

Figure 6.

Comparison of the true PWER (left plot), the maximal strata-wise FWER (middle plot) and the mean strata-wise FWER (right plot) with and without the adjustment by the minimal prevalence $π_{min} = 1 / (2^{m + 1} - 2)$ . We consider a setting with $m = 3$ popoulations, $N = 500$ patients and $α = 0.025$ as significance level for PWER control. Only cases in which at least one stratum has no observations are included here. PWER: population-wise error rate; FWER: family wise error rate.

4.7. The case of unknown, heterogeneous variances

Under the assumption of heterogeneous and unknown variances the estimated critical value cannot be determined from condition equation (5) anymore, because the joint distribution of the test statistics is unknown in this case. Instead we need to follow an approximate approach. For FWER control, Hasler and Hothorn⁸ proposed to determine population-specific critical boundaries $c_{i}$ from an $n$ -dimensional $t$ -distribution with degrees of freedom according to Satterthwaite⁹

d f_{i}^{*} = \frac{{({\hat{σ}}_{P_{i}, T_{i}}^{2} / n_{P_{i}, T_{i}} + {\hat{σ}}_{P_{i}, C}^{2} / n_{P_{i}, C})}^{2}}{\frac{{({\hat{σ}}_{P_{i}, T_{i}}^{2} / n_{P_{i}, T_{i}})}^{2}}{n_{P_{i}, T_{i}} - 1} + \frac{{({\hat{σ}}_{P_{i}, C}^{2} / n_{P_{i}, C})}^{2}}{n_{P_{i}, C} - 1}}

and plug-in estimation of the correlation matrix. We use this distribution to find our estimated, population-specific critical values

{\hat{c}}_{1}, \dots, {\hat{c}}_{n}

from

\sum_{J \subseteq I} {\hat{π}}_{J} (1 - F_{0, {\hat{Σ}}_{J}, d f_{i}^{*}} ({\hat{c}}_{i}, \dots, {\hat{c}}_{i})) = α for all i \in I

where

F_{0, {\hat{Σ}}_{J}, d f_{i}^{*}}

denotes the cdf of the

| J |

-dimensional

t

-distribution with parameters

0

{\hat{Σ}}_{J} = ({\hat{Σ}}_{i j})_{i, j \in J}

and

d f_{i}^{*}

degrees of freedom. To approximate the resulting true PWER, we generate 10,000 random samples under the global null hypothesis, store the test results for each sample, and take the weighted average over all strata-wise proportions of rejected hypotheses:

PWER ({\hat{c}}_{1}, \dots, {\hat{c}}_{n}) \approx \sum_{J \subseteq I} π_{J} \frac{# reject at least one H_{j}, j \in J}{# simulation runs}

The results for the configuration presented before (with

N = 500

patients, pairwise different treatments, independent biomarkers, equal patient allocation and

α = 0.025

) with randomly generated variances are contained in Figure 7 (only for

m = 2

and

m = 3

populations due to the increased computational effort needed). The corresponding summary statistics can be found in Appendix D. We see that the true PWER is again controlled very well in the average, but now observe a higher variance of the simulated values. This is due to the only approximate calculation of the true PWER. We have also tried this approximation for some cases from Section 4.3 (where the true PWER was known) and found a similar variance there.

Figure 7.

Distribution of the approximated true population-wise error rate (PWER) for heterogeneous residual variances between strata and treatments, for $m = 2$ and $m = 3$ populations.

5. Discussion

The PWER is a new concept for measuring multiple type I errors in clinical trials with overlapping patient populations. It gives the average probability of erroneous rejections in the disjoint population strata. This makes the PWER only consider type I errors that are really relevant to the patients. The aim of this work was to investigate the stability of the PWER under plug-in estimation of the (usually unknown) strata prevalences. We have seen that in situations with up to eight different biomarkers and up to 255 strata, estimating prevalences does not prevent from adequate control of the true PWER. When interpreting the results from Section 4, one should distinguish between individual values of the true PWER, which are conditional on the vector of sample sizes $(n_{J})_{J \subseteq I}$ , and their mean over all simulation runs. An individual value of $PWER (\hat{c})$ is only meaningful for a specific study with a specific vector of sample sizes drawn from the multinomial distribution. We have seen a rather limited fluctation of these values. The unconditional PWER, where we average over all sample size configurations, approximates the expected PWER over many studies. For critical boundaries that are based on sample estimates of the prevalences, it is very well under control. In practice the situation may even be improved by also utilizing historical data in the estimation of the prevalences. We were also able to prove that the true PWER converges almost surely towards the significance level $α$ , that is, its asymptotic distribution is degenerated.

Regarding the maximal strata-wise FWER, we have seen in the simulations that under PWER control it is often indirectly controlled at a higher level, here $2 α$ . Note that there is no guarantee for this, so in practice the maximal SWER should always be investigated (e.g. by simulations) and be taken into account in order to adjust the PWER or the prevalences used, as indicated in Section 4.6 (if deemed necessary). However, when using the same critical value for all test statistics, it can be expected that by PWER control the FWER is at least smaller than the FWER with no multiplicity control.

One might think that for high number of strata (e.g. 255 strata for $m = 8$ populations) and low correlations, PWER control would quickly lead to overly conservative tests. However, note that in the setting with overlapping populations, conservatism does not increase with the number of strata, but only with the number of populations that determines the number of tests to be performed. By formula equation (4), the correlations only depend on the relative size of the intersections of populations in the control group which does not necessarily decrease with increasing number of populations. For example, consider a sequence of independent binary biomarkers—the proportion of a group with a specific set of multiple expressed biomarkers in its populations is then independent of the number of biomarkers (and just depends on the nature of these biomarkers).

An important prerequisite for the practical applicability of the PWER is the development of a sample size estimation for PWER-controlling studies, which is a subject for future reasearch. The convergence of the rejection boundaries presented in Appendix C could be of interest for this. Due to the restriction on type I errors that are relevant to the patients, we expect PWER control to enable higher power and lower sample sizes than FWER control.

Footnotes

Acknowledgements

The authors thank the anonymous referees for their helpful comments. We would also like to thank Dr. Miriam Kesselmeier for kindly providing the data set of her real data example, and Dr. Charlie Hillner for his comments on a previous version of the manuscript.

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

ORCID iDs

Remi Luschei

Werner Brannath

Appendix

References

Ouma

, et al. Design and analysis of umbrella trials: where do we stand. Front Med 2022; 9: 1037439.

Brannath

Hillner

Rohmeyer

. The population-wise error rate for clinical trials with overlapping populations. Stat Methods Med Res 2023; 32: 334–352.

Kesselmeier

, et al. Effect size estimates from umbrella designs: handling patients with a positive test result for multiple biomarkers using random or pragmatic subtrial allocation. PLoS ONE 2020; 15: 1–24.

Westfall

Young

. Resampling-based multiple testing. Examples and methods for p-value adjustment. Wiley series in probability and mathematical statistics, Applied probability and statistics. New York: Wiley, 1993.

Dickhaus

. Simultaneous Statistical Inference. With Applications in the Life Sciences. Berlin, Heidelberg: Springer, 2014.

Hillner

. Adaptive group sequential designs with control of the population-wise error rate. PhD thesis, University of Bremen, 2021.

Kotz

Nadarajah

. Multivariate T-distributions and their applications. Cambridge: Cambridge University Press, 2004.

Hasler

Hothorn

. Multiple contrast tests in the presence of heteroscedasticity. Biom J 2008; 50: 793–800.

Satterthwaite

. An approximate distribution of estimates of variance components. Biom Bull 1946; 2: 110–114.

10.

Brunkhorst

, et al. Effect of empirical treatment with moxifloxacin and meropenem vs meropenem on sepsis-related organ dysfunction in patients with severe sepsis: a randomized trial. JAMA 2012; 307: 2390–2399.