We introduce a new multiple type I error criterion for clinical trials with multiple, overlapping populations. Such trials are of interest in precision medicine where the goal is to develop treatments that are targeted to specific sub-populations defined by genetic and/or clinical biomarkers. The new criterion is based on the observation that not all type I errors are relevant to all patients in the overall population. If disjoint sub-populations are considered, no multiplicity adjustment appears necessary, since a claim in one sub-population does not affect patients in the other ones. For intersecting sub-populations we suggest to control the average multiple type I error rate, i.e. the probability that a randomly selected patient will be exposed to an inefficient treatment. We call this the population-wise error rate, exemplify it by a number of examples and illustrate how to control it with an adjustment of critical boundaries or adjusted -values. We furthermore define corresponding simultaneous confidence intervals. We finally illustrate the power gain achieved by passing from family-wise to population-wise error rate control with two simple examples and a recently suggested multiple-testing approach for umbrella trials.
The aim of precision medicine is to provide each patient with an optimal treatment tailored to his or her genetic and/or clinical profile. One strategy for reaching this goal is to undertake trials where one or several treatments are investigated in multiple sub-populations. Examples of such trials are umbrella and basket trials in oncology. In an umbrella trial patients with the same cancer type but different molecular alterations are enrolled and the treatments are tailored to the specific target sub-populations. In a basket trial, patients with different cancer types but one common molecular alteration are enrolled with the aim to study one specific treatment that is targeted to the common alteration (see e.g., Woodcock and LaVange,1 Strzebonska and Waligora2). In many cases, the target or sub-populations are disjoint by nature, but when many different biomarkers or cancer types are used, it can also occur that patients belong to more than one sub-population. For example, in the FOCUS4 study,3 biomarker tests were conducted to define subgroups based on the mutations present in the patients’ tumour DNA. Some patients belonged to more than one subgroup and thus the subgroups were made disjoint by means of a hierarchical ordering structure defined for the different mutations. Another example are subgroup selection and adaptive enrichment designs (e.g. Brannath et al.,4 Glimm and Di Scala,5 Wassmer and Brannath,6 Stallard et al.7) where a single treatment is tested in an overall patient population and a specific biomarker subgroup of the overall patient population, for which earlier studies indicate that the treatment may be more or even only effective. In this manuscript, we explicitly allow biomarker-defined sub-populations to be overlapping such that patients become eligible for multiple targeted treatment strategies including the case of a single treatment with multiple target populations. This means that, by the trial’s result, future patients of the overlap may be exposed to more than a single test decision on potentially inefficient treatment strategies. Moreover, for such studies suitable allocation procedures have to be defined. Issues of eligibility for multiple target therapies have been addressed e.g. in Malik et al.,8 Collignon et al.9 and Kesselmeier et al.10
In confirmatory clinical trials with tests of several hypotheses the multiple type I error is usually kept small by controlling the family-wise error rate (FWER). With the growing effort of detecting new and more predictive biomarkers and an increasing focus on rare diseases, it is becoming more and more difficult to undertake clinical trials that are sufficiently powered and also provide sufficient control of type I errors. Since the control of multiple type I errors amplifies this issue, more efficient alternatives to the common approach of FWER control are of strong interest. If a treatment or a treatment strategy is tested in several disjoint populations and each population is affected by only a single hypothesis test, the overall study basically consists of separate trials that merely share the same infrastructure. Therefore, no multiplicity adjustments are needed (e.g. Glimm and Di Scala,5 Collignon et al.9). However, if some sub-populations are overlapping, these intersections will contain patients that are possibly exposed to multiple erroneously rejected null hypotheses, implying that one has to adjust for multiplicity (e.g. Collignon et al.9). Since only patients in the intersections are concerned with this multiplicity issue, there is no need for adjustments for patients in the complements, who can only be affected by at most one false rejection of a null hypothesis. The FWER would therefore be too conservative also in this case. Especially for small and/or highly stratified populations, as for instance encountered in paediatric oncology, a more efficient approach is desirable (e.g. Fletcher et al.11). The purpose of this manuscript is to propose a new concept of multiple type I error control that is less conservative. With this new error rate, which we name population-wise error rate (PWER), we aim to keep the average multiple type I error rate at a reasonable level. This provides control of the probability that a randomly chosen future patient will be exposed to an inefficient treatment policy.
The paper is outlined as follows. First, the PWER is motivated by means of a simple example, followed by the general mathematical definition. Then, we demonstrate how to control the PWER at a pre-specified level by adjusting critical boundaries or -values. In Section 4, we present a mathematical result (with a limit) for the strata-wise FWER when adjusting the boundary for PWER control. In the subsequent section, the gain in power by using PWER- instead of FWER-control is illustrated by two examples. In the first example we will investigate the case of two overlapping populations and with (i) two different treatments in each population and (ii) the same treatment in both populations. The second example consists of an application of the PWER to a multiple testing approach for umbrella trials suggested in Sun et al.12 In Section 6 we extend the multiple test with PWER-control to simultaneous confidence intervals (SCIs) and discuss their coverage properties. The paper concludes with a discussion in Section 7. All computations and simulations are done in R. The corresponding R-script files are all available at https://github.com/chillner/RCode_Paper_PWER.
The population-wise error rate
In this section, the aforementioned PWER is introduced conceptually and formally. Examples for different settings are given to further deepen the understanding.
General framework and definition
Consider an overall population consisting of possibly overlapping sub-populations and suppose that we want to investigate for all a treatment in comparison to a control treatment. The population is defined by specific inclusion and exclusion criteria which may include specific biomarker characteristics. In trials on personalized treatments, the appropriate inclusion and exclusion criteria are often a research question itself and then the same treatment is investigated in several different populations simultaneously. This is the case, for instance, in basket and subgroup selection trials. In other trial examples, like umbrella trials, different experimental treatments are investigated in the different sub-populations. In the sequel we call the tuples the treatment policies where we allow treatments to be different or equal to each other, so that there can be some and with such that . To each treatment policy () we assign the null hypothesis , where quantifies the efficacy of treatment in comparison to the control treatment in population . Also the control treatments can be the same or can differ between the different populations. The PWER is then given by the risk for a randomly chosen future patient to be assigned to one or more inefficient treatment policies, i.e. to belong to at least one tuple where for the true effect but has been rejected. This future patient is imagined to be drawn after the study from the same population the study sample was drawn from.
In order to define the PWER mathematically, we partition the overall population into disjoint strata for . In Figure 1 we see an example for such a partition based on three sub-populations , . Note that . For each non-empty stratum , we denote its proportion within the overall patient population by such that . In most parts of the paper we will assume that the relative prevalences are known. In practical cases, they will have to be estimated. The effect of estimation will be discussed in the Section 5.3. For any future patient in , , we commit a type I error if he/she belongs to at least one with and for which has been rejected. The PWER is then defined as
To determine the PWER, we need to know for each stratum the probability of rejecting at least one true null hypothesis that affects the stratum.
populations and their disjoint sub-populations.
Compared to the FWER, which controls the maximum risk for future patients to be assigned to an inefficient treatment strategy, the PWER is an average risk. It is more liberal and thereby more powerful, because
Note that only in the case where .
Two intersecting populations
As an example consider a trial with two intersecting sub-populations and and two treatments , to be tested by means of the hypotheses and . Usually, the two treatments will be compared to the same control, however, the basic idea given in the subsequent sections also applies with treatment-specific controls. As illustrated in the left panel of Figure 2, the overall population can be partitioned into three disjoint sub-populations, , and . Obviously, we commit a type I error for whenever is falsely rejected, , and for whenever or are falsely rejected. Hence, if and are both true, then
If is true and is false, then . Hence, if only one null hypothesis is true, say , the PWER reduces to the probability of rejecting multiplied by the relative size of the population .
Left panel: intersecting populations. Right panel: nested populations.
In situations where a single treatment is tested in both sub-populations and , the specific case in which only one hypothesis is rejected deserves some discussion. In clinical trials the final decisions for future patients usually depend on additional analyses that explore efficacy and safety across different populations and strata. Rejection of is then a necessary but not a sufficient condition for the decision to treat future patients from with treatment . The PWER is then under control in any case, namely even in the worst-case scenario where all future patients of a population will receive treatment even though is inefficient for the other, intersecting population.
Nested populations
In practice, one often faces the problem of nested populations , as in the right panel of Figure 2. Think of a situation where the optimal eligibility criteria for a single treatment are unknown and thus a sequence of tightening eligibility criteria for testing the efficacy of that treatment is planned which then ultimately leads to a sequence of nested populations. Define the strata for , which are for and for . We commit a type I error for whenever any true is rejected for . With relative prevalences of the PWER under the global null hypothesis is given by
Especially, if is defined by a biomarker , i.e. for cut-off points , (with ), the PWER under global null hypothesis can be written as
Three populations with two intersections
Finally, we give an example where the FWER is strictly conservative even for control of the maximum (instead of the average) type I error rate. Consider three populations , , with , and , as in Figure 1. Again, hypotheses of the form are to be tested in each population, respectively. Under the global null hypothesis, where all null hypotheses are true, the PWER is given by
The FWER under the global null hypothesis equals . Since no patient can belong to and simultaneously, the FWER corrects for a multiplicity no patient is affected by.
Control of the PWER
In this section, we demonstrate how to achieve control of the PWER at a pre-specified level under the general framework of Section 2.1. Suppose that each can be tested with a test statistic where larger values of provide evidence against . We assume further that the joint distribution of is known (at least asymptotically). In order to control the PWER at a pre-specified significance level , we need to find the smallest critical value such that
where is the parameter configuration that maximizes the PWER and is the index set of corresponding true null hypotheses. The maximal PWER is usually obtained under the global null hypothesis, i.e. for . This is e.g. the case under the subset pivotality condition (see e.g. Westfall and Young,13 Dickhaus14) which applies, for instance, to contrast - or -statistics for normal data with a variance that is homogeneous across treatment groups and population strata. Since the (asymptotic) correlations between the test statistics usually depend only on the relative prevalences , , the PWER-level can be exhausted under . When each is tested by means of a p-value , we can reach by choice of an adjusted significance level applied to all . The critical value in (3) or adjusted significance level can be found by applying a univariate root finding method. Because the PWER is always bounded by the FWER, the critical value and adjusted significance level are more liberal than for FWER-control. Therefore the PWER leads to a higher power and a lower sample size to achieve a certain power.
Instead of determining the critical value we could report the PWER-adjusted -values
where is the observed value of . Obviously, if an only if and hence can alternatively be tested with the PWER-adjusted p-value . Furthermore, gives the smallest PWER-level the hypothesis can be rejected with.
Note that we could control the PWER also with population-specific critical values (or adjusted levels ). Unique solutions for can be obtained by setting for pre-specified weights and searching for the that meets the pre-specified PWER-level. Multiplicity adjusted -values can also be calculated with the weights . The weights may, for instance, be larger for smaller populations in order to increase the chance of finding efficient treatment policies for small sub-populations. However, due to the weighting by in definition (1) and expression (3), the multiple type I error rate for will automatically be larger for smaller . We will therefore only consider equal critical values in our examples below. This has the additional advantage that, when all test statistics have the same marginal null distribution, the critical value cannot fall short of the -quantile of the marginal null distribution (see Appendix 1). We further note that due to the natural heterogeneity of the treatment effects within each sub-population, it would be advisable to apply a stratified test to investigate a treatment in each (as done in Section 5).
Behaviour of the strata-wise FWER under PWER-control
In order to understand the consequences of PWER-control it is of interest to investigate the behaviour of the FWERs for each stratum , i.e. the risk for future patients in to be exposed to at least one inefficient treatment strategy. Obviously, under PWER-control the strata-wise FWER for each is bounded by . This bound is useful only for sufficiently large , because there is another bound, which is independent from : As mentioned in the last section, in the common situation when all test statistics have the same marginal null distribution under , the common critical value is bounded from below by the -quantile of the marginal null distribution. Therefore, the strata-wise FWER for is at most , see Appendix 1. Note that is independent of the prevalences because the defining probability is conditional on the sample sizes. Moreover, is at most by the Bonferroni inequality. The bound becomes useless latest when it is above the Bonferroni bound which is equivalent to . The following theorem gives an approximation for small which is often smaller than , but depends on the prevalences of for relative to the complement . For this theorem we need to define for the PWER that we obtain when removing from the entire population:
Accordingly, we call the complementary PWER of . Note that the sum in (5) is a weighted average of all non- terms where the weights are the prevalence of for , relative to .
Theorem. Assume that for all the strata-wise is strictly decreasing and differentiable as a function in . Then and are differentiable and strictly decreasing in as well and we find such that . With this we get
where converges linearly with to zero while all , , remain constant. Furthermore, when for all , then for all .
Note that the last statement follows for instance when for all , in particular, for when . The proof of the theorem can be found in Appendix 2. Its conditions are weak: The strict monotonicity and differentiability of the strata-wise FWER are satisfied when the joint distribution of is continuous. This is often the case, at least asymptotically.
We illustrate the approximation (6) with the examples in Figure 2, starting with the left side, i.e. two intersecting populations. In this case for all and any and thereby the approximation is an upper bound for the maximal strata-wise family wise error rate . If have the same marginal null distribution, then is the -quantile of this distribution, e.g. when are standard normal under . The upper bound then equals the FWER with two un-adjusted tests that are always below and even smaller if we account for the positive correlation between and .
For three nested populations (right side of Figure 2) the maximal strata-wise FWER is, according to the theorem, given by where solves
with for and . The threshold depends on the relative prevalences and and the correlation between and , whereby the latter is determined by the sample sizes and the situation whether the same or different treatments are tested in the populations (see Section 5.2 below). For a numerical illustration, we assume the nominal level , a single treatment and prevalences , and as well as standard normally distributed endpoints with test statistics , . Assuming balanced treatment groups in each stratum and that the sample sizes perfectly match the population prevalences, we obtain the correlations , and . The critical value is then and the upper bound for the strata-wise FWER for becomes . Since this bound results from taking the limit for (while keeping the sample sizes and complementary conditional prevalences fixed), the maximal strata-wise FWER with prevalence is somewhat smaller, namely .
Comparison with FWER-controlling procedures
Due to the PWER being more liberal than the FWER, the next naturally arising question is how much this affects power and sample size. We will at first compare PWER-control with FWER-control for two intersecting sub-populations when (i) the same and (ii) two different treatments are investigated in each sub-population. Secondly, we will apply PWER-control to the multiple testing approach for umbrella trials considered in Sun et al.12 and compare it to the originally suggested FWER-control.
Combination of independent samples
We start with a hypothetical, but statistically simple situation. Assume that a treatment is investigated for two intersecting populations , , that are defined by two different biomarkers. Assume further that a sponsor has decided to test the effect of for the two biomarker-positive groups in two disjoint samples within a single clinical trial. We consider here the one-sided hypotheses for the efficacy of in against a control, where (with ) is given as a weighted mean of the unknown effects in the respective strata and . Since the analysis of the two samples is done in a single study, regulatory authorities may require multiple testing adjustment. Let us assume that PWER-control is accepted as a compromise between control of the FWER and the unadjusted testing, the latter being the case when submitting two different studies. PWER-control bounds the overall probability for a future patient to be exposed to an inefficient treatment strategy.
Since the two treatment strategies , , are investigated in two independent samples, the corresponding test statistics are stochastically independent. Let us further assume that both are normally distributed with variance 1. The question is now, what we gain in terms of power by switching from FWER- to PWER-control. We will assume an overlap between the two populations and of probability that will be varied in our investigation.
Let and be the standard normal distribution and quantile functions, respectively. By the independence of the test statistics, the is controlled at by Šidák's critical value . Following the example in Section 2.2, the PWER is given by
where is the critical value used for control of the PWER at level . Note that determines how much multiplicity adjustment is needed for PWER-control. Solving yields
see Appendix 3 for the derivation. For this critical value decreases to coinciding with the unadjusted case and for we have .
To assess the power gain by using PWER- instead of FWER-control, we consider the factor of sample size increase with PWER- or FWER-control in comparison to the one with no multiplicity correction. Aiming for a marginal power of at least , the sample size for each population has to be at least with critical value and non-centrality parameter in . The fractions
describe how much more sample size one would need for a marginal power of when the multiplicity adjustments are performed.
Figure 3 shows for and depending on the size of when both populations are assumed to be of equal size. FWER-control requires an increase in sample size of about 21% while PWER-control requires considerably less depending on . The larger the intersection, the more patients are exposed under the global null hypothesis to two false rejections, therefore the critical value increases and the sample size increases as well. At , PWER and FWER coincide and so do the sample sizes. If, for instance, the intersection makes up of the union of the two populations only around sample size increase is needed when using PWER-control, less than half of what is necessary with FWER-control.
Factor of sample size increase compared to the unadjusted case to achieve a marginal power of with PWER- and FWER-control at in a combination of two independent studies with different but overlapping populations. PWER: population-wise error rate; FWER: family-wise error rate.
Testing population specific effects in one study.
We consider now a single study with two overlapping populations , , for each of which a treatment is compared to a common control . We will investigate two possible scenarios, namely (i) and (ii) . For simplicity, we assume again that both populations have the same size, i.e. . We assume further that the data from each population are normally distributed with mean treatment difference and a common known variance (across treatments and subgroups , ) and z-tests are used to test . For , we denote by the sample size in and by the overall total sample size.
(i) Unequal treatments. In scenario (i) we have to think of a way to randomize patients to either treatment or control. In the complements we simply apply 1:1 randomization to treatment or control . In the intersection we apply 1:1:1 randomization to the three groups and . By this we can assume that in there are patients in the treatment and control group, whereas in the intersection there are patients in each group.
Obviously, this type of allocation leads to an inconsistency between the sample sizes and prevalences. Say has a prevalence of and of 100 patients in , 70 belong to and 30 to . However, applying the above allocation rule implies that of the patients sampled from and assigned to treatment belong to . This means that the proportions of the strata-wise sample sizes within a treatment group do not match their corresponding proportions in the population. Hence, the population-wise means must be estimated by a weighted sum of strata-wise means:
where is the mean response in strata , , under treatment . In the above example, we would need to compute for treatment .
The -test statistic is finally given by . Since in the intersection the same control group is used for both test statistics, they are positively correlated. Assuming , we obtain . The calculation of this correlation and an expression for the variance can be found in Appendix 4.
(ii) Equal treatments. In scenario (ii), we investigate one and the same treatment in both populations and apply the 1:1 randomization to every stratum. By this we can use for the test statistic . Because we are using the same treatment in both populations, we expect a higher correlation between and . Indeed, for the correlation is equal to which is greater or equal to the correlation with different treatments for all ; see Appendix 4.
For both scenarios, we intend to find critical values to control the PWER and FWER, respectively. Following Section 2.2, the PWER under the global null is given by
with being the critical value that is to be found, and is the cumulative distribution function of the bivariate normal distribution with standard normal marginals and correlation . A univariate root finding algorithm can now be used to solve for .
As an example, suppose we are in scenario (i) (multiple treatments) with , , and . Then we have . We solve to obtain and to obtain . Using (8), this yields a sample size increase of around 20% for the FWER and only an increase of 5% for the PWER.
Figure 4 shows graphs of sample size increases for both scenarios and both types of multiple error control in dependence of . At (disjoint populations), for instance, the PWER-approach yields no sample size increase, where the FWER-based method yields an increase of over . With increasing intersection size the difference between sample sizes for PWER and FWER-control declines until both values fall together at where the PWER is equal to the FWER.
Factor of sample size increase compared to the unadjusted case for FWER- and PWER-control at in a single study with two overlapping populations depending on the size of the intersection . The Left panel is for scenario (i) with different experimental treatments and a common control; the right panel is for scenario (ii) with equal experimental treatments. The power is in both scenarios. PWER: population-wise error rate; FWER: family-wise error rate.
For the PWER, this graphic also illustrates that the correlation of the test statistics and the degree of adjustments needed to correct for multiplicity behave like opposing ’forces’. At the test statistics are uncorrelated, but there is no need to adjust for multiplicity with PWER-control. At there is only one population, so the correlation is 1 which implies again that no multiplicity adjustment is needed, although we are formally testing two hypotheses for everyone. For between 0 and 1 we obtain the maximum for the PWER and corresponding sample size increase. Mathematically, this can be seen by rewriting the PWER as . For fixed , only the second term depends on . It is the product of two non-negative factors, and ), where the first increases from 0 to 1 and the second decreases from 1 to 0.
Estimation of population prevalences
In clinical practice, the assumption of known prevalences is rarely justified and it is natural to ask whether the replacement of by an estimator will significantly inflate the PWER. A suitable choice for is the maximum likelihood estimator (MLE) from the multinomial distribution of . Using these estimates instead of the true prevalences, we compute the critical value by solving . This guarantees asymptotic control of the PWER, since is consistent and the joint distribution of the test statistics used in the calculation of is conditional on .
We examine the PWER by means of scenarios (i) and (ii) of Section 5.2. For each constellation of true prevalences, we generate sample size vectors from the corresponding multinomial distribution and computed the MLEs . To see by how much the true PWER is inflated, the probabilities for a type I error for each sub-population are computed by using the ‘estimated’ critical value and the conditional correlation structure of the involved test statistics. By weighting each of these probabilities by their respective true population prevalence , we obtain the true PWER for the given ‘estimated’ critical value. This procedure was repeated 10.000 times and the mean of each true PWER was taken as approximation of the actual overall PWER. Figure 5 shows contour plots of this approximation of the overall PWER for scenarios (i) and (ii) and and , respectively. The plots indicate that the target PWER of may be missed only slightly, even for . More precisely, the mean true PWER values range from around to at worst where the largest standard error we observed is at around , which all in all displays a fairly negligible deviation from 0.025. To be fair, however, this also implies that the standard deviation of a single true PWER amounts to , meaning that values of around 0.026 are quite possible.
Contour plots of the actual overall PWER when using ML-estimates for the prevalences in the determination of the critical value at level . The first row corresponds to scenario (i) and the second row to scenario (ii) from Section 5.2. Because of , the contour plots are restricted to the lower left rectangle of the squares. PWER: population-wise error rate; ML: maximum likelihood.
In the case of a small relative population size it might be that, by chance, no patient is recruited in this group. This would mean that we would not account for all the multiplicity of these patients completely. If such an intersection cannot be excluded theoretically from the inclusion and exclusion criteria or due to medical arguments, we could introduce a small minimal number for all in order to be more conservative. Also, different approaches like shrinkage methods or Bayesian estimation of the are conceivable options in future research.
Lastly, note that we assumed here that the patient population of the trial is representative of the respective future patient populations. If this cannot be assumed, one could take prevalence estimates from more representative studies. If these estimates are based on sufficiently large sample sizes, the control of the PWER should be achieved similarly or better than with the trial data if the representative data are larger.
Multiple testing approaches for umbrella trials
We consider now a multiple testing approach for umbrella trials suggested in Sun et al.12 and investigate the gain in power by switching from FWER- to PWER-control. Following Sun et al.,12 we assume disjoint population strata, which are denoted here by and which are subsets of a larger and more general patient population from which we randomly sample. In each stratum a specific experimental treatment shall be compared to a control treatment . For simplicity, we assume that each population has the relative population prevalence which is assumed to equal , where is the number of patients in and is the total number of patients in the sample. This holds in practice at least approximately; see also Section 5.3.
With only small , the establishment of a treatment effect in the individual strata is difficult and impossible to achieve with sufficient power. Therefore, study designs have been suggested that compare with the global treatment strategy which assigns treatment to population strata . Application examples for such trial designs where a pooled comparison of the global strategy to has been chosen for the primary analysis (with the strata-wise comparisons as secondary analyses only) are Hill et al.15 and Owadally et al.16 Such an overall comparison of the strategy with utilizes the total sample size and does also not require multiple testing. However, it does not permit a claim for a sub-population when the effect of is heterogeneous. To improve the approach, Sun et al.12 suggest to test all sub-strategies , , that consider only the union with treatment assignments as in , against the control in . This permits claims also for sub-populations and thereby increases the possibility for efficacy conclusions. Of course, such testing requires an adjustment for multiplicity. Sun et al.12 provide a (single-step) procedure that controls the FWER. For the formal description of the procedures, let be the vector of unknown treatment effects (mean differences) in the populations, and consider for each the average treatment effect in :
with the relative prevalence of . Sun et al.12 assume the linear model
where denotes the treatment indicator for patient in group which equals 1 if assigned to the experimental treatment and otherwise , and is the treatment effect of in population . The error terms are assumed to be i.i.d. normally distributed with mean and homogeneous variance . As mentioned above, the authors suggest to test
Note that the and , , correspond to the and , , Section 2 and 3.
From the least squares estimate of the linear model, we obtain one-sided -test statistics for testing for each . In order to control the FWER, Sun et al.12 conduct a single-step procedure that compares each with the upper -quantile of the distribution of under the global null hypothesis, i.e. the assumption that none of the treatments is superior to the control. We finally select the subset for which a positive treatment effect is claimed and that yields the largest value of ,
To achieve PWER-control at the same level , we determine the critical value such that holds under the global null hypotheses. While are disjoint, some of their unions overlap. Since not all overlap, the FWER corrects the multiple type I error rate for cases that cannot occur (similar to example 3) and hence may be viewed as overly conservative.
The PWER under the global null hypothesis () is given by
where ‘’ denotes all that contain the index . This is because population is affected by a type I error whenever a hypothesis is erroneously rejected that corresponds to a population for which (or ).
Due to the assumption of a homogeneous residual variance and the mean parameter in the linear model (10), follows a joint -distribution with degrees of freedom. In R, the distribution function of the multivariate -distribution is implemented in the mvtnorm-package (see Genz et al.17) via the function pmvt which needs the degrees of freedom and the correlation matrix of the test statistics as input (see e.g. Bretz et al.18). The correlation matrix can be computed using the contrast matrix and the design matrix of the linear model. Probabilities in (13) are then calculated by choosing the appropriate sub-matrices of the correlation matrix. Thus, for known values of , , and , we can numerically determine the critical value such that .
We know that , which implies that whenever the FWER-approach selects a non-empty , the same set is selected by the PWER-approach, . We may, however, select the empty set with the FWER-approach, , while .
Performance measures. Sun et al.12 examined several quality and performance measures to assess how good a selected subset is. For example, they considered the average effect in the overall population when applying treatment strategy in and the control in the rest of the population. We will consider the relative quantity where is the expectation with respect to the sample distribution and is the weighted average of the positive treatment effects,
that describes how efficient the experimental treatment strategy is for the union of sub-populations that benefit from . Since the PWER-procedure chooses a non-empty more often as the FWER-procedure, this quantity will always be larger for the PWER-approach.
In addition to this measure we will investigate the average size of the ‘correctly’ chosen subgroups within the selected ones, i.e. the average of where and . This gives the fraction of the patient cohort that benefits from the experimental treatment strategy within the one that is exposed to by the results of the study. Analogously, we are interested in the average of the relative size of the ‘falsely’ chosen subgroups within the chosen ones: with . Lastly, we consider the probability of rejecting at least one false null hypothesis,
as a way to measure the power of the procedures.
Design of the simulation. To make our results comparable to those of Sun et al.,12 we conducted simulations with roughly the same parameters. That is, for the cases of sub-populations and a significance level , we chose a total sample size of and assume that all group-specific intercepts are equal to 0. Also, for simplicity, each group is assumed to be of equal size, i.e. .
As in Sun et al.,12 we assume non-negative effects and choose based on the number of subgroups and three further characteristics. The first one is the percentage of true null hypothesis: with the size of . The second one characterizes the treatment effect heterogeneity and is defined as
where and . Note that equals the relative half-range of the positive ’s, i.e. half of their range divided by the average of their extremes. Obviously, a large means a large heterogeneity between the positive . The third one is the weighted average as previously introduced.
Given values for , , and one finds a gird of equidistant points such that the three characteristics are met. One easily verifies that this grid is uniquely determined by the four quantities. Following Sun et al.12 we chose such that is always an integer. Note that for there is at most one and so (no heterogeneity) is the only possible value for . Moreover, we assume in all simulations.
Results. The simulation results for and are given in Table 1 and for and in Appendix 5. On can see from the tables that control of the PWER, in comparison to FWER-control, provides a substantially larger power and larger average proportion of ‘correctly’ chosen subgroups and a larger average effect. It also increases the proportion of ‘falsely’ chosen subgroups. This is because a subgroup is selected more frequently with PWER-control.
Simulation results for and , assuming .
Power
Correct
False
RAE
Power
Correct
False
RAE
PWER
36.4
36.4
0
29
FWER
31.0
31.0
0
25
PWER
40.4
40.4
0
33
FWER
34.6
34.6
0
28
PWER
51.2
51.2
0
47
FWER
45.2
45.2
0
42
PWER
57.7
52.9
4.8
58
0
0
36
0
FWER
52.0
47.8
4.2
52
0
0
24
0
PWER
36.2
36.2
0
23
42.2
38.8
3.5
30
FWER
27.4
27.4
0
17
32.7
30.1
2.6
24
PWER
37.9
37.9
0
25
44.8
41.3
3.6
33
FWER
29.1
29.1
0
19
35.5
32.7
2.7
26
PWER
43.0
43.0
0
30
52.7
48.9
3.7
42
FWER
33.7
33.7
0
24
43.2
40.2
3.0
35
PWER
53.2
45.7
7.6
46
FWER
43.8
37.8
6.1
38
PWER
58.8
51.1
7.8
50
FWER
49.5
43.1
6.4
42
PWER
73.9
65.9
8.0
68
FWER
65.3
58.5
6.8
60
PWER
81.5
70.1
11.5
81
0
0
4.2
0
FWER
75.1
64.9
10.2
75
0
0
2.4
0
Results for power (%), the percentage of correctly and falsely chosen sub-populations and the RAE for PWER- and FWER-control under parameter configurations that depend on the fraction of true null hypotheses and the relative half-range of the positive ‘s. RAE: relative average effect; FWER: family-wise error rate; PWER: population-wise error rate.
While the proportion of ‘falsely’ chosen subgroups is increased by at most 2.2% (percentage points) and remains below 5% (one-sided), the proportion of ‘correctly’ chosen subgroups (among the selected ones) and the power are increased by up to 10% and often by more than 5%. The expected effect RAE is always larger with PWER-control.
Under the global null hypothesis () the average proportion of ‘falsely’ selected populations equals by theory the one-sided FWER. With PWER-control at level 2.5% the FWER was found to be between 3.6% and 4.5% for . Note that the average proportion of ‘falsely’ selected populations exceeds the level of 2.5% (sometimes substantially) also with FWER-control when there is an effect in some but not all population strata.
In summary, we see that control of PWER substantially increases the chance for a delivery of efficient treatments while the risk of receiving an inefficient treatment and the percentage of patients that do not benefit from the treatment decisions is increased to a moderate extent and remains comparable to the procedure with FWER-control.
Extension to SCIs
We are coming back to the general set-up of Section 2 and 3. Utilizing the duality between (multiple) hypothesis tests and (simultaneous) confidence intervals, the multiple test procedure with control of the PWER, introduced in Section 3, can be extended to confidence intervals for the efficacy parameter , . In this section, we will introduce the dual SCIs and discuss their coverage properties.
To introduce the confidence intervals, let be a vector of possible values for and consider the corresponding null hypotheses , . Assume further that , are (asymptotically) pivotal test statistics for , i.e., the (asymptotic) joint distribution of under is the same for all . If decreases in for the given data, then it makes sense to form the one-sided intervals with the lower bound
where is the critical value defined in (3) for . Because is pivotal, the critical value is independent from . The monotonicity of applies to most (one-sided) tests and is satisfied e.g. for Wald-type test statistics where is an estimate of (e.g. the MLE) with an standard error that is independent of the parameter value . In this case, we obtain .
Upper confidence bounds can be derived by applying the same principle and two-sided confidence intervals are obtained by the intersection of the two one-sided intervals. With Wald-type dual tests we obtain the two-sided intervals .
We finally discuss the coverage properties of the above-introduced confidence bounds and intervals. We start with the lower confidence bounds . To this end, consider a patient that is randomly drawn from and let be the set of indices of the sub-populations the patient belongs to, i.e. . The set gives all population efficacy parameter , , that are relevant for patient . Note that is a random set because is randomly drawn from . If is the true unknown efficacy parameter, then by the definition (14) we get if and only if . Since the dual tests for control the PWER, the (simultaneous) probability that any of the lower confidence bounds , , fall above the true is at most . This gives the coverage property
meaning that with a probability of at most , for a randomly chosen patient , the lower confidence intervals , cover all true that are relevant to this patient. Because, if and only if we can write the coverage probability as
Hence, equation (15) means to control a kind of average simultaneous coverage probability where we focus in each stratum on the relevant confidence statements and average the strata-wise coverage probability over the entire population .
The upper confidence bounds and two-sided confidence intervals control the same type of average simultaneous coverage probability. As for the classical confidence intervals, the two-sided interval has a twice as large non-coverage probability as the one-sided intervals.
Discussion
This paper introduces a new multiple type I error rate concept for clinical trials with multiple and possibly intersecting populations that permits for more powerful tests than control of the FWER. It relies on the observation that not all patients are affected by all test decisions, since not all hypotheses concern all population strata. By averaging the individually relevant, multiple type I errors over the entire population, it provides control of the probability that a randomly selected future patient will be exposed to an inefficient treatment strategy. This average multiple type I error rate, which we call the population-wise error rate (PWER), is smaller than the maximal FWER a patient strata is exposed to. In Section 4, we have discussed several bounds for the strata-wise FWER that follow from control of the PWER and illustrated them with numerical examples. Hence, control of the PWER guarantees control of the maximal strata-wise FWER on a larger and only indirectly defined level.
Let us recall that we only consider population-wise claims, i.e. claims on treatment strategies that consist of a treatment and a population the treatment is intended for and for which the average treatment effect is the estimand of interest. This is also the case when aiming for FWER-control. No individual efficacy claims are anticipated here. Error control of patient-wise claims is impossible without sacrificing power or making strong assumptions. However, a population-wise claim can be viewed as a proxy or approximation for individual claims in the target population.26 Test results from more than a single population may be used for a more informed individual decision. Depending on the inconsistency of the efficacy estimates across the individual strata, we may refrain from taking a formal rejection for a claim in the corresponding population. Hence, with PWER-control, we consider the worst-case scenario, where an efficacy claim for a treatment strategy will always lead to an application of the treatment to all patients in the target population. Note that we do not account for a potential off-label use where a treatment is applied to patients outside its target population.
We have presented a simple approach for achieving PWER-control by adjustment of critical values and have illustrated the power gain when passing from FWER to PWER in a number of examples. We have mainly considered the simple situation of multivariate normal distributed test statistics. This situation applies at least asymptotically to a large number of hypothesis tests for which PWER-control is then guaranteed asymptotically. The methods and principle introduced here can also be implemented with other finite sample distributions like e.g. the multivariate -distribution (as done in Section 5.4) or be improved via resampling methods. Variance heterogeneity across populations is a general issue for trials with multiple populations that applies similarly to procedures with FWER-control (see e.g. Placzek and Friede24). One can say, whenever control of the FWER is possible then control of the PWER is possible as well, since the latter just controls an average of FWERs. We have also extended the suggested multiple test to SCIs and showed that these intervals control, for a randomly chosen patient, the probability of a simultaneously correct statement on the parameters that are relevant for the corresponding stratum the patient belongs to.
One referee asked at which level the PWER should be controlled. We would suggest to control the PWER at the one-sided level of 2.5% which is usually used for FWER-control, because the PWER has an interpretation that is close to the FWER, namely represents the risk to expose a future patient to an inefficient treatment strategy. Of course, this choice of is as arbitrary as the choice is for FWER-control and there might be reasons for choosing another level, e.g. in phase II studies.
Control of the PWER requires knowledge of the relative prevalences of all disjoint population strata. These may either be obtained from previous studies or may be estimated at the end of the study. This complicates PWER-control. We have illustrated in an example with two populations that the estimation of the prevalences does not strongly harm PWER-control even with moderate sample sizes. However, more examples with more hypotheses are required to fully explore this issue. At least, PWER-control is always guaranteed asymptotically.
Since our procedure simply results in an adjustment of critical values, power calculations and power simulations deviate only minimally from approaches for classical multiple tests, except for the fact that the critical values may depend on the sample via the prevalence estimates. This can be resolved by using a priori estimates of the prevalences based on experience and past studies. The same issues arises from the estimation of the correlation structure of the test statistics used for an efficient PWER and FWER-control. A miss-specification of the prevalences may be corrected in a mid-trial blinded sample size review.19
In Section 3, we have suggested a single-step procedure to control the PWER and one might ask whether this procedure can be uniformly improved by a step-down test because this is the case for single-step tests with FWER-control (e.g. Dmitrienko et al.20). For instance, in Example 2.2 with two intersecting hypotheses, we may ask whether we can test with a smaller critical when has already been rejected with critical value . One can quickly see that this is not possible. To this end assume that both hypotheses and are true. Rejection of when or with , obviously increases the second and third terms in equation (2) of the PWER. Since we have chosen to be the smallest critical value that satisfies (3), which leads to a PWER equal to with continuously distributed (a generic and common situation), we do not control the PWER for any . We may define PWER-controlling step-down tests with an enlarged in order to mimic and improve step-down tests with FWER-control. However, such procedures do not uniformly improve the single-step test with PWER-control and are therefore beyond the scope of this paper. The development of step-down tests with PWER-control is a topic for future research.
Single-step procedures have the advantage that they can directly be extended by simple and always informative SCIs. We have illustrated this in Section 6 for single-step tests with PWER-control. An extension to simple and always informative SCIs is impossible for step-down tests: Compatible SCIs often are non-informative in the sense that they do not provide any additional information to the sheer hypothesis tests,21,22 and sufficiently informative SCIs are compatible only to a modification of the original step-down test.23 This justifies the use of single-step tests in practice.
We finally remark that an extension of the presented PWER-approach to multi-stage and adaptive designs is under development by the authors and will be a topic of future contributions. Multi-stage and particularly flexible designs provide the opportunity for adding or dropping populations at interim analyses based on the unblinded interim data (e.g. Brannath et al.,4 Wassmer and Brannath,6 Placzek and Friede24). In the example of Section 2.2, we may for instance add and enrich the intersection of the two populations for an investigation in a second stage of the study if the efficacy of the treatment is seen at interim only in one of the two populations. Hence, the development of adaptive and sequential designs with PWER-control is an interesting and valuable research task. It has also the potential to provide a valuable contribution to platform trials for which FWER-control has been discussed (e.g. Stallard et al.,7 Collignon et al.9) and alternatives (like FDR-control) have been suggested rather recently (e.g. Zehetmayer et al.,25 Robertson et al.26).
Footnotes
Acknowledgements
The authors thank the referees for their valuable comments that have lead to major improvements of the paper and Dr. Miriam Kesselmeier from the University Hospital of Jena for her constructive comments on a previous version of this manuscript. We finally thank Remi Luschei for his comments on a revised version of the manuscript.
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was supported by the BMBF under funding number 01EK1503B.
ORCID iDs
Werner Brannath
Charlie Hillner
Appendix
References
1.
WoodcockJLaVangeLM. Master protocols to study multiple therapies, multiple diseases, or both. N Engl J Med2017; 377: 62–70.
2.
StrzebonskaKWaligoraM. Umbrella and basket trials in oncology: ethical challenges. BMC Med Ethics2019; 20: 58. DOI: 10.1186/s12910-019-0395-5.
3.
KaplanRMaughanTCrookA, et al. Evaluating many treatments and biomarkers in oncology: a new design. J Clini Oncol: Offi J Am Soc Clini Oncol2013; 31. DOI: 10.1200/JCO.2013.50.7905.
4.
BrannathWZuberEBransonM, et al. Confirmatory adaptive designs with Bayesian decision tools for a targeted therapy in oncology. Stat Med2009; 28: 1445–1463.
5.
GlimmEDi ScalaL. An approach to confirmatory testing of subpopulations in clinical trials. Biom J2015; 57: 897–913.
6.
WassmerGBrannathW. Group sequential and confirmatory adaptive designs in clinical trials. Springer Springer International Publishing, Switzerland, 2016.
7.
StallardNToddSParasharD, et al. On the need to adjust for multiplicity in confirmatory clinical trials with master protocols. Ann Oncol2019; 30: 506–509.
8.
MalikSMPazdurRAbramsJS, et al. Consensus report of a joint NCI thoracic malignancies steering committee: FDA workshop on strategies for integrating biomarkers into clinical development of new therapies for lung cancer leading to the inception of “master protocols” in lung cancer. J Thorac Oncol2014; 9: 1443–1448.
9.
CollignonOGartnerCHaidichAB, et al. Current statistical considerations and regulatory perspectives on the planning of confirmatory basket, umbrella, and platform trials. Clin Pharmacol Ther2020; 107: 1059–1067.
10.
KesselmeierMBendaNScheragA. Effect size estimates from umbrella designs: Handling patients with a positive test result for multiple biomarkers using random or pragmatic subtrial allocation. PLoS ONE2020; 15: 1–24.
11.
FletcherJIZieglerDSTrahairTN, et al. Too many targets, not enough patients: rethinking neuroblastoma clinical trials. Nat Rev Cancer2018; 18: 389–400.
12.
SunHBretzFGerkeO, et al. Comparing a stratified treatment strategy with the standard treatment in randomized clinical trials. Stat Med2016; 35: 5325–5337.
13.
WestphalPHYoungSS. Resampling-Based Multiple Testing. Wiley, New York, 1993.
HillJCWhitehurstDGLewisM, et al. Comparison of stratified primary care management for low back pain with current best practice (start back): a randomised controlled trial. The Lancet2011; 378: 1560–1571.
16.
OwadallyWHurtCTimminsH, et al. PATHOS: a phase II/III trial of risk-stratified, reduced intensity adjuvant treatment in patients undergoing transoral surgery for Human papillomavirus (HPV) positive oropharyngeal cancer. BMC Cancer2015; 15: 602.
BretzFHothornTWestfallP. Multiple Comparisons Using R. CRC Press, New York, 2016. ISBN 9781420010909.
19.
PlaczekMFriedeT. Clinical trials with nested subgroups: analysis, sample size determination and internal pilot studies. Stat Methods Med Res2018; 27: 3286–3303.
20.
DmitrienkoATamhaneABretzF. Multiple testing problems in pharmaceutical statistics. CRC Press, New York, 2009.
21.
StrassburgerKBretzF. Compatible simultaneous lower confidence bounds for the Holm procedure and other Bonferroni-based closed tests. Stat Med2008; 27: 4914–4927.
22.
GuilbaudO. Alternative confidence regions for bonferroni-based closed-testing procedures that are not alpha-exhaustive. Biom J2009; 51: 721–735.
23.
BrannathWSchmidtS. A new class of powerful and informative simultaneous confidence intervals. Stat Med2014; 33: 3365–3386.
24.
PlaczekMFriedeT. A conditional error function approach for adaptive enrichment designs with continuous endpoints. Stat Med2019; 38: 3105–3122.
25.
ZehetmayerSPoschMKoenigF. Online control of the false discovery rate in group-sequential platform trials, 2021. doi:10.48550/ARXIV.2112.10619. https://arxiv.org/abs/2112.10619.
26.
RobertsonDSWasonJMSKönigF, et al. Online error control for platform trials, 2022. doi:10.48550/ARXIV.2202.03838. https://arxiv.org/abs/2202.03838.
27.
LiuKMengXL. Comment: a fruitful resolution to simpson’s paradox via multiresolution inference. Am Stat2014; 68: 17–29.