Sage Journals: Discover world-class research

Abstract

Many randomized trials have used overall survival as the primary endpoint for establishing non-inferiority of one treatment compared to another. However, if a treatment is non-inferior to another treatment in terms of overall survival, clinicians may be interested in further exploring which treatment results in better health utility scores for patients. Examining health utility in a secondary analysis is feasible, however, since health utility is not the primary endpoint, it is usually not considered in the sample size calculation, hence the power to detect a difference of health utility is not guaranteed. Furthermore, often the premise of non-inferiority trials is to test the assumption that an intervention provides superior quality of life or toxicity profile without compromising survival when compared to the existing standard. Based on this consideration, it may be beneficial to consider both survival and utility when designing a trial. There have been methods that can combine survival and quality of life into a single measure, but they either have strong restrictions or lack theoretical frameworks. In this manuscript, we propose a method called health utility adjusted survival, which can combine survival outcome and longitudinal utility measures for treatment comparison. We propose an innovative statistical framework as well as procedures to conduct power analysis and sample size calculation. By comprehensive simulation studies involving summary statistics from the PET-NECK trial, we demonstrate that our new approach can achieve superior power performance using relatively small sample sizes, and our composite endpoint can be considered as an alternative to overall survival in future clinical trial design and analysis where both survival and health utility are of interest.

Keywords

Health utility overall survival time-to-event data hazard ratio proportional hazards randomized controlled trials

Statement of significance

We propose a composite endpoint that can be considered as an alternative to overall survival in future clinical trial design and analysis where both survival and health utility are of interest. It may achieve higher power to demonstrate the benefit of a treatment while requiring a smaller sample size.

1. Introduction

In many clinical studies, overall survival (OS) is used as the primary endpoint to assess the efficacy of treatments. Superiority trials are used to test whether a new treatment is better than a standard or control treatment, while non-inferiority trials are used to test whether the new treatment is not unacceptably worse than control. Non-inferiority trials are especially important in circumstances where the new treatment may have other benefits (e.g., lower costs, fewer side effects, improved quality of life, or is easier to implement) compared to control, and people are only interested in showing the new treatment is not worse than control in terms of OS. When non-inferiority has been established, clinicians may be interested in further examining whether the new treatment can benefit patients more in terms of health utility.¹ Health utility is a construct, usually ranging from 0 to 1 (although theoretically can also have negative values), that quantifies the preference for a given health state experienced by a patient at a certain time point. A higher value means a healthier state, while death usually corresponds to 0. Using health utility scores at different time points during the treatment and post-treatment, statistical analysis may be performed to compare different treatment groups’ utility scores.^2–4 However, given that the study design is usually based on the primary endpoint of OS without considering health utility, whether there will be enough power for health utility analysis is uncertain. Also, conducting tests for OS and health utility separately may not be the most efficient, because it involves multiple testing adjustment and can lose statistical power. Hence, it may be beneficial to consider using a composite endpoint that combines survival and utility, which may lead to increased statistical power and smaller required sample sizes. There are some commonly used approaches that do not require multiplicity adjustment, but literature has shown their own issues. For example, in hierarchical testing, if survival is placed above utility in the hierarchy, then if there is no significant difference in survival, the procedure will skip the health utility assessment, even if utility may have a very small p-value (e.g., $p < 1 \times 10^{- 10}$ ).⁵ This can be problematic if the new treatment significantly improves utility but not survival, as the utility benefit could be ignored. On the other hand, if survival and utility are used as co-primary endpoints, both must be statistically significant for the trial to be considered successful. This approach is also not ideal when the new treatment meaningfully improves utility but does not show a survival benefit, potentially leading to the rejection of a clinically valuable treatment, and it may have challenges like reverse multiplicity.⁶

Creation of a composite endpoint of survival and utility, can aid in clinical interpretation of non-inferiority trials where non-inferiority of survival is not the only acceptable outcome. For example, a new therapeutic intervention may be purported as offering improvements in quality of life or toxicity. However, clinicians may not be willing to sacrifice disease control to provide these other benefits. In this case, testing this new intervention in phase 3 non-inferiority trial where OS is the primary outcome and quality of life or toxicity is a secondary outcome may establish the intervention as non-inferior from a survival perspective and then falsely identify the new intervention as a standard of care without appropriate consideration of the quality of life and toxicity. On the other hand, one may consider a situation where a patients’ preference for improved quality of life (or utility) may outweigh their desire to have non-inferior survival. In this instance, demonstration of superiority of utility may not be enough if it is associated with a significant loss of survival and the two outcomes cannot be interpreted in isolation. In this instance, a combination of both survival and utility endpoints may be needed to declare a new intervention superior.

Some methods that can combine survival and utility have been proposed and used to analyze clinical trial data, and the most commonly used method is called quality-adjusted time without symptoms of disease or toxicity (Q-TWiST).^7–14 Though Q-TWiST has not been commonly seen as a primary endpoint for designing new studies, researchers have derived its statistical properties as well as formulas for sample size calculations.¹⁰ That being said, one major issue about Q-TWiST is that it divides each patient's status into three states (toxicity, time without symptoms and toxicity, and relapse) and uses pre-selected weights for different states. In many scenarios, with utility scores measured as continuous variables at different time points throughout the trials, it may be much more desirable to analyze them in their original scales rather than forcing to have three categories, which may likely result in loss of information and decreased statistical power.

Quality-adjusted life years (QALY), of which Q-TWiST can be considered as a special case, is the most intuitive way to combine survival with utility when comparing different treatments.^7,15–19 It has also been used in the field of cost-utility analysis, where similar methods have been proposed and compared.^20–23 Quality-adjusted progression-free survival, a similar concept with a slightly different focus, has also been used to assess the benefits of different treatments in randomized trials.^24–26 However, such measures have rarely been considered as a primary endpoint for designing new trials, and we are not aware of any detailly developed statistical framework or comprehensive simulation studies that demonstrate the advantages and feasibility of a quality-adjusted survival endpoint compared to the traditional survival endpoint.

With these limitations and considerations, we propose an innovative composite endpoint for combining longitudinal health utility and survival, called health utility adjusted survival (HUS), with a detailed statistical testing framework as well as procedures to perform power analysis and sample size calculations. By assigning weights to health utility and survival, HUS can be modified to suit different scenarios with increased power.

This new endpoint may help better interpret the findings in clinical trials. Often non-inferiority trials are plagued with uncertainty of the efficacy of a new intervention that is statistically deemed non-inferior based mainly on survival estimates but that has not been clearly shown to be more effective from a toxicity reduction or quality of life improvement perspective. In Table 1, we provide several scenarios of how the new composite endpoint of HUS may improve interpretation of clinical trial findings if this composite endpoint is used in place of standard primary endpoints. For example, one may consider three scenarios in which a new treatment is deemed non-inferior based on a primary outcome of survival in a typical non-inferiority design where different utility scores may produce drastically different trial conclusions if a composite HUS endpoint were used. If a new intervention had lower utility than the comparator, a non-inferior trial would declare the new intervention non-inferior, when in fact, a HUS endpoint would appropriately declare the new intervention inferior. In addition, as we will show in the simulations, sufficient power may be achieved with smaller sample sizes to make statistical inferences than non-inferiority trials based on a non-inferiority margin of survival. This feature may improve the efficiency of trial conduct and arriving at meaningful conclusions with smaller samples.

Table 1.
Interpretations for different scenarios of survival and utility.

Scenario Non-inferiority interpretation Health utility interpretation Clinical interpretation and caveats

Survival non-inferior Improved utility New treatment non-inferior New treatment superior With composite endpoint, patients and clinicians can be confident that weighted health utility adjusted survival is superior.

Survival non-inferior Worse utility New treatment non-inferior New treatment not superior With non-inferiority design, the new treatment may be falsely accepted as a treatment option despite worse utility

Survival non-inferior Similar utility New treatment non-inferior New treatment not superior As above

Survival inferior Improved utility New treatment inferior New treatment may be superior, similar or worse depending on magnitude of effect With non-inferiority design new option is rejected as non-inferior. However, if there is a large therapeutic benefit with the new intervention a composite endpoint may demonstrated this new treatment to be superior.

Survival inferior Worse utility New treatment Inferior New treatment Inferior Non-inferior design may appropriately declare new treatment as inferior

Survival inferior Similar utility New treatment Inferior New treatment Inferior As above

Scenario	Non-inferiority interpretation	Health utility interpretation	Clinical interpretation and caveats
Survival non-inferior Improved utility	New treatment non-inferior	New treatment superior	With composite endpoint, patients and clinicians can be confident that weighted health utility adjusted survival is superior.
Survival non-inferior Worse utility	New treatment non-inferior	New treatment not superior	With non-inferiority design, the new treatment may be falsely accepted as a treatment option despite worse utility
Survival non-inferior Similar utility	New treatment non-inferior	New treatment not superior	As above
Survival inferior Improved utility	New treatment inferior	New treatment may be superior, similar or worse depending on magnitude of effect	With non-inferiority design new option is rejected as non-inferior. However, if there is a large therapeutic benefit with the new intervention a composite endpoint may demonstrated this new treatment to be superior.
Survival inferior Worse utility	New treatment Inferior	New treatment Inferior	Non-inferior design may appropriately declare new treatment as inferior
Survival inferior Similar utility	New treatment Inferior	New treatment Inferior	As above

This manuscript is structured as follows. In section 2, we present the methodology of the HUS endpoint, including its construction, sample size calculation, and power analysis. In section 3, we use comprehensive simulation studies with various settings, including scenarios incorporating parameter estimates based on the PET-NECK trial¹ to demonstrate the power advantage of HUS when analyzing study data and its potential to reduce required sample sizes when designing new trials. At last, we provide a discussion on the advantages, limitations and future directions for HUS in section 4.

2. Methods

2.1. Health utility adjusted survival

In this section, we describe the basic framework of HUS. In many clinical studies, OS is chosen as the primary endpoint, which determines the sample size, while health utility scores are usually analyzed in the secondary analyses. To construct a composite endpoint combining survival and health utility, we can take the product of the survival curve and the utility curve, as illustrated in Figure 1.

Figure 1.

Basic framework of health utility adjusted survival (HUS), using the product of survival and health utility.

Suppose the total length of the study follow-up time is T, and we are interested in comparing survival and health utility between treatment groups 1 and 2. We define a Q-statistic to represent the HUS of each treatment group as

\begin{aligned} Q_{1} & = \int_{0}^{T} S_{1} (t) {\bar{U}}_{1} (t) d t, \end{aligned}

(1)

\begin{aligned} Q_{2} & = \int_{0}^{T} S_{2} (t) {\bar{U}}_{2} (t) d t, \end{aligned}

(2)

where

S_{1} (t)

and

{\bar{U}}_{1} (t)

are the survival function (proportion of patients alive at

t

) and average utility score of those alive at t for group 1.

S_{2} (t)

and

{\bar{U}}_{2} (t)

are the survival function and average utility score of those alive at t for group 2. We propose to use the Kaplan Meier (KM) estimated survival functions

{\hat{S}}_{1} (t)

{\hat{S}}_{2} (t)

to substitute

S_{1} (t)

S_{2} (t)

We can also assign weights to the survival and utility separately by defining

\begin{aligned} Q_{1} & = \int_{0}^{T} {[S_{1} (t)]}^{λ_{1}} {[{\bar{U}}_{1} (t)]}^{λ_{2}} d t, \end{aligned}

(3)

\begin{aligned} Q_{2} & = \int_{0}^{T} {[S_{2} (t)]}^{λ_{1}} {[{\bar{U}}_{2} (t)]}^{λ_{2}} d t . \end{aligned}

(4)

If $λ_{1} = 0$ and $λ_{2} = 1$ , then $Q_{1}$ and $Q_{2}$ only consider the utility functions without including survival. If $λ_{1} = 1$ and $λ_{2} = 0$ , then $Q_{1}$ and $Q_{2}$ simply calculate the areas under the survival curves without adjusting for utility. For simplicity, we suggest fixing the weight $λ_{1}$ as 1, since survival is usually considered as important. $λ_{2}$ can be chosen from different values (e.g., 0.5, 1, 2), and $λ_{2} = 1$ leads to the standard definition of HUS. The higher $λ_{2}$ is, the more importance is assigned to health utility. For the rest of this manuscript, we focus on $λ_{1} = 1$ and $λ_{2} = 1$ unless otherwise specified. We will also show some results with various $λ_{2}$ in our simulation studies and discuss its effect.

2.2. Hypothesis testing

To examine the difference in HUS between the two treatment groups, we can define the test statistic as

\begin{aligned} T = Q_{1} - Q_{2} . \end{aligned}

(5)

To perform a one-sided test on whether group 1 has better HUS than group 2, we can either use the bootstrap method to obtain the confidence interval of $T$ , or use the permutation method to obtain the distribution of $T$ under the null hypothesis.²⁷ We can reject or accept the null hypothesis (H₀: $T$ ≤ 0) based on bootstrap confidence intervals. Suppose groups 1 and 2 have $n_{1}$ and $n_{2}$ subjects, respectively, and the chosen significance threshold is $α$ . The bootstrap procedure can be described as follows: (1)

For iteration b ( $b = 1, \dots, B)$ , take a bootstrap dataset from the original samples, meaning that we randomly sample $n_{1}$ subjects with replacement from treatment group 1 to be group 1 in the new sample, $n_{2}$ subjects with replacement from treatment group 2 to be group 2 in the new sample.

(2)

Calculate the $T$ test statistic for the new sample, denoted by $T^{(b)}$ .

(3)

After obtaining $T^{(b)}$ 's ( $b = 1, \dots, B)$ , calculate the $(1 - α)$ confidence interval based on these B bootstrap samples. If the confidence interval does not contain 0, reject the null hypothesis. Note that the confidence interval should be constructed based on the test of interest (one-sided or two-sided).

The permutation procedure can be described as follows: (1)

For iteration b ( $b = 1, \dots, B)$ , permute on the original samples to get a new permutation dataset, meaning that we randomly reassign all of the subjects into two groups with sample sizes $n_{1}$ and $n_{2}$ .

(2)

Calculate the $T$ test statistic for the new sample, denoted by $T^{(b)}$ .

(3)

After obtaining $T^{(b)}$ 's ( $b = 1, \dots, B)$ , calculate the $(1 - α)$ confidence interval based on these B permutation samples. If the observed test statistic $T$ is outside the confidence interval, reject the null hypothesis.

Note that the distribution generated by bootstrap is under the alternative hypothesis, whereas the distribution generated by permutation is under the null hypothesis, which is why the former is compared with 0, while the latter is compared with the observed test statistic. Based on our experience, both bootstrap and permutation methods can control type I errors, but bootstrap tends to have slightly higher power than permutation. Hence, we focus on the bootstrap method by default. Some simulation results comparing bootstrap and permutation can be found in the supplementary materials (Table S4, Figure S1). Besides, as hinted by Glasziou et al.,¹⁵ Jackknife resampling can also be used to obtain the distribution of $T$ under the alternative hypothesis.^28,29 However, our past experience shows that there is little difference in terms of type I error and power when comparing the bootstrap method with Jackknife, while the distribution of $T$ based on bootstrap samples tends to be closer to normal. As a result, we suggest using the bootstrap method as default. In terms of the number of resamples, $B = 500$ is usually sufficient for controlling type I errors and obtaining decent power. Examples showing the performance of Jackknife and evaluating the choice of B are also provided in the supplementary materials (Tables S4-S5, Figure S1).

2.3. Theoretical properties

If we assume the survival time follows a piecewise exponential distribution, we can derive a Monte Carlo approach to calculate the variance of the test statistic, which can be used for power analysis and sample size calculation.³⁰ A similar idea was used by Royston and Parmar³¹ to calculate the variance for restricted mean survival time.^32–34

We consider a simple case with three key time points: 0 (baseline), C (end of surgery), and T (end of study). Focusing on one treatment group, suppose the survival time is piecewise exponential, with piecewise constant hazards $h_{1}, h_{2}$ for time periods $0 \sim C$ , $C \sim T$ , respectively. The utility function is piecewise linear, which starts from $A_{1}$ at time 0, changes to $A_{2}$ at time C, and then goes to $A_{3}$ at time T. Let $X = min (ξ, T)$ , where $ξ$ is the survival time with cumulative hazard function $H (t)$ and survival function $S (t)$ . We can decompose X as $X_{1} + X_{2}$ , where

\begin{aligned} X_{1} & = {\begin{array}{ll} ξ & (0 \leq ξ \leq C) \\ C & (ξ > C) \end{array}, \end{aligned}

(6)

\begin{aligned} X_{2} & = {\begin{array}{ll} 0 & (0 \leq ξ \leq C) \\ ξ - C & (C < ξ \leq T) \\ T - C & (ξ > T) \end{array} . \end{aligned}

(7)

Denote $M = \int_{t = 0}^{T} S (t) U_{0} (t) d t$ where $U_{0} (t)$ is the base utility function for the currently considered treatment group. Write its statistic of HUS as $Q = \int_{t = 0}^{T} \hat{S} (t) \bar{U} (t) d t$ . If we define

\begin{aligned} X * = A_{1} X_{1} + \frac{T A_{2} - C A_{3}}{T - C} X_{2} + \frac{A_{2} - A_{1}}{2 C} X_{1}^{2} + \frac{A_{3} - A_{2}}{2 (T - C)} X_{2}^{2} + \frac{A_{3} - A_{2}}{T - C} X_{1} X_{2}, \end{aligned}

(8)

we can derive that

M = E (X *)

. Following Royston & Parmar (2013), we can assume that for a specific scenario, we have

\begin{aligned} SE (Q) = ϕ \frac{SD (X *)}{\sqrt{n}}, \end{aligned}

(9)

where

ϕ

is a factor no less than 1 and n is the sample size for the group we are currently looking at. For convenience, we call

ϕ

the variance balance factor, which takes account of the extra variance introduced into HUS by missing utility, censored survival, KM estimation, etc.

SD (X *)

can be calculated using the parameters, while

ϕ

can be estimated by Monte Carlo sampling. More details including the derivations are provided in the supplementary materials (Tables S2-S3). We will demonstrate in our simulations that

ϕ

is robust to different sample sizes.

Note that when two treatment groups are compared, they should have their own variance balance factors, which we denote as $ϕ_{1}$ and $ϕ_{2}$ . Applying our assumed property to each of the groups, we have

\begin{aligned} SE (Q_{1}) & = ϕ_{1} \frac{SD ({X *}_{1})}{\sqrt{n_{1}}}, \end{aligned}

(10)

\begin{aligned} SE (Q_{2}) & = ϕ_{2} \frac{SD ({X *}_{2})}{\sqrt{n_{2}}}, \end{aligned}

(11)

where

Q_{1}

and

Q_{2}

are the statistics of HUS for treatment groups 1 and 2, and

n_{1}, n_{2}

are the sample sizes of the two groups.

{X *}_{1}

and

{X *}_{2}

are constructed separately for the two groups using their own parameter settings. Hence, the variance of

T = Q_{1} - Q_{2}

\begin{aligned} var (T) = {[SE (Q_{1})]}^{2} + {[SE (Q_{2})]}^{2} . \end{aligned}

(12)

For the one-sided test, we can reject the null hypothesis if $T - z_{1 - α} \sqrt{var (T)} > 0$ .

2.4. Power analysis and sample size calculation

In any scenario with prespecified parameters, given different sample size, we can calculate the corresponding power of HUS using simulations. Then we can obtain a table showing different power under different sample sizes, which can be used to determine the sample size needed to achieve specific power (e.g., 80%) for a new trial. Detailed examples are provided in section 3.1.

If we assume that the special case described in section 2.3 is true, then we only need to run one simulation given a fixed sample size (e.g., 200 subjects per treatment group), which can give us estimates of $ϕ_{1}$ and $ϕ_{2}$ . For the one-sided test where we reject the null hypothesis if $T - z_{1 - α} \sqrt{var (T)} > 0$ , the power is

\begin{aligned} ω = P (T - z_{1 - α} \sqrt{var (T)} > 0) = P (T > z_{1 - α} \sqrt{var (T)}) . \end{aligned}

(13)

Assume $T$ follows $N (T_{true}, var (T))$ and denote the power by $ω$ , we have

\begin{aligned} ω = Φ (\frac{T_{true}}{\sqrt{var (T)}} - z_{1 - α}), \end{aligned}

(14)

where

Φ

is the cumulative distribution function of the standard normal distribution. On the other hand, to achieve power

ω

, the required sample sizes should satisfy

\begin{aligned} ϕ_{1}^{2} \frac{SD {({X *}_{1})}^{2}}{n_{1}} + ϕ_{2}^{2} \frac{SD {({X *}_{2})}^{2}}{n_{2}} = {(\frac{T_{true}}{Φ^{- 1} (ω) + z_{1 - α}})}^{2} . \end{aligned}

(15)

If we assume $n_{1} = n_{2}$ , then the required sample size per arm is

\begin{aligned} n_{1} = \frac{{(Φ^{- 1} (ω) + z_{1 - α})}^{2} [ϕ_{1}^{2} var ({X *}_{1}) + ϕ_{2}^{2} var ({X *}_{2})]}{T_{true}^{2}} . \end{aligned}

(16)

Note that it is difficult to calculate $T_{true}$ based on the setting of parameters. However, we can estimate it by using the average of the observed $T$ from our simulated samples. To summarize, in the special situation with simplified settings described in section 2.3, we can use the following procedure to calculate power yielded by a specific sample size: (1)

Calculate $var ({X *}_{1})$ , $var ({X *}_{2})$ based on parameter settings.

(2)

Simulate samples to estimate $ϕ_{1}$ , $ϕ_{2}$ and $T_{true}$ .

(3)

For each new sample size combination $n_{1}$ , $n_{2}$ , calculate $SE (Q_{1})$ , $SE (Q_{2})$ using the estimated $ϕ_{1}$ , $ϕ_{2}$ .

(4)

Calculate $var (T)$ and power $ω$ .

On the other side, we can use the following procedure to calculate the sample size required to achieve specific power: (1)

Calculate $var ({X *}_{1})$ , $var ({X *}_{2})$ based on parameter settings.

(2)

Simulate samples to estimate $ϕ_{1}$ , $ϕ_{2}$ and $T_{true}$ .

(3)

Calculate $n_{1}$ using the sample size formula.

2.5. Handling missing utility scores

In clinical studies, utility scores may not be available at each time point for all subjects, while the current framework of HUS requires complete utility profiles to calculate the test statistic. The most intuitive way is to impute the utility scores. We use linear functions to fill in the utility scores using the available data. If a subject's utility score is only available at one-time point, then we use that score as the imputed utility at all other time points. This approach may seem simple, but it can be quite effective. Another method we consider is to impute the group average at each key time point (i.e., each time point at which at least one subject has their utility score recorded), and then use linear functions to fill in the other missing scores. This approach can be regarded as a combination of the cross-mean and linear interpolation methods.³⁵ While imputing the group average, we can also add some variation using a normal distribution with mean zero and its standard deviation equal to the standard deviation of the recorded scores at that time point. In this way, the imputed values may be closer to the true values, which may lead to an increase of statistical power. It is also worth noting that many other methods are available for imputing longitudinal data, and a very recent study has compared the effects of different imputation methods and shown that most of them are similar in various scenarios, whereas trajectory mean single imputation has the best overall performance.³⁵ Hence, we consider trajectory mean imputation as a third method. A comparison of the three methods using simulation results is provided in the supplementary materials (Table S1), which shows that method 1 has much worse performance when the missing rate is higher, while methods 2 and 3 are not affected as much. For convenience, we use method 2 by default.

3. Results

3.1. Simulations with simplified settings

3.1.1 Power comparison

We conduct simulations in various scenarios to assess the performances of HUS. Suppose we are designing a randomized clinical trial with two treatment arms. The total length of study is 36 months ( $T = 36$ ), and each patient receives surgery at 3 months ( $C = 3$ ). The two arms are assigned to different treatment strategies to help them recover, and we are interested in comparing the two treatments in terms of both survival and health utility. Denote the true survival time, observed survival time, and survival status for patient i from group g as $T_{g i}$ , $X_{g i}$ and $δ_{g i}$ , respectively. Groups 1 and 2 have sample sizes $n_{1}$ and $n_{2}$ . The survival data is simulated using

\begin{aligned} T_{g i} & \sim e x p (h_{g}), \\ ξ_{g i} & \sim Unif (0, ζ), \\ X_{g i} & = min (T_{g i}, ξ_{g i}, T), \\ δ_{g i} & = {\begin{array}{ll} 1 & (T_{g i} < ξ_{g i} a n d T_{g i} < T) \\ 0 & (o t h e r w i s e) \end{array}, \end{aligned}

where

ζ

is chosen to control the censoring rate, denoted by

p_{censoring}

. The hazard ratio of treatment 1 against treatment 2 is

h_{1} / h_{2}

To simulate the health utility score, we first define base utility functions for the two groups. The base utility at time t for group g can be written as

\begin{aligned} U_{g 0} (t) = {\begin{array}{ll} A_{g 1} + \frac{A_{g 2} - A_{g 1}}{C} t & (0 \leq t \leq C) \\ \frac{T A_{g 2} - C A_{g 3}}{T - C} + \frac{A_{g 3} - A_{g 2}}{T - C} t & (C < t \leq T) \end{array} . \end{aligned}

This definition means the average utility for group g starts from $A_{g 1}$ at baseline, changes to $A_{g 2}$ at 3 months, and then changes to $A_{g 3}$ at the end of the study. The change is piecewise linear. Our motivation for this setting is that usually, a cancer patient's health utility reaches the lowest at the end of treatment and gradually recovers after that. For patient i from group g, the health utility score at time t, denoted by $U_{g i} (t)$ , follows a normal distribution with mean $U_{g 0} (t)$ and standard deviation 0.1.

In practice, we do not expect health utility scores to be collected at each time point. Furthermore, some of the scores scheduled to be collected may be missing. For our main simulation study, we assume that the health utility scores are only collected at $t = 1, C$ and T. When $t = 1$ , all subjects have their utility scores collected. When $t = C$ or T, the subjects that are still being followed have their utility scores collected, while there is a $p_{missingU}$ chance that the score is missing.

In this section, we focus on the situation where the two treatment groups do not have a difference in OS, which is the situation that motivated our HUS framework. Other situations (e.g., the two treatment groups differ in both OS and health utility) are explored in section 3.2 and the supplementary materials (Tables S6-S9). Table 2 shows a summary of our major scenarios. In each scenario, we compare the theoretical rejection rate using our results from section 2.4 and the empirical rejection rates of HUS using bootstrap with $B = 500$ . We consider three choices of $λ_{2}$ : $λ_{2} = 1$ corresponds to the standard HUS approach; $λ_{2} = 0.5$ means giving utility less weight than survival; $λ_{2} = 2$ means giving utility more weight than survival. We also examine the performance of OS-based tests. In the tables, “sup” represents the log-rank test that tests whether group 1 is superior to group 2 in terms of OS using KM estimates. “5%” and “10%” correspond to the inferiority test using the hazard ratio with margins of 5% and 10%, respectively. For instance, a 5% margin means we establish non-inferiority (treatment 1 is non-inferior to treatment 2 in terms of OS) if the upper bound of the 95% CI of the hazard ratio is smaller than 1.05.

Table 2.
Simulation settings with different scenario.

Average utility

Scenario $p_{censoring}$ $p_{missingU}$ Group Baseline 3 months 36 months

0 30% 30% 1 0.8 0.4 0.7

2 0.8 0.4 0.7

1 30% 30% 1 0.8 0.5 0.8

2 0.8 0.35 0.7

2 60% 60% 1 0.8 0.5 0.8

2 0.8 0.4 0.7

			Average utility
0	30%	30%	1	0.8	0.4	0.7
2	0.8	0.4	0.7
1	30%	30%	1	0.8	0.5	0.8
2	0.8	0.35	0.7
2	60%	60%	1	0.8	0.5	0.8
2	0.8	0.4	0.7

In scenario 0, we examine the rejection rates of different methods when the two treatment groups have the same OS and health utility. As shown in Table 3, all of the superiority tests are able to control type I errors at 0.05. The rejection rates of the non-inferiority tests are power instead of type I errors, since the alternative is true (treatment 1 is not inferior to treatment 2). This is why they may be higher than 0.05.

Table 3.

Rejection rates of different methods in scenario 0 based on 1000 replications.

	HUS				OS
		Bootstrap			Superiority	Non-inferiority	Non-inferiority
$n_{1}, n_{2}$	Theoretical	$(λ_{2} = 1)$	$λ_{2} = 0.5$	$λ_{2} = 2$	test	test with margin 5%	test with margin 10%
50	0.052	0.052	0.050	0.053	0.056	0.040	0.050
100	0.052	0.054	0.053	0.051	0.068	0.051	0.073
150	0.053	0.047	0.052	0.048	0.054	0.042	0.079
200	0.053	0.052	0.046	0.048	0.053	0.050	0.092
500	0.054	0.049	0.051	0.050	0.045	0.083	0.184

HUS: health utility adjusted survival; OS: overall survival.

In scenario 1, we compare the power of different methods when treatment group 1 has better health utility than treatment group 2. For the theoretical power analysis, firstly, we run one simulation with $n_{1} = n_{2} = 200$ and 4000 replications to obtain the estimates $ϕ_{1} = 1.07, ϕ_{2} = 1.12, T_{true} = 3.11$ . Then we can calculate the power of different sample sizes. For the other methods such as bootstrap $λ_{2} = 1$ , $λ_{2} = 0.5$ , and $λ_{2} = 2$ , we need to simulate new datasets (200 replications) with different sample sizes to get the empirical power. Note that $ϕ_{1}$ , $ϕ_{2}$ and $T_{true}$ are quite robust to different sample sizes. For example, if we use $n_{1} = n_{2} = 500$ , the obtained estimates are $ϕ_{1} = 1.06, ϕ_{2} = 1.11, T_{true} = 3.11$ , which is very close to the scenario of $n_{1} = n_{2} = 200$ . More results regarding the variance balance factors are provided in the supplementary materials (Tables S2-S3).

As shown in Table 4, the bootstrap method with $λ_{2} = 1$ performs close to the theoretical results, which makes sense since the theoretical results are based on the standard HUS with $λ_{2} = 1$ . Larger $λ_{2}$ tends to lead to higher power by giving more weight to utility than survival. This is also expected because the two groups only differ in terms of utility. Meanwhile, the superiority and non-inferiority tests based on OS have little power since there is no real difference in the two group's OS. We also calculate the power corresponding to different sample sizes using our theoretical results and plot the power curves in Figure 2. For scenario 1, to achieve 80% power, using HUS as the endpoint only requires 85 subjects per arm.

Figure 2.

Power curves based on theoretical calculations. Left- scenario 1; right- scenario 2.

Table 4.

Power comparison of different methods in scenarios 1 and 2 based on 200 replications.

Scenario 1
	HUS				OS
		Bootstrap			Superiority	Non-inferiority	Non-inferiority
$n_{1}, n_{2}$	Theoretical	$(λ_{2} = 1)$	$λ_{2} = 0.5$	$λ_{2} = 2$	test	test with margin 5%	test with margin 10%
50	0.61	0.56	0.28	0.9	0.05	0.04	0.05
100	0.86	0.85	0.44	1	0.05	0.05	0.07
150	0.95	0.95	0.59	1	0.05	0.06	0.1
200	0.99	1	0.71	1	0.06	0.04	0.06
Scenario 2
	HUS				OS
		Bootstrap			Superiority	Non-inferiority	Non-inferiority
$n_{1}, n_{2}$	Theoretical	$(λ_{2} = 1)$	$λ_{2} = 0.5$	$λ_{2} = 2$	test	test with margin 5%	test with margin 10%
50	0.42	0.42	0.2	0.76	0.05	0.04	0.06
100	0.65	0.67	0.31	0.94	0.06	0.06	0.08
150	0.80	0.82	0.42	0.97	0.06	0.05	0.1
200	0.89	0.92	0.48	1	0.05	0.05	0.06

HUS: health utility adjusted survival; OS: overall survival.

In scenario 2, we increase the censoring rate to 60% and missing rates to 60%, and reduce the difference between the two group's health utility scores. As shown in Table 4 and Figure 2, results are very similar to those in scenario 1. Again, HUS is able to obtain decent power with relatively small sample sizes while the superiority and non-inferiority tests struggle to find enough evidence to show treatment 1's benefit compared to treatment 2. If we design a trial based on HUS with the assumptions in scenario 2, we only need to have 151 patients in each treatment group.

We would like to point out that even though choosing a larger $λ_{2}$ may seem to have higher power in the above scenarios, it may not always be a good choice, especially when there is a difference in OS. We recommend using $λ_{2} = 1$ as default, though it can be modified depending on the knowledge of the two treatments (e.g., whether treatment 1 is likely to have better OS than treatment 2).

3.1.2 Sample size calculation

In this subsection, we use our developed sample size calculation formulas to calculate sample sizes needed for the composite endpoint, and the standard formulas to calculate sample sizes needed for the basic survival endpoint (implemented in PASS 2023, v23.0.2 with the one-sided log-rank test), to further demonstrate the advantage of HUS. Following scenario 1 from 3.1.1, where treatment 1 has better utility than treatment 2, we consider four different cases. In the first case, there is no survival difference, which is consistent with the focus of this manuscript, and the endpoint OS does not have power. In the second case, we assume that treatment 1 has better survival than treatment 2, while in the third case, we assume that treatment 2 has better survival. In the last case, we assume treatment 1 has better survival, but there is no difference in utility, and the utility function is the same as that in scenario 0. As shown in Table 5, with scenario 1's utility functions, when $h_{1}$ is smaller than $h_{2}$ , meaning that treatment 1 has better OS than treatment 2, the required sample size for HUS is decreased, which makes sense because the difference in HUS is larger. When $h_{1}$ is larger than $h_{2}$ , meaning that treatment 1 has worse OS than treatment 2, the required sample size for HUS is increased. Nevertheless, the numbers are still much smaller than those calculated for OS. If there is no utility difference, HUS will require more subjects than OS, which is expected, though the difference is not as big. These results show again that using the composite endpoint may help greatly reduce the required sample size to detect a significant difference when two treatments differ in utility.

Table 5.
Sample size calculation under a significance level of 0.05. When there is a utility difference, utility functions from scenario 1 are used. When there is no utility difference, utility functions from scenario 0 are used.

Utility: 1 > 2; survival: 1 = 2 ( $h_{1} / h_{2} = 1$ )

Sample size requirement for each arm

Targeted power HUS OS

70% 65 /

80% 85 /

90% 118 /

Utility: 1 > 2; survival: 1 > 2 ( $h_{1} / h_{2} = 0.9$ )

Sample size requirement for each arm

Targeted power HUS OS

70% 47 1710

80% 62 2247

90% 85 3112

Utility: 1 > 2; survival: 1 < 2 ( $h_{1} / h_{2} = 1.1$ )

Sample size requirement for each arm

Targeted power HUS OS

70% 78 2243

80% 103 2946

90% 143 4080

Utility: 1 = 2; survival: 1 > 2 ( $h_{1} / h_{2} = 0.7$ )

Sample size requirement for each arm

Targeted power HUS OS

70% 184 138

80% 242 180

90% 335 249

Utility: 1 > 2; survival: 1 = 2 ( $h_{1} / h_{2} = 1$ )
70%	65	/
80%	85	/
90%	118	/
Utility: 1 > 2; survival: 1 > 2 ( $h_{1} / h_{2} = 0.9$ )
	Sample size requirement for each arm
Targeted power	HUS	OS
70%	47	1710
80%	62	2247
90%	85	3112
Utility: 1 > 2; survival: 1 < 2 ( $h_{1} / h_{2} = 1.1$ )
	Sample size requirement for each arm
Targeted power	HUS	OS
70%	78	2243
80%	103	2946
90%	143	4080
Utility: 1 = 2; survival: 1 > 2 ( $h_{1} / h_{2} = 0.7$ )
	Sample size requirement for each arm
Targeted power	HUS	OS
70%	184	138
80%	242	180
90%	335	249

HUS: health utility adjusted survival; OS: overall survival.

3.3. Simulations with real data estimates

To demonstrate the benefit of HUS in a more practical scenario, we conduct additional simulations with average utility scores and the hazard ratio mimicking the summary data provided in a real randomized trial PET-NECK.¹ PET-NECK is a randomized phase III non-inferiority trial that compares positron emission tomography-computerized tomography (PET-CT)-guided watch-and-wait policy with planned neck dissection (planned ND) for head and neck cancer patients. The two-year OS rates of the two treatment groups (PET-CT and planned ND) with 282 subjects per arm, are 84.9% and 81.5%, respectively, which leads to a hazard ratio of 0.80. We conducted simulations utilizing the parameter setting to emulate the survival times in PET-NECK. Figure 3 shows the average utility scores at different time points in the study, with the maximum time being 24 months. Hence, in this scenario, we set $T = 24$ and define the base utility functions following the observed average utility scores. We also record the utilities at baseline and months 1, 3, 6, 12, 24 with 30% missing rate. Note that this scenario does not fall into the framework of section 2.3, and thus we cannot apply our theoretical results directly to calculate the power and sample sizes. However, obtaining the empirical results is similar to what we describe in section 3.1.

Figure 3.

Average utility scores based on real summary data.

As shown in Table 6, with the two groups differing in both OS and health utility, the superiority test based on HUS still has much higher power than the superiority and non-inferiority tests based on OS. We would only need 200 subjects per arm to achieve 80% power of showing PET-CT has better HUS than planned ND, which is fewer than the subjects in the original study which were based on OS comparison. In terms of weighting, $λ_{2} = 2$ again leads to higher power, while $λ_{2} = 0.5$ has lower power compared to the standard HUS. Nevertheless, in certain scenarios, especially if the difference in health utility is small, using a larger $λ_{2}$ may not be as beneficial. More results, including scenarios where there is no difference in health utility, are available in the supplementary materials (Tables S6-S9, Figures S2-S4).

Table 6.

Power comparison for simulations using real data estimates.

	HUS			OS
	Bootstrap			Superiority	Non-inferiority	Non-inferiority
$n_{1}, n_{2}$	$(λ_{2} = 1)$	$λ_{2} = 0.5$	$λ_{2} = 2$	test	test with margin 5%	test with margin 10%
50	0.34	0.26	0.47	0.09	0.1	0.12
100	0.54	0.4	0.69	0.14	0.17	0.22
150	0.74	0.54	0.84	0.19	0.24	0.34
200	0.82	0.61	0.97	0.2	0.25	0.33
282	0.94	0.76	0.99	0.27	0.36	0.44

HUS: health utility adjusted survival; OS: overall survival.

4. Discussion

We have presented a methodological framework to compare two treatment groups using HUS as a composite endpoint combining survival and health utility. As demonstrated by our comprehensive simulation studies, when there is a difference in health utility, HUS has a significant power advantage over the statistical tests based on OS endpoint, meaning that using HUS as an endpoint for new trials may require much smaller sample sizes to achieve decent power. We have also demonstrated two different procedures (theoretical and empirical approaches) to conduct power analysis and sample size calculation with specified parameters. When the model assumptions are met, the two procedures yield similar results.

There are several different options when applying HUS. We recommend using bootstrap given its popularity as well as its convenience of constructing confidence intervals for the test statistic, though permutation may be theoretically more appropriate for testing the null hypothesis, since it can obtain the null distribution of the test statistic.

In practice, the choice of weights of HUS is important. The main purpose of this paper is to provide the methodology framework of the model, currently, by default, we use equal weights, which gives an equal balance to survival and health utility. There are a few recommendations for future clinical study design. First, clinicians’ input should be considered. For example, for the new treatment evaluation, physicians may have more interests in the survival difference or the utility difference. Besides that, if preliminary clinical data are available from pilot studies, researchers can estimate the weights from such data. In addition, for studies that there is not enough prior information about the new treatment effects, or the clinicians have no obvious input, we would suggest the study design considering the default equal weights. In the data analysis stage, if non-equal weights are chosen, we recommend conducting sensitivity analyses by assigning several weight schemes (up weight or down weight) on survival to assess the robustness of the model performance. Another possible way to combine different weighting options without having to choose one is to apply an idea similar to the aSPU test.^36,37

Note that the theoretical properties we have presented are based on assumptions by analogy with the assumptions used by Royston and Parmar,³¹ though we have shown the validity of our theoretical results in our simulations. Similar to the approximate distribution of the difference between weighted mean survivals derived by Shen and Fleming,³⁸ we are exploring the asymptotic property of the HUS test statistics, especially for the complicated weighted HUS models, which may improve computational efficiency. We would also like to point out that the linear imputation method we use to fill in the utility scores may be problematic in some cases, especially if the scores are only recorded at a few time points and the missing rate is high. We may consider other imputation methods or modifying the definition of HUS so that it does not require complete utility score profiles as input.^39,40 Besides that, given the possible drawbacks brought by KM estimates, sometimes it may be beneficial to apply other models, including the flexible parametric model for survival analysis.⁴¹

Another possible direction worth exploring is to take different functions of the utility score into consideration. One special case is that in many clinical studies, multiple measures of health status are recorded. There are various ways to combine different measures into a single utility score.^42–44 Extending HUS to be able to handle any function of utility may potentially increase power and help us gain insight on how the utility is different in different treatment groups. Furthermore, considering utility may have different importance at different time points, we may assign different weights across time. For example, having a better utility score at the later stage of the study, which means the patients have recovered better, may be more important than having a better utility score at the end of surgery. In such case, we can consider giving higher weights to later time points, and the resulted HUS may provide a clearer picture of which treatment is more beneficial for recovery.

Supplemental Material

sj-pdf-1-smm-10.1177_09622802251338409 - Supplemental material for Health utility adjusted survival: A composite endpoint for clinical trial designs

Supplemental material, sj-pdf-1-smm-10.1177_09622802251338409 for Health utility adjusted survival: A composite endpoint for clinical trial designs by Yangqing Deng, John de Almeida andWei Xu in Medical Research

Footnotes

Acknowledgments

The authors would like to acknowledge the contributions of Dr Hisham Mehanna (Institute of Head and Neck Studies and Education, University of Birmingham) and Dr Sue Yom (Department of Radiation Oncology, University of California) for clinical insights and discussion.

Data availability

R code for our simulation studies and summary data of health utility are available at .

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the Alan Brown Chair in Molecular Genomics, the Lusi Wong Family Fund, and the Posluns Family Fund, all through the Princess Margaret Cancer Foundation.

ORCID iD

Yangqing Deng

Supplemental material

Supplemental material for this article is available online.

References

Mehanna

McConkey

Rahman

, et al. PET-NECK: a multicentre randomised phase III non-inferiority trial comparing a positron emission tomography–computerised tomography-guided watch-and-wait policy with planned neck dissection in the management of locally advanced (N2/N3) nodal metastases in patients with squamous cell head and neck cancer. Health Technol Assess 2017; 21: 1–122.

Mathias

Bates

Pasta

, et al. Use of the health utilities Index with stroke patients and their caregivers. Stroke 1997; 28: 1888–1894.

Horsman

Furlong

Feeny

, et al. The health utilities Index (HUI®): concepts, measurement properties and applications. Health Qual Life Outcomes 2003; 1: 54.

Jewell

Smrtka

Broadwater

, et al. Utility scores and treatment preferences for clinical early-stage cervical cancer. Value Health 2011; 14: 582–586.

Snapinn

. Some remaining challenges regarding multiple endpoints in clinical trials. Stat Med 2017; 36: 4441–4445.

Offen

Chuang-Stein

Dmitrienko

, et al. Multiple co-primary endpoints: medical and statistical solutions: a report from the multiple endpoints expert team of the pharmaceutical research and manufacturers of America. Drug Information J 2007; 41: 31–46.

Glasziou

Simes

Gelber

. Quality adjusted survival analysis. Statist Med 1990; 9: 1259–1276.

Gelber

. Quality-of-Life-Adjusted evaluation of adjuvant therapies for operable breast cancer. Ann Intern Med 1991; 114: 621.

Gelber

Goldhirsch

Cole

, et al. A quality-adjusted time without symptoms or toxicity (Q-TWiST) analysis of adjuvant radiation therapy and chemotherapy for resectable rectal cancer. JNCI Journal of the National Cancer Institute 1996; 88: 1039–1045.

10.

Murray

Cole

. Variance and sample size calculations in quality-of-life-adjusted survival analysis (Q-TWiST). Biometrics 2000; 56: 173–182.

11.

Konski

Winter

Cole

, et al. Quality-adjusted survival analysis of radiation therapy oncology group (RTOG) 90-03: phase III randomized study comparing altered fractionation to standard fractionation radiotherapy for locally advanced head and neck squamous cell carcinoma. Head Neck 2009; 31: 207–212.

12.

Zbrozek

Hudes

Levy

, et al. Q-TWiST analysis of patients receiving temsirolimus or interferon alpha for treatment of advanced renal cell carcinoma. Pharmacoeconomics 2010; 28: 577–584.

13.

Seymour

Gaitonde

Emeribe

, et al. A quality-adjusted survival (Q-TWiST) analysis to assess benefit-risk of acalabrutinib versus idelalisib/bendamustine plus rituximab or ibrutinib among relapsed/refractory (R/R) chronic lymphocytic leukemia (CLL) patients. Blood 2021; 138: 3722–3722.

14.

Jerusalem

Delea

Martin

, et al. Quality-Adjusted survival with ribociclib plus fulvestrant versus placebo plus fulvestrant in postmenopausal women with HR±HER2− advanced breast cancer in the MONALEESA-3 trial. Clin Breast Cancer 2022; 22: 326–335.

15.

Glasziou

Cole

Gelber

, et al. Quality adjusted survival analysis with repeated quality of life measures. Stat Med 1998; 17: 1215–1229.

16.

Prieto

. Sacristán JA: problems and solutions in calculating quality-adjusted life years (QALYs). Health Qual Life Outcomes 2003; 1: 80.

17.

Whitehead

Ali

. Health outcomes in economic evaluation: the QALY and utilities. Br Med Bull 2010; 96: 5–21.

18.

Touray

MML

. Estimation of quality-adjusted life years alongside clinical trials: the impact of ‘time-effects’ on trial results. J Pharm Health Serv Res 2018; 9: 109–114.

19.

Chung

C-H

T-H

Wang

J-D

, et al. Estimation of quality-adjusted life expectancy of patients with oral cancer: integration of lifetime survival with repeated quality-of-life measurements. Value Health Reg Issues 2020; 21: 59–65.

20.

Laska

Meisner

Siegel

. Power and sample size in cost- effectiveness analysis. Med Decis Making 1999; 19: 339–343.

21.

Willan

Lin

. Incremental net benefit in randomized clinical trials. Statist Med 2001; 20: 1563–1574.

22.

Hollingworth

McKell-Redwood

Hampson

, et al.

Cost–utility analysis conducted alongside randomized controlled trials: are economic end points considered in sample size calculations and does it matter?

Clinical Trials 2013; 10: 43–53.

23.

Bader

Cossin

Maillard

, et al. A new approach for sample size calculation in cost-effectiveness studies based on value of information. BMC Med Res Methodol 2018; 18: 113.

24.

Billingham

Abrams

Jones

. Methods for the analysis of quality-of-life and survival data in health technology assessment. Health Technol Assess 1999; 3: 1–152.

25.

Diaby

Adunlin

Ali

, et al. Using quality-adjusted progression-free survival as an outcome measure to assess the benefits of cancer drugs in randomized-controlled trials: case of the BOLERO-2 trial. Breast Cancer Res Treat 2014; 146: 669–673.

26.

Oza

Lorusso

Aghajanian

, et al. Patient-Centered outcomes in ARIEL3, a phase III, randomized, placebo-controlled trial of rucaparib maintenance treatment in patients with recurrent ovarian carcinoma. JCO 2020; 38: 3494–3505.

27.

Good

. Permutation, parametric and bootstrap tests of hypotheses 3rd ed. New York: Springer, 2005.

28.

CFJ

. Jackknife, bootstrap and other resampling methods in regression analysis [Internet]. Ann Statist 1986[cited 2022 Dec 14]; 14: 1270–1283. Available from: https://projecteuclid.org/journals/annals-of-statistics/volume-14/issue-4/Jackknife- Bootstrap-and-Other-Resampling-Methods-in-Regression-Analysis/10.1214/aos/1176350142.full.

29.

Shao

. The Jackknife and Bootstrap [Internet]. New York, NY: Springer New York, 1995[cited 2022 Dec 14], Available from: http://link.springer.com/10.1007/978-1-4612-0795-5.

30.

Myers

Ahn

Jin

. Sample size and power estimates for a confirmatory factor analytic model in exercise and sport: a monte carlo approach. Res Q Exerc Sport 2011; 82: 412–423.

31.

Royston

Parmar

MKB

. Restricted mean survival time: an alternative to the hazard ratio for the design and analysis of randomized trials with a time-to-event outcome. BMC Med Res Methodol 2013; 13: 152.

32.

Irwin

. The standard error of an estimate of expectation of life, with special reference to expectation of tumourless life in experiments with mice. J Hyg 1949; 47: 188–189.

33.

Royston

Parmar

MKB

. The use of restricted mean survival time to estimate the treatment effect in randomized clinical trials when the proportional hazards assumption is in doubt. Statist Med 2011; 30: 2409–2421.

34.

Zhao

Claggett

Tian

, et al. On the restricted mean survival time curve in survival analysis: on the restricted mean survival time curve in survival analysis. Biom 2016; 72: 215–221.

35.

Jahangiri

Kazemnejad

Goldfeld

, et al. A wide range of missing imputation approaches in longitudinal data: a simulation study and real data analysis. BMC Med Res Methodol 2023; 23: 161.

36.

Pan

Kim

Zhang

, et al. A powerful and adaptive association test for rare variants. Genetics 2014; 197: 1081–1095.

37.

Kim

Bai

Pan

. An adaptive association test for multiple phenotypes with GWAS summary statistics. Genet Epidemiol 2015; 39: 651–663.

38.

Shen

Fleming

. Weighted mean survival test statistics: a class of distance tests for censored survival data. Journal of the Royal Statistical Society Series B: Statistical Methodology 1997; 59: 269–280.

39.

Naeim

Keeler

Mangione

. Options for handling missing data in the health utilities Index mark 3. Med Decis Making 2005; 25: 186–198.

40.

Graham

. Missing Data [Internet]. New York, NY: Springer New York, 2012 [cited 2022 Dec 14]. Available from: http://link.springer.com/10.1007/978-1-4614-4018-5.

41.

Lambert

Royston

. Further development of flexible parametric models for survival analysis. Stata J 2009; 9: 265–290.

42.

Hawthorne

Richardson

. Day NA: a comparison of the assessment of quality of life (AQoL) with four other generic utility instruments. Ann Med 2001; 33: 358–370.

43.

Fisk

. A comparison of health utility measures for the evaluation of multiple sclerosis treatments. J Neurol Neurosurg Psychiatry 2005; 76: 58–63.

44.

Pickard

Ray

Ganguli

, et al. Comparison of FACT- and EQ-5D–based utility scores in cancer. Value Health 2012; 15: 305–311.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.37 MB

			Average utility
Scenario	$p_{censoring}$	$p_{missingU}$	Group	Baseline	3 months	36 months
0	30%	30%	1	0.8	0.4	0.7
0	30%	30%	2	0.8	0.4	0.7
1	30%	30%	1	0.8	0.5	0.8
1	30%	30%	2	0.8	0.35	0.7
2	60%	60%	1	0.8	0.5	0.8
2	60%	60%	2	0.8	0.4	0.7

Health utility adjusted survival: A composite endpoint for clinical trial designs

Abstract

Keywords

Statement of significance

1. Introduction

2.1. Health utility adjusted survival

3. Results

3.1. Simulations with simplified settings

3.1.1 Power comparison

Table 2. Simulation settings with different scenario. Average utility Scenario p censoring p missingU Group Baseline 3 months 36 months 0 30% 30% 1 0.8 0.4 0.7 2 0.8 0.4 0.7 1 30% 30% 1 0.8 0.5 0.8 2 0.8 0.35 0.7 2 60% 60% 1 0.8 0.5 0.8 2 0.8 0.4 0.7

Supplemental Material

sj-pdf-1-smm-10.1177_09622802251338409 - Supplemental material for Health utility adjusted survival: A composite endpoint for clinical trial designs

Footnotes

Acknowledgments

Data availability

Declaration of conflicting interests

Funding

ORCID iD

Supplemental material

References

Supplementary Material

Table 2.
Simulation settings with different scenario.

Average utility

Scenario $p_{censoring}$ $p_{missingU}$ Group Baseline 3 months 36 months

0 30% 30% 1 0.8 0.4 0.7

2 0.8 0.4 0.7

1 30% 30% 1 0.8 0.5 0.8

2 0.8 0.35 0.7

2 60% 60% 1 0.8 0.5 0.8

2 0.8 0.4 0.7