Sage Journals: Discover world-class research

Abstract

The win ratio has been increasingly used in trials with hierarchical composite endpoints. While the outcomes involved and the rule for their comparisons vary with the application, there is invariably little attention to the estimand of the resulting statistic, causing difficulties in interpretation and cross-trial comparison. We make the case for articulating the estimand as a first step to win ratio analysis and establish that the root cause for its elusiveness is its intrinsic dependency on the time frame of comparison, which, if left unspecified, is set haphazardly by trial-specific censoring. From the statistical literature, we summarize two general approaches to overcome this uncertainty—a nonparametric one that pre-specifies the time frame for all comparisons, and a semiparametric one that posits a constant win ratio across all times—each with publicly available software and real examples. Finally, we discuss unsolved challenges, such as estimand construction and inference in the presence of intercurrent events.

Keywords

Censoring hierarchical endpoints pairwise comparisons proportionality restricted mean time

Introduction

It is widely accepted that the traditional time-to-first-event analysis of a composite endpoint is less than ideal.^1,2 The first event mixes patient death with lesser events, such as hospitalization, and ignores whatever happens to the patient afterward (in case of a nonfatal first event). Yet it is not until in the recent decade that some other method gained enough traction to become a viable alternative.

Win ratio and hierarchical endpoints

The new method is called the win ratio, first proposed in the work by Pocock et al.³ in 2012 in the European Heart Journal. It compares each pair of treated and untreated patients through a hierarchy of endpoints, for example, death > hospitalization > 6-min walk test (6MWT),⁴ with a lower component considered only if the prioritized ones are inconclusive. This allows more data to be used, and, more importantly, it prioritizes patient survival over nonfatal clinical events and, in turn, possibly other “softer” endpoints, such as quality-of-life measures and biomarkers. As a summary of treatment effect, the proportion of “wins” by the treatment is divided by that of their “losses” against the control. Close relatives^5,6 of the win ratio include the proportion in favor of treatment (or net benefit),⁷ which uses the difference rather than ratio, and the win odds,^8–10 which gives the numerator and denominator each half of the “ties.”

This type of approach, and the idea of a hierarchical composite it facilitates, quickly grew in popularity.^11–14 A reckoning of the online registry ClinicalTrials.gov finds a sharp rise in recent years in both the number of trials that adopt this methodology and the total number of patients involved (Figure 1). Most of the trials are cardiovascular in nature,¹⁵ some in disease areas such as cancer and diabetes. In a high-profile case, an early variation of the win ratio (namely, Finkelstein–Schoenfeld method)⁵ in the ATTR-ACT trial supported the Food and Drug Administration (FDA) approval of tafamidis in 2019 for the treatment of cardiomyopathy.¹⁶

Figure 1.

Registered trials (by start year) that specify win ratio-like approach to hierarchical composite endpoints in primary, secondary, or other analyses.

Impact of censoring on estimand

The win ratio has its downsides. A notable one is that its estimand, that is, the population-level quantity estimated by the sample-based statistic, varies with the censoring distribution. The statistical literature on this phenomenon has grown fairly rich and complete.^17–21 However, some explanations in practical terms may help the applied researcher to better understand its cause and implications.

Censoring decides the time frame of comparison

Simply put, the estimand’s dependency on censoring is caused by the fact that different time frames are used to compare patient pairs censored at different times. This creates a mixture of comparisons whose underlying time frames lack consistency. Recall that two patients censored differently are compared through their minimum follow-up time.³ For example, consider a pair whose minimum follow-up is 1 year (e.g. one patient censored at Year 1 and the other at, say, Year 2). This means that their comparison is based on the data collected during the first year. Consider another pair in which neither patient is censored until Year 5. For them, the time frame of comparison is much longer. In this longer time, ties are less likely as there is more stuff (events) to compare on. So both the win and loss probabilities go up (regardless of the treatment effect). It also happens that win–loss is more likely to be determined by prioritized components, as they are more likely to give conclusive results and thus harder to pass (in an extreme case, if most patients die within 5 years, then hospitalization would be of no use). Figure 2 illustrates how the win–loss status, and its deciding component, changes over time. In brief, censoring sets the time frame of comparison, which systematically influences the magnitude of win–loss proportions and the relative contributions by components (Table 1). In the end, the estimand is an average of shorter-term (e.g. 1-year) versus longer-term (e.g. 5-year) comparisons weighted by the censoring distribution.¹⁹

Figure 2.

An example of win–loss outcome changing over time.

Table 1.

How longer follow-up changes parameters and relative weight of components.

Parameter			Components		Win ratio
Win	Loss	Tie	Higher	Lower	Win ratio
↑	↑	↓	↑	↓	Uncertain

Bias versus ambiguity in estimand

Some call this the “bias” due to censoring—that is, if you have an estimand in mind to begin with. Sometimes this is almost true. For example, every outcome measure on ClinicalTrials.gov has a “time frame” tag, supposedly specifying the period over which patient outcomes are to be collected for analysis. In that case, a natural estimand would be the (population) win ratio with all patients followed over the same period of time. However, because of staggered entry or random withdrawal, some patients may have shorter follow-up than specified. A naive calculation would bias the win–loss estimands in directions opposite to those listed in Table 1 (since follow-up is shorter than the target). Here what needs to be done is correct the bias by addressing the curtailed follow-up in the observed data (more on this later).

More often there is no clear estimand in mind. The intrinsic dependency of the measure on the time frame is ignored, and a single measure of effect size is reported as if universal in nature. To be fair, this is not completely without merit. In any case, the win ratio statistic is still a (censoring-)weighted average of the time (frame)-dependent win ratios, much like the hazard ratio statistic being one of the time-dependent hazard ratios in case of non-proportional hazards (PH).²² Moreover, to test certain hypotheses where the treatment performs consistently better than the control over time, the win ratio does offer a valid test with desirable properties.^17,20 (The analogy with the univariate case is that the log-rank statistic offers a valid test of ordered-hazards alternatives even without proportionality.)²³

Importance of estimand as a measure of effect size

More care is needed when it comes to estimation, whose aim is to quantify the treatment effect. To do so, one needs to articulate the way the effect is measured on the target population. This requires an estimand that is generalizable and not subject to change by the design and logistics of specific trials, which is not true of the win ratio, as shown in Table 1. For one thing, it is hard to compare trials lasting different lengths, let alone perform a meta-analysis thereof. Even for trials of similar durations, factors affecting the stochastic distribution of follow-up times within the trial, for example, are more patients recruited toward the beginning or the end, can move the result in ways not controlled by the investigator.

What’s more, the International Council for Harmonization of Technical Requirements for Pharmaceuticals for Human Use (ICH), an initiative to build consensus on the evaluation of medicinal products around the world, recently issued an “addendum on estimands and sensitivity analysis” to the “guideline on statistical principles for clinical trials” (ICH E9 (R1)).²⁴ The addendum demands clarity in the way treatment effect is measured, which it sees as one of the “central questions for drug development and licensing.” It also lists four key attributes of a meaningful estimand, namely, the treatment, target population, endpoint, and population-level summary.²⁵ The associated training material specifically warns that “missing data and loss-to-follow-up are irrelevant to the construction of estimands.”²⁶ The document has since been adopted by the FDA and European Medicines Agency (EMA), both regulatory members of the ICH, and is in the final step of implementation (Step 5).^27,28

What this means for the win ratio

All this calls for attention to the estimand of the win ratio. A valid estimand should summarize the scientific variables (endpoints) without entangling censoring (though the latter must be dealt with in estimation). The question is both old and new. Old because censoring predates win ratio (no pun intended) and affects univariate and composite endpoints alike. New because win ratio involves multiple outcomes, with a pairwise-comparison routine that is complicated both arithmetically and statistically. With a bit more care, however, and some instructive comparisons with the univariate case, we can realign the win ratio toward the requisite estimand-driven framework.

Two approaches to estimand construction

Since ambiguity in the estimand is due to stochastically shifting time frames, there are two ways to pin it down. One is to fix the time frame, which can be done nonparametrically. The other is to posit a measure that stays constant over time, which requires a (at least semiparametric) temporal model.

Time restriction: a nonparametric approach

The idea of pre-specifying a time horizon to measure the outcome is hardly new. We have mentioned, for example, that each outcome measure on ClinicalTrials.gov is listed with a “time frame” attribute (e.g. 6 and 12 months, 5 years, etc.), though it is not always used to define an estimand explicitly.

Time restriction in univariate setting

For time-to-event outcomes, it is particularly helpful, even mandatory, to specify the time frame over which events are counted. In breast cancer, for example, it is common to report the 5-year survival rate (relative to the healthy population).²⁹ Let $D^{(a)}$ denote the survival time of a generic patient in group $a$ , where $a = 1, 0$ indicate the active treatment versus the control. Then, the 5-year survival rate amounts to a restriction of survival function $S^{(a)} (t) = P (D^{(a)} > t)$ at $t = 5$ years. Both $S^{(1)} (t) - S^{(0)} (t)$ and $S^{(1)} (t) / S^{(0)} (t)$ , that is, absolute and relative increases in $t$ -year survival rate, are sensible estimands of effect size.

Better yet, the survival rates over a time range, say from 0 to $τ$ years, can be combined into the restricted mean survival time (RMST) $μ_{D}^{(a)} (τ) = E {min (D^{(a)}, τ)}$ , the average time lived in the first $τ$ years. It is a cumulative summary of the survival rate because of the alternate expression $μ_{D}^{(a)} (τ) = \int_{0}^{τ} S^{(a)} (t) d t$ .³⁰ Intuitively, the RMST captures not just the survival status at a particular point but its distribution over a time window. As a result, many consider it a fuller measure of patient experience than a cross-sectional survival rate. Likewise, we can use $μ^{(1)} (τ) - μ^{(0)} (τ)$ and $μ^{(1)} (τ) / μ^{(0)} (τ)$ to measure the absolute or relative increases in the average life time during the first $τ$ years.

Time-restricted win ratio

A lesson learned from the simple setting above is that the estimand should summarize the latent, structural variable uncorrupted by censoring. This feels natural and effortless if the outcome is just $D^{(a)}$ , and all summaries end up a functional of $S^{(a)} (t)$ , which can be estimated by the Kaplan–Meier estimator. It may not be so for the win ratio, which involves pairwise comparisons on multiple ranked outcomes. In fact, because the comparison must take into account when patients are under observation (two patients become uncomparable after one is censored), is it even possible to strip away censoring in its definition? If so, how should we redefine it in structural terms?

The key is to envision all patients followed to the same restriction time $τ$ , none censored early (early censoring in the observed data, if any, is a problem for estimation, not one for estimand construction). Call $τ$ a “structural censoring” time (Oakes¹⁹ calls this progressive censoring). With $τ$ the common “censoring” time in all pairwise comparisons, a structural win ratio is produced with a consistent time frame $[0, τ]$ that is transferable and generalizable across trials.

Start with the univariate case. The survival time up to $τ$ is $min (D^{(a)}, τ)$ . Thus, a win for group $a$ compared with $1 - a$ $(a = 1, 0)$ amounts to $min (D^{(1 - a)}, τ) < min (D^{(a)}, τ)$ , which can be succinctly written as $D^{(1 - a)} < min (D^{(a)}, τ)$ (since $τ \geq min (D^{(a)}, τ)$ always). This means that the $τ$ -restricted win probability is $P {D^{(1 - a)} < min (D^{(a)}, τ)}$ , giving rise to win ratio estimand

r (τ) = \frac{P {D^{(0)} < min (D^{(1)}, τ)}}{P {D^{(1)} < min (D^{(0)}, τ)}} .

(1)

Hence, the treated are $r (τ)$ times as likely, or $r (τ) - 1$ times more likely, to survive longer than the untreated by time $τ$ .

The composite case is similar. For simplicity consider a two-tiered composite with a single nonfatal event (hospitalization) time $T^{(a)}$ of secondary importance to death. A win for group $a$ compared with $1 - a$ by time $τ$ can result from the former being a longer survivor $(D^{(1 - a)} < min (D^{(a)}, τ))$ or, when both survive to the end $(\min (D^{(1)}, D^{(0)}) > τ)$ , the winner enjoying a longer time event-free $(T^{(1 - a)} < min (T^{(a)}, τ))$ . Under this rule, the $τ$ -restricted win probability is

\begin{matrix} w_{a, 1 - a} (τ) \\ = P {D^{(1 - a)} < min (D^{(a)}, τ)} \\ + P {min (D^{(1)}, D^{(0)}) > τ, T^{(1 - a)} < min (T^{(a)}, τ)} \end{matrix}

(2)

with corresponding win ratio estimand $r (τ) = w_{1, 0} (τ) / w_{0, 1} (τ)$ .

More generally, the outcome may comprise multiple types of (possibly recurrent) events,³¹ complete with biomarkers or patient-reported quality-of-life scores. Let $H^{(a)} (τ)$ denote the totality of such data collected up to $τ$ . For example, if $N_{D}^{(a)} (t), N_{1}^{(a)} (t), \dots, N_{K}^{(a)} (t)$ are the counting processes for death and $K$ other types of nonfatal events, respectively, and if $Y^{(a)} (t)$ is a set of quantitative measures at $t$ , then

\begin{matrix} H^{(a)} (τ) = & {N_{D}^{(a)} (t), N_{1}^{(a)} (t), \dots, N_{K}^{(a)} (t), Y^{(a)} (t) : \\ 0 \leq t \leq τ}, \end{matrix}

containing all event trajectories and longitudinal measurements during $[0, τ]$ . To calculate the win ratio, let $W (\cdot, \cdot)$ be the win indicator, in the sense that

\begin{matrix} W {H^{(a)} (τ), H^{(1 - a)} (τ)} \\ = I {H^{(a)} (τ) is better than H^{(1 - a)} (τ)} . \end{matrix}

(3)

Then the $τ$ -restricted win probability can be expressed as $w_{a, 1 - a} (τ) = P [W {H^{(a)} (τ), H^{(1 - a)} (τ)} = 1]$ , with corresponding win ratio $r (τ) = w_{1, 0} (τ) / w_{0, 1} (τ)$ . Similarly, this is interpreted as the treated faring $r (τ)$ times as well as, or $r (τ) - 1$ times better than, the untreated by time $τ$ (with “better” defined by the rule of comparison encoded in $W$ ).

Techniques in estimation

We have seen that defining a time-restricted estimand for composite endpoints is conceptually no more complex than one for a univariate endpoint. The difficulty in estimating it, however, increases with the number of variables. This has to do with how censoring is handled.

It is true that one can specify a small $τ$ so that most patients are uncensored by then. However, this usually results in a window too limited to be of medical interest, not to mention the waste of data. With a reasonably set $τ$ , some patients are likely censored before it. In that case, the $H^{(a)} (τ)$ under comparison are not fully observed. This means that we can’t use standard two-sample $U$ -statistics, that is, empirical versions of the win–loss probabilities, to estimate the estimand.

To work around it in the univariate case is relatively easy. One can express the estimand as a function of the outcome distribution, which can then be estimated in the presence of censoring by the Kaplan–Meier method. For example, we can rewrite Equation (1) by $r (τ) = \int_{0}^{τ} S^{(1)} (t) d F^{(0)} (t) / \int_{0}^{τ} S^{(0)} (t) d F^{(1)} (t)$ (first fix the losing patient’s survival time at $t$ , then integrate over its distribution), where $F^{(a)} (t) = 1 - S^{(a)} (t)$ . Plugging in the Kaplan–Meier estimators of the $S^{(a)} (t)$ gives us a valid estimator of $r (τ)$ .

The challenge in the general case is not that a relationship between the estimand and outcome distribution fails to hold, but that the outcome distribution is harder to estimate with multiple censored components. Indeed, we can similarly express the win–loss probability in Equation (2), or one with even more components, using the joint distributions of the event times involved. However, an equivalent of the Kaplan–Meier estimator that is nonparametrically valid, stable, and efficient in the multidimensional setting is nonexistent. To capitalize on the estimand-outcome distribution relationship (in an “integral approach”),³² one typically needs a joint (e.g. shared-frailty) model for the outcomes.³³ The modeling can easily become unwieldy as more components are added.

It is thus preferable to take the nonparametric approach if there is one. To do so, one can tweak the standard two-sample $U$ -statistic (in a “counting approach”)³² to a proper adjustment of censoring. The key is to inversely weight the uncensored patients usable in the cross-group comparisons by the probability of them being uncensored, in an effort to overcome the bias in selecting them. Let $C^{(a)}$ denote the censoring time. A patient is uncensored (including dead) by time $τ$ if $C^{(a)} > min (D^{(a)}, τ)$ . Under independent censoring, the proper inverse weight is $G^{(a)} {min (D^{(a)}, τ)}$ , where $G^{(a)} (t) = P (C^{(a)} > t)$ and can be replaced by the Kaplan–Meier estimator for censoring. This idea of inverse probability censoring weighting (IPCW) applies in the general setup of Equation (3), though in specific cases it may be modified to capture more patients (for example, a patient hospitalized and then censored is still comparable to one known to be event-free by $τ$ ). Dong et al.³⁴ worked out the common case with hierarchical time-to-event components. If censoring depends on the outcome, but the dependency is fully explained by baseline covariates, an extension is available through the use of covariate-specific censoring weights (at the price of a nuisance model for censoring against covariates).³⁵ R-package WINS implements this methodology, with a nicely written tutorial available on the Comprehensive R Archive Network (CRAN).³⁶ Besides IPCW, multiple imputations are also used to deal with incomplete observation of general types of outcomes before the restriction time.³⁷

Variations of restricted win ratio

The win–loss probabilities $w_{a, 1 - a} (τ)$ can also be used to calculate restricted win odds or proportion in favor of treatment. For further extensions, the win indicator $W$ can be replaced by a quantitative function, for example, one that measures the length of time one “wins” against the other.

In the univariate case, since a (cross-sectional) win at $t$ means $D^{(a)} > t$ and $D^{(1 - a)} \leq t$ ( $a$ alive and $1 - a$ dead), the length of win time during $[0, τ]$ is

\begin{matrix} W {H^{(a)} (τ), H^{(1 - a)} (τ)} \\ = \int_{0}^{τ} I (D^{(a)} > t, D^{(1 - a)} \leq t) d t . \end{matrix}

So the average win time is $w_{a, 1 - a} (τ) = E [W {H^{(a)} (τ), H^{(1 - a)} (τ)}] = \int_{0}^{τ} P (D^{(a)} > t, D^{(1 - a)} \leq t) d t = \int_{0}^{τ} S^{(a)} (t) {1 - S^{(1 - a)} (t)} d t$ . Now that the $w_{a, 1 - a} (τ)$ are average times, not proportions, $r (τ) = w_{1, 0} (τ) / w_{0, 1} (τ)$ might be called a restricted “win time” ratio. Alternatively, we can take their difference to obtain the restricted “mean time” (instead of proportion) in favor (RMT-IF) of treatment.³⁸ Interestingly, this gives us something familiar

\begin{matrix} μ (τ) = w_{1, 0} (τ) - w_{0, 1} (τ) \\ = \int_{0}^{τ} S^{(1)} (t) d t - \int_{0}^{τ} S^{(0)} (t) d t \\ = E {min (D^{(1)}, τ)} - E {min (D^{(0)}, τ)} \end{matrix}

(4)

that is, the difference in RMST (net time in favor is just the extra survival time).

With multiple ranked events, the overall RMT-IF can be divided into component-specific pieces, each expressible as an integral of the component’s survival functions similarly to the second line of Equation (4).³⁹ This means that we can again insert the Kaplan–Meier estimator, whose numerical stability is reliable, instead of resorting to IPCW or multiple imputations to handle censoring. As an example, consider the landmark colon cancer trial reported by Moertel et al.,⁴⁰ with death $(D^{(a)})$ and cancer recurrence $(T^{(a)})$ in a two-tiered composite endpoint. The RMT-IF’s death component is the same as in Equation (4). For recurrence, the average win time is $E {\int_{0}^{τ} I (D^{(1)} > t, D^{(0)} > t, T^{(a)} > t, T^{(1 - a)} \leq t) d t} = \int_{0}^{τ} {\tilde{S}}^{(a)} (t) {S^{(1 - a)} (t) - {\tilde{S}}^{(1 - a)} (t)} d t$ , where ${\tilde{S}}^{(a)} (t)$ is the survival function of the traditional time to the first event $min (D^{(a)}, T^{(a)})$ (also estimable by the Kaplan–Meier method). With $τ = 7.5$ years, Table 2 summarizes the restricted mean times in favor of levamisole + fluorouracil $(n = 304)$ against control $(n = 314)$ .³⁸ On average, the combined treatment extends patient survival by 0.62 year (net win time on survival), with an extra 0.35 year recurrence-free in the living (net win time on recurrence). General implementations of the RMT-IF, along with sample size calculation tools,⁴¹ are available in the rmt package on CRAN.⁴²

Table 2.

Restricted mean times in favor of treatment in a colon cancer trial by $τ = 7.5$ years.

Component	Estimate (years)	95% CI (years)	p
Death	0.62	(0.20, 1.04)	0.004
Recurrence	0.35	(0.21, 0.49)	<0.001
Overall	0.97	(0.47, 1.46)	<0.001

Temporal modeling: a semiparametric approach

A temporal model imposes a constraint on the time trajectory of certain features. A prime example is the proportionality assumption in the Cox model—group-specific hazard rates are constrained to be proportional, or equivalently their ratio constant, over time. This allows us to report the hazard ratio as a singular measure of effect size regardless of the length of follow-up. A similar strategy can be applied to the win ratio to break its dependency on the time frame.

Proportionality of win fractions (proportions)

Recall that the win–loss proportions (fractions) $w_{a, 1 - a} (t)$ both increase with $t$ (Table 1). In the presence of censoring before $τ$ , the naive calculation involves shorter time frames than $[0, τ]$ , thereby biasing the estimand. Indeed, the actual estimand is shown to be $E [w_{a, 1 - a} {min (τ, C^{(1)}, C^{(0)})}] < w_{a, 1 - a} (τ)$ , as $min (τ, C^{(1)}, C^{(0)}) \leq τ$ with strict inequality if censoring ever occurs before $τ$ . This is where the IPCW comes in—it picks, with probabilistic adjustment, only those uncensored by $τ$ for comparison. While this achieves consistency, it leaves some data unused.

To learn from the Cox model, we can posit a relationship of win and loss probabilities over time so that all data can be used. This temporal model would also obviate the need for a restriction time. The simplest idea is to posit a constant win ratio, or equivalently, proportional win and loss probabilities

\frac{w_{1, 0} (t)}{w_{0, 1} (t)} = \exp (θ) for some θ and all t .

(5)

This is called the proportional win-fractions (PW) model.⁴³ It is semiparametric because it parametrizes only the win–loss relationship over time, while leaving other aspects of the outcomes, like the baseline event rates, unspecified. This achieves a minimal model for time-constant win ratio.

Plausibility of proportionality

Although the possibility of a constant win ratio cannot be ruled out a priori (Table 1), how plausible is it in practice? Can model (5) actually hold?

In the univariate case, we have already seen that the time-dependent win ratio on the left hand side is $r (t) = \int_{0}^{t} S^{(1)} (u) d F^{(0)} (u) / \int_{0}^{t} S^{(0)} (u) d F^{(1)} (u)$ . Let $Λ^{(a)} (u)$ denote the cumulative hazard function of $D^{(a)}$ . The standard conversion $d F^{(a)} (u) = S^{(a)} (u) d Λ^{(a)} (u)$ (bar discontinuities) helps to rewrite $r (t) = \int_{0}^{t} S^{(1)} (u) S^{(0)} (u) d Λ^{(0)} (u) / \int_{0}^{t} S^{(1)} (u) S^{(0)} (u) d Λ^{(1)} (u)$ . Under the Cox PH model with $Λ^{(1)} (u) = \exp (β) Λ^{(0)} (u)$ , it becomes $r (t) = \exp (- β)$ . This means that the PH model implies the PW model, with the win ratio equal to the inverse of the hazard ratio. In fact, the reverse implication is also true, so the two models (PW and PH) are actually equivalent.⁴⁴

Their relationship goes beyond the univariate case. In a two-tiered composite, for example, if a PH model holds not only on survival but also on the nonfatal event conditioning on survival with the same hazard ratio, then the PW model holds, again with the win ratio equal to the inverse of the (two components’ shared) hazard ratio. This extends to three components and more. In the general case, Equation (5) is implied by a Lehmann model, where the joint survival function in the treatment is equal to that in the control raised to the power of $\exp (- θ)$ .¹⁹ However, because the Lehmann model in a multidimensional setting has more constraints than Equation (5) requires, the reverse implication is false—the PW model is more relaxed and encompasses a wide range of non-Lehmann scenarios.⁴³

Estimation and model diagnostics

One benefit of proportionality is that it makes estimation with censoring easier. The standard win ratio statistic, for example, is now a valid estimator of $\exp (θ)$ under Equation (5). In fact, any time-weighted version thereof also works. Let $\hat{W} (t)$ and $\hat{L} (t)$ be the standard win and loss statistics calculated based on observed data cut off by $t$ (with $C^{(a)}$ replaced by $min (C^{(a)}, t))$ . Although they themselves are not valid estimators of $w_{1, 0} (t)$ or $w_{0, 1} (t)$ (due to censoring before $t$ ), proportionality means that their weighted ratio $\int_{0}^{\infty} H (t) d \hat{W} (t) / \int_{0}^{\infty} H (t) d \hat{L} (t)$ is always one of $\exp (θ)$ regardless of the weight function $H (t)$ . Taking $H (t) \equiv 1$ reduces it to the standard win ratio statistic $\hat{W} (\infty) / \hat{L} (\infty)$ . Other choices of $H (t)$ that systematically improve the efficiency of the standard estimator (i.e. narrow its confidence interval) have yet to be established.

If the model requires no modification to the standard approach, it may feel like just assuming away the problem (i.e. time dependency) to justify business as usual. However, making the assumption transparent has its own merit. Among other things, it allows us to identify when the model is violated so as not to opt for business as usual. For example, Equation (5) implies that the conditional probability of a win among determinate pairs (those excluding ties) is a constant $\exp (θ) / {1 + \exp (θ)}$ (think of a conditional logistic regression of case–control pairs).⁴⁵ To exploit this for model diagnostics, let $\hat{D} (t) = \hat{W} (t) + \hat{L} (t)$ and consider the residual process

resid (t) = \underset{Observed wins}{\underset{︸}{\hat{W} (t)}} - \underset{Model - predicted wins}{\underset{︸}{\hat{D} (t) \cdot \frac{\exp (\hat{θ})}{1 + \exp (\hat{θ})}}},

which under Equation (5) should be unbiased around zero at all times (Figure 3(a)). A closer look reveals that $resid (0) = resid (\infty) = 0$ always (the latter because $\exp (\hat{θ}) = \hat{W} (\infty) / \hat{L} (\infty)$ by definition). It is the pattern in-between that could show signs of non-proportionality. For example, if the residual process goes up and then down, there are proportionally more wins earlier than later, suggesting the win ratio is decreasing (Figure 3(b)). Likewise, if it goes down and then up, there are proportionally less wins earlier than later, suggesting the win ratio is increasing (Figure 3(c)). Both violate the constancy of win ratio required in Equation (5).

Figure 3.

Residual processes standardized to a Brownian bridge. Extremum exceeding ± 2 suggests bias (non-proportionality).

To take a substantive view of model violations, recall that a two-tiered composite proportionality is guaranteed by a Lehmann model with a common component-wise hazard ratio. We can roughly interpret this as the treatment having the same effect on both components. This helps explain why the win ratio stays constant even as time shifts more focus to the prioritized one (Table 1). To flip the argument, when the treatment has different effects on the components, the win ratio will change magnitude to align itself more with the prioritized one as time goes on. For example, if the treatment offers a greater/smaller reduction in mortality than it does the risk of hospitalization in the survivors, the win ratio will increase/decrease, respectively, over time, making Equation (5) untenable. In such cases, the estimand of model-based estimators will be a (censoring) time-weighted average of (log-)win ratios (similarly to the partial-likelihood estimator of hazard ratio in presence of non-PH).²² Nonetheless, it may still yield a valid test if one group wins (or loses) consistently over time compared to the other.

Covariate adjustment

Another benefit of a model like Equation (5) is that it can adjust for covariates the same way as it does the treatment arm. Let $z = (a, x)^{T}$ , where $x$ contains baseline covariates, for example, patient sex, age, medical history, and so on. To compare $z$ with another $z' = (a', x')^{T}$ , let $w_{z, z'} (t)$ and $w_{z', z} (t)$ be the win and loss probabilities, respectively. Similarly to Equation (5), the win ratio can be modeled as

\frac{w_{z, z'} (t)}{w_{z', z} (t)} = \exp {θ (a - a') + β^{T} (x - x')} for all t .

(6)

In this model, $\exp (θ)$ is still the treatment-versus-control win ratio, albeit a conditional one holding other covariates constant, and $\exp (β)$ contains the win ratios resulting from unit increases in the corresponding covariates. To allow for treatment–covariate interaction, add $γ^{T} (ax - a' x')$ to the linear predictor on the right hand side of Equation (6). Then, unit increases in $x$ change the treatment-versus-control win ratio by factors of $\exp (γ)$ .

Consider a subset of $n = 1, 051$ heart failure (HF) patients in the HF-ACTION trial conducted over 2003–2007 to evaluate the effects of exercise training in addition to usual care.⁴⁶ With a two-tiered composite of all-cause mortality and first all-cause hospitalization, model (6) is fit with $a = I (training vs usual care)$ adjusting for $x =$ patient age, sex, HF etiology (ischemic or not), pre-treatment cardiopulmonary exercise (CPX) test (duration of exercise before reporting discomfort), histories of atrial fibrillation (AF) or diabetes, among others. Table 3 lists the regression results for key variables.⁴³

Table 3.

Multiple PW regression in HF-ACTION trial.

Predictor	Win ratio	95% CI	p
Training versus usual	1.06	(0.95, 1.19)	0.275
Age (10 years)	1.02	(0.97, 1.07)	0.468
Male versus female	0.72	(0.63, 0.82)	<0.001
Ischemic versus no	0.87	(0.76, 0.98)	0.027
CPX (min)	1.11	(1.09, 1.13)	<0.001
AF versus no	0.80	(0.70, 0.92)	0.002
Diabetes versus no	0.98	(0.87, 1.11)	0.726

Despite the modest 6% added benefit by the treatment, multiple other factors do substantially and significantly affect patient outcomes. Of note, a mere 1-min increase in CPX test raises the likelihood of a better outcome by 11%. Would this measure of physical stamina also modify the effect of exercise training? One can find out by refitting the model with an extra treatment × CPX interaction.

Covariate adjustment by Equation (6) carries some caveats. First, the conditional win ratio is not comparable to the marginal one, where the treatment is the only predictor, due to “non-collapsibility” of the metric (average of the ratio is not ratio of the average). Second, additional predictors bring in additional assumptions, namely, that proportionality must hold for every one of them. Fortunately, similar residual processes can be used to check on each covariate. If a categorical covariate is found with a non-proportional effect, one can stratify (rather than regress) the model on it,⁴⁷ essentially restricting comparisons within each level of the strata.^48–50 For two-tiered composites, PW methodology including model diagnostics and stratification is available in the WR package on CRAN.⁵¹

Discussions

Starting a win ratio analysis by first defining the estimand clarifies the scientific goal, meets regulatory guidelines/requirements, and is after all, as we have seen, not that hard to do. In a nonparametric approach, one pre-specifies the time frame for the comparisons and deals with censored data within in an unbiased way via the use of censoring weights (IPCW). In a semiparametric approach, one constrains the win ratio to be constant regardless of the restriction time (PW model). The validity of this assumption can be checked by plotting the residuals between observed and model-predicted wins over time, whose pattern shows the actual trend of the win ratio.

Separation of estimand and estimation has limits

A clear definition of estimand benefits from separating it from estimation—the former uses the full data one wishes to observe while the latter makes do with a censored version that is actually observed. This separation is not absolute. In the nonparametric approach, for example, if the restriction time is set so far that no patient is still under observation by that time, the corresponding estimand would not be identifiable or estimable.³⁰ Likewise in the semiparametric approach, even if proportionality is found to be true by the observed residuals, extrapolation of the model beyond the last observation would still be a matter of faith. As a rule of thumb, a sizable portion of patients must be present at any given time in order for an estimand to be estimable or a model to be checkable with reasonable precision or confidence.

Trade-offs between two approaches

The nonparametric approach by definition has fewer assumptions, but also requires pre-specification of a restriction time and results in less data being used. The semiparametric one combines all data in a global measure of effect size, but at the price of a (somewhat stringent) temporal model. For statistical efficiency, the semiparametric approach may be preferred if its assumption is supported by the observed residuals. Otherwise, the nonparametric one offers a more robust alternative by localizing the time frame of comparison.

Improvement in efficiency and robustness

This efficiency–robustness trade-off exists largely because the current methods are still underdeveloped. An eclectic, more sophisticated approach may improve in both qualities. For example, we can define the estimand nonparametrically through time restriction, and then use semiparametric means to estimate it. In randomized trials, this often entails positing a “working model” to augment a standard estimator (e.g. IPCW) by model-based predictions of missing/censored values. This makes the estimator more efficient when the model is true, yet still unbiased when it is not (thanks to the presence of the model-free standard estimator to fall back on).⁵² The same idea applies to covariate adjustment, which provides an alternative to the regression modeling of Equation (6), differing in both the estimand (marginal versus conditional) and inference (robust versus model-dependent). Although this covariate adjustment approach has been well established for standard endpoints,^53–55 even with FDA recommendation,⁵⁶ using it on the win ratio faces statistical challenges, such as correlated pairs resulting from cross-group comparisons.⁵⁷

Handling of intercurrent events

With a broad brush, we have considered loss to follow-up as random censoring, a missing-data mechanism that plays no role in the estimand. This, however, should not include patient withdrawal or change of treatment due to toxicity or non-response, or what ICH E9 (R1) calls “intercurrent events”—those “occurring after treatment initiation that affect either the interpretation or the existence of the measurements associated with the clinical question of interest.”²⁴ Thankfully, the addendum offers some general strategies for handling such events in estimand construction, and those are instructive for the win ratio.

To start, the “hypothetical strategy” means envisioning a scenario where the composite endpoints, or at least their deduced comparison results, were observed had the intercurrent event not occurred. This is close to seeing it as “dependent censoring” and can be accommodated relatively easily within our existing frameworks. Indeed, the estimand would stay the same (with full data defined hypothetically in the absence of intercurrent event), but with changes to the estimator to correct for bias. For example, we can replace the censoring weights in the IPCW with the ones that account for the dependence of the intercurrent event on the observed data, or apply them to the estimating function of the PW model for the same purpose.

The “composite strategy” sounds even easier as it implies just adding the intercurrent event as a component to the outcome. Yet there are delicate issues to ponder, not least about which tier to insert the event and what to do with the prioritized ones that occur after it. Consider treatment failure as an intercurrent event second in importance only to death. If a patient dies after treatment failure and her survival time is used for comparison, then we are essentially following a “treatment policy strategy,” that is, intent-to-treat, for at the point of death the patient is no longer under the initial treatment she was randomized to. However, if the patient’s survival time is unknown or considered unusable after treatment failure, then we need to account for the latter as a competing risk. (Although death is also a competing risk, it poses no difficulty because it takes precedence over other components.) Different options yield different meanings to the estimand and require different approaches to estimation.

The “principal strata strategy” may be the most challenging, both conceptually and technically. It restricts the population (one of the four attributes of an estimand) to a subset defined by the potential status of an intercurrent event under different treatments.⁵⁸ More concretely, consider the principal strata of patients who would not experience treatment failure if treated. This includes not only those in the treatment group who did not experience treatment failure but also those in the control group who would not experience treatment failure if assigned otherwise. This counterfactual thinking implies a need to grapple with unobserved variables, with concomitant identifiability issues. Adding to that is the requirement that both patients under comparison in the win ratio must come from the same (counterfactually defined) stratum. Technicalities aside, the principal strata approach does have its place in causal reasoning. Besides measuring treatment effect on particular subgroups, it can help investigate treatment mechanisms. For example, the extent to which treatment effect is mediated by a biomarker can be assessed by looking at the treatment effect on patients whose biomarkers would be the same whether treated or not.⁵⁸ Win ratio-like methods framed in such counterfactual terms have only just begun to emerge.^59–61 More work is certainly welcome.

Footnotes

Acknowledgements

The author thanks the Editor, Associate Editor, and an anonymous referee for their helpful comments.

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the National Institutes of Health (NIH; grant no. R01HL149875).

ORCID iD

Lu Mao

References

Freemantle

Calvert

Wood

, et al. Composite outcomes in randomized trials: greater precision but with greater uncertainty. J Am Med Assoc 2003; 289: 2554–2559.

Anker

McMurray

. Time to move on from ‘time-to-first’: should all events be included in the analysis of clinical trials. Eur Heart J 2012; 33(22): 2764–2765.

Pocock

Ariti

Collier

, et al. The win ratio: a new approach to the analysis of composite endpoints in clinical trials based on clinical priorities. Eur Heart J 2012; 33(2): 176–182.

Rasekaba

Lee

Naughton

, et al. The six-minute walk test: a useful metric for the cardiopulmonary patient. Intern Med J 2009; 39(8): 495–501.

Finkelstein

Schoenfeld

. Combining mortality and longitudinal measures in clinical trials. Stat Med 1999; 18: 1341–1354.

Buyse

. Generalized pairwise comparisons of prioritized outcomes in the two-sample problem. Stat Med 2010; 29: 3245–3257.

Péron

Buyse

Ozenne

, et al. An extension of generalized pairwise comparisons for prioritized outcomes in the presence of censoring. Stat Methods Med Res 2018; 27(4): 1230–1239.

Dong

Hoaglin

Qiu

, et al. The win ratio: on interpretation and handling of ties. Stat Biopharm Res 2020; 12: 99–106.

Brunner

Vandemeulebroecke

Mütze

. Win odds: an adaptation of the win ratio to include ties. Stat Med 2021; 40(14): 3367–3384.

10.

Song

Verbeeck

Huang

, et al. The win odds: statistical inference and regression. J Biopharm Stat 2023; 33(2): 140–150.

11.

Abdalla

Montez-Rath

Parfrey

, et al. The win ratio approach to analyzing composite outcomes: an application to the evolve trial. Contemp Clin Trials 2016; 48: 119–124.

12.

Cui

Dong

Kuan

, et al. Evidence synthesis analysis with prioritized benefit outcomes in oncology clinical trials. J Biopharm Stat 2023; 33(3): 272–288.

13.

Seifu

Mt-Isa

Duke

, et al. Design of paediatric trials with benefit-risk endpoints using a composite score of adverse events of interest (AEI) and win-statistics. J Biopharm Stat 2023; 33(6): 696–707.

14.

Dong

Huang

Verbeeck

, et al. Win statistics (win ratio, win odds, and net benefit) can complement one another to show the strength of the treatment effect on time-to-event outcomes. Pharm Stat 2023; 22(1): 20–33.

15.

Redfors

Gregson

Crowley

, et al. The win ratio approach for composite endpoints: practical guidance based on previous experience. Eur Heart J 2020; 41(46): 4391–4399.

16.

Maurer

Schwartz

Gundapaneni

, et al. Tafamidis treatment for patients with transthyretin amyloid cardiomyopathy. N Engl J Med 2018; 379(11): 1007–1016.

17.

Luo

Tian

Mohanty

, et al. An alternative approach to confidence interval estimation for the win ratio statistic. Biometrics 2015; 71(1): 139–145.

18.

Bebu

Lachin

. Large sample inference for a win ratio analysis of a composite outcome based on prioritized components. Biostatistics 2016; 17(1): 178–187.

19.

Oakes

. On the win-ratio statistic in clinical trials with multiple types of event. Biometrika 2016; 103: 742–745.

20.

Mao

. On the alternative hypotheses for the win ratio. Biometrics 2019; 75: 347–351.

21.

Chen

, et al. The elusiveness of the win ratio parameter in the presence of missing data. Ther Innov Regul Sci 2024; 58(3): 431–432.

22.

Struthers

Kalbfleisch

. Misspecified proportional hazard models. Biometrika 1986; 73: 363–369.

23.

Fleming

Harrington

. Counting processes and survival analysis. Hoboken, NJ: John Wiley & Sons, 1991.

24.

ICH E9 (R1) addendum on estimands and sensitivity analysis in clinical trials to the guideline on statistical principles for clinical trials, step 5. London: European Medicines Evaluation Agency, 2020.

25.

Akacha

Bretz

Ohlssen

, et al. Estimands and their role in clinical trials. Stat Biopharm Res 2017; 9(3): 268–271.

26.

ICH. The ICH E9(R1) step 2 training material, https://database.ich.org/sites/default/files/E9(R1)TrainingMaterial-PDF_0.pdf, 2018.

27.

van der Laan

DeGeorge

. Global approach in safety testing: ICH guidelines explained. New York:Springer, 2013.

28.

Ionan

Paterniti

Mehrotra

, et al. Clinical and statistical perspectives on the Ich E9 (R1) estimand framework implementation. Stat Biopharm Res 2023; 15(3): 554–559.

29.

Helgeson

Tomich

. Surviving cancer: a comparison of 5-year disease-free breast cancer survivors with healthy women. Psychooncology 2005; 14(4): 307–317.

30.

Tian

Jin

Uno

, et al. On the empirical choice of the time window for restricted mean survival time. Biometrics 2020; 76(4): 1157–1166.

31.

Mao

Kim

. On recurrent-event win ratio. Stat Methods Med Res 2022; 31(6): 1120–1134.

32.

Dong

Huang

Chang

, et al. The win ratio: impact of censoring and follow-up time and use with nonproportional hazards. Pharm Stat 2020; 19(3): 168–177.

33.

Finkelstein

Schoenfeld

. Graphing the win ratio and its components over time. Stat Med 2019; 38: 53–61.

34.

Dong

Mao

Huang

, et al. The inverse-probability of censoring weighting (IPCW) adjusted win ratio statistic: an unbiased estimator in the presence of independent censoring. J Biopharm Stat 2020; 30(5): 882–899.

35.

Dong

Huang

Wang

, et al. Adjusting win statistics for dependent censoring. Pharm Stat 2021; 20(3): 440–450.

36.

Cui

Huang

. Introduction to the R package WINS, 2023, https://CRAN.R-project.org/package=WINS

37.

Wang

Zilinskas

, et al. Missing data imputation for a multivariate outcome of mixed variable types. Stat Biopharm Res 2023; 15: 1–12.

38.

Mao

. On restricted mean time in favor of treatment. Biometrics 2023; 79(1): 61–72.

39.

Mao

Wang

. Dissecting the restricted mean time in favor of treatment. J Biopharm Stat 2023: 1–16.

40.

Moertel

Fleming

Macdonald

, et al. Levamisole and fluorouracil for adjuvant therapy of resected colon carcinoma. N Engl J Med 1990; 322(6): 352–358.

41.

Mao

. Study design for restricted mean time analysis of recurrent events and death. Biometrics 2023; 79(4): 3701–3714.

42.

Mao

. rmt: restricted mean time in favor of treatment, 2021, https://cran.r-project.org/package=rmt

43.

Mao

Wang

. A class of proportional win-fractions regression models for composite outcomes. Biometrics 2021; 77(4): 1265–1275.

44.

Moser

McCann

. Reformulating the hazard ratio to enhance communication with clinical investigators. Clinical Trials 2008; 5: 248–252.

45.

Connolly

Liang

. Conditional logistic regression models for correlated binary data. Biometrika 1988; 75(3): 501–506.

46.

O’Connor

Whellan

Lee

, et al. Efficacy and safety of exercise training in patients with chronic heart failure: Hfaction randomized controlled trial. J Am Med Assoc 2009; 301: 1439–1450.

47.

Wang

Mao

. Stratified proportional win-fractions regression analysis. Stat Med 2022; 41(26): 5305–5318.

48.

Dong

Qiu

Wang

, et al. The stratified win ratio. Pharm Stat 2018; 28: 778–796.

49.

Gasparyan

Folkvaljon

Bengtsson

, et al. Adjusted win ratio with stratification: calculation methods and interpretation. Stat Methods Med Res 2021; 30(2): 580–611.

50.

Dong

Hoaglin

Huang

, et al. The stratified win statistics (win ratio, win odds, and net benefit). Pharm Stat 2023; 22(4): 748–756.

51.

Mao

Wang

. WR: win ratio analysis of composite time-to-event outcomes 2021, https://cran.r-project.org/package=WR

52.

Tsiatis

. Semiparametric theory and missing data. New York: Springer, 2006.

53.

Tsiatis

Davidian

Zhang

, et al. Covariate adjustment for two-sample treatment comparisons in randomized clinical trials: a principled yet flexible approach. Stat Med 2008; 27(23): 4658–4677.

54.

Wang

Susukida

Mojtabai

, et al. Model-robust inference for clinical trials that improve precision by stratified randomization and covariate adjustment. J Am Stat Assoc 2023; 118(542): 1152–1163.

55.

Shao

, et al. Toward better practice of covariate adjustment in analyzing randomized clinical trials. J Am Stat Assoc 2023; 118(544): 2370–2382.

56.

FDA. Guidance document: adjusting for covariates in randomized clinical trials for drugs and biological products. Silver Spring, MD: US Food and Drug Adminstration, 2023.

57.

Mao

. On causal estimation using u-statistics. Biometrika 2018; 105(1): 215–220.

58.

Bornkamp

Rufibach

Lin

, et al. Principal stratum strategy: potential role in drug development. Pharm Stat 2021; 20(4): 737–751.

59.

Han

Chen

, et al. Causal inference for Mann– Whitney–Wilcoxon rank sum and other nonparametric statistics. Stat Med 2014; 33(8): 1261–1271.

60.

Fay

Brittain

Shih

, et al. Causal estimands and confidence intervals associated with Wilcoxon-Mann-Whitney tests in randomized experiments. Stat Med 2018; 37(20): 2923–2937.

61.

Zhang

Wisniewski

Jeong

. Causal inference on win ratio for observational data with dependent subjects. arXiv preprint. arXiv:221206676, 2022.

Defining estimand for the win ratio: Separate the true effect from censoring

Abstract

Keywords

Introduction

Win ratio and hierarchical endpoints

Impact of censoring on estimand

Censoring decides the time frame of comparison

Bias versus ambiguity in estimand

Importance of estimand as a measure of effect size

What this means for the win ratio

Two approaches to estimand construction

Time restriction: a nonparametric approach

Time restriction in univariate setting

Time-restricted win ratio

Techniques in estimation

Variations of restricted win ratio

Temporal modeling: a semiparametric approach

Proportionality of win fractions (proportions)

Plausibility of proportionality

Estimation and model diagnostics

Covariate adjustment

Discussions

Separation of estimand and estimation has limits

Trade-offs between two approaches

Improvement in efficiency and robustness

Handling of intercurrent events

Footnotes

Acknowledgements

Declaration of conflicting interests

Funding

ORCID iD

References