Sage Journals: Discover world-class research

Abstract

The concordance index is often used to measure how well a biomarker predicts the time to an event. Estimators of the concordance index for predictors of right-censored data are reviewed, including those based on censored pairs, inverse probability weighting and a proportional-hazards model. Predictive and prognostic biomarkers often lose strength with time, and in this case the aforementioned statistics depend on the length of follow up. A semi-parametric estimator of the concordance index is developed that accommodates converging hazards through a single parameter in a Pareto model. Concordance index estimators are assessed through simulations, which demonstrate substantial bias of classical censored-pairs and proportional-hazards model estimators. Prognostic biomarkers in a cohort of women diagnosed with breast cancer are evaluated using new and classical estimators of the concordance index.

Keywords

Biomarkers C-index discrimination Pareto model proportional-hazards model survival analysis

1 Introduction

After determining if predictors of censored survival data are significant, a common objective is to measure their predictive strength on a scale that is not sample dependent. A plethora of statistics have been suggested. Some have attempted to transfer the concept of R² from linear regression to censored data.^1,2 In this article we consider use of the concordance index for censored data.

The first part of the paper reviews the concordance index for predictors of censored survival data. The second part develops concordance index estimators that are valid when the strength of the predictor becomes diminished with follow up. Our proposals are compared with classical methods using computer simulations and a breast cancer prognostic biomarker example.

2 Concordance index

The concordance index was initially developed to estimate the degree to which a randomly chosen observation from one distribution was larger than one chosen independently from another distribution.³ When T₁ and T₂ are continuous independent random variables with cumulative distribution functions F₁ and F₂ the concordance index is

\begin{matrix} C = P (T_{1} > T_{2}) \\ = \int {1 - F_{1} (u)} d F_{2} (u) \end{matrix}

If T₁ and T₂ place positive mass at the same point then we count half for ties and define C as P(T₁ > T₂) + P(T₁ = T₂)/2 so that

\begin{matrix} C = \int {1 - F_{1} (u) + \frac{1}{2} P (T_{1} = u)} d F_{2} (u) \end{matrix}

(1)

and C = 0.5 when the two distributions are the same, even with ties. The concordance index can be estimated from the normalized Wilcoxon ranksum (Mann–Whitney) statistic, by

\begin{matrix} \hat{C} = (nm)^{- 1} \sum_{i = 1}^{n} \sum_{j = 1}^{m} I (T_{1 i} > T_{2 j}) + \frac{1}{2} I (T_{1 i} = T_{2 j}) \end{matrix}

where T₁ _i (i = 1, …, n) and T₂ _j (j = 1, …, m) are independent samples from F₁ and F₂ respectively, and I(.) denotes the indicator function. If R_i denotes the rank of the T₁ _i (i = 1, …, n) in the combined sample (T₁₁, …, T₁ _n , T₂₁, …, T₂ _m ) with the ranks of tied observations averaged, then the Wilcoxon ranksum test statistic is given by

W = \sum_{i = 1}^{n} R_{i}

, which can be related to

\hat{C}

through

W = nm \hat{C} + n (n + 1) / 2

. When the samples (T₁ and T₂) come from cases and controls respectively, the concordance index is the area under the receiver operating characteristic curve for (F₁, F₂).⁴ When the samples are from two arms of a randomised control trial, C is a measure of the treatment effect. Some variations of C have also been studied. These include the odds of concordance C(1−C)⁻¹,^5–7 and a modification to account for matched case-control designs,⁸ but they are not considered further in this article.

For a one-parameter family {T_Z} of random variables indexed by real number Z from distribution {F_Z}, a concordance index that quantifies the degree of association between T_Z and Z is defined as

\begin{matrix} C_{Z} = 2 \int_{z_{1} > z_{2}} \int {P (T_{z_{1}} > T_{z_{2}}) + \frac{1}{2} P (T_{z_{1}} = T_{z_{2}})} d F_{Z} (z_{1}) d F_{Z} (z_{2}) + \frac{1}{2} P (Z_{1} = Z_{2}) \end{matrix}

(2)

where the last term essentially derives from allowing ties in Z to be broken at random.⁹ The definition has the advantage of being continuous in the distribution of F_Z and is equivalent to Kendall’s τ rank correlation coefficient because

C_{Z} = 0.5 + τ / 2

C_Z and C are not the same when Z is a two-point distribution, but they are linearly related. Consider where Z = 1, 2 (e.g. respectively cases and controls, or treated and untreated) and P(Z = 1) = P(Z = 2) = 0.5. Then $C_{Z} = 2 \times P (T_{2} > T_{1}) \times 0.5 \times 0.5 + 1 / 2 \times 0.5 = C / 2 + 1 / 4$ . Thus for the balanced two-sample situation the range of C_Z is only (1/4, 3/4) and not (0, 1) as for C. This important aspect is due to ties in Z, and interpretation of C_Z is affected whenever ties in Z are possible. For example, the upper bound of C_Z may decrease if a continuous Z is rounded. Although obvious from (2), this might seem surprising because in practice it is often implicitly assumed that the range of the concordance index C_Z is always (0, 1). Some bounds on the range of C_Z are as follows. Suppose there are n discrete values of Z. Then the smallest possible P(Z₁ = Z₂) occurs when they are distributed uniformly so that $P (Z_{1} = Z_{2}) = 1 / n$ ; the smallest minimum value of C_Z with n points is (2n)⁻¹ and the maximum is $1 - (2 n)^{- 1}$ . Therefore, with discrete data one might normalize C_Z so that it can theoretically attain 0 and 1 via ${C_{Z} - (2 n)^{- 1}} (1 - 1 / n)^{- 1}$ . For large n the range of C_Z is less of an issue, and for continuous distributions of Z the range of C_Z is (0, 1), as can be seen by letting T_Z = {−Z} and T_Z = {Z} respectively be a set of degenerate one-point distributions for continuous Z.

In the rest of the paper we focus on estimators of C and C_Z for right-censored data.

3 Estimator review

3.1 Censored-pairs estimators

The concordance indices (1) and (2) have been extended to censored data by ignoring pairs when the smaller survival time is censored and using a normalising constant to account for these uninformative pairs.^10,11 While such statistics can be useful for comparing different models on the same data set, Efron¹² noted that Gehan’s approach¹⁰ was dependent on the censoring distribution, and so was not not a universal measure of P(T₁ > T₂). Others have noted that Harrell’s approach¹¹ likewise depends on the censoring distribution.¹³ If the censoring random variable H_Z is conditionally independent of T_Z given Z, so that the observed survival function is (1−F_{T_Z})(1−F_{H_Z}), then from equation (2), the censored-pairs concordance index is given by

\begin{matrix} C_{ZH} = [2 \int_{z_{1} > z_{2}} \int {P (T_{z_{1}} > T_{z_{2}}) + \frac{1}{2} P (T_{z_{1}} = T_{z_{2}})} \\ \times P (H_{z_{1}} > T_{z_{2}}) P (H_{z_{2}} > T_{z_{2}}) d F_{Z} (z_{1}) d F_{Z} (z_{2}) + \frac{1}{2} P (Z_{1} = Z_{2})] \\ \times [2 \int_{z_{1} > z_{2}} \int P (H_{z_{1}} > T_{z_{2}}) P (H_{z_{2}} > T_{z_{2}}) d F_{Z} (z_{1}) d F_{Z} (z_{2}) + \frac{1}{2} P (Z_{1} = Z_{2})]^{- 1} \end{matrix}

(3)

The

P (H_{z_{1}} > T_{z_{2}}) P (H_{z_{2}} > T_{z_{2}})

terms in the numerator and denominator arise because contributions to the statistic only occur for pairs of observations when the smaller survival time is not censored. The following methods were developed to be independent of the censoring distribution.

3.2 Efron’s estimator of C

For the two-sample situation, Efron¹² suggested a solution using the Kaplan–Meier estimates for the survival distribution given by S₁(t) = 1−F₁(t) and S₂(t) = 1−F₂(t), and computing P(T₁ > T₂) based on these estimates through

\begin{matrix} {\hat{C}}_{E} = - \int {\hat{S}}_{1} (u) d {\hat{S}}_{2} (u) \end{matrix}

where

{\hat{S}}_{1}

(u) and

{\hat{S}}_{2}

(u) are the Kaplan–Meier estimates of the survival functions S₁ and S₂ respectively.¹⁴ That is

\begin{matrix} {\hat{C}}_{E} = (nm)^{- 1} \sum_{i = 1}^{n} \sum_{j = 1}^{m} \hat{Q} (t_{1 i}, t_{2 j}, y_{1 i}, y_{2 j}) \end{matrix}

(4)

where the observed data are in pairs of event times and indicators (t₁ _i , y₁ _i ) in group 1 and (t₂ _j , y₂ _j ) in group 2, where y₁ _i = 0 if t₁ _i is censored, one otherwise, and similarly for y₂ _j , and

Q (t_{1 i}, t_{2 j}, y_{1 i}, y_{2 j}) = P (T_{1} > T_{2} | t_{1 i}, t_{2 j}, y_{1 i}, y_{2 j})

is estimated by substituting Kaplan–Meier estimates of survival functions into the relevant terms in Table 1. Examples to show the difference between

E ({\hat{C}}_{E})

and the censored-pairs approach have been reported.¹⁵

Table 1.

Values of Efron’s Q(t_i, t_j, y_i, y_j) for the concordance statistic. Note that for the two-sample estimator of C the 1 and 2 subscripts have been dropped, so that for example t_i represents t₁ _i and t_j is t₂ _j , similarly S_i is S₁ etc. This notation is used so that the table generalises to estimators of C_Z.

(y_i, y_j)	t_i ≥ t_j	t_i < t_j
(1, 1)	1	0
(0, 1)	1	$\frac{S_{i} (t_{j})}{S_{i} (t_{i})}$
(1, 0)	$1 - \frac{S_{j} (t_{i})}{S_{j} (t_{j})}$	0
(0, 0)	$1 - \frac{S_{j} (t_{i})}{S_{j} (t_{j})} + \frac{\int_{t_{i}}^{\infty} S_{i} (u) d F_{j} (u)}{S_{i} (t_{i}) S_{j} (t_{i})}$	$\frac{\int_{t_{j}}^{\infty} S_{i} (u) d F_{j} (u)}{S_{i} (t_{i}) S_{j} (t_{j})}$

${\hat{C}}_{E}$ overcomes limitations of the censored-pairs approach for the two-group problem but requires that the estimated survival functions decrease to zero, so that one treats the last event time in each group as not censored in the Kaplan–Meier estimator. When there is censoring due to incomplete follow up, with everyone censored by t_max and where S₁(t_max) > 0 and S₂(t_max) > 0, then Efron’s estimator may be very unstable. An important example of this situation is when individuals are enrolled sequentially in a clinical trial and events are recorded until (say) 10-years after the first entry (t_max = 10). In such situations taking the last time in each group to be an event will substantially bias the concordance index in the direction of the group with the longest surviving member beyond that time. For example, if 90% are at risk in both groups after the last event has occurred, then 81% of the terms in the double summation (4) will favour the group with the longest surviving (censored) member, and ${\hat{C}}_{E}$ is guaranteed to be greater than 0.81−0.19 = 0.62.

3.3 Uno’s estimator of C_Z

Uno and colleagues¹³ developed a censored-pairs estimator of the concordance index (2) based on inverse probability weighting. Their solution uses a Kaplan–Meier estimate of the censoring distribution S_H, treating it as independent of Z and T_Z, and re-weights the censored-pairs contribution when t_i > t_j to be ${\hat{S}}_{H} (t_{j})^{- 2}$ , rather than one. The approach is justified by inspection of (3); the weighting cancels out the $P (H_{z_{1}} > T_{z_{2}}) P (H_{z_{2}} > T_{z_{2}})$ terms, so that it is (asymptotically) independent of the censoring distribution and converges to C_Z.

However, the resulting estimator is only completely independent of the censoring distribution if, as above for the Efron estimator, the maximal follow up for all patients is to a time τ such that the marginal survival distribution S(τ) = P(T > τ) = 0. If not, then the censored-pairs approach will converge to a quantity greater than C_Z. Informally, this is because the individuals with high Z have the event first whether or not hazards also converge with time. More formally, this may be seen by re-expressing C_Z as

\begin{matrix} C_{Z} = \int_{0}^{\infty} C_{t} \frac{S (t)}{\int_{0}^{\infty} S (u) d u} d t \end{matrix}

(5)

where

S (t) = \int P (T > t | z) d F_{Z} (z)

and

\begin{matrix} C_{t} = \int \int {P (z > z^{*}) + \frac{1}{2} P (z = z^{*})} d F_{Z} (z | T > t) d F_{Z} (z^{*} | T = t) \end{matrix}

(6)

where

d F_{Z} (z^{*} | T = t) = λ (t | z^{*}) / \int λ (t | u) d F_{Z} (u | T = t)

and

d F_{Z} (z | T > t) = P (T > t | z) d F_{Z} (z) / S (t)

from Bayes’ rule. As t increases, the distribution of Z in those still at risk becomes weighted towards those with longer survival, and C_t decreases. When follow up is until t = τ, the censored-pairs concordance index converges to

\begin{matrix} \int_{0}^{τ} C_{t} \frac{S (t)}{\int_{0}^{τ} S (u) d u} d t \end{matrix}

and because C_t is decreasing this limit is greater than C_Z (anti-conservatively biased) unless S(τ) = 0. One can also see that the limit of Uno’s concordance index for τ close to the longest follow up will be less than Harrell’s version, since it gives relatively more weight to those C_t that are closer to t = τ.

3.4 Proportional-hazards model

A common approach is to estimate linear predictors of outcomes with censored event times using a proportional-hazards model. Here an estimator of the concordance index that does not depend on the censoring distribution or follow up was achieved by Gönen and Heller.¹⁶ If T_Z has hazard of form $λ (T | Z) = λ_{0} (T) g (Z)$ , then, because

\begin{matrix} P (T_{Z_{1}} > T_{Z_{2}}) = \frac{g (Z_{2})}{g (Z_{1}) + g (Z_{2})} \end{matrix}

(7)

we have from (2) that

\begin{matrix} C_{Z} = 2 \int_{z_{1} > z_{2}} \int \frac{g (z_{2})}{g (z_{1}) + g (z_{2})} d F_{Z} (z_{1}) d F_{Z} (z_{2}) + \frac{1}{2} P (Z_{1} = Z_{2}) \end{matrix}

(8)

where Z₁ and Z₂ are independent samples from distribution function F_Z. When

z = β_{1} x_{1} + \dots + β_{k} x_{k}

for some linear combination of covariates x = (x₁, …, x_k) and coefficients

β = (β_{1}, \dots, β_{k})

g (.) = exp (.)

and both T_Z and Z are continuous, the concordance index depends on the distribution of z and equals

\begin{matrix} C_{Z} = 2 \int_{z_{1} > z_{2}} \int \frac{1}{1 + exp (z_{1} - z_{2})} d F_{Z} (z_{1}) d F_{Z} (z_{2}) = 2 E [I (Z_{1} > Z_{2}) {1 + exp (Z_{1} - Z_{2})}^{- 1}] \end{matrix}

(9)

which is linked to T_Z only through the distribution of the coefficients β and covariates x . Equation (9) may be estimated by replacing F_Z with its empirical distribution so that

\begin{matrix} {\hat{C}}_{Z} = 2 {N (N - 1)}^{- 1} \sum_{i = 1}^{N - 1} \sum_{j = i}^{N} \frac{I ({\hat{z}}_{i} > {\hat{z}}_{j})}{1 + exp ({\hat{z}}_{i} - {\hat{z}}_{j})} \end{matrix}

(10)

where

{\hat{z}}_{i}

uses the proportional-hazards estimates

{\hat{β}}_{1}, \dots, {\hat{β}}_{k}

, and similarly for the more general (8). Its variance is estimable from re-sampling methods or from asymptotic formulae¹⁶ which depend on the covariance matrix of β that is routinely available from the partial-likelihood methods of the proportional-hazards model.

4 New estimators

4.1 Motivation

The methods reviewed above are not universal when the predictor loses strength with time, and may depend on the length of follow up. In particular, formulas (8) and (9) depend implicitly on the validity of the proportional-hazard assumption. Further developments would be useful because hazards are often observed to converge, so that the effect of a predictive factor diminishes as follow-up time increases. This issue is pervasive in applications⁵. For example, in breast cancer epidemiology, many prognostic factors are based on characteristics of the tumour that lose relevance once an individual has survived a period of time¹⁷. We next propose modifications to the Efron and the proportional-hazard estimators, before introducing a more parsimonious approach.

4.2 Modified two-sample estimator

Recall that when there is censoring due to incomplete follow up, Efron’s estimator may be very unstable. The following modification of Table 1 solves this problem by accounting for when the last time in each group is censored.

Denote $A_{i} = (t_{1 i} < T_{1} \leq t_{max})$ , $B_{j} = (t_{2 j} < T_{2} \leq t_{max})$ , $a = (T_{1} > t_{max})$ and $b = (T_{2} > t_{max})$ . Let $w_{1 i} = P (a | T_{1} > t_{1 i}) = S_{1} (t_{max}) / S_{1} (t_{1 i})$ and $w_{2 j} = S_{2} (t_{max}) / S_{2} (t_{2 j})$ , being respectively defined to be zero when $S_{1} (t_{max}) = 0$ or $S_{2} (t_{max}) = 0$ . Now when $y_{1 i} = y_{2 j} = 0$ , P(T₁ > T₂) may be partitioned as

\begin{matrix} P (T_{1} > T_{2} | A_{i}, B_{j}) P (A_{i}, B_{j}) + P (T_{1} > T_{2} | a, B_{j}) P (a, B_{j}) + P (T_{1} > T_{2} | a, b) P (a, b) \end{matrix}

since

P (T_{1} > T_{2} | A_{i}, b) = 0

. Then

Q (t_{1 i}, t_{2 j}, y_{1 i} = 0, y_{2 j} = 0)

from Table 1 is redefined to be t₁ _i ≥ t₂ _j

\begin{matrix} {1 - \frac{S_{2} (t_{1 i})}{S_{2} (t_{2 j})} + \frac{\int_{t_{1 i}}^{t_{max}} S_{1} (u) d F_{2} (u)}{S_{1} (t_{1 i}) S_{2} (t_{2 j})}} (1 - w_{1 i}) (1 - w_{2 j}) + w_{1 i} (1 - w_{2 j}) + \frac{w_{1 i} w_{2 j}}{2} \end{matrix}

t₁_i < t₂ _j

\begin{matrix} {\frac{\int_{t_{2 j}}^{t_{max}} S_{1} (u) d F_{2} (u)}{S_{1} (t_{1 i}) S_{1} (t_{2 j})}} (1 - w_{1 i}) (1 - w_{2 j}) + w_{1 i} (1 - w_{2 j}) + \frac{w_{1 i} w_{2 j}}{2} \end{matrix}

The terms are estimated by using Kaplan–Meier estimates of S₂(t) for w₂ _j ; for example S₁(t_max) is the Kaplan–Meier estimate at the last non-censored time in the first group.

As the original Efron estimator, the modified estimator is not a universal measure when censoring is due to incomplete follow up because it depends on t_max, but it is more stable than the Efron estimator because it does not depend on which group has the longest surviving censored member. It is not consistent for the concordance index if $S_{1} (t_{max}) > 0$ and $S_{2} (t_{max}) > 0$ but, in this case, clearly it is not possible to obtain a consistent estimator of the concordance index with making assumptions. However, one may obtain an estimate of the concordance index for different follow-up periods by varying t_max, where the modified estimator consistently estimates

C_{E} (t_{max}) = - \int_{0}^{t_{max}} S_{1} (u) d S_{2} (u)

Thus, one approach to facilitate comparisons between studies is to present the estimate of this for different values of t_max. This idea has been used in a similar context elsewhere,^6,13 and is considered further in later simulations (Figure 2) and an example (Figure 5).

Figure 1.

Illustration of the effect of converging hazards and censoring on concordance index estimators. Solid lines (—) use the classical censored-pairs approach, and the proportional-hazards model estimator is dashed (– – –). The true concordance index for this model is when there was no censoring (— black).

Figure 2.

Illustration of the effect of censoring on the two-group concordance statistic estimator. The lines show the concordance index under a Pareto model, with the γ parameter shown in the key.

Figure 3.

Concordance index estimates from simulations and true value (– – –). H: censored-pairs estimator; Ga: proportional-hazards estimator (10); Gb: hybrid proportional-hazards estimator based on (11); Pa: Pareto estimator using model fit; Pb: hybrid Pareto estimator using Table 1.

Figure 4.

Pareto model fit in example. Plot (a) is Schoenfeld partial residuals from a proportional-hazards (o) and Pareto model (end of line linked to o). Least squares trend lines of the residuals are shown for the proportional-hazards (—) and Pareto models (– –); the line at 0.5 indicates good model fit (- - -). Plot (b) compares the expected Ki67 at each event from the two models and least squares trend line. Plot (c) shows the fitted hazard ratios. Plot (d) is the estimated cumulative risk for a binarised Ki67 in the data (KM, Kaplan–Meier) and the models (— above median, – – – less than or equal to median).

Figure 5.

Plot of two-sample concordance index against type I censoring time (t_max) for binarized Ki67 and HER2 from the example. Point-wise 95% confidence intervals (empirical bootstrap) are also shown.

4.3 Modified proportional-hazards model estimator

A problem with the estimator of Gönen and Heller¹⁶ is that if there is no censoring but proportional hazards do not hold, then the estimator will not agree with the classical approach. A partial solution to this is to modify the approach of Efron and write

\begin{matrix} C_{EZ} = 2 {N (N - 1)}^{- 1} \sum_{i = 1}^{N - 1} \sum_{j = i}^{N} Q (t_{i}, t_{j}, y_{i}, y_{j}, z_{i}, z_{j}) \end{matrix}

(11)

where

Q (t_{i}, t_{j}, y_{i}, y_{j}, z_{i}, z_{j}) = P {T_{i} > T_{j} | (t_{i}, y_{i}), (t_{j}, y_{j}), z_{i}, z_{j}}

. Under a proportional-hazards model, C_EZ may be estimated via the terms in Table 1, but the proportional-hazard assumption is only needed to calculate the non-trivial terms and so the estimator agrees with the classical formula when there is no censoring. A further difference to the above is that it requires an estimate of the baseline survivor function S₀(t). This approach will be anti-conservatively biased when the data are censored and proportional hazards hold. It is intended for use when censoring is light and one would like robustness against large departures from proportional hazards.

One might consider allowing $λ (T | Z) = λ_{0} (T) g_{T} (Z)$ for time-varying hazards g_T. In this case

\begin{matrix} P (T_{z_{1}} > T_{z_{2}}) = \int_{0}^{\infty} λ_{0} (t) g_{t} (z_{2}) exp [- \int_{0}^{t} λ_{0} (s) {g_{t} (z_{1}) + g_{t} (z_{2})} d s] d t \end{matrix}

(12)

A concordance index based on this involves O(N²) evaluations of this double integral, which would need to be evaluated numerically. One also cannot use the model beyond the maximal follow-up time.

4.4 Pareto model

A parsimonious approach is to use a simple one-parameter model to account for varying degrees of convergence by introducing an unobserved additive covariate (frailty) to the proportional-hazards model, independent from other covariates, with a log-gamma distribution with mean one and variance γ.¹⁸ This leads to a transformation model based on the Pareto distribution, so that if the baseline hazard and cumulative hazard are given by λ₀(t) and Λ₀(t) respectively, then an individual with covariate $z = exp (β x')$ has survival function

\begin{matrix} S (t | z; γ) = 1 - F_{z, γ} (t) \\ = {1 + γ z Λ_{0} (t)}^{- 1 / γ} \end{matrix}

(13)

and hazard function

\begin{matrix} λ (t | z; γ) = z λ_{0} (t) {1 + γ z Λ_{0} (t)}^{- 1} \end{matrix}

(14)

This very flexible model has some attractive features. The hazard ratio is given by

\begin{matrix} \frac{λ (t | z_{1}; γ)}{λ (t | z_{2}; γ)} = \frac{1 + γ Λ_{0} (t)}{z_{2} / z_{1} + γ Λ_{0} (t)} \end{matrix}

so that a consequence of the frailty (γ > 0) is that the hazard ratio approaches one as t gets large. When γ = 0 there is no frailty and it becomes the proportional-hazards model; when γ = 1 it becomes the proportional-odds model.

Technical aspects of estimation and inference are considered in the appendix.

4.4.1 Concordance index

Computation of the Pareto concordance index involves a formula with γ, the {Z} and the baseline cumulative hazard function $Λ_{0} (t)$

\begin{matrix} P (T_{z_{1}} > T_{z_{2}} | T_{z_{1}}, T_{z_{2}} > s) = \int_{s}^{\infty} {1 + γ z_{1} Λ_{0} (t)}^{- 1 / γ} z_{2} λ_{0} (t) {1 + γ z_{2} Λ_{0} (t)}^{- (1 + 1 / γ)} d t \\ = γ^{- 1} \int_{v}^{\infty} {1 + (z_{1} / z_{2}) u}^{- 1 / γ} (1 + u)^{- (1 + 1 / γ)} d u \end{matrix}

(15)

where

v = γ z_{2} Λ_{0} (s)

, and analysis of concordance index (2) can proceed as the two previous approaches for proportional hazards. That is, the Pareto model can be used with

{1 + exp (Z_{1} - Z_{2})}^{- 1}

in (9) replaced by (15) with s = 0 or via the hybrid approach replacing the non-trivial terms in Table 1 with the Pareto terms. The integral in (15) is needed for both approaches. Although it does not appear to be analytically tractable it may be estimated numerically, and it requires much less computation than (12).

4.4.2 Goodness-of-fit

We lastly consider model goodness-of-fit, partly because the Pareto concordance index is not needed when a proportional-hazards assumption is appropriate. One method is an asymptotic score test for when a Pareto model is taken as the alternative hypothesis to proportional hazards.¹⁹ Another approach in this line is to apply a likelihood-ratio test for γ = 0,²⁰ with adjustment for model-boundary testing.²¹ Schoenfeld residuals²² are sometimes used, and in the general setting are defined for all $i = 1, \dots, N$ when a non-censored event occurred (y_i = 1) to be

\begin{matrix} {\hat{s}}_{i} = x_{i} - \hat{E} (x | t \geq t_{i}) \end{matrix}

where

\begin{matrix} \hat{E} (x | t \geq t_{i}) = \frac{\sum_{j = 1}^{N} I (t_{j} \geq t_{i}) \hat{λ} (t_{i} | x_{j}) x_{j}}{\sum_{j = 1}^{N} I (t_{j} \geq t_{i}) \hat{λ} (t_{i} | x_{j})} \end{matrix}

and

\hat{λ} (t_{i} | x_{j})

are model estimates. These residuals show the difference between the observed and expected covariate at each event time, and have expectation zero if the model is correct. Plots of ŝ_i against t_i and fitted trends may help to identify departures from the model, and a chi-squared test based on scaled residuals is commonly used to test a proportional-hazards assumption,²³ without taking a Pareto model as the alternative. Because Schoenfeld residuals were designed to check the proportional-hazard assumption, a direct comparison with the Pareto model will help assess whether it satisfactorily addressed lack of fit. A related goodness-of-fit test is to use partial residuals

\hat{P} (x \geq x_{i})

defined as²²

\begin{matrix} {\hat{r}}_{i} = \frac{\sum_{j = 1}^{N} I (t_{j} \geq t_{i}) \hat{λ} (t_{i} | x_{j}) I (x_{j} > x_{i})}{\sum_{j = 1}^{N} I (t_{j} \geq t_{i}) \hat{λ} (t_{i} | x_{j})} \end{matrix}

(16)

Under the model these should be distributed uniformly between zero and one, independently of t_i. Empirical distribution function goodness-of-fit tests²⁴ could be used to assess the distribution of r_i in early and late periods.

5 Simulations

5.1 Bias

A simulation was used to demonstrate issues with existing methodology when there are converging hazards. Twenty-thousand individuals were simulated with survival times from a Pareto distribution; the rate for an individual was the exponent of a random normal covariate with unit mean and variance multiplied by a frailty sampled from a gamma distribution with mean one and variance γ. Type I censoring was considered, so that events occurred before a maximal follow-up time based on the expected proportion censored. For exposition we show 90%, 50% and 20% censoring. For ∼10-year follow up, heavy censoring might correspond to survival such as for distant recurrence in women diagnosed with estrogen-receptor positive breast cancer;²⁵ mid-range censoring (∼50%) might be seen for survival following disease such as an acute myocardial infarction event;⁷ light censoring occurs when survival rates are low, for example, for survival following complete resection of non-small-cell lung cancer.⁵ In all simulation scenarios there is no difference between the censored-pairs estimators of Harrell or Uno because everyone is censored at the same time. Concordance indices using a proportional-hazards model and the censored-pairs statistic were calculated and compared with the true index, obtained using a simulation without censoring.

The results in Figure 1 show that for this model the proportional-hazard estimate was conservative when there was no censoring, but had positive bias when censoring was more than about 50%. The classical estimator substantially overestimated the concordance index when censoring was 50% or more; this bias was more pronounced for heavy censoring as the frailty variance γ increased.

A second simulation was used to demonstrate the dependence of the two-sample estimator on follow up. Ten-thousand individuals were simulated in two groups, with survival time from an exponential distribution with rate one or two, compounded with a gamma frailty with variance γ, which was chosen to show the effect of a change from constant hazards (γ = 0) to when they converge very quickly (γ = 20). Censoring was generated by allowing individuals to be enrolled into a study at different times according to a uniform distribution between [0.00, 0.05], and then they were censored at a maximum follow-up time. The results in Figure 2 show that the two-sample statistic was conservatively biased when there was heavy censoring. Considering the chart from right (heavy censoring due to censoring) to left (no censoring), one can see that the concordance index estimate increased with more follow up (later censoring) until the covariate had ceased to influence survival due to converging hazards. The plot shows that the statistic is actually better when there are converging hazards, since it will converge to the true value with less follow up.

5.2 Comparison of estimators

A final simulation was used to compare estimators of C_Z. Survival times were from a Pareto distribution that was the exponent of a standard random normal covariate (x) multiplied by 0.7 (i.e. z = exp(βx) with β = 0.7) and compounded by a frailty sampled from a gamma distribution with mean one and variance γ. Two choices of γ were considered (1.0 and 6.6) and three levels of censoring (follow up to time with expected censoring percentage 87%, 50% and 20%). The sample size was 1125 and 500 replications were used. The Pareto model was fitted by maximizing the profile likelihood (see Appendix).

The reason for choosing β = 0.7, γ = 6.6, 87% censoring and n = 1125 is that these correspond to an example in the next section (Table 3(b), Ki67). We also considered γ = 1 in order to assess a scenario where the proportional-hazards assumption is violated more slowly, and partly for theoretical interest because it corresponds to a proportional-odds model. The censoring levels were varied to help assess the estimators as more follow up is accrued.

The distribution of estimated concordance indices is shown in Figure 3. The concordance-index estimates from a Pareto model were substantially less biased than the other methods with heavy censoring (Table 2). The Pareto estimator was biased for heavy censoring at this sample size because it fits a proportional-hazards model where there is insufficient power to detect non-proportional hazards. Harrell’s statistic and the modified proportional-hazards statistic became less biased as the level of censoring decreased. The Pareto estimator had a lower mean squared error than the other estimators (Table 2).

Table 2.

Simulation estimation results for two scenarios of γ.

	Mean bias (×100)			MSE (×100)
Censoring:	87%	50%	20%	87%	50%	20%
γ = 1 (proportional odds)
Censored pairs	4.8	2.1	0.5	28.4	5.7	1.2
PH-fit	3.2	0.4	−1.6	13.7	1.4	3.3
PH-hybrid	3.3	1.1	0.1	14.1	2.4	0.9
Pareto-fit	0.6	0.0	−0.1	10.5	1.3	0.8
Pareto-hybrid	0.6	0.0	−0.1	10.7	1.4	0.9
γ = 6.6
Censored pairs	8.3	1.8	0.2	75.5	4.8	1.0
PH-fit	6.8	0.5	−1.3	50.3	1.7	2.6
PH-hybrid	6.9	1.2	0.1	51.8	3.0	1.0
Pareto-fit	1.7	0.4	0.5	9.4	0.9	0.9
Pareto-hybrid	1.6	0.1	0.0	9.2	0.9	0.9

MSE: mean squared error; PH-fit: proportional-hazards estimator (10); PH-hybrid: proportional-hazards estimator based on (11); Pareto-fit: estimate using model fit only; Pareto-hybrid: Pareto model estimator using Table 1.

Some differences were seen between a proportional-hazards concordance index based solely on model fit and the hybrid approach using Table 1. As expected the hybrid approach worked best for light censoring. It was worse under 50% censoring for the proportional-hazards model because it shifted the estimate towards the Harrell estimate, and the censored-pairs estimators are expected to be anti-conservative unless follow up is to a point where survival is zero (c.f. Figure 1). Thus, we do not recommend the hybrid approach unless censoring is light.

6 Example

The example uses a sample of 1125 women with oestrogen-receptor positive breast cancer, of whom 145 had a distant recurrence after a median 8.5-years follow up in a clinical trial (ATAC trial, ISRCTN registration numer ISRCTN18233230). This sample from the transATAC study (approved by the South-East London Research Ethics Committee (REC ref no. 971037)) were previously used to show that some immunohistochemical (IHC) biomarkers added useful information to classical clinical prognostic factors.²⁵ For demonstration and insight we focus next on some of the individual biomarkers used in the IHC risk score. We do not present results from the hybrid estimators because censoring is heavy, but there was little difference because model assumptions dominate the calculations (87% of women were censored).

Table 3 shows some univariate concordance index estimates. The following points are of note. First, the two-sample estimates were different than the other form of concordance index. Second, Harrell’s and Uno’s statistics were closer to each other than the proportional-hazards and Pareto model statistics. This is likely due to the bias from follow up, as discussed earlier. Third, Pareto estimates were substantially lower than the proportional-hazards model when $\hat{γ} \neq 0$ , reflecting an assumption of converging hazards. Finally, the concordance indices of binarised predictors were less than continuous counterparts due to the information loss from dichotomising.

Table 3.

Estimated univariate concordance indices and model coefficients from example.

	Grade	HER2	Nodes	Ki67	ER
(a) Binary predictor
2-sample	0.57	0.61	0.59	0.55	0.53
Harrell	0.59	0.57	0.63	0.61	0.56
Uno	0.58	0.57	0.63	0.58	0.56
PH	0.57	0.54	0.60	0.59	0.56
Pareto	0.53	*	*	0.53*
PH $\hat{β}$ (LR-χ²)	0.9 (24.9)	1.1 (23.1)	1.2 (47.5)	0.8 (21.6)	−0.5 (7.8)
Pareto $\hat{β}$ (LR-χ²)	1.3 (27.0)	*	*	1.4 (25.2)*
$\hat{γ}$ (LR-χ²)	4.0 (2.1)	0.0 (0.0)	0.0 (0.0)	8.7 (3.6)	0.0 (0.0)
(b) Continuous predictor
Harrell			0.65	0.64	0.57
Uno			0.64	0.62	0.58
PH			0.61	0.63	0.57
Pareto			*	0.55	0.54
PH $\hat{β}$ (LR-χ²)			1.0 (72.7)	0.4 (31.8)	−0.2 (11.5)
Pareto $\hat{β}$ (LR-χ²)			*	0.7 (35.2)	−0.2 (12.0)
$\hat{γ}$ (LR-χ²)			0.0 (0.0)	6.6 (3.5)	2.8 (0.4)

PH: using proportional-hazards assumption and (10); Grade: moderate or worse; HER2: positive; Nodes: lymph node positive or number of nodes (ordinal: 0, 1–3, > 4); Ki67: above median or continuous marker; ER: oestrogen-receptor score above median or continuous; LR-χ²: likelihood-ratio statistic; $\hat{β}$ : estimated regression coefficient for predictor; * indicates when Pareto model fit was proportional hazards.

To explore further we focus on Ki67, whose Pareto concordance index estimate was 0.552 (SE (standard error) 0.0156) compared with 0.631 (SE 0.0210) under a proportional-hazards assumption, 0.644 (SE 0.0220) for Harrell’s estimator and 0.624 (SE 0.0213) for Uno’s adjusted version. Ki67 showed evidence of a departure from proportional hazards, seen informally by inspection of Table 4. More formally, a likelihood-ratio test (Table 3) that γ = 0 had P = 0.03 (after correction for model-boundary testing²¹); a different test for non-proportionality²³ yielded χ₁²= 4.16, P = 0.04. Schoenfeld partial residuals in Figure 4(a) show that allowing for converging hazards via a Pareto model improved the residuals at the start and end. Figure 4(b) helps to show why; the expected value of Ki67 for events decreased more rapidly than a proportional-hazards assumption. Figure 4(c) shows the fitted hazard ratio from the Pareto model, which approximately halved over the period. Figure 4(d) demonstrates that a Pareto model for a binary Ki67 predictor better matched the Kaplan–Meier estimates than a proportional-hazards model.

Table 4.

Number of events in each year, split by Ki67 median (low/high).

Year	Low Ki67	High Ki67	Ratio
1	4	8	2.0
2	3	14	4.7
3	3	16	5.3
4	5	10	2.0
5	4	13	3.2
6	4	12	3.0
7	9	9	1.0
8	8	10	1.2
9	6	5	0.8
10	2	0	0.0

A goodness-of-fit test of the Pareto model is suggested by Figure 4(a), where most of the change in partial residuals between the proportional-hazards and Pareto model were in the first and last three years. Applying a two-sample Kolmogorov–Smirnov test of equality in distribution between the residuals in years ≤ 3 vs > 6 for the proportional-hazards model was rejected (D = 0.28, two-sided P = 0.03). The trend line shows that the Pareto model fitted somewhat better, and the same test did not reject a fit of the Pareto model (D = 0.22, P = 0.17). Thus the data showed some evidence to support the Pareto model fit, which was certainly better than proportional hazards, and the lower concordance index estimate than from a proportional-hazards model or the other approaches.

Figure 5 plots the two-sample concordance index for binarised Ki67 by censoring time. The concordance index increased, and then appeared to plateau after six years. Thus one might surmise that the two-sample estimate from 10-year follow up is unlikely to increase for this variable with further follow up due to converging hazards (c.f. Figure 2). HER2 positivity is included for comparison, where the estimated concordance index increased with follow up, in better agreement with a proportional-hazards assumption.

7 Conclusion

The concordance index is routinely used to measure how well a variable predicts the time to a censored event. However, current estimators depend on the extent of follow up and many predictors using survival data lose their discriminatory power with follow up time. To account for this phenomenon we developed a concordance index based on a Pareto model. This semi-parametric model accounts for converging hazards, but leaves a baseline hazard function unspecified. In simulations under the model it was substantially less biased than other estimators. In a breast-cancer application the ordering of prognostic biomarker concordance index estimates changed when converging hazards were modelled, reflecting that some predictors are more useful for longer-term predictions than others. Our semi-parametric concordance index estimator is recommended for predictors of censored survival data when there is evidence of converging hazards.

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was funded by Cancer Research UK (grant number C569/A16891).

References

Choodari-Oskooei

Royston

Parmar

. A simulation study of predictive ability measures in a survival model I: Explained variation measures. Stat Med 2012; 31(23): 2627–2643.

Choodari-Oskooei

Royston

Parmar

MKB

. A simulation study of predictive ability measures in a survival model II: Explained randomness and predictive accuracy. Stat Med 2012; 31(23): 2644–2659.

Mann

Whitney

. On a test of whether one of two random variables is stochastically larger than the other. Ann Math Stat 1947; 18(1): 50–60.

Hanley

McNeil

. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 1982; 143(1): 29–36.

Schemper

Wakounig

Heinze

. The estimation of average hazard ratios by weighted cox regression. Stat Med 2009; 28(19): 2473–2489.

Martinussen

Pipper

. Estimation of odds of concordance based on the aalen additive model. Lifetime Data Anal 2013; 19(1): 100–116.

Martinussen

Pipper

. Estimation of causal odds of concordance using the aalen additive model. Scand J Statist 2014; 41(1): 141–151.

Brentnall

Cuzick

Field

et al.

A concordance index for matched case-control studies with applications in cancer risk. Stat Med 2015; 34(3): 396–405.

Harrell

Lee

Mark

. Multivariable prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Stat Med 1996; 15(4): 361–387.

10.

Gehan

. A generalized Wilcoxon test for comparing arbitrarily singly-censored samples. Biometrika 1965; 52: 203–223.

11.

Harrell

Califf

Pryor

et al.

Evaluating the yield of medical tests. JAMA - J Am Med Assoc 1982; 247(18): 2543–2546.

12.

Efron B. The two sample problem with censored data. In: M. Lucien, Le Cam and Jerzy N (eds) Fifth Berkeley Symposium on Mathematical Statistics and Probability, Statistical Laboratory of the University of California, Berkeley, June 21–18 July 1965 and 27 December 1965–7 January 1966, p.666. Berkeley, Calif: University of California Press 1967, ISSN: 0097-0433, http://projecteuclid.org/euclid.bsmsp/1200512974.

13.

Uno

Cai

Pencina

et al.

On the c-statistics for evaluating overall adequacy of risk prediction procedures with censored survival data. Stat Med 2011; 30(10): 1105–1117.

14.

Kaplan

Meier

. Nonparametric estimation from incomplete observations. J Am Stat Assoc 1958; 53(282): 457–481.

15.

Koziol

Jia

. The concordance index c and the Mann–Whitney parameter Pr(X > Y) with randomly censored data. Biometrical J 2009; 51(3): 467–474.

16.

Gönen

Heller

. Concordance probability and discriminatory power in proportional hazards regression. Biometrika 2005; 92(4): 965–970.

17.

Sestak

Cuzick

. Markers for the identification of late breast cancer recurrence. Breast Cancer Res 2015; 17(1): 10+.

18.

Clayton D and Cuzick J. The semi-parametric pareto model for regression analysis of survival times. In: Collected papers on semiparametric models at the centenary session of the international statistical institute, pp.19–30. Amsterdam: Centrum voor Wiskunde en Informatica.

19.

Clayton

Cuzick

. Multivariate generalizations of the proportional hazards model. J Roy Stat Soc A 1985; 148(2): 82–117.

20.

Murphy

van der Vaart

. On profile likelihood. J Am Stat Assoc 2000; 95(450): 449–465.

21.

Self

Liang

. Asymptotic properties of maximum likelihood estimators and likelihood ratio tests under nonstandard conditions. J Am Stat Assoc 1987; 82(398): 605–610.

22.

Schoenfeld

. Partial residuals for the proportional hazards regression model. Biometrika 1982; 69(1): 239–241.

23.

Grambsch

Therneau

. Proportional hazards tests and diagnostics based on weighted residuals. Biometrika 1994; 81(3): 515–526.

24.

Stephens

. EDF statistics for goodness of fit and some comparisons. J Am Stat Assoc 1974; 69(347): 730–737.

25.

Cuzick

Dowsett

Pineda

et al.

Prognostic value of a combined estrogen receptor, progesterone receptor, ki-67, and human epidermal growth factor receptor 2 immunohistochemical score and comparison with the genomic health recurrence score in early breast cancer. J Clin Oncol 2011; 29(32): 4273–4278.

26.

Therneau

Grambsch

Pankratz

. Penalized survival models and frailty. J Comput Graph Stat 2003; 12(1): 156–175.

27.

Therneau TM. A package for survival analysis in S. R package version 2.37-7, 2014.

28.

Zeng

Lin

. Maximum likelihood estimation in semiparametric regression models with censored data. J Roy Stat Soc B 2007; 69(4): 507–564.

29.

Kosorok

Lee

Fine

. Robust inference for univariate proportional hazards frailty regression models. Ann Stat 2004; 32(4): 1448–1491.

30.

Cheng

Huang

. Bootstrap consistency for general semiparametric m-estimation. Ann Stat 2010; 38(5): 2884–2915.

31.

Murphy

Van Der Vaart

. Observed information in Semi-Parametric models. Bernoulli 1999; 5(3): 381–412.

32.

Dixon

Kosorok

Lee

. Functional inference in semiparametric models using the piggyback bootstrap. Ann Inst Statist Math 2005; 57(2): 255–277.

33.

Lee

Kosorok

Fine

. The profile sampler. J Am Stat Assoc 2005; 100(471): 960–969.

Year	Low Ki67	High Ki67	Ratio
1	4	8	2.0
2	3	14	4.7
3	3	16	5.3
4	5	10	2.0
5	4	13	3.2
6	4	12	3.0
7	9	9	1.0
8	8	10	1.2
9	6	5	0.8
10	2	0	0.0

Year	Low Ki67	High Ki67	Ratio
1	4	8	2.0
2	3	14	4.7
3	3	16	5.3
4	5	10	2.0
5	4	13	3.2
6	4	12	3.0
7	9	9	1.0
8	8	10	1.2
9	6	5	0.8
10	2	0	0.0

Use of the concordance index for predictors of censored survival data

Abstract

Keywords

1 Introduction

2 Concordance index

3 Estimator review

3.1 Censored-pairs estimators

3.2 Efron’s estimator of C

3.3 Uno’s estimator of CZ

3.4 Proportional-hazards model

4 New estimators

4.1 Motivation

4.2 Modified two-sample estimator

4.3 Modified proportional-hazards model estimator

4.4 Pareto model

4.4.1 Concordance index

4.4.2 Goodness-of-fit

5 Simulations

5.1 Bias

5.2 Comparison of estimators

6 Example

7 Conclusion

Footnotes

Declaration of conflicting interests

Funding

References

3.3 Uno’s estimator of C_Z

Year	Low Ki67	High Ki67	Ratio
1	4	8	2.0
2	3	14	4.7
3	3	16	5.3
4	5	10	2.0
5	4	13	3.2
6	4	12	3.0
7	9	9	1.0
8	8	10	1.2
9	6	5	0.8
10	2	0	0.0