Sage Journals: Discover world-class research

Abstract

Background:

Although the analysis of event-based clinical trials commonly relies on assumptions about the underlying hazard functions, in practice it is rare to see estimates of those functions.

Methods:

I describe conventional and novel methods for estimating the hazard function using discrete and discretized continuous survival models. The conventional approach involves parametric modeling; the novel approach applies Bayesian model averaging to flexible modeling by splines or fractional polynomials. I evaluate the methods in a Monte Carlo study and illustrate them in the analysis of three historical clinical trials.

Results:

Although flexible models can capture features of the hazard functions—such as multimodality—that parametric models miss, they are not foolproof. Spline modeling was generally the most reliable, in the sense of yielding good coverage probabilities for the mean and median with modest loss of efficiency. In the examples, the discreteness of the measurements—days, weeks, or months—had little effect on the shape of estimated hazard functions. All three data sets showed some evidence of departure from the proportional hazards assumption, but in only one did a test for proportionality detect this departure.

Conclusion:

Flexible parametric models, estimated in the Bayesian model averaging framework, offer a robust approach to recovering the shape of the hazard function. Analyses of three clinical trial databases suggest that visualization of the hazard function can be a valuable adjunct to conventional survival analysis.

Keywords

Clinical trial hazard function proportional hazards model survival analysis

Introduction

In event-based clinical trials, it is common to base analyses of survival outcomes on the Cox proportional hazards model.¹ The simplifying assumption of proportionality enables deployment of a partial likelihood analysis that avoids estimating the hazard functions. It is typical to interpret the proportionality factor—the hazard ratio—as a treatment effect that measures the relative risk of an event at any point in the follow-up.

Recognizing that proportional hazards is unlikely to hold exactly, statisticians have devised alternative versions of the hazard ratio.^2,3 Most significantly, the “Cox model average hazard ratio” is a censoring-dependent parameter that the partial likelihood analysis consistently estimates. It is generally taught that the estimand in a Cox analysis is a hazard ratio function averaged over time.

Graphical analyses of event-based trials typically present the data as a comparison of survival curves.⁴ This analysis is reliable and straightforward, because the Kaplan–Meier curve consistently estimates the underlying survival curve under mild assumptions and is readily interpretable. As the survival curve is fully equivalent to the hazard function—in the sense that knowing the survival you can calculate the hazard, and vice versa—there is no theoretical reason to prefer one to the other. Yet despite the common use of the hazard ratio as a touchstone for model-based analysis, it is rare to see estimated hazard functions from clinical trials.

To see why, consider Figure 1, which displays the nonparametric estimate of the hazard function in the medical arm of the REMATCH trial.⁵ Spikes occur at event times, with height equal to the fraction of subjects in the risk set who had an event at that time. To extract underlying structure from such graphs, investigators have proposed methods for estimating a smooth hazard function from censored data.^6–23 An important early example is the method of Efron,⁹ in which one estimates the dependence of the hazard on time by logistic regression on a spline predictor. Alternatively, one can directly smooth the nonparametric hazard function.¹⁹

Figure 1.

Nonparametric estimate of the hazard function in the medical arm of the REMATCH trial.

Unfortunately, such estimates can be sensitive to spline degree, the placement of knots, and tuning parameters. Thus, model selection is consequential, and measures of variability should properly reflect model uncertainty. Although there are many such estimation methods, including some implemented in popular software, clinical trial applications remain rare.

This article revisits the problem of estimating hazard functions in an event-based clinical trial. I extend Efron’s⁹ method by incorporating fractional polynomials,^12,17 and I account for model selection by Bayesian model averaging.^24,25 In a simulation study, I observe that flexible logistic regression with Bayesian model averaging performs nearly as well as a correctly specified parametric model and better than an incorrectly specified parametric model.

Analyses of three clinical trial examples reveal substantial departures from proportional hazards that significance testing detects in only one.

Methods

Notation

Assume a two-arm clinical trial involving n subjects. The variable $Z_{i} = 0 (1)$ indicates randomization to control (test). The failure time $T_{i}$ is discrete, taking values in the non-negative integers. The failure time coded in the data is $T_{i}^{*}$ , which equals the minimum of $T_{i}$ and the subject’s censoring time. The event indicator takes the value $C_{i} = 1 (0)$ if the subject experienced an event (was censored).

Elements of the statistical model are the hazard function

h (t | z; θ) = \Pr [T = t | T \geq t, Z = z; θ],

the survival function

S (t | z; θ) = \Pr [T \geq t | Z = z; θ],

(1)

and the probability mass function

p (t | z; θ) = \Pr [T = t | Z = z; θ],

all defined on $t \in {0, 1, 2, \dots}$ . Starting from the hazard function, one derives the survival function as

S (t | z; θ) = {\begin{matrix} 1 t = 0 \\ Π_{u = 0}^{t - 1} [1 - h (u | z; θ)] t \in {1, 2, \dots} \end{matrix}

and the probability mass function as

p (t | z; θ) = h (t | z; θ) S (t | z; θ) .

(2)

Here, θ stands for an unknown model parameter.

My definition of the survival function (≥ rather than the usual >) in (1) permits writing the density simply as in (2). Cox¹ (p. 187) also defined the survival function this way.

Standard parametric models

I estimated two kinds of standard parametric models: discrete models and discretized continuous models. With a continuous underlying density $g (u | z; θ)$ , the discretized outcome has probability mass function

p (t | z; θ) = \int_{t}^{t + 1} g (u | z; θ) du

and survival function

S (t | z; θ) = \int_{t}^{\infty} g (u | z; θ) du

for $t \in {0, 1, 2, \dots}$ .

Flexible parametric models

With discrete survival, it is convenient to model the event hazard at time t by logistic regression.⁹ Defining $expit (w) = e^{w} / (1 + e^{w})$ , the hazard function at time t is

h (t; θ) = expit (X (t) θ),

where $X (t)$ represents the predictor variables at time t for either a fractional polynomial model^12,17 or a spline model,⁹ and θ is a vector of unknown regression coefficients.

Estimation

Whatever the model, estimation is based on the log-likelihood for right-censored data

L (θ) = \sum_{i = 1}^{n} c_{i} \ln p (t_{i}^{*} | z_{i}; θ) + (1 - c_{i}) \ln S (t_{i}^{*} | z_{i}; θ) .

To estimate θ in the parametric models, we maximize $L (θ)$ numerically, computing the variance-covariance matrix of the estimate as the inverse observed information.

We implement maximum likelihood for the flexible hazard models using Efron’s partial logistic regression,⁹ which involves creating, for subject i with survival time $t_{i}^{*}$ , a $t_{i}^{*} + 1$ -vector of observations $(0, \dots, 0, c_{i})$ , where $c_{i}$ is the event indicator. The subject has a corresponding predictor matrix $X_{i}$ whose u-th row is the spline or fractional polynomial predictor at time u. One concatenates all the outcome vectors into a single outcome vector and the predictor matrices into a single predictor matrix and computes the maximum likelihood estimate $\hat{θ}$ by logistic regression.

Alternatively, one can create vectors r and d for each arm, where $r (u)$ represents the number of subjects at risk at time u, $d (u)$ represents the number of events at time u, and $X (u)$ is the flexible model predictor at time u. Estimation proceeds by logistic regression with d as the numerator and r as the denominator. This approach accelerates computation but cannot accommodate subject-level covariates.

One computes large-sample standard errors for the hazard and survival ordinates as follows: The variance of the linear predictor is $Var (X \hat{θ}) = X \hat{Σ} X^{T}$ , where X is the matrix of predictors, with one row per value of t, and $\hat{Σ}$ is the variance-covariance matrix of $\hat{θ}$ computed from the logistic regression. Then

Var (h (\hat{θ})) = WX \hat{Σ} X^{T} W,

where W is a diagonal matrix with elements $W_{ii} = expit (X_{i} \hat{θ}) (1 - expit (X_{i} \hat{θ}))$ . Finally, defining $h_{i}$ as the estimated hazard for time i and the $(τ + 1) \times (τ + 1)$ matrix

G = (\begin{matrix} 0 & 0 & \dots & 0 & 0 \\ {(1 - h_{0})}^{- 1} & 0 & \dots & 0 & 0 \\ {(1 - h_{0})}^{- 1} & {(1 - h_{1})}^{- 1} & \dots & 0 & 0 \\ ⋮ & ⋮ & ⋱ & ⋮ & ⋮ \\ {(1 - h_{0})}^{- 1} & {(1 - h_{1})}^{- 1} & \dots & {(1 - h_{τ - 1})}^{- 1} & 0 \end{matrix}),

we get

V = Var (\ln \hat{S}) = G Var (h (\hat{θ})) G^{T},

where $\ln \hat{S}$ is the vector of estimated log survival function values. The standard error of the survival function at time t is then $SE (\hat{S} (t)) = \hat{S} (t) V_{tt}^{1 / 2}$ .

Bayesian model averaging

With flexible modeling, it is challenging to select the spline or fractional polynomial terms to include in the regression. One approach is to choose, from a suitably large family, the model that optimizes a measure of fit such as the Akaike information criterion²⁶ or Bayesian information criterion (BIC).²⁷ If one does not account for model selection, standard errors may be too small and significance tests based on model statistics may have size exceeding the nominal.

An alternative in such cases is Bayesian model averaging.^24,25 This method involves positing a comprehensive set of models, assigning each a prior probability, estimating each model and computing its posterior probability, and averaging inferences across this posterior. With many types of analyses, the class of potential models is so large that it is impossible to estimate them all. But with flexible parametric regression, even a relatively small set of models—perhaps only a few dozen—should manifest enough shapes to describe any pattern likely to be seen in real data. Thus, one can, with modest effort, conduct an analysis that properly adjusts for model uncertainty.

Once one has estimated each model in the proposed class, the posterior probability of model $M_{k}$ given the data D is

\Pr (M_{k} | D) = \frac{\Pr (D | M_{k}) \Pr (M_{k})}{\sum_{l} \Pr (D | M_{l}) \Pr (M_{l})},

where $\Pr (D | M_{k})$ is the probability of the data given model $M_{k}$ , and $\Pr (M_{k})$ is the prior probability of model $M_{k}$ . Assuming we wish to estimate a scalar parameter Δ such as the value of the hazard function at a designated time t, we extract the single-model estimates ${\hat{Δ}}_{k}$ and standard errors ${SE}_{k} = SE ({\hat{Δ}}_{k})$ . Then, Δ has posterior mean

E [Δ | D] = \sum_{l} \Pr (M_{l} | D) {\hat{Δ}}_{l}

and posterior variance

\begin{matrix} V [Δ | D] = \sum_{l} \Pr (M_{l} | D) {SE}_{l}^{2} \\ + \sum_{l} \Pr (M_{l} | D) {({\hat{Δ}}_{l} - E [Δ | D])}^{2} . \end{matrix}

That is, the posterior mean of a parameter is the average of its within-model estimates, and the posterior variance is the average of its within-model variances plus the variance of its within-model estimates, always averaging against the posterior distribution over the models.

For generalized linear models, such as those estimated in flexible modeling by logistic regression, one can approximate $\Pr (D | M_{k})$ as $\exp (- {BIC}_{k} / 2)$ , where ${BIC}_{k}$ is the BIC computed from model k, taking the sample size to be the number of events rather than the number of rows in the data matrix.^24,25 I used a prior that assigns probability proportional to $1 / p$ , where p is the number of predictors in the logistic model. Thus, a simple geometric model has $p = 1$ ; a model linear in time has $p = 2$ ; a linear spline with a single knot has $p = 3$ ; and so on.

In spline-based analyses, I used a collection of 19 models: An intercept-only model (equivalent to a geometric distribution for discrete data and approximating an exponential for discretized continuous data), and all combinations of models with 0 to 5 knots (at equally spaced quantiles of the observed event times) and linear, quadratic, and cubic terms. I omitted zero-order splines—analogous to the piecewise exponential—because they give non-continuous hazard functions.

With fractional polynomials, I used a collection of 45 models including a null (intercept-only, geometric/exponential); models including an intercept and $t^{(q)}$ for each $q \in {- 2, - 1, - 1 / 2, 0, 1 / 2, 1, 2, 3}$ ; and models including an intercept and each pairwise combination of those powers. Here, the convention is that $t^{(q)} = t^{q}$ for $q \neq 0$ , and $t^{(0)} = \ln t$ . If a power q appears twice, the first term in the model is the original power (i.e. $t^{(q)}$ ), and the second term is the original power times the log (i.e. $t^{(q)} \ln t$ ).^12,17

Simulation

I conducted a simulation to evaluate frequency properties of the estimation methods in samples of moderate size. I simulated 400 single-arm data sets of size $n = 100$ from each of the following five distributions:

1. a negative binomial;

2. a discretized exponential;

3. a discretized (non-exponential) Weibull;

4. a discretized lognormal; and

5. a model—henceforth denoted mixture—defined by the bimodal hazard function

h (t) = ϕ (t; 200, 200) + 3 ϕ (t; 800, 200), t > 0

where $ϕ (t; m, s)$ is the normal density with mean m and standard deviation s.

I chose the exponential distribution to have mean (and standard deviation) 365.25 days, and the other parametric distributions to have mean 365.25 days and standard deviation 400 days. For each data set under each model, I estimated the hazard and the mean and median survival. Because the flexible models had finite support, when computing the mean, I appended an exponential tail beyond the largest t. With independent censoring uniform on 0–36 months (6–36 months for the mixture model), simulations gave on average roughly 70% events.

I computed Wald confidence intervals for mean survival. To obtain an interval for the median, I took as lower confidence bound the smallest t for which the lower limit of the 95% confidence interval of the survival was less than 0.5, and as upper confidence bound the largest t for which the upper limit of the 95% confidence interval was greater than 0.5.

I summarized simulation results by the root mean squared error of estimation and the coverage probability of the 95% confidence interval. With 400 simulations, assuming true 95% coverage probability, the Monte Carlo standard error of the coverage probability estimate is roughly 1%.

Analysis of clinical trial data sets

I applied the flexible methods to data from three historical randomized, event-based trials: REMATCH,⁵ the gastric carcinoma trial analyzed by Stablein et al,^28,29 and the head-and-neck cancer trial analyzed by Efron.⁹ For each trial, I applied both methods to estimate the control and test hazard functions and the hazard ratio as a function of t. To assess the effect of the level of discreteness, I conducted the analysis of each data set with survival in days (the original scale), rounded to the nearest week, and rounded to the nearest month.

Computation

I conducted simulations in R Version 4.5.1 and analyses in R Version 4.5.2.³⁰

Results

Simulation

Results for mean survival appear in the upper half of Table 1. Coverage probabilities were close to 95% except when one of the generating or analysis models was lognormal. Precision was typically best, or nearly so, under the generating model; again, the lognormal was the exception. The flexible models in most cases had precision somewhat less than the generating model. The spline gave the most precise mean for mixture data.

Table 1.

Simulated coverage probability and root mean squared error for mean and median survival, by generating and analysis model.

	Simulation model
	Ex	Wb	NB	Ln	M
Analysis model	Coverage probability of the mean (%)
Ex	95	97	96	96	96
Wb	95	94	93	70	97
NB	96	96	94	70	97
Ln	96	92	95	94	87
FP	98	99	97	86	100
Sp	93	96	94	86	96
	Root mean squared error of the mean (days)
Ex	45.2	54.1	54.8	42.9	50.2
Wb	47.1	37.8	36.9	48.8	28.6
NB	46.1	39.5	38.0	49.6	31.8
Ln	278.5	163.2	119.5	48.9	99.5
FP	46.0	43.9	43.6	47.7	28.6
Sp	45.3	46.9	46.5	48.3	25.2
	Coverage probability of the median (%)
Ex	95	95	96	96	99
Wb	94	97	94	77	69
NB	94	96	94	81	78
Ln	92	89	92	95	96
FP	95	97	94	94	72
Sp	95	97	96	94	96
	Root mean squared error of the median (days)
Ex	31.3	33.9	33.2	30.3	24.7
Wb	32.4	29.0	30.6	41.2	44.6
NB	31.7	28.8	28.9	36.3	37.3
Ln	39.1	39.2	35.4	25.8	25.7
FP	33.4	31.3	30.9	27.3	59.8
Sp	32.4	32.3	32.6	28.5	37.3

There were 400 replications of a single arm of $n = 100$ subjects. Ex = exponential; Wb =Weibull; NB = negative binomial; Ln = lognormal; M = mixture (bimodal hazard); FP = fractional polynomial; Sp = spline. Bold text for an analysis model indicates that it achieved nominal coverage for all simulation models. A bold entry for root mean squared error means that that analysis model was most accurate for that simulation model.

Results for median survival appear in the lower half of Table 1. Coverage probabilities were close to 95% except when the simulation model was lognormal or mixture, or the analysis model was lognormal. Precision was best, or nearly so, under the generating model. The flexible models, when calibrated, had precision only slightly less than the generating model, except that the fractional polynomial analysis failed badly for mixture data.

Graphs of averaged estimates of the mixture hazard function (see Supplementary Figures S.1–S.6) reveal that only the spline analysis captures its multiple modes. Figure S.5 shows that the fractional polynomial model underestimated the mixture hazard in the range $t \in (100, 300)$ and overestimated it in the range $t \in (300, 500)$ , giving rise to overestimation of the median.

Example: REMATCH

The REMATCH trial randomized heart failure patients to receive either optimal medical management or an implantable left-ventricular assist device.⁵ Of 129 subjects, 107 experienced events. Analysis by the Cox model gives an estimated hazard ratio of 0.52 (favoring the device) with 95% confidence interval (0.34, 0.78) and logrank $P = 0.001$ . The Grambsch-Therneau proportional hazard test³¹ gives $P = 0.78$ .

Figure 2 displays the spline estimates of the REMATCH hazard functions. The graph shows an elevated hazard in the device arm at early t, reflecting perioperative mortality from the implantation. The hazard in the control arm is more nearly constant. Figure S.7 displays the REMATCH hazard ratio as a function of time, estimated under the spline model with Bayesian model averaging. The graph suggests the possibility of non-proportional hazards.

Figure 2.

Bayesian model-averaged spline estimates of the REMATCH hazard functions with pointwise standard error bars ( $\exp [\ln \hat{h} \pm SE (\ln \hat{h})]$ ).

Example: gastric carcinoma

The Gastrointestinal Tumor Study Group conducted a randomized trial comparing chemotherapy alone to chemotherapy plus external radiotherapy in locally advanced gastric carcinoma.²⁹ Of 90 subjects, 74 experienced events. An analysis by the Cox model gives a hazard ratio for survival (favoring chemotherapy alone) of 1.30, with 95% confidence interval (0.83, 2.06) and logrank $P = 0.26$ . The Grambsch-Therneau test gives $P = 0.0059$ .

The original publication stated that “combined therapy was associated with an increased number of early deaths.”²⁹ Moreover, the survival curves crossed, although not until a late follow-up time when standard errors were large. On the basis of the slightly improved survival in the combined arm beyond about 30 months, the investigators concluded that combined therapy would be superior to chemotherapy alone if only one could mitigate its early toxic effects. Stablein et al.²⁸ expressed this idea by modeling the log hazard ratio as quadratic in time.

Figure 3 displays spline estimates of the hazard functions. The graph shows an elevated hazard in the test arm at early t, with hazard functions for the two arms converging before day 500. Figure S.8 displays the hazard ratio as a function of time, estimated under the spline model. The graph suggests that the hazard ratio declines from a value of 2.5 at the outset, settling below 1.0 by 2 years.

Figure 3.

Bayesian model-averaged spline estimates of the gastric cancer trial hazard functions with pointwise standard error bars ( $\exp [\ln \hat{h} \pm SE (\ln \hat{h})]$ ).

Example: head-and-neck cancer

Efron analyzed data from a randomized trial comparing radiation therapy alone to chemotherapy plus radiation in head-and-neck cancer.⁹ Of 96 patients, 73 experienced events. Cox analysis gives a hazard ratio (favoring chemoradiotherapy) of 0.58, with 95% confidence interval (0.35, 0.91) and logrank $P = 0.024$ . The Grambsch-Therneau test gives $P = 0.097$ . A plot of the estimated hazards under the spline model appears in Figure 4. Both curves suggest a rapidly changing hazard at early times with a peak at around 6 months, agreeing with Efron’s analysis.⁹ Figure S.9 hints at an elevation in the hazard ratio around the time where the hazards peak, although the hazard ratio there is not significantly different from the estimate assuming proportionality.

Figure 4.

Bayesian model-averaged spline estimates of the head-and-neck cancer trial hazard functions with pointwise standard error bars ( $\exp [\ln \hat{h} \pm SE (\ln \hat{h})]$ ).

Evaluation of the effect of coarsening

Starting with survival time t denominated in days, one can transform to weeks by taking $w = round (t / 7)$ or months by taking $m = round (12 t / 365.25)$ ; the censoring indicator is unchanged. Efron rounded survival times to the nearest month or half month, which accelerates the computation by reducing the number of rows in the data set. He moreover demonstrated that the dependence of the information loss from discretization would decline rapidly with increasing sample size. I re-ran the analysis of the head-and-neck cancer data under both flexible models with time scales of days, weeks, and months and found only modest sensitivity of the estimated hazard ratio curve (see Figure 5). Graphs for the other examples (not shown) demonstrate similar robustness to the time units.

Figure 5.

Estimates of the head-and-neck hazard ratio from the spline model by Bayesian model averaging, on three different time scales.

Discussion

Discrete versus continuous models

The models proposed here depart from common practice in treating survival as discrete. This device enables effective approximation of continuous-data models and straightforward estimation of hazards via binary regression. Most importantly, these models reflect the fact that clinical trial survival data are discrete, typically denominated in days with the possibility of zero values and ties. Thus, it is the continuous-data models that we should view as approximations, not the other way around.

My empirical findings confirm Efron’s⁹ prediction that the shape of the estimated hazard function is robust to the level of discretization. Indeed, the degree of coarsening appears to have less influence on estimates than the choice of flexible modeling technique. This agrees with general observations that analysis of discrete data as though continuous is valid unless the number of distinct data values is small, perhaps fewer than 10.³²

Flexible models

Flexible-model confidence intervals were calibrated for the mean except when the underlying data were lognormal. Plausibly, this is because one can define flexible models only on the range of the data; to extrapolate beyond this range, I added an exponential tail that figured in the computation of the mean. This adjustment was evidently unsuitable for the lognormal but adequate for the others. For medians, where tail length is irrelevant, the flexible estimates were calibrated for all the parametric generating models.

In simulations, the spline model estimated all hazards well, whereas the fractional polynomial was accurate for all but the mixture hazard. This was because the fractional polynomial was not sufficiently flexible to describe the multiple modes of the mixture hazard, leading to bias. Including a third term in the fractional polynomial should address this deficiency.

Model-averaged hazard estimates inherit features of each component model. For instance, when the model space includes a linear spline with a single knot, the averaged hazard function will also have discontinuous first derivative at that knot (see Figures 2 –5 and S.7–S.9). I therefore excluded zero-order splines (step functions), which would have created discontinuities at the knots.

Extensions of the model

Various approaches to extending the flexible model are possible: increasing the number of fractional polynomial terms;¹² mixing model types in the model space; using informative priors for coefficients in the regression models; and replacing the logistic link with an alternative such as the complementary log-log.

Why estimate hazards?

Recent literature challenges the common practice of interpreting the hazard ratio as a causal parameter.^3,33–37 The issue is that at any time $t > 0$ , the conditioning sets of the potential-outcome hazard functions—subjects who would survive beyond t on test therapy, and subjects who would survive beyond t on control therapy—are not identical. Thus, hazard ratios do not compare like to like, and there can be no interpretation as a causal estimand.

Its equivalence with other descriptors of survival nevertheless suggests a role for the hazard in applications. First, whereas survival curves look largely the same, declining from 1 to 0, hazards can take a wide range of shapes, including some that are characteristic of particular distributions. Therefore, estimating hazards can help distinguish distributions and suggest parametric forms for future analyses. Second, because the survival function is essentially an integrated version of the hazard,³⁸ it can smooth away some potentially informative detail. Consider that in the gastric cancer data, the survival curves cross only after 30 months,²⁹ whereas Figure 3 shows the hazard curves crossing around 12 months—substantially changing the interpretation. Third, the hazard function represents an instantaneous mortality rate, that is, in a sense, “upstream” of the survival curve and density. Thus, it is natural to conceptualize an intervention as operating at each time t to reduce the instantaneous risk, thereby increasing future survival. This idea is implicit in the ubiquitous practice of describing clinical trial treatment effects with hazard ratios.² Yet despite the popularity of such analyses, it continues to be rare to see graphs of estimated hazards in clinical trial practice.³⁸

Proportional hazards tests

Tests of the null hypothesis of proportional hazards^31,39 have limited power when the departure from proportionality is modest, the sample size is small, or censoring is abundant^40,41—conditions that hold in many clinical trials. It is therefore informative to inspect hazard function graphs and compare them in light of estimated error bounds.

Consider our examples: In the gastric cancer trial, the Grambsch-Therneau test is significant, and the hazard plots (Figures 3 and S.8) demonstrate marked departure from proportionality. In REMATCH, a larger trial, there was an expectation of excess early hazard reflecting perioperative mortality in the device arm. Although Grambsch-Therneau is not significant, inspection of the hazard curves (Figures 2 and S.7) reveals nonproportionality in the anticipated direction. Averaging these varying hazard ratios—some above 1.0 and others below 0.4—obscures important differences in the mortality distributions.

Conclusion

Given the hazard function’s central role in survival modeling, I recommend its estimation as a routine element of clinical trial practice.

Supplemental Material

sj-pdf-1-ctj-10.1177_17407745261439661 – Supplemental material for Estimating clinical trial hazard functions

Supplemental material, sj-pdf-1-ctj-10.1177_17407745261439661 for Estimating clinical trial hazard functions by Daniel F Heitjan in Clinical Trials

Footnotes

Acknowledgements

I thank the deputy editor, associate editor, and a referee for helpful comments.

ORCID iD

Daniel F Heitjan

Funding

The author disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by grant 2208892 from the US National Science Foundation and grant ME-2012C2-22793-IC from the Patient-Centered Outcomes Research Institute.

Declaration of conflicting interests

The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data Availability Statement

Data and code are available in the Supplement.

Supplemental material

Supplemental material for this article is available online.

References

Cox

. Regression models and life-tables. J R Stat Soc Ser B 1972; 34: 187–202.

Prentice

Aragaki

. Intent-to-treat comparisons in randomized trials. Stat Sci 2022; 37: 380–393.

Fay

. Causal interpretation of the hazard ratio in randomized clinical trials. Clin Trials 2024; 21: 623–635.

Kaplan

Meier

. Nonparametric estimation from incomplete observations. J Am Stat Assoc 1958; 53: 457–481.

Rose

Gelijns

Moskowitz

, et al. Long-term use of a left ventricular assist device for end-stage heart failure. N Engl J Med 2001; 345: 1435–1443.

Anderson

Senthilselvan

. Smooth estimates for the hazard function. J R Stat Soc Ser B 1980; 42: 322–327.

Anderson

Senthilselvan

. A two-step regression model for hazard functions. Appl Stat 1982; 31: 44–51.

Whittemore

Keller

. Survival estimation using splines. Biometrics 1986; 42: 495–506.

Efron

. Logistic regression, survival analysis, and the Kaplan-Meier curve. J Am Stat Assoc 1988; 83: 414–425.

10.

Sleeper

Harrington

. Regression splines in the Cox model with application to covariate effects in liver disease. J Am Stat Assoc 1990; 85: 941–949.

11.

Gray

. Flexible methods for analyzing survival data using splines, with applications to breast cancer prognosis. J Am Stat Assoc 1992; 87: 942–951.

12.

Royston

Altman

. Regression using fractional polynomials of continuous covariates: parsimonious parametric modelling. Appl Stat 1994; 43: 429–467.

13.

Rosenberg

. Hazard function estimation using B-splines. Biometrics 1995; 51(3): 874–887.

14.

Younes

Lachin

. Link-based models for survival data with interval and continuous time censoring. Biometrics 1997; 53: 1199–1211.

15.

Hess

Serachitopol

Brown

. Hazard function estimators: a simulation study. Stat Med 1999; 18: 3075–3088.

16.

Gelfand

Ghosh

Christiansen

, et al. Proportional hazards models: a latent competing risk approach. Appl Stat 2000; 49: 385–397.

17.

Royston

. Flexible parametric alternatives to the Cox model, and more. Stata J 2001; 1: 1–28.

18.

Royston

Parmar

MKB

. Flexible parametric proportional-hazards and proportional-odds models for censored survival data, with application to prognostic modelling and estimation of treatment effects. Stat Med 2002; 21: 2175–2197.

19.

Wang

. Smoothing hazard rates. In: Armitage

Colton

(eds) Encyclopedia of Biostatistics, second edition. Vol 7. Hoboken, NJ: John Wiley & Sons, 2005, pp. 4986–4997.

20.

Tong

Tao

Hengjian

. Hazard regression with penalized spline. Acta Math Sci 2010; 30B: 1759–1768.

21.

Meira-Machado

Cadarso-Suárez

Gude

, et al. smoothHR: an R package for pointwise nonparametric estimation of hazard ratio curves of continuous predictors. Comput Math Methods Med 2013; 2013: 745742.

22.

Rutherford

Crowther

Lambert

. The use of restricted cubic splines to approximate complex hazard functions in the analysis of time-to-event data: a simulation study. J Stat Comput Simul 2015; 85: 777–793.

23.

Austin

Fang

Lee

. Using fractional polynomials and restricted cubic splines to model non-proportional hazards or time-varying covariate effects in the Cox regression model. Stat Med 2022; 41: 612–624.

24.

Volinsky

Madigan

Raftery

, et al. Bayesian model averaging in proportional hazards models: assessing the risk of a stroke. Appl Stat 1997; 46: 433–448.

25.

Hoeting

Madigan

Raftery

, et al. Bayesian model averaging: a tutorial. Stat Sci 1999; 14: 382–417.

26.

Akaike

. Information theory and an extension of the maximum likelihood principle. In: Csáki

Petrov

(eds) Second International Symposium on Information Theory. Budapest: Akademia Kiado1973, p. 267.

27.

Schwartz

. Estimating the dimension of a model. Ann Stat 1978; 6: 461–464.

28.

Stablein

Carter

Jr Novak

. Analysis of survival data with nonproportional hazard functions. Control Clin Trials 1981; 2(2): 149–159.

29.

Gastrointestinal Tumor Study Group. A comparison of combination chemotherapy and combined modality therapy for locally advanced gastric carcinoma. Cancer 1982; 49: 1771–1777.

30.

R Core Team. R: a language and environment for statistical computing. Vienna: R Foundation for Statistical Computing, 2025. https://www.R-project.org/

31.

Grambsch

Therneau

. Proportional hazards tests and diagnostics based on weighted residuals. Biometrika 1994; 81: 515–526.

32.

Heitjan

. Inference from grouped continuous data: a review. Stat Sci 1989; 4: 164–183.

33.

Hernán

. The hazards of hazard ratios. Epidemiology 2010; 21: 13–15.

34.

Aalen

Cook

Røysland

. Does Cox analysis of a randomized survival study yield a causal treatment effect. Lifetime Data Anal 2015; 21(4): 579–593.

35.

Stensrud

Aalen

, et al. Limitations of hazard ratios in clinical trials. Eur Heart J 2019; 40: 1378–1383.

36.

Martinussen

. Causality and the Cox regression model. Annu Rev Stat Appl 2022; 9: 49–59.

37.

Heitjan

. Comment on “Causal interpretation of the hazard ratio in randomized clinical trials” by Fay and Li. Clin Trials 2024; 21(5): 636–637.

38.

Meier

Karrison

Chappell

, et al. The price of Kaplan-Meier. J Am Stat Assoc 2004; 99: 890–896.

39.

Schoenfeld

. Chi-squared goodness-of-fit tests for the proportional hazards regression model. Biometrika 1980; 67: 145–153.

40.

Ng’andu

. An empirical comparison of statistical tests for assessing the proportional hazards assumption of Cox’s model. Stat Med 1997; 16: 611–626.

41.

Grant

Chen

May

. Performance of goodness-of-fit tests for the Cox proportional hazards model with time-varying covariates. Lifetime Data Anal 2014; 20(3): 355–368.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.25 MB

0.00 MB