Empirical power comparison of statistical tests in contemporary phase III randomized controlled trials with time-to-event outcomes in oncology

Abstract

Background:

More than 95% of recent cancer randomized controlled trials used the log-rank test to detect a treatment difference making it the predominant tool for comparing two survival functions. As with other tests, the log-rank test has both advantages and disadvantages. One advantage is that it offers the highest power against proportional hazards differences, which may be a major reason why alternative methods have rarely been employed in practice. The performance of statistical tests has traditionally been investigated both theoretically and numerically for several patterns of difference between two survival functions. However, to the best of our knowledge, there has been no attempt to compare the performance of various statistical tests using empirical data from past oncology randomized controlled trials. So, it is unknown whether the log-rank test offers a meaningful power advantage over alternative testing methods in contemporary cancer randomized controlled trials. Focusing on recently reported phase III cancer randomized controlled trials, we assessed whether the log-rank test gave meaningfully greater power when compared with five alternative testing methods: generalized Wilcoxon, test based on maximum of test statistics from multiple weighted log-rank tests, difference in t-year event rate, and difference in restricted mean survival time with fixed and adaptive $τ$ .

Methods:

Using manuscripts from cancer randomized controlled trials recently published in high-tier clinical journals, we reconstructed patient-level data for overall survival (69 trials) and progression-free survival (54 trials). For each trial endpoint, we estimated the empirical power of each test. Empirical power was measured as the proportion of trials for which a test would have identified a significant result (p value < .05).

Results:

For overall survival, t-year event rate offered the lowest (30.4%) empirical power and restricted mean survival time with fixed $τ$ offered the highest (43.5%). The empirical power of the other types of tests was almost identical (36.2%–37.7%). For progression-free survival, the tests we investigated offered numerically equivalent empirical power (55.6%–61.1%). No single test consistently outperformed any other test.

Conclusion:

The empirical power assessment with the past cancer randomized controlled trials provided new insights on the performance of statistical tests. Although the log-rank test has been used in almost all trials, our study suggests that the log-rank test is not the only option from an empirical power perspective. Near universal use of the log-rank test is not supported by a meaningful difference in empirical power. Clinical trial investigators could consider alternative methods, beyond the log-rank test, for their primary analysis when designing a cancer randomized controlled trial. Factors other than power (e.g. interpretability of the estimated treatment effect) should garner greater consideration when selecting statistical tests for cancer randomized controlled trials.

Keywords

Hazard ratio log-rank test restricted mean survival time survival data analysis weighted log-rank test

Get full access to this article

View all access options for this article.

References

Uno

Horiguchi

Hassett

. Statistical test/estimation methods used in contemporary phase III cancer randomized controlled trials with time-to-event outcomes. Oncologist 2020; 25(2): 91–93.

Cox

. Regression models and life-tables. J R Stat Soc Series B Stat Methodol 1972; 34(2): 187–202.

Uno

Claggett

Tian

, et al. Moving beyond the hazard ratio in quantifying the between-group difference in survival analysis. J Clin Oncol 2014; 32(22): 2380–2385.

Uno

Wittes

, et al. Alternatives to hazard ratios for comparing the efficacy or safety of therapies in noninferiority studies. Ann Intern Med 2015; 163(2): 127–134.

Péron

Roy

Ozenne

, et al. The net chance of a longer survival as a patient-oriented measure of treatment benefit in randomized clinical trials. JAMA Oncol 2016; 2(7): 901–905.

Chappell

Zhu

. Describing differences in survival curves. JAMA Oncol 2016; 2(7): 906–907.

A’Hern

. Restricted mean survival time: an obligatory end point for time-to-event analysis in cancer trials? J Clin Oncol 2016; 34(28): 3474–3476.

A’Hern

. Cancer biology and survival analysis in cancer trials: restricted mean survival time analysis versus hazard ratios. Clin Oncol (R Coll Radiol) 2018; 30(9): e75–e80.

Horiguchi

Hassett

Uno

. How do the accrual pattern and follow-up duration affect the hazard ratio estimate when the proportional hazards assumption is violated? Oncologist 2019; 24(7): 867–871.

10.

Kalbfleisch

Prentice

. Estimation of the average hazard ratio. Biometrika 1981; 68(1): 105–112.

11.

O’Quigley

. Estimating average regression effect under non-proportional hazards. Biostatistics 2000; 1(4): 423–439.

12.

Schemper

Wakounig

Heinze

. The estimation of average hazard ratios by weighted Cox regression. Stat Med 2009; 28(19): 2473–2489.

13.

McCaw

Orkaby

Wei

, et al. Applying evidence-based medicine to shared decision making: value of restricted mean survival time. Am J Med 2019; 132(1): 13–15.

14.

Weir

Marshall

Schneider

, et al. Interpretation of time-to-event outcomes in randomized trials: an online randomized experiment. Ann Oncol 2019; 30(1): 96–102.

15.

Royston

Parmar

MKB

. The use of restricted mean survival time to estimate the treatment effect in randomized clinical trials when the proportional hazards assumption is in doubt. Stat Med 2011; 30(19): 2409–2421.

16.

Royston

Parmar

MKB

. Restricted mean survival time: an alternative to the hazard ratio for the design and analysis of randomized trials with a time-to-event outcome. BMC Med Res Methodol 2013; 13: 152.

17.

Trinquart

Jacot

Conner

, et al. Comparison of treatment effects measured by the hazard ratio and by the ratio of restricted mean survival times in oncology randomized controlled trials. J Clin Oncol 2016; 34(15): 1813–1819.

18.

Tian

Ruberg

, et al. Efficiency of two sample tests via the restricted mean survival time for analyzing event time observations. Biometrics 2018; 74(2): 694–702.

19.

Fleming

Harrington

. Counting processes and survival analysis. New York: John Wiley & Sons, 1991

20.

Guyot

Ades

Ouwens

, et al. Enhanced secondary analysis of survival data: reconstructing the data from published Kaplan-Meier survival curves. BMC Med Res Methodol 2012; 12(1): 9

21.

Gill

. Censoring and stochastic integrals. Stat Neerl 1980; 34(2): 124

22.

Peto

. Asymptotically efficient rank invariant test procedures. J R Stat Soc Ser A 1972; 135(2): 185

23.

Prentice

. Linear rank tests with right censored data. Biometrika 1978; 65(1): 167

24.

Anderson

. Design and analysis of clinical trials in the presence of non-proportional hazards. In: ASA Biopharmaceutical Section Regulatory-industry Statistics Workshop, Washington, DC, 12–14 September 2018.

25.

Freidlin

Korn

. Methods for accommodating nonproportional hazards in clinical trials: ready for the primary analysis? J Clin Oncol 2019; 37(35): 3455–3459.

26.

Karrison

. Versatile tests for comparing survival curves based on weighted log-rank statistics. Stata J 2016; 16(3): 678–690.

27.

Uno

Tian

. Is the log-rank and hazard ratio test/estimation the best approach for primary analysis for all trials? J Clin Oncol 2020; 38: 2000–2001.

28.

Tian

Jin

Uno

, et al. On the empirical choice of the time window for restricted mean survival time. Biometrics. Epub ahead of print 15 February 2020. DOI: 10.1111/biom.13237.

29.

Horiguchi

Cronin

Takeuchi

, et al. A flexible and coherent test/estimation procedure based on restricted mean survival times for censored time-to-event data in randomized clinical trials. Stat Med 2018; 37(15): 2307–2320.

30.

Liu

Hsueh

Hsieh

, et al. Tests for equivalence or non-inferiority for paired binary data. Stat Med 2002; 21(2): 231–245.

31.

Grambsch

Therneau

. Proportional hazards tests and diagnostics based on weighted residuals. Biometrika 1994; 81(3): 515

32.

Lin

. Goodness-of-fit analysis for the Cox regression model based on a class of parameter estimators. J Am Stat Assoc 1991; 86(415): 725–728.

33.

Sasieni

. Maximum weighted partial likelihood estimators for the cox model. J Am Stat Assoc 1993; 88(421): 144–152.

34.

Eaton

Therneau

Le-Rademacher

. Designing clinical trials with (restricted) mean survival time endpoint: practical considerations. Clin Trials 2020; 17(3): 285–294.

35.

Hasegawa

Misawa

Nakagawa

, et al. Restricted mean survival time as a summary measure of time-to-event outcome. Pharm Stat. Epub ahead of print 18 February 2020. DOI: 10.1002/pst.2004.

36.

Lin

Wei

Ying

. Checking the Cox model with cumulative sums of martingale-based residuals. Biometrika 1993; 80(3): 557–572.

37.

Wei

. Testing goodness of fit for proportional hazards model with censored observations. J Am Stat Assoc 1984; 79(387): 649

38.

Schoenfeld

. Chi-squared goodness-of-fit tests for the proportional hazards regression model. Biometrika 1980; 67(1): 145

39.

Marubini

Valsecchi

. Analysing survival data from clinical trials and observational studies. Chichester: J Wiley, 1995.

40.

Dwan

Altman

Arnaiz

, et al. Systematic review of the empirical evidence of study publication bias and outcome reporting bias. PLoS One 2008; 3(8): e3081.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.85 MB