Abstract
The semiparametric accelerated failure time mixture cure model is an appealing alternative to the proportional hazards mixture cure model in analyzing failure time data with long-term survivors. However, this model was only proposed for independent survival data and it has not been extended to clustered or correlated survival data, partly due to the complexity of the estimation method for the model. In this paper, we consider a marginal semiparametric accelerated failure time mixture cure model for clustered right-censored failure time data with a potential cure fraction. We overcome the complexity of the existing semiparametric method by proposing a generalized estimating equations approach based on the expectation–maximization algorithm to estimate the regression parameters in the model. The correlation structures within clusters are modeled by working correlation matrices in the proposed generalized estimating equations. The large sample properties of the regression estimators are established. Numerical studies demonstrate that the proposed estimation method is easy to use and robust to the misspecification of working matrices and that higher efficiency is achieved when the working correlation structure is closer to the true correlation structure. We apply the proposed model and estimation method to a contralateral breast cancer study and reveal new insights when the potential correlation between patients is taken into account.
Keywords
Introduction
Many studies involving time-to-event analysis have a fraction of study subjects who will not experience the event of interest even after an extended follow-up. They are often deemed as cured or long-term survivors who are immune or non-susceptible to the event. For instance, only 5% to 50% of patients with head and neck cancer 1 experienced local recurrences, and many patients were free of symptoms of cancer at the end of the sufficiently long observation period and can be considered cured. Due to the existence of cured subjects, using classical survival models, such as Cox’s proportional hazards (PH) model, for such data, can result in biased estimates and information loss, and cure models2,3 that have been developed to take the cured subjects into account should be considered.
In addition to potentially cured subjects, clustered survival times are often observed in the studies. Clustering may occur when there are multiple events from one subject 4 or multiple subjects from the same family or hospital, 5 and the times to the event of interest from the same cluster tend to be correlated due to shared genetic or other common environments.
To appropriately account for the correlation in a cluster, the two most studied approaches are marginal models and frailty models. The marginal models focus on the population average on the marginals of the joint distribution of data from one cluster. The correlation within a cluster is either estimated by a working correlation or treated as a nuisance parameter in the marginal models. Alternatively, frailty models explicitly formulate the underlying dependence structure by random effects or frailties, and the failure times are assumed to be independent conditional on the unobservable frailty. These models have been studied extensively in the literature. For example, Rubio and Drikvandi 6 developed a novel parametric mixed-effects general hazard model for the analysis of clustered survival data. The heterogeneity between clusters is modeled via the incorporation of random effects into a hazard-based regression model. Chiou et al. 7 considered a semiparametric accelerated failure time (AFT) model for clustered failure times from stratified random sampling and proposed weighted rank-based estimating equations for fitting the model with the induced smoothing approach. The generalized estimating equations (GEE) method 8 has been adopted in both the marginal AFT models9,10 and the marginal PH models.11,12
To model clustered survival data with a cured fraction, the marginal mixture cure model is often assumed and a robust variance estimation is used for inference.5,13 The PH assumption was considered for survival times among uncured subjects in the marginal models. This approach was further generalized with a transformation model for survival times among uncured subjects14,15 and with GEE to allow for more efficient estimation.16–18 Other approaches for modeling clustered survival data were also considered in cure models, including the random effects/frailty approach19–23 and the copula approach.24–27
The existing models for clustered/correlated survival time with a cured fraction are largely based on the PH assumption when modeling the effects of covariates on the survival time of uncured subjects. Although the PH assumption is widely used in modeling censored survival data, it is not an assumption that is easier to satisfy in practice than other assumptions. It also suffers from difficulty in interpreting the estimated effects. Researchers will benefit from having more than one analytic technique at their disposal when the PH assumption is not appropriate.
One alternative assumption to the PH assumption is the AFT assumption. The attractive feature of the AFT model is that the effects of covariates are modeled directly on the expected value of the survival time, making the interpretation more intuitive and straightforward than the effects from the PH assumption. 28 The AFT model also enjoys other desirable properties, including collapsibility, that are not exhibited by a PH model. The collapsibility makes the AFT model particularly attractive when quantifying confounding effects 29 and mediation effects of covariates30,31 in causal inference.
The AFT assumption has been considered in models for survival data with a cure fraction, including parametric AFT mixture cure models32,33 and semiparametric AFT mixture cure models.34–41 However, we are not aware of any existing work on modeling clustered survival data with a cured fraction using the AFT assumption, particularly under semiparametric models. This may be due to the challenges in extending the existing semiparametric estimation methods to clustered survival data. It motivates us to consider AFT assumption-based cure models for clustered survival time with a cured fraction. This work is important because it fills the gap in the literature and provides useful alternatives to the PH-based models for clustered survival data with a cured fraction. The interpretation of covariate effects in the AFT assumption-based models is more straightforward and intuitive, and the models are more suitable for future work in causal inference.
In this paper, we propose a marginal semiparametric AFT mixture cure model for clustered survival time data with a cured fraction. An estimating equation approach is employed to estimate the regression parameters in the model with flexible working correlation structures for cure statuses and for the survival times of uncured subjects within clusters. The paper is organized as follows. Section 2 introduces the marginal semiparametric AFT mixture cure model for clustered survival data with a cure fraction. A semiparametric estimation method of the model based on a set of GEE is presented and the asymptotic properties of the estimators are investigated in this section. A simulation study is conducted in Section 3 to evaluate the finite sample performance of the proposed estimation method. The proposed model and the estimation method are applied to contralateral breast cancer data in Section 4. Conclusions and discussions are presented in Section 5.
Marginal AFT mixture cure model
Let
Let
If we assume all
To take the potential correlation within clusters into account in the estimation, we consider incorporating working correlation matrices into the estimating equations (5) and (7). Let
To develop the GEE for
Using the generalized estimation equations (9) and (10) also requires a full specification of Set initial values E-Step: Calculate M-Step:
Calculate the estimate of Given the current estimates of Update the estimate of Repeat steps (ii) and (iii) until convergence. Denote the estimates as Repeat steps (b) and (c) until the algorithm converges. The stopping criterion is that the sum of the squares of the differences in estimates between two adjacent iterations is below
Let
Let the estimator as
A proof of the theorem is provided in the Appendix. Note that the theorem only provides the asymptotic properties of
We conduct an extensive simulation study to investigate the finite sample performance of the proposed method. Simulated clustered survival data are generated from a mixture cure model with the marginal specified by (1) to (3). Two covariates are considered in
To generate correlated
To generate correlated
We set
Bias, Var, Var
, CP of 95% confidence intervals of
from the proposed method under the IND working correlation , EX correlation , and AR(1) correlation and from the method of ZP
35
for data simulated under the cure rate 0.35 and the EX structure.
Bias, Var, Var
Var: empirical variance; Var*: average of bootstrap variance; CP: coverage percentage; IND: independent; EX: exchangeable; AR(1): first-order autoregressive; ZP: Zhang and Peng.
Bias, Var, Var
Var: empirical variance; Var*: average of bootstrap variance; CP: coverage percentage; IND: independent; EX: exchangeable; AR(1): first-order autoregressive; ZP: Zhang and Peng.
Bias, Var, Var
Var: empirical variance; Var*: average of bootstrap variance; CP: coverage percentage; IND: independent; EX: exchangeable; AR(1): first-order autoregressive; ZP: Zhang and Peng.
Bias, Var, Var
Var: empirical variance; Var*: average of bootstrap variance; CP: coverage percentage; IND: independent; EX: exchangeable; AR(1): first-order autoregressive; ZP: Zhang and Peng.
For each setting above, we generate 1000 datasets and then fit each dataset with the proposed marginal model and the estimation method. As a comparison, we also fit the data with the AFT mixture cure model of Zhang and Peng
35
(referred to as the ZP method) that does not consider the correlation within clusters. This method is available in the R package
To further evaluate the efficiency gain of the proposed method, we calculate the relative efficiency defined as the ratio of mean squared errors (MSEs) of the estimates from the ZP method and the proposed method with the working independent, exchangeable, and first-order autoregressive correlation structures to the MSE of the estimates from the proposed method with the correctly specified working correlation structure. The results are summarized in Table 5, which shows that the MSEs from the proposed method with three working correlation structures are generally smaller than those from the ZP method when the values of
Relative efficiencies (ratios of MSEs) from the proposed method under the IND working correlation, EX correlation, and AR(1) correlation and from the method of ZP 35 against the proposed method under the correct correlation structure for data simulated under the cure rate 0.35 and 0.5 and the EX and the AR(1) correlation structures.
MSE: mean squared error; IND: independent; EX: exchangeable; AR(1): first-order autoregressive; ZP; Zhang and Peng.
The proposed estimation method also produces estimates of
Mean, Var, and the Var* of (
Var: empirical variance; Var*: average of bootstrap variance; EX: exchangeable; AR(1): first-order autoregressive.
We apply the proposed method to a dataset of contralateral breast cancer patients from the SEER database of the National Cancer Institute (https://seer.cancer.gov/data/). Patients with unilateral breast cancer diagnosed between 2005 and 2008 and followed for contralateral breast cancer cases, invasive ductal carcinoma of no special type (ICD-O-3: 8500/3), positive lymph node statuses, and positive histology are considered along with the following baseline covariates: radiation therapy, age at diagnosis, estrogen receptor (ER) status, progesterone receptor (PR) status, and the lymph node ratio (LNR) defined as the ratio between the number of positive lymph nodes and the total number of examined lymph nodes. 48 Patients with these characteristics are extracted from the SEER Plus Data 17 Registries database using SEER*Stat 8.4.0.1 software.
There are 694 eligible patients included in our study with a censoring rate of 52.6%. The survival time of interest is defined as the time (in years) to relapse or death due to the cancer. We plot the Kaplan–Meier survival curve in Figure 1(a) and observe that the survival curve presents a high plateau and levels off at a value substantially greater than 0 after 10 years of follow-up due to a large number of long-term survivors of the cancer. It indicates that the long-term survivors may be considered cured since they are unlikely to relapse or die of the cancer. We also conduct a nonparametric test
49
for the existence of long-term survivors and the obtained

(a) The Kaplan–Meier survival curve and its pointwise 95% confidence interval and (b) the logarithm of estimated cumulative hazard functions of survival time distribution from some groups defined by the covariates in the cure model for the breast cancer data.
The 694 patients are from 44 clusters formed by SEER registries. Patients from the same cluster may share a similar lifestyle, a similar socioeconomic status, and a similar healthcare system, and their cure statuses and failure times of uncured patients may tend to be correlated due to the shared environments. Due to the number of clusters in the data, it will not be efficient to consider clusters as a categorical variable in the model. Therefore, the proposed marginal mixture cure model is suitable for analyzing the data to take into account the potential correlation within clusters.
Since we do not have any a priori subject knowledge about which of the baseline covariates should be in which part of the cure model, we consider all of them in both parts. The covariates are denoted and coded as follows: Radiation (a binary covariate with value 1 if a patient received radiation therapy and 0 otherwise), Age (a binary covariate with value 1 if a patient’s age is more than or equal to 50 years old and 0 otherwise), ER (a binary covariate with value 1 if a patient’s ER status is positive and 0 otherwise), PR (a binary covariate with value 1 if a patient’s PR status is positive and 0 otherwise), and LNR (the LNR between 0 and 1).
To examine the suitability of the AFT assumption and the PH assumption in the marginal mixture cure model for the data, following the idea of Zhang and Peng,
35
we plot the logarithm of the cumulative hazard functions obtained from the weighted Kaplan–Meier survival estimators for the groups determined by the five covariates considered (LNR is dichotomized at 0.15
50
). The weight is estimated by
We fit the proposed model to the data under two working correlation structures: the working independent correlation and the working exchangeable correlation structures. As a comparison, we also fit the data with the AFT mixture cure model using the ZP method. The standard errors of estimates from the models are obtained using 100 bootstrap samples. The results from the models, summarized in Table 7, show some substantial differences. For example, in the latency part, we observe that Radiation is significant (
Estimated parameters and their SEs from the ZP method and from the proposed marginal semiparametric AFT mixture cure model under the working IND correlation and the working EX correlation structures for the breast cancer data.
SE; standard error; AFT: accelerated failure time; ZP: Zhang and Peng; EX: exchangeable; IND: independent; ER: estrogen receptor; PR: progesterone receptor; LNR: lymph node ratio.
Marginal cure models have been widely used for analyzing multivariate survival data with a cure fraction. However, most efforts have been focused on the marginal PH mixture cure models which may be improper for the applications when the PH assumption is not satisfied for the latency. In this paper, we considered a semiparametric marginal AFT mixture cure model for correlated survival data with a cure proportion. A novel estimation approach is developed based on the GEE in the marginal AFT mixture cure model. We showed that the regression estimators are consistent and asymptotically normal, and employed a bootstrap method to estimate the variances of the estimated parameters. Our work relaxes the independent observations assumption for potentially correlated survival data on the usage of the AFT mixture cure model 35 by incorporating working correlation structures in the estimation procedure. The proposed method is also an extension of the marginal AFT model, 9 which was proposed for correlated failure time data without a cure fraction.
Our numerical studies show that the proposed method for clustered survival data improves the estimation efficiency of the regression estimators compared with the method in Zhang and Peng, 35 especially when the correlation is strong and the cluster size is large. When the prespecified working correlation structure matches the underlying true correlation structure, the empirical variances are in general smaller than those under other working correlation structures. The results from the proposed method are comparable with the results from the method in Zhang and Peng, 35 when the correlation strength declines or the cluster size decreases. Hence, the proposed semiparametric marginal AFT mixture cure model provides a new approach for the analysis of correlated lifetime data with a cure proportion, particularly when one is interested in characterizing covariate effects on the failure times of uncured patients directly, and the dependence among the survival times of uncured patients and among the cure statuses can be described by a few unknown parameters. The proposed model and estimation method also facilitate for future development of causal inference for clustered survival data with a cured fraction.
It is worth noting that although the models with random effects or frailties are often considered for clustered data where the random effects/frailties formulate the underlying dependence in a cluster, for clustered survival data with a cure fraction, we are not aware of any existing work based on the AFT assumption. Therefore, as pointed out by one reviewer, considering the AFT mixture cure model with random effects/frailties for clustered survival data should be an important and interesting topic for future research. However, a random effects/frailty model may not always be written as a marginal model unless under special cases such as a linear model with normal random effects. Thus, the estimates from a marginal model are generally not comparable with those from a random effects/frailty model. 52
The bootstrap method is employed to estimate the variances of the estimators in the model in all the numerical work reported in this paper. This method is relatively straightforward to implement but can be computationally intensive. Having a computationally efficient method for variance estimation is always preferable to the bootstrap method. 53 One interesting work in the future is to explore possibilities to simplify the variance estimation in the proposed method.
The proposed semiparametric estimation method for the marginal AFT mixture cure model was implemented in our R package
Footnotes
Acknowledgements
The authors acknowledge the efforts of the Surveillance, Epidemiology, and End Results (SEER) Program of the National Cancer Institute in the creation of the SEER database. The dataset that supports the findings in this paper is available from the website of the National Cancer Institute at
. Restrictions apply to the availability of these data, which are used under license for this study.
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The work of Yingwei Peng is partially supported by a research grant from the Natural Sciences and Engineering Research Council of Canada. The work of Yi Niu is partially supported by grants from the National Natural Science Foundation of China (11401072), the Fundamental Research Funds for the Central Universities of China (DUT20LK24, DUT22LAB302), and the Dalian High-level Talent Innovation Project (2020RD09).
Appendix
The proof of the theorem is developed based on Rosen et al.
54
and Liang and Zeger.
8
We begin by listing the assumed conditions required for the consistency and asymptotic normality of The censoring mechanism is noninformative. The covariate vectors With probability one, there exists a positive constant For each fixed The function The expected value There exists a continuous function which is bounded away from zero, There are strictly positive numbers The expectation Let Next we prove that, Let Next we show the details of two items of For the second component of
Let
