A novel mixture cure frailty model is introduced for handling censored survival data. Mixture cure models are preferable when the existence of a cured fraction among patients can be assumed. However, such models are heavily underexplored: frailty structures within cure models remain largely undeveloped, and furthermore, most existing methods do not work for high-dimensional datasets, when the number of predictors is significantly larger than the number of observations. In this study, we introduce a novel extension of the Weibull mixture cure model that incorporates a frailty component, employed to model an underlying latent population heterogeneity with respect to the outcome risk. Additionally, high-dimensional covariates are integrated into both the cure rate and survival part of the model, providing a comprehensive approach to employ the model in the context of high-dimensional omics data. We also perform variable selection via an adaptive elastic-net penalization, and propose a novel approach to inference using the expectation–maximization (EM) algorithm. Extensive simulation studies are conducted across various scenarios to demonstrate the performance of the model, and results indicate that our proposed method outperforms competitor models. We apply the novel approach to analyze RNAseq gene expression data from bulk breast cancer patients included in The Cancer Genome Atlas (TCGA) database. A set of prognostic biomarkers is then derived from selected genes, and subsequently validated via both functional enrichment analysis and comparison to the existing biological literature. Finally, a prognostic risk score index based on the identified biomarkers is proposed and validated by exploring the patients’ survival.
When analyzing time-to-event data, the presence of substantial censoring following a prolonged follow-up period often indicates the presence of “long-term survivors” or “cured individuals,” that is individuals who may never experience the event of interest. This phenomenon can be observed in certain clinical investigations such as cancer studies, where successful treatment can effectively prevent disease recurrence. This scenario becomes more noticeable among patients who are diagnosed in the early stages of cancer development. The presence of a cured fraction violates a key assumption in traditional survival models, like the Cox proportional hazards (PH) model, which assumes that all subjects will inevitably experience the event of interest. When a cured fraction exists, this assumption no longer holds true. Hence, the use of a Cox PH model in such cases can lead to an underestimation of hazard rates and an overestimation of survival probabilities for individuals susceptible to the event.1 In practice, the cured subjects cannot be observed directly, as they are censored together with the individuals who will eventually experience the outcome. However, the Kaplan–Meier (KM) estimated survival curve can be used for determining the existence of a cured fraction when the follow-up period is sufficiently long. A long and consistent plateau in this curve could imply the existence of a cured subset of subjects. Cure models expand the scope of survival analysis by incorporating a fraction of cured individuals. The mixture cure (MC) model is the most commonly employed cure rate model, which was first introduced by Boag2 and improved by Berkson and Gage.3 The MC model incorporates two components: one component represents the cured fraction, that is the fraction of individuals who will never experience the event of interest, having a survival probability of one. The other component instead captures the susceptible fraction, that is the fraction of individuals for whom the event occurrence is governed by a proper survival distribution, which is often referred to as the “latency distribution.”
In medical and epidemiological studies, it is common to assume the existence of unobserved factors generating heterogeneity among individuals that cannot be explained via the observed covariates. Frailty models can be employed to incorporate and account for unobserved heterogeneity among individuals, to allow for a more accurate modeling of the survival outcome. These models introduce a random frailty term, often assumed to follow a specific distribution, which captures individual-specific characteristics that affect the hazard or risk of the event. Price and Manatunga1 introduced a gamma frailty term into the latency distribution to address the presence of unobserved risks within the MC model. Peng and Zhang4 extended this model by incorporating covariates into both the cure rate and the latency distribution, and they named this model the mixture cure gamma frailty model. Due to the presence of missing variables in cure models, Sy and Taylor5 as well as Peng and Dear6 employed the EM algorithm7 to derive maximum likelihood estimates of the model parameters. Moreover, Peng and Zhang4 studied semi-parametric estimation methods for the mixture cure gamma frailty model using the EM algorithm and a multiple imputation method with low-dimensional covariates. They also examined the identifiability of the general mixture cure frailty model (MCFM) in Peng and Zhang.8 Cai et al.9 developed an R package, called smcure,10 to facilitate the application of semi-parametric estimation techniques to both the proportional hazards MC model and the accelerated failure time MC model. Finally, frailty models have also been used in this context when dealing with correlated data, and particularly to be able to take into account heterogeneity in the presence of recurrent events. For instance, Rondeau et al.11 studied the analysis of recurrent time-to-event data using frailty models with a cured fraction.
The focus of the present article is on situations when the available covariates are way more than the subjects included in the sample. Thanks to advancements in biomedicine, we can now generate and gather molecular data from diverse modalities, such as genomics, epigenomics, transcriptomics, proteomics, and metabolomics, often resulting in high-dimensional data sets. Due to the volume and complexity of such data, and related challenges and opportunities, more sophisticated methods than the traditional ones are required, especially to deal with variable selection. To this purpose, the use of advanced regularization methods such as least absolute shrinkage and selection operator (lasso),12 elastic net,13 the adaptive elastic net,14 and others has become a standard practice for enhancing accuracy and interpretation. The three methods mentioned above are established popular choices for performing simultaneous prediction and variable selection in high-dimensional linear regression problems. These techniques allow discovering meaningful patterns, such as identifying relevant prognostic biomarkers for a certain disease, from high-dimensional molecular data. When it comes to generalized linear models, Friedman et al.15 introduced a novel approach utilizing coordinate descent to compute regularization paths, and also implemented their method in the now very popular R package glmnet.16 Simon et al.17 extended the previous work to Cox models for right-censored data, incorporating elastic net regularization, with a very efficient algorithm that was then integrated into the glmnet R package, enabling the solution of significantly larger problems than previously possible. Tay et al.18 expanded the scope of the elastic net-regularized regression to encompass all generalized linear model families, including Cox models with (start, stop] data and strata, also implemented within the glmnet R package. Adaptive versions of both the lasso Zou and Hastie13 and elastic net14 have also been developed, and their properties explored under specific conditions. In a discussion article, Bühlmann and Meier19 emphasized the relevance of a multi-step procedure for the adaptive lasso, showing that it allows for sparser models at each step. Furthermore, Xiao and Xu20 introduced the multi-step adaptive elastic-net, showing via extensive numerical exploration that this method effectively reduces the false positives in the variable selection process while preserving estimation accuracy, and implemented it into the msaenet R package.21 For further insights into regularization methods and their applications, interested readers may refer to Bühlmann and van de Geer,22 Hastie et al.23 and James et al.24
In the context of survival data within high-dimensional settings, the traditional Cox PH model has been commonly used. Although there has recently been a growing interest in the application of cure models to high-dimensional data, the existing literature on this subject has remained somewhat limited. Liu et al.25 introduced a variable selection procedure for the semi-parametric MC model using penalties based on the smoothly clipped absolute deviation (SCAD) and lasso. Fan et al.26 explored the minimax concave penalty (MCP), ridge, and lasso penalties, to estimate the MC model in high-dimensional data sets. Masud et al.27 studied variable selection problems for both the mixture and promotion time cure models, employing penalty terms such as lasso and adaptive lasso. Baretta and Heuchenne28 conducted a study on variable selection using SCAD penalties for the semi-parametric PH cure model, considering time-varying covariates, and implemented the method in the penPHcure29 R package. Sun et al.30 developed a variable selection methodology that incorporates lasso, adaptive lasso, and SCAD-type penalties, tailored for the semi-parametric promotion time cure model, and particularly suitable for interval-censored data. Bussy et al.31 proposed a novel statistical model known as C-mix, specifically designed as a mixture-of-experts model for handling censored survival outcomes, to model the patients’ heterogeneity by detecting distinct subgroups, and allowed high-dimensional covariates by employing an elastic net penalization. They also conducted a comparison of the C-mix model with both the MC and Cox PH models through simulation experiments. Shi et al.32 developed a novel penalty method for the MC model that includes both the MCP and a new penalty term that promotes sign consistency using the covariate effects. Xie and Yu33 proposed the use of neural networks to address inference in the MC model, demonstrating their favorable predictive ability in high-dimensional settings. Xu et al.34 considered the variable selection problem by adopting penalties such as lasso, adaptive lasso, and SCAD for the generalized odds rate MC model, particularly in the context of interval-censored data. Lastly, Fu et al.35 investigated the Weibull MC model, incorporating generalized monotone incremental forward stagewise (GMIFS)36 and EM algorithm techniques for inference. They applied the EM algorithm to a penalized Weibull MC model, employing a lasso type penalty. Their findings indicated a superior performance of their penalized MC model as compared to alternative methods in various simulation scenarios. It is noteworthy to emphasize that all aforementioned studies on cure models exhibit significant limitations concerning the size of covariates relative to the sample size that can be handled by the model, in both simulation experiments and real-data applications, with the only exception being the case studies conducted by Bussy et al.31 and Fu et al.35 Moreover, neither of the two latter studies include a frailty term in the model.
This study is therefore aimed at developing a first novel version of the MCFM useful in instances where high-dimensional covariates are potentially associated with both the cured rate and uncured subjects in the sample. To this purpose, a new penalized EM algorithm is introduced, incorporating adaptive elastic net penalties specifically tailored for the MCFM. In particular, the model assumes that the distribution of uncured subjects follows a Weibull distribution, and that their survival probability at each time depends on a subset of the high-dimensional set of covariates, which is also a target of inference. To the best of our knowledge, this study represents the first exploration of MCFM, that is the first attempt at incorporating frailty into the MCM, within high-dimensional settings. Nonetheless, surpassing the predominant focus of the existing literature on MCFM on low-dimensional settings, our study explores scenarios in which the sizes of covariates significantly exceed the sample sizes in both simulated and real-data applications. The rest of this article is organized as follows: in Section 2, we present the classical MCFM and its extension to high-dimensional settings. In Section 3, we introduce the novel penalized MCFM (penMCFM), with the associated adaptive EM algorithm for inference, in details. The performances of our proposed methods are compared with some competitors through a comprehensive Monte Carlo simulation study in Section 4. In Section 5, all considered methods are applied to publicly available breast cancer RNA-seq data, with the aim of identifying potential biomarker genes. We also conduct functional enrichment analyses for better interpretation, and determine a prognostic risk score for enhanced validation, of all the identified biomarker genes. Finally, in Section 6 we provide a brief summary and discussion of the study, also mentioning potential future research directions.
The penalized mixture cure frailty model
Classical mixture cure frailty models
Let the random variable represent the lifetime of interest with survival function denoted by . Let be the indicator for a subject eventually or never experiencing the event of interest, with representing the probability of a subject being susceptible (or uncured) for the event of interest. Among the subjects for whom , the survival function is , and for those who experience the event () , the survival function and the probability density function (pdf) are and , respectively. is not observed for a censored subject. The population survival function is therefore defined as
Note that since as , is not a proper survival function. The uncured rate and the survival function of the uncured subjects are also referred to as the incidence and the latency distribution, respectively.
The basic model introduced in (1) can be extended to include the covariates associated with the incidence and latency distributions. Let us denote via and the covariates that have effect on the latency distribution and the incidence, respectively. The model (1) can be then rewritten as
where is the probability of a subject being uncured conditionally on , and is the survival function of the lifetime distribution of uncured subjects conditionally on . Concerning the modeling of the effect of the covariates on the incidence, as previously proposed in Farewell37 we use a logistic regression model of the form where is a covariate matrix, with columns , and is a vector of unknown regression coefficients. When the mixture cure model defined in (2) is specified via proportional hazards, we get the following PH mixture cure model
where is the baseline survival function, is the covariate matrix, with columns , and is the vector of unknown regression coefficients for the latency distribution.
In medical and epidemiological studies, frailty models extend the Cox PH model to account for unobservable heterogeneity among individuals. This extension provides a more flexible structure for analysis. A frailty is defined as an unobservable, random, multiplicative factor acting on the hazard function. Let be a non-negative frailty random variable associated to the th subject, , with cumulative distribution function (cdf) . The hazard function of the th subject with frailty is
where is a baseline hazard function common for all subjects and is the covariate vector for the th subject. If we include the frailty in the latency distribution in model ( 3), the conditional survival function given the frailty takes the form
where is the baseline cumulative hazard function. Then, the marginal survival function of uncured subjects based on the frailty model is given by
where is the Laplace transformation of the frailty . Hence, model (3) with frailty becomes
Model (4) was first introduced by Peng and Zhang4 and called the mixture cure frailty model. Model (4) reduces to the PH mixture cure model in (3) when there is no frailty effect, namely , and it reduces to a standard frailty model when there is no cure fraction existing in the population, namely .
In this study, we consider a fully parametric version of model (4). The Weibull distribution is one of the most commonly used and well-known distributions in survival analysis, recognized for its flexibility in modeling various hazard functions and accurately representing increasing, decreasing, and constant hazard rates. Additionally, because similar mixture cure models in the literature often use the Weibull distribution, employing it in our model facilitates easier comparison with existing studies such as Fu et al.35 Hence, we assume that the baseline of the latency distribution follows a Weibull distribution with and being the scale and shape parameters, respectively. The hazard and cumulative hazard functions for the baseline of the latency distribution then become and , respectively.
The choice of frailty distribution is crucial for deriving the functional form of our model, as the population survival function in (4) includes the Laplace transform of the frailty distribution. The Gamma distribution is particularly flexible, capable of capturing both increasing and decreasing hazard rates, and it features a closed and simple Laplace transform. Due to these mathematical conveniences, the Gamma distribution is widely used as a frailty distribution in survival analysis (see the nice discussion in the book by Klein and Moeschberger38). Nonetheless, to address identifiability issues associated with the mixture cure frailty model,8 we must fix the mean of the frailty distribution to . The Gamma distribution then allows a convenient reparameterization that adheres to this identifiability constraint. The Gamma is frequently employed as a frailty distribution in the mixture cure model literature, as demonstrated by Price and Manatunga1 and Peng and Zhang.4 For these reasons, and to effectively handle high-dimensional covariates within this framework, using the Gamma distribution as a frailty component is both practical and meaningful. Therefore, we assume that the frailty follows a gamma distribution with mean and variance . The Laplace transformation of the frailty is then . Therefore, given our parametric assumptions, the survival function of model (4) can be rewritten as
Note that, with the parametric assumptions as detailed above, the hazard function of the latency, that is of the uncured subjects, does not satisfy the PH assumption.4
Mixture cure frailty models with high-dimensional covariates
In the context of medical applications, and particularly in cancer studies, molecular data of several kinds can be used to better estimate both the cured fraction, and the uncured patients survival. This is particularly relevant in cancer studies, since tissue samples originating from biopsies of the tumor are often available, and these can provide excellent information on the tumor composition and characterization. Such molecular omics data often include genomics, epigenomics, transcriptomics, proteomics, metabolomics and radiomics information, often generated using advanced high-throughput technologies to allow the understanding of complex biological systems and their underlying mechanisms. Common characteristic of all these data layers is their high-dimensionality, meaning that the number of variables included in each omics data layer is much larger than the sample size . We plan to modify the mixture cure frailty model as discussed above so that high-dimensional omics data can be used as covariates in the model.
In this set-up, we consider both covariates and , related to the latency part and the uncured rate of the model, respectively, as potentially high-dimensional, for example including some kind of omics information. These covariates can also include lower-dimensional information associated to the patients’ demographics or clinical characteristics, such as age, sex, treatment method. This means that the two vectors of variables in and could potentially combine very diverse exogenous information, including both clinical variables and high-dimensional omics features, which often show varying degrees of accuracy, redundancy, and noise.
We apply well-known regularization approaches such as lasso12 and the adaptive elastic net14 to regression models to find the relevant variables from our complete set of omics covariates in both and . The high-dimensional covariate matrices and can share features or not, depending on the available domain knowledge, and moreover the low-dimensional covariates (e.g. demographics and clinical information) in the matrices are left unpenalized. In what follows, we will refer to the low- and high-dimensional parts of the covariate matrices via “unpenalized” and “penalized” variables, respectively. For these covariates and for their corresponding regression coefficients, we use the following notations. The covariate matrices and can include both types of unpenalized and penalized variables, which are represented as and where and represent the unpenalized and penalized variables in and , respectively. The vector of regression coefficients and are also split in the same way into and where is the intercept, and and represent the unpenalized and penalized regression coefficients, respectively.
Inference in the penalized mixture cure frailty model
Let the observed data be where is the observed survival time for the th subject, is an indicator function of censoring, with for the uncensored time and for the censored time. and are respectively the unpenalized and penalized observed covariates associated to the cure part of the model for the th subject, , while and are the same for the survival part. When possible, we will use the compact notation and , with and , and .
The likelihood function for the right-censored observed survival data is
where are the unknown parameters of the Weibull distribution, is the unknown parameter of the gamma distribution used for the frailty, and are vectors of unknown regression coefficients for the covariates and , respectively, where we indicate and , . The maximum likelihood estimator of the unknown parameters can be obtained by direct maximization of the observed likelihood function in (6) when the number of unknown parameters is not large. Since we would like to consider high-dimensional covariates and perform variable selection via penalization, direct maximization of the likelihood function will not provide parameter estimates. We therefore propose to adapt the EM algorithm to include regularization methods to the purposes of parameter estimation and variable selection in penMCFM. Details on the proposed implementation of the EM for penMCFM are given in Section 3.2, after recalling the standard EM for MCFM in Section 3.1. For the sake of comparisons in the simulation studies, we have also adapted the GMIFS method from Ref.36 to obtain parameter estimations in penMCFM using the observed likelihood function in (6), and details on the implementation of this method are given in Section 3.4.
EM algorithm for the standard MCFM
Recall that the cure indicator is a latent boolean variable such that if an individual is susceptible (uncured), and if non-susceptible (cured). It follows from the censoring assumption that if then , and if then is not observable; in this case, is latent and can assume either values (one or zero). Hence, is only partially observed, which means that we need an EM algorithm to carry out inference on the latent variables.
The conditional survival function of model (5) for the -th subject, given the covariates and the latent value of the frailty , is
while the conditional survival and hazard functions of the susceptible subjects, that is those such that are respectively
and
Note that, for the -th subject, we have that . Hence, given the frailty , the contribution of the -th subject to the likelihood function is when and
when . Therefore, the complete-data likelihood function corresponding to (6) can be expressed as
given the values of the latent outcome and frailty , for and where is the pdf of the gamma distribution. Hence, the corresponding complete-data log-likelihood function where can be written as the sum of the log-likelihood functions corresponding to the different model parts, and specifically
where
and
In the E-step, we evaluate the conditional expectation of the complete-data log-likelihood function with respect to the latent variables ’s and ’s given the parameter estimates at the -th iteration of the algorithm and given the observed data . We then need to compute the conditional expectations of , and given . These expectations for the -th iteration of the E-step are obtained as follows:
where is the digamma function. In the M-step, the conditional expectation of the complete-data log-likelihood function is maximized with respect to the unknown parameters. For details about the derivation of expectations, please refer to the relevant calculations presented in Peng and Zhang.4
EM algorithm for penMCFM
When the covariate vectors and are high-dimensional, variable selection is also a purpose of the analysis, together with estimation of the unknown parameters and regression coefficients. To this aim, we consider a penalized version of the complete-data likelihood function, where a penalty term is added for the coefficients and corresponding to the penalized part of the high-dimensional covariates and . Then, the penalized complete-data log-likelihood function takes the form
where, from the notation introduced at the beginning of Section 3, and are the number of covariates included in the covariate vectors and , respectively, and is a penalty function depending on a vector of penalty parameters for .
In the E-step, the conditional expectation of the penalized complete-data log-likelihood function with respect to the latent variables ’s and ’s are the same as in equations (8) to (11). Hence, and in (8) to (11) are used at the -th iteration of the E-step in penMCFM. Then, the conditional expectation of the penalized complete-data log-likelihood function at the -th step of the algorithm is where
We focus on the multi-step adaptive elastic-net penalty described by Xiao and Xu20, which aims to achieve increased sparsity while concurrently reducing false positives in the variable selection process, and we adapt it to the case of penMCFM. When applying such elastic-net penalization for the high-dimensional covariates corresponding to and in (12) and (13), the components of the penalized complete-data log-likelihood function at the th iteration of the E-step become
where and are the tuning parameters controlling the amount of penalty, is a higher level hyperparameter, and is the data-driven weighting parameter corresponding to the th variable in the model (referred to as “adaptive weights” in the following). The tuning parameter values can be identified through cross-validation, and choices related to the tuning of these parameters are discussed in the following section. We update the adaptive weights times as
and
for and . When , the adaptive weights equal for all variables, and this case is equivalent to the regular elastic-net method. When then the adaptive elastic-net method is used. The estimates of and at the th iteration of the E-step are obtained by minimizing the negative penalized complete-data log-likelihood functions, that is respectively:
At the th iteration of the M-step, the maximization of the is equivalent to maximizing the , and separately to obtain updated parameter estimates .
Implementation of the penMCFM algorithm
At the th iteration of the E-step, in (12) has the same structure as the penalized log-likelihood function of a logistic regression model, except for the parameters . As evident from the definition in (8), is not constrained to binary values, but can take any value in the interval . Notably, it is important to note that these response values are not directly observable within this framework, in contrast to the typical situation in penalized logistic regression problems. Consequently, optimizing poses more challenges compared to solving a conventional penalization problem. This challenge has been recently highlighted by Bussy et al.31 and Fu et al.35, and implies the impossibility to directly employ the glmnet R package16 to address the high-dimensional component of our problem. The second term, in (13), can be considered as the log-likelihood function of the penalized Weibull regression model with and Here is a constant that is evaluated by using (11) before the optimization steps of the algorithm. We employ the lbfgs R package,39 which provides the functionalities to utilize penalties for the minimization of the cost functions outlined in (15) and (16). Finally the third term, in (14), is maximized to obtain the estimate of the frailty distribution parameter (or its reciprocal, i.e. the variance).
We give a pseudo-code description of our EM algorithm for penMCFM in Algorithm 1. Initialization is carried out as follows: we use Cox regression estimates for , moment estimates for the Weibull distribution parameters and , and we set and . Generally, it is assumed that the tuning parameters for both penalties are common, that is for the sake of simplicity. We have followed this approach in our simulations as well. We then generate a sequence of based on the penalized Cox regression of using the glmnet R package.16
GMIFS method for penMCFM
The GMIFS method is a variable selection technique that iteratively enhances the predictive power of a model by progressively adding variables, thus it can be adapted to obtain parameter estimations in penMCFM. Unlike traditional forward selection methods, GMIFS enforces a monotonic constraint on the coefficients of the added variables, ensuring that they either remain constant or increase with each iteration. By sequentially introducing variables with controlled increases in their coefficients, GMIFS strikes a balance between model sparsity and predictive accuracy. This method was first introduced by Hastie et al.36 for solving penalized least squares regression problems. Since then, Makowski and Archer40 and Hou and Archer41 provided extensions to Poisson regression and ordinal responses in the high dimensional setting, Yu et al.42 to variable selection for nonlinear regression, and Fu et al.35 to the Weibull MCM.
In the simulation studies, with the aim of comparing our EM-based method for penMCFM not only with the penalized Weibull MC model introduced in Fu et al.,35 but also with another variable selection algorithm, the GMIFS method is adapted so to be able to target our observed likelihood function in (6). For a comprehensive introduction to this method, readers are encouraged to refer to Hastie et al.36 and Fu et al.,35 and for further details on the implementation, to check our code details at https://github.com/fatihki/penMCFM.
Simulation study
In this section, we describe the results of extensive simulation studies that we have conducted to evaluate the performance of our proposed model. The evaluation of model performance involves assessing both the accuracy of the model predictions, and its capability to correctly identify covariates with true non-zero coefficients. We examine different scenarios involving varying censoring rates, cure rates, correlation among covariates, and number of nonzero coefficients.
For the sake of performance comparisons in the simulation studies, besides our proposed penMCFM EM-based algorithm (detailed in Sections 3.2-3.3), which we refer to as penMCFM(EM), we also implemented a version of the penMCFM algorithm where we incorporate the GMIFS method (detailed in Section 3.4), named penMCFM(GMIFS), and the GMIFS-based penalized Weibull MC model from Fu et al.,35 named MCM(GMIFS). We are also interested in exploring the performance of a survival model that does not account for a cure fraction. To this aim, the Cox PH model with lasso penalty is employed by using the glmnet16 R package based on the observed data , where the tuning parameter that gives the most regularized model such that the cross-validated error is within one standard error of the minimum is chosen. This value is referred to as “lambda.1se” in Friedman et al.,16 and thus we refer to this method as penCox.1se. Note that this method only considers the latency part of our model, namely the one related to the regression coefficients.
The R code for implementing the proposed estimation algorithms and for replicating the simulation studies is available at https://github.com/fatihki/penMCFM.
Data generation under the Weibull MCFM
In order to generate random samples from a Weibull MCFM with right-censoring, cure rate and covariates, we follow the procedure outlined in Asano et al.43 This method can be regarded as a modification of the inverse sampling technique, particularly suited for situations where the data generating process is assumed to have an improper survival function. From the population survival function in (5) we can derive the inverse function for the cdf of the population as
where . The pseudo-code description of a random sample generation for the Weibull MCFM model is detailed in Algorithm 2.
Simulation design
We carry out a large-scale simulation study to test the novel high-dimensional setup of our model. Our data generating process can be described as follows: we set the number of observations to , and the penalized covariates sizes as . The sparsity levels of and are set to be meaning that is the number of non-zero coefficients (thus, 2% sparsity in this setting). We randomly split the dataset into a training and a testing set to assess performance, and implement a -fold cross-validation on the training set to tune the model parameters. The parameter is tuned over a grid of values on a log scale in the interval using the same sequence of values as used for penalized Cox regression in the glmnet R package. The parameter is chosen among the values and .
A categorical variable with three levels is used as unique unpenalized covariate with , generated with weight probabilities . The unpenalized covariates associated to the survival part of the model are i.i.d. standard normal, and . The penalized covariates and are assumed common, and generated from a -dimensional Gaussian distribution where is a block diagonal matrix with block size , and with the correlation between any pair of covariates within the same block defined by , ; the tested values for are . The unpenalized regression coefficients are set as while is randomly generated from a uniform distribution on the interval . For what concerns the penalized regression coefficients we consider five different settings with respect to their nonzero coefficients. In each setting, components of both and are nonzero and take the same value which we will refer to as the “signal.” The signals are occurring at roughly equally-spaced indices between and , while the rest of the coefficients is equal to . Hence, the non-zero coefficients are distributed equally in each block, and specifically one non-zero coefficient per block. Five different signal values are considered, specifically , to test the model when the signal in the data becomes smaller and smaller.
Finally, we set the Weibull distribution parameters as , the frailty distribution parameter and for generating the censoring times. After generating all model coefficients and covariates according to these assumptions, we generate the data using Algorithm 2. This process is repeated times, and the average results across all runs of the EM on the datasets so obtained are subsequently reported. Notably, the average censoring (cure) rates in our simulated samples are approximately for the respective values chosen in
Performance evaluation
To compare the performances of the various models considered in our simulation studies, we consider several performance metrics. For selecting the tuning parameter , we use -fold cross-validation on a revised version of the concordance index defined in (17) below, and also used in Fu et al.35 Harrell’s concordance index (or C-statistic, or C-index) is widely used as a measure of performance in fitted survival models for censored data. The C-statistic for a standard survival model is the proportion of concordant pairs out of the total number of possible evaluation pairs, given by
where , and is the estimated vector of coefficients. Assano and Hirakawa44 noted that the value of is the same for both cured patients and censored uncured patients, and therefore proposed a modified C-statistics that considers this issue by introducing a cure status weighting for each patient. The weight is defined as for the uncured patients who experienced the event, for the cured patients, and is equal to the estimate of the uncured probability for the censored patients. Then, the C-statistic is defined by Assano and Hirakawa44 as
where is an indicator function taking values if or , and if is missing, and where . Notice that in (17) reduces to when and for all patients.
Moreover, we evaluate the performance of the variable selection part of the model by calculating the associated sensitivity, that is the percentage of non-zero coefficients accurately estimated as non-zeros, specificity, that is the percentage of zero coefficients accurately estimated as zero, and false positive rate (FPR), that is the proportion of wrongly selected zero coefficients among those estimated as non-zero coefficients, which is equal to the proportion of mistakenly selected irrelevant variables among those identified as significant. We also evaluate the model prediction performance by considering the relative model error (RME), and the model estimation performance by considering the estimation error (ERR). For the vector of regression coefficients , these measures are defined as
where is the covariance matrix of the covariates, is the true vector of coefficients, and is the oracle estimate of the coefficients derived from the model where only the true signals were included and the coefficients for the other covariates were forced to be zero.
Furthermore, according to the model, the probability of being cured is sample-specific: this means that, even when two patients share identical clinical characteristics, their unique genomic profiles can influence the cure probability in different ways. Recall that the uncured probability for the th subject is computed by using the formula , for given coefficients and for the th subject covariates , . Therefore, the average true uncured probability can be computed as over Monte Carlo runs of the EM algorithm, and over the subjects in the sample, where is the true uncured probability for the th subject in the th Monte Carlo simulation, with being the total number of simulated datasets. Consequently, we can also evaluate the accuracy of model estimates in terms of bias and mean squared error (MSE) of the estimated uncured probability , which are calculated as
and
where and are respectively the true uncured probability and its estimate for the -th subject and the -th Monte Carlo run, similarly to what reported in Pal et al.45
Simulation results
We here report the simulation results for each scenario specified in the simulation design described in Section 4.2, over simulated datasets, when using the different performance measures as introduced in Section 4.3. We plot the number of non-zero regression coefficients, and the sensitivity, specificity, and FPR, to show the performance of the variable selection for both sets of regression coefficients and in Figures 1 and 2, respectively. The C-statistics plots for both training and testing data are shown in Figure 3. Here the regular C-statistic, namely , is used for the penCox.1se method (which does not include a cured part), while is used for all other methods. The performance of the uncured rate estimates is also shown in Figure 4, where the absolute Bias and MSE are reported. Finally, we present results that illustrate the prediction and estimation performance of the models, as well as the cure rates, by reporting the RME, ERR, and in Table 1, for the case in which the data are generated with . Corresponding tables for the cases of and are included in Tables S1 and S2 of the Supplementary Material. Additionally, we investigate the uncertainty of the non-zero regression coefficients and the uncured rate. We compute the empirical standard error of the parameter estimates for the non-zero coefficients for both and , as well as , across the repeatedly simulated datasets. We present the average estimates of these parameters, accompanied by one standard error wide bars, for the cases of and , which are included in Figures S1–S4 of the Supplementary Material.
Results of simulation studies. Variable selection performance in inference for .
Results of simulation studies. Variable selection performance in inference for .
Results of simulation studies. C-statistics plot based on the train (top row) and test (bottom row) data.
Results of simulation studies. Absolute bias and MSE plots for the uncured rate estimates.
Each panel in the figures shows the average of a given metric over repetitions for the respective methods under consideration. In the table, mean results of the different methods are listed along with the standard deviation (in parenthesis). It is worth noting that the GMIFS method is independent of the parameter, while penCox.1se only includes the lasso case for the penalized Cox regression, and thus results associated to these two methods remain unchanged regardless of the values.
To assess the accuracy of variable selection, it is vital to consider sensitivity, specificity, and the FPR plots in combination. Figures 1 and 2 reveal that the proposed penMCFM(EM) method exhibits very similar average results when choosing equal to either values across all metrics, and moreover, these figures also show that penMCFM(EM) is the method providing best overall results across all simulated scenarios and considered metrics. Notably, when looking specifically at Figure 1, the penMCFM(GMIFS) method consistently selects the fewest variables, often below the true value of , while conversely the MCM model version, MCM(GMIFS), chooses the largest number of variables compared to other methods. These behaviors obviously reflect in the often suboptimal sensitivity, specificity and FPR values attained by these two methods. The penCox.1se, which utilizes the penalized Cox regression with a lasso penalty, ranks as the second method with regards to the number of selected variables in Figure 1, thus showing large sensitivity but suboptimal specificity and FPR. If we compare the variable selection performance of our proposed penMCFM(EM) method when choosing and penCox.1se, they show similar sensitivity plots although penMCFM(EM) selects less variables than penCox.1se. Moreover, penMCFM(EM) attains better performance across values than penCox.1se in terms of specificity and FPR across all and settings. The MCM (GMIFS) method demonstrates better sensitivity performance for (Figure 2) as compared to (Figure 1). However, similarly to what shown for in Figure 1, penMCFM(EM) consistently outperforms other methods across all values also for estimating (Figure 2), as it selects the fewest variables and frequently approaches the true value of more closely than other methods, while maintaining large sensitivity and specificity, and exhibiting low FPR.
The overall performance of the methods in terms of variable selection generally deteriorates when the signal becomes weaker, specifically when , and it improves as the signal magnitude increases, as observed in both figures. The effect of the signal parameter is apparent in the increasing variable selection performance particularly in the number of selected variables and sensitivity. On the other hand, while the other methods seem to have slightly deteriorating performance when the correlation among variables is increasing, the impact of the correlation parameter on the performance of penMCFM(EM) cannot be prominently discerned. We therefore conclude that the penMCFM(EM) method, especially when choosing or 1, attains better variable selection performance than other methods in terms of the true number of selected variables, high sensitivity and specificity, and low FPR.
From inspection of Figure 3, we note that performance in terms of C-statistics is generally slightly better on the testing set than on the training set for all methods, as also previously observed in a similar model setting by Fu et al.35 As the signal magnitude increases, the penMCFM(EM) and penCox.1se methods as expected tend to attain larger C-statistics values for both the train and test data. penCox.1se shows generally the best performance except for the cases and , when performance deteriorates more than for other methods. penMCFM(EM) generally shows larger C-statistic values when choosing as compared to , and with respect to penMCFM(GMIFS) (except for on the test data). Even if the C-statistics values attained by MCM(GMIFS) tend to be larger than those attained by penMCFM(EM), the performance of the two methods becomes more and more similar as the signal magnitude increases, both on the training and on the testing datasets. It is also worth noting that, when calculating the C-statistics for a cure model, in (17), both coefficient vectors and are considered in the calculations, thus making it plausible that this contributes to the larger values attained by MCM(GMIFS) as compared to penMCFM(EM). Nonetheless, the MCM(GMIFS) method consistently exhibits the highest count of nonzero variables for both and in all scenarios (Figures 1 and 2, respectively), thus showing a worse variable selection performance. To summarize, we can conclude that generally all the methods show acceptable and comparable C-statistics values for almost all considered scenarios. We thus give more importance to the evaluation of the variable selection performance, which discriminates more among the considered methods, and which impacts in a much more crucial way on the reliable interpretation of the model outputs.
Finally, Figure 4 shows that penMCFM(EM) when choosing and generally gives the lowest first and second average absolute bias and MSE values of , respectively. penMCFM(GMIFS) exhibits the least favorable performance among all methods. Additionally, it is worth noting that the penMCFM(EM) method showcases consistent behavior across varying correlation sizes for each value.
From Table 1, it can be observed that our proposed method penMCFM(EM) outperforms other methods in terms of the RME and ERR metrics, except for the most difficult case when . In almost all scenarios, penMCFM(EM) with shows the best prediction and estimation performance for the regression coefficients in the cure part, and therefore we observe similar performance for the estimation of .
Application to RNA-Seq data from TCGA-BRCA
In this section we utilize publicly available omics data from The Cancer Genome Atlas (TCGA) project1 for showcasing the use of the proposed penMCFM model on a real case application. The overall survival time, demographic and gene expression data from primary invasive BReast CAncer patients in the TCGA database (TCGA-BRCA) were obtained from the Genomic Data Commons Data Portal data release v33.0. We retrieve the RNA-Seq data from the primary tumor of TCGA-BRCA patients, together with the accompanying metadata comprising survival outcomes, as well as clinical and demographic variables. We also use the BCR Biotab files to gather some additional clinical variables related to hormone status, such as estrogen receptor (ER), human epidermal growth receptor 2 (HER2) and progesterone receptor (PR), for the same TCGA-BRCA patients.
The original data set comprises samples with gene expression features, and clinical and demographic variables. We specifically consider protein-coding genes, and after eliminating duplicate genes we are left with RNA-seq features. We then remove observations which show a survival time of less than one month, leaving samples before arrangement of the clinical variables. We use years as unit for the survival time variable in this analysis. We apply the DESeq2 normalization approach for RNA-seq data as implemented in the R/Bioconductor package DESeq246 before further statistical analysis, as suggested by Zhao et al.47,48
For data preprocessing, we reduce the RNA-seq features using the variance filter method implemented in the M3C R package,49 thus reducing the features considered to the top RNA-seq, which explain around of the variation of the whole data. Since prior knowledge of BRCA plays a crucial role in the identification of potential prognostic biomarkers, we consolidate genes of interest from five distinct sources of prior knowledge as in Li and Liu.50,51 This selection includes:
the genes related to BRCA from the top ranked gene ontology (GO) terms sorted in gene ontology annotations (GOA),53
the genes known as the “MammaPrint” BRCA signatures,54
the genes collected by the online consensus survival analysis web server for breast cancers (OSbrca) previously published BRCA biomarkers,55
the BRCA prognosis signatures selected by the scPrognosis method using single-cell RNA sequencing (scRNA-seq) data.56
When we take the intersection of these sources with our already reduced set of RNA-seq features, and combine them with the top features, we obtain a total of RNA-seq features. Lastly, we filter the samples based on the clinical variables listed in Table 2, resulting in a dataset of observations and around censoring rate. The remaining portion of the RNA-seq data, which includes observations and has around censoring rate, will be used for validating the identified genes. Even if the censoring rate is slightly lower on this validation set (as compared to the training and testing sets), filtering on clinical variables makes this validation dataset at most homogeneous to the training and testing sets with respect to covariates, which is most crucial in calculating the Prognostic Risk Score.
Demographic and clinical characteristics of TCGA-BRCA patients.
Clinical outcomes
Mean
Range
Event rate
Overall survival (year)
3.19
0.109
Clinical covariates
Levels
Clinical covariates
Levels
PAM50 Subtypes
Basal
144
Pathological Stage (PS)
Stage I
151
Her2
58
Stage II
478
LumA
432
Stage III
180
LumB
164
Stage IV
14
Normal
30
Stage X
5
Lymph nodes (LN)
N0
402
Tumor size (TS)
T1
221
N1
273
T2
481
M2
87
T3
101
N3
56
T4
24
NX
10
TX
1
Metastases stage (MS)
M0
706
HER2 status
Equivocal
172
M1
14
Positive
143
MX
108
Negative
513
Pharmaceutical Therapy (PT)
Yes
669
ER status
Positive
648
No
159
Negative
180
Radiation Therapy (RT)
Yes
451
PR status
Positive
567
No
377
Negative
261
Age (year)1
58.79
mean and range
We consider two real data scenarios for the choice of the penalized covariates and , which are the same for the cure and the survival part: Scenario , where the penalized covariates only include the top RNA-seq features after the variance filtering, and Scenario , which is built upon the larger set of RNA-seq features, combining the same top RNA-seq features as in Scenario with the known genes related to BRCA. In both scenarios, we employ the same set of unpenalized covariates and . Unpenalized covariates in include the following clinical variables: ER, PR, HER2, PT, RT, Age, while includes PAM50, MS, LN, TS, PS, ER, PR, HER2 (as in De Bin et al.,57 Volkmann et al.58 and Li and Liu50).
Identification of biomarkers
To identify the relevant features (biomarkers), we initially partition the dataset randomly into a training and a testing set. In order to select the tuning parameter , we employ a -fold cross-validation on the training dataset to obtain the optimal that maximizes the C-statistic ( for the penCox.1se method and for all other methods). We additionally tune the parameter, by considering values in the set , as the one maximizing the C-statistic value in the testing dataset when employing the parameter selected from the training dataset. We then repeat this random splitting (and subsequent parameter tuning) times to get several new training and testing datasets, as a part of our experiments. It is crucial to emphasize the importance of such approach to ensure feature selection robustness, as the selected features are consolidated as the union of the genes associated with non-zero coefficients in all experiments.
We employ our methods in both Scenarios 1 and 2 as detailed above. The tuning parameter is determined based on the average values of the C-statistic over the repeats for the penMCFM(EM) and penCox.1se methods, the only ones allowing this parameter. The results obtained using the optimally selected parameters can be seen in Figure 5, which shows the boxplots of the C-statistic for the different methods. The number of identified biomarker genes is different for the different methods: for instance, in Scenario , the penMCFM(EM) method with selects genes for the latency part of the model, genes which correspondingly show nonzero values in , while MCM(GMIFS) selects genes. These identified sets of biomarker genes show great overlap across different methods, as shown in Figure 6, which illustrates the overlap of the selected genes for the latency part of the model across the four methods. Interestingly, penMCFM(EM) generally does not select any features for the incidence part of the model, meaning that no coefficient in is estimated to be significantly different from zero, while the GMIFS method (when used with both models penMCFM and MCM) usually selects some features. A similar result was also observed in Fu et al.35 in an application to acute myeloid leukemia data. A summary of the overlap of the selected genes of these two methods for the incidence part of the model is also given in Figure S5 of the Supplementary Material. Furthermore, the selected sets of biomarker genes for the different methods, along with their frequencies of occurrence obtained over the repeats of each method, are provided at https://github.com/fatihki/penMCFM as excel files.
Results of the analysis of the TCGA-BRCA data. Boxplots of the C-statistic, AUC and IBS values obtained on the testing datasets over the repeated data-splitting processes. (a) Scenario and (b) Scenario .
Results of the analysis of the TCGA-BRCA data. Overlap of the selected gene sets (nonzero coefficients) among the four methods: the blue barplots report the frequencies of intersections among the methods, while the bottom green lines report which methods are considered for the overlap. (a) Scenario and (b) Scenario .
From inspection of Figure 5 one can see that the proposed method penMCFM(EM) achieves the second-largest C-statistic values, whereas MCM(GMIFS) attains the largest C-statistic. It is worth noting however that the latter method selects a larger number of genes for both parts of the model compared to other methods, which may contribute to its greater C-statistic values. Moreover, it is also possible to consider other metrics, such as the area under the receiver operating characteristic (ROC) curve and the integrated Brier score, for comparing the different methods, for both the incidence and latency parts of the model. To calculate the area under the ROC curve (AUC), each method is used to estimate -years survival for the subjects in the testing set, and compared to outcome on the corresponding time interval. The integrated Brier score (IBS) is a composite measure of discrimination and calibration, and lower values indicate better model calibration and predictive accuracy. The procedures outlined by Zhao et al.48 were followed to compute both the AUC and IBS metrics. Our analysis leads to the conclusion that the penMCFM(EM) method, as depicted in Figures 5(a) and 5(b), exhibits superior AUC and IBS performance compared to the alternative methods for both the incidence and latency components of the model.
For a fair comparison of the methods performance, we carry out more investigations for the identified biomarker genes via gene set enrichment analyses, and by constructing a prognostic risk score, as described in Sections 5.2 and 5.3, respectively. Furthermore, we validate a subset of the top-identified genes derived from penMCFM(EM) when considering the intersection of results from both scenarios, by referring to the existing BRCA literature. This additional validation can be found in Table S3 of the Supplementary Material.
Functional enrichment analysis
We perform both GO—and the Kyoto encyclopedia of genes and genomes (KEGG)—Enrichment Analyses (EA) on the union of the selected genes obtained in both scenarios to investigate the pathological implications of these identified biomarker genes. To this aim, we utilize the R/Bioconductor package clusterProfiler59,60 for all the analyses described in this section. First, we perform GO-EA on the genes identified from penMCFM(EM), penMCFM(GMIFS), MCM(GMIFS) and penCox.1se in Scenarios and , and then we conduct the KEGG pathway EA (KEGG-EA). We only present here the results obtained when studying the identified genes from the latency part of the model in penMCFM(EM), while the results obtained with other methods (namely penMCFM(GMIFS), MCM(GMIFS), and penCox.1se) are provided in the Supplementary Materials. Moreover, the detailed results of enrichment analyses for each method are given in https://github.com/fatihki/penMCFM as excel files.
In Scenario , only one biological process (BP) for the GO-EA is significant, the “xenobiotic metabolic process,” and the association between this BP and similar breast cancer evolution processes is examined by Lee et al.61 Another BP identified by GO-EA is “GABAergic neuron differentiation,” with a p-value of . For KEGG-EA, the only significantly enriched term is “metabolism of xenobiotics by cytochrome P450.” The association between this enzyme expression and cancer risk, progression, metastasis and prognosis has been widely reported in basic and clinical studies (see the review study of Luo et al.62 for more details). Also the role of CYP1A1 and CYP2A13 genes in breast and lung cancers has been investigated extensively (see Sneha et al.63 and references therein for more details). The gene network plots related to the significant GO and KEGG pathway terms are presented in Figures 7(a) and 7(b). The EA results for other methods, namely penMCFM(GMIFS) and MCM(GMIFS), are respectively shown in Figures S6–S7 in the Supplementary Materials. While the number of enriched terms in both GO- and KEGG-EA is limited for all the methods, some of the identified terms exhibit relevance to breast cancer.
Results of the analysis of the TCGA-BRCA data. Gene network plots related to the significant GO and KEGG pathway terms obtained from EA for Scenario : (a) GO-enriched terms and (b) KEGG-enriched terms.
In Scenario , we detect totally enriched BP terms via GO-EA, including some important BPs associated with breast cancer, such as “Notch signaling pathway”,64 “canonical Wnt signaling pathway”,65,66 and “BMP signaling pathway”.67 The barplot for some of the enriched terms in GO-EA, and the gene network plot related to the top-significant GO terms, are presented in Figures 8(a) and 8(b) respectively. Concerning results of KEGG-EA in this scenario, Figure 9(a) shows that the “Wnt signaling” and “breast cancer pathways” are two of the most significant KEGG pathways, as confirmed by the growing number of studies in the literature demonstrating that Wnt signaling involves the proliferation, metastasis, immune microenvironment regulation, stemless maintenance, therapeutic resistance, and phenotype shaping of breast cancer (see Xu et al.65 and Abreu de Oliveira et al.,66 and references therein, for more details). The gene network plot related to the significant KEGG-EA pathway terms is presented in Figure 9(b). Moreover, the enriched KEGG pathway “hsa05224: Breast cancer” is reported in Figure S9 in the Supplementary Materials to more clearly show the associated biological meaning. We observe a greater number of enriched terms in the EA for all methods in Scenario compared to Scenario . Notably, the “breast cancer” pathway is identified as one of the most enriched terms in at least one part of the model structure for all methods.
Results of the analysis of the TCGA-BRCA data. GO-EA for Scenario : (a) barplot of significantly enriched terms and (b) network plot of enriched GO terms and related selected genes.
Results of the analysis of the TCGA-BRCA data. KEGG-EA for Scenario : (a) barplot of significantly enriched terms and (b) network plot of enriched KEGG pathway terms and related selected genes.
Prognostic risk score construction and validation of identified genes
We establish a prognostic risk score (PRS) to demonstrate the prognostic power of our model and methods. The PRS is constructed similarly to in Li and Liu,50 by using the average estimates of the coefficients corresponding to the selected genes over the repeats. Precisely, the PRS is defined as
where is the total number of selected genes for the considered method, is the average estimate of the coefficient corresponding to the -th gene, and is the expression value of the -th gene in the data. We dichotomize the prognostic scores based on the median value and subsequently employ a log-rank test to compare the survival curves between the two groups of patients.
We use the validation dataset, which includes observations with their RNA-seq features and survival time, to validate the selected genes with the proposed PRS. We assign the subjects in the validation dataset to a high and low risk group using the median of the obtained PRS values. This procedure is applied when using the selected genes obtained with penMCFM(EM), penMCFM(GMIFS), MCM(GMIFS) and penCox.1se in Scenario . The KM curves of these subgroups are presented in Figure 10 for penMCFM(EM) and MCM(GMIFS): in the figure it can be observed that the log-rank test shows significant differences between the survival distributions corresponding to the high- and low-risk groups as obtained by the PRS calculations carried out on the outputs of both penMCFM(EM) and MCM(GMIFS) methods (with p-values and , respectively). Since the -values of the log-rank tests for penMCFM(GMIFS) and penCox.1se are not significant, these KM plots are only shown in Figure S18(a) and (b) in the Supplementary Materials.
Results of the analysis of the TCGA-BRCA data. KM curves for the TCGA-BRCA patients in the validation dataset when dichotomized into two groups by the median PRS using the average results of method repeats in Scenario : (a) penMCFM(EM) and (b) MCM(GMIFS).
Moreover, the heatmaps of the expression values of the selected genes, based on the penMCFM(EM) and MCM(GMIFS) methods, in the validation dataset are also presented in Figure S19 in the Supplementary Materials. In these heatmaps, only the genes which are selected in at least out of re-run of the methods are shown, for visualization purposes. From inspection of these figures it can be observed that the penMCFM(EM) method gives much more visible clusters of genes showing similar expression behaviors than those based on the MCM(GMIFS) method.
Discussion and conclusion
In this study, we have introduced a novel method that adapts the mixture cure frailty model to account for high-dimensional covariates in both the incidence and the latency components of the model. The method extends the classical MCFM to higher dimensions using a multi-step adaptive elastic net penalty term inside the EM algorithm, therefore the name penMCFM. The proposed penMCFM allows performing variable selection in the MCFM model in high-dimensional settings where the number of covariates is significantly larger than the sample size. A comparative analysis with alternative survival models, including the penalized mixture cure model and penalized Cox regression, was also carried out. We also applied the proposed model along with alternative approaches to analyze RNAseq data from BRCA samples from TCGA. Finally, we conducted functional enrichment analyses of the identified biomarkers to show the disease mechanisms associated with BRCA, and then validated these biomarkers using an independent validation dataset to demonstrate their effectiveness and usefulness.
Moreover, we conducted a comprehensive investigation comparing our primary methodology, penMCFM(EM), with alternative methods using various metrics including C-statistic, AUC, and IBS, alongside functional enrichment analysis and the construction of prognostic risk scores, on various real data scenarios. Our analysis revealed that penMCFM(EM) and MCM(GMIFS) methods demonstrated comparable performances under certain circumstances, while penMCFM(EM) exhibited superior performance in others. Notably, despite similarities in performance observed in certain scenarios, penMCFM(EM) consistently selected fewer genes than MCM(GMIFS), its main competitor. Moreover, through rigorous simulation studies, penMCFM(EM) consistently outperformed alternative methods across different evaluation metrics. Consequently, we conclude that penMCFM(EM) represents a compelling alternative, offering advantages across diverse data generating scenarios. Additionally, our future research agenda entails applying this method to other cancer datasets, and conducting further comparative analyses against relevant methods.
A few practical issues regarding the proposed algorithm deserve some discussion. One notable computational challenge in our algorithm pertains to the efficient selection of the tuning parameters, specifically and . While in penMCFM it is assumed that these parameters are identical for both penalties, alternative methods (such as grid search, random search, or Latin hypercube sampling) could be employed to identify optimal different values for and , however at a much higher computational cost. In one experimental setting on simulated data, using Latin hypercube sampling with pairs for tuning and did not improve the results as compared to using the same value for the two parameters, leading us to omit this approach. Moreover, when selecting the optimal tuning parameters and via cross-validation, we opted for optimizing the C-statistics, that is and , instead of the commonly used Akaike information criterion (AIC) and Bayesian information criterion (BIC). While AIC and BIC were initially included in the cross-validation tuning process, it was observed that these criteria tended to select only a limited number of variables. Consequently, these results are not presented in the study.
The considered MCFM model has certain limitations. First, we employed a parametric baseline distribution for the latency component of the model, specifically choosing a Weibull distribution. While this choice may limit the flexibility of the model’s baseline structure, it provides several advantages, including mathematical tractability in certain formulations, a flexible hazard function (as demonstrated in recent similar but low-dimensional studies by Jiang and Basu68 and Pal et al.69), and ease of comparison with other studies in the literature, such as Fu et al.35 Another limitation is the lack of completely independent validation datasets from sources other than TCGA, which are needed to assess the model’s applicability and robustness in practical applications. It would be particularly valuable to utilize completely independent validation datasets, such as the Norwegian breast cancer Oslo2 cohort data from Aure et al.,70 as well as datasets from different breast cancer cohorts. Additionally, considering other cancer types separately and incorporating data from these types could provide further insights. Addressing these aspects remains an important focus for future work.
Some potential directions for future research include exploring a semi-parametric approach as an alternative to the current use of parametric distributions in the latency part. Additionally, investigating a Bayesian approach for variable selection could be a fruitful direction for further investigation. Finally, to the best of our knowledge, the literature on cure frailty models for recurrent events in high-dimensional settings has not flourished to the same extent as that based on survival data. This could be another potentially fruitful direction of future research.
Supplemental Material
sj-pdf-1-smm-10.1177_09622802251327687 - Supplemental material for A Weibull mixture cure frailty model for high-dimensional covariates
Supplemental material, sj-pdf-1-smm-10.1177_09622802251327687 for A Weibull mixture cure frailty model for high-dimensional covariates by Fatih Kızılaslan, David Michael Swanson and Valeria Vitelli in Statistical Methods in Medical Research
Footnotes
Acknowledgments
Computer simulations were performed in HPC solutions provided by the Oslo Centre for Epidemiology and Biostatistics, at the University of Oslo.
The results published in the real data study are based upon data generated by the TCGA Research Network: .
The authors would also like to thank Dr. Zhi Zhao for fruitful discussions.
Authors contribution
FK, DMS, and VV jointly contributed to the method conception and development. FK was responsible for the implementation, for all data analyses, and for drafting the paper. VV contributed to the paper drafting and to the interpretation of all results. Both DMS and VV revised the paper for submission.
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This project has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No 801133.
ORCID iDs
David Michael Swanson
Valeria Vitelli
Supplemental material
Supplemental material for this article is available online.
Notes
References
1.
PriceDLManatungaAK. Modelling survival data with a cured fraction using frailty models. Stat Med2001; 20: 1515–1527.
2.
BoagJW. Maximum likelihood estimates of the proportion of patients cured by cancer therapy. J R Stat Soc Series B Stat Methodol1949; 11: 15–53.
3.
BerksonJGageRP. Survival curve for cancer patients following treatment. J Am Stat Assoc1952; 47: 501–515.
4.
PengYZhangJ. Estimation method of the semiparametric mixture cure gamma frailty model. Stat Med2008; 27: 5177–5194.
5.
SyJPTaylorJM. Estimation in a cox proportional hazards cure model. Biometrics2000; 56: 227–236.
6.
PengYDearKB. A nonparametric mixture model for cure rate estimation. Biometrics2000; 56: 237–243.
7.
DempsterAPLairdNMRubinDB. Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Series B Stat Methodol1977; 39: 1–22.
8.
PengYZhangJ. Identifiability of a mixture cure frailty model. Stat Probab Lett2008; 78: 2604–2608.
9.
CaiCZouYPengY, et al.SMCURE: an R-package for estimating semiparametric mixture cure models. Comput Methods Programs Biomed2012; 108: 1255–1260.
RondeauVSchaffnerECorbiereF, et al.Cure frailty models for survival data: application to recurrences for breast cancer and to hospital readmissions for colorectal cancer. Stat Methods Med Res2013; 22: 243–260.
12.
TibshiraniR. Regression shrinkage and selection via the lasso. J R Stat Soc Series B Stat Methodol1996; 58: 267–288.
13.
ZouHHastieT. Regularization and variable selection via the elastic net. J R Stat Soc Series B Stat Methodol2005; 67: 301–320.
14.
ZouHZhangHH. On the adaptive elastic-net with a diverging number of parameters. Ann Stat2009; 37: 1733–1751.
15.
FriedmanJHastieTTibshiraniR. Regularization paths for generalized linear models via coordinate descent. J Stat Softw2010; 33: 1–22.
BühlmannPVan De GeerS. Statistics for high-dimensional data: methods, theory and applications. Heidelberg, Berlin: Springer, 2011.
23.
HastieTTibshiraniRWainwrightM. Statistical learning with sparsity: the lasso and generalizations. New York: CRC press, 2015.
24.
JamesGWittenDHastieT, et al.An introduction to statistical learning. New York, NY: Springer, 2021.
25.
LiuXPengYTuD, et al.Variable selection in semiparametric cure models based on penalized likelihood, with application to breast cancer clinical trials. Stat Med2012; 31: 2882–2891.
26.
FanXLiuMFangK, et al.Promoting structural effects of covariates in the cure rate model with penalization. Stat Methods Med Res2017; 26: 2078–2092.
27.
MasudATuWYuZ. Variable selection for mixture and promotion time cure rate models. Stat Methods Med Res2018; 27: 2185–2199.
28.
BerettaAHeuchenneC. Variable selection in proportional hazards cure model with time-varying covariates, application to us bank failures. J Appl Stat2019; 46: 1529–1549.
SunLLiSWangL, et al.Variable selection in semiparametric nonmixture cure model with interval-censored failure time data: an application to the prostate cancer screening study. Stat Med2019; 38: 3026–3039.
31.
BussySGuillouxAGaïffasS, et al.C-mix: a high-dimensional mixture model for censored durations, with applications to genetic data. Stat Methods Med Res2019; 28: 1523–1539.
32.
ShiXMaSHuangY. Promoting sign consistency in the cure model estimation and selection. Stat Methods Med Res2020; 29: 15–28.
XuYZhaoSHuT, et al.Variable selection for generalized odds rate mixture cure models with interval-censored failure time data. Comput Stat Data Anal2021; 156: 1–17.
35.
FuHNicoletDMrózekK, et al.Controlled variable selection in Weibull mixture cure models for high-dimensional data. Stat Med2022; 41: 4340–4366.
36.
HastieTTaylorJTibshiraniR, et al.Forward stagewise regression and the monotone lasso. Electron J Sta2007; 1: 1–29.
37.
FarewellVT. The use of mixture models for the analysis of survival data with long-term survivors. Biometrics1982; 38: 1041–1046.
38.
KleinJPMoeschbergerML. Survival analysis: techniques for censored and truncated data. New York: Springer, 2003.
MakowskiMArcherKJ. Generalized monotone incremental forward stagewise method for modeling count data: application predicting micronuclei frequency. Cancer Inform2015; 14: 97–105.
41.
HouJArcherKJ. Regularization method for predicting an ordinal response using longitudinal high-dimensional genomic data. Stat Appl Genet Mol Biol2015; 14: 93–111.
42.
YuT. Nonlinear variable selection with continuous outcome: a fully nonparametric incremental forward stagewise approach. Stat Anal Data Min2018; 11: 188–197.
43.
AsanoJHirakawaAHamadaC. Assessing the prediction accuracy of cure in the cox proportional hazards cure model: an application to breast cancer data. Pharm Stat2014; 13: 357–363.
44.
AsanoJHirakawaA. Assessing the prediction accuracy of a cure model for censored survival data with long-term survivors: application to breast cancer data. J Biopharm Stat2017; 27: 918–932.
45.
PalSPengYAselisewineW. A new approach to modeling the cure rate in the presence of interval censored data. Comput Stat2024; 39: 2743–2769.
46.
LoveMIHuberWAndersS. Moderated estimation of fold change and dispersion for RNA-seq data with deseq2. Genome Biol2014; 15: 1–21.
47.
ZhaoYLiMCKonatéMM, et al.Tpm, fpkm, or normalized counts? A comparative study of quantification measures for the analysis of RNA-seq data from the nci patient-derived models repository. J Transl Med2021; 19: 1–15.
48.
ZhaoZZobolasJZucknickM, et al.Tutorial on survival modeling with applications to omics data. Bioinform2024; 40: 1–15.
49.
JohnCRWatsonDRussD, et al.M3c: Monte Carlo reference-based consensus clustering. Sci Rep2020; 10: 1–14.
50.
LiLLiuZP. Detecting prognostic biomarkers of breast cancer by regularized cox proportional hazards models. J Transl Med2021; 19: 1–20.
51.
LiLLiuZP. Biomarker discovery from high-throughput data by connected network-constrained support vector machine. Expert Syst Appl2023; 226: 1–12.
52.
KanehisaMFurumichiMSatoY, et al.Kegg: integrating viruses and cellular organisms. Nucleic Acids Res2021; 49: D545–D551.
53.
ConsortiumTGO. The gene ontology resource: enriching a gold mine. Nucleic Acids Res2021; 49: D325–D334.
54.
CardosoFvan’t VeerLJBogaertsJ, et al.70-Gene signature as an aid to treatment decisions in early-stage breast cancer. N Engl J Med2016; 375: 717–729.
55.
YanZWangQSunX, et al.OSBRCA: a web server for breast cancer prognostic biomarker investigation with massive data from tens of cohorts. Front Oncol2019; 9: 1–8.
56.
LiXLiuLGoodallGJ, et al.A novel single-cell based method for breast cancer prognosis. PLoS Comput Biol2020; 16: 1–20.
57.
De BinRSauerbreiWBoulesteixAL. Investigating the prediction ability of survival models based on both clinical and omics data: two case studies. Stat Med2014; 33: 5310–5329.
58.
VolkmannADe BinRSauerbreiW, et al.A plea for taking all available clinical information into account when assessing the predictive value of omics data. BMC Med Res Methodol2019; 19: 1–15.
59.
YuGWangLGHanY, et al.Clusterprofiler: an R package for comparing biological themes among gene clusters. Omics: A J Int Biol2012; 16: 284–287.
60.
WuTHuEXuS, et al.Clusterprofiler 4.0: a universal enrichment tool for interpreting omics data. The Innovation2021; 2: 1–11.
61.
LeeDGSchuetzJMLaiAS, et al.Interactions between exposure to polycyclic aromatic hydrocarbons and xenobiotic metabolism genes, and risk of breast cancer. Breast Cancer (Auckl)2022; 29: 38–49.
62.
LuoBYanDYanH, et al.Cytochrome p450: implications for human breast cancer (review). Oncol Lett2021; 22: 1–9.
63.
SnehaSBakerSCGreenA, et al.Intratumoural cytochrome p450 expression in breast cancer: impact on standard of care treatment and new efforts to develop tumour-selective therapies. Biomedicines2021; 9: 1–22.
64.
YousefiHBahramyAZafariN, et al.Notch signaling pathway: a comprehensive prognostic and gene expression profile analysis in breast cancer. BMC Cancer2022; 22: 1282.
65.
XuXZhangMXuF, et al.Wnt signaling in breast cancer: biological mechanisms, challenges and opportunities. Mol Cancer2020; 19: 1–35.
66.
Abreu de OliveiraWAEl LaithyYBrunaA, et al.Wnt signaling in the breast: from development to disease. Front Cell Dev Biol2022; 10: 1–19.
67.
EhataSMiyazonoK. Bone morphogenetic protein signaling in cancer; some topics in the recent 10 years. Front Cell Dev Biol2022; 10: 883523.
68.
JiangQBasuS. Cure models with adaptive activation for modeling cancer survival. Stat Methods Med Res2024; 33: 227–242.
69.
PalSPengYAselisewineW. A new approach to modeling the cure rate in the presence of interval censored data. Comput Stat2024; 39: 2743–2769.
70.
AureMRVitelliVJernströmS, et al.Integrative clustering reveals a novel split in the luminal a subtype of breast cancer with impact on outcome. Breast Cancer Res2017; 19: 19–44.
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.