Sage Journals: Discover world-class research

Abstract

A novel mixture cure frailty model is introduced for handling censored survival data. Mixture cure models are preferable when the existence of a cured fraction among patients can be assumed. However, such models are heavily underexplored: frailty structures within cure models remain largely undeveloped, and furthermore, most existing methods do not work for high-dimensional datasets, when the number of predictors is significantly larger than the number of observations. In this study, we introduce a novel extension of the Weibull mixture cure model that incorporates a frailty component, employed to model an underlying latent population heterogeneity with respect to the outcome risk. Additionally, high-dimensional covariates are integrated into both the cure rate and survival part of the model, providing a comprehensive approach to employ the model in the context of high-dimensional omics data. We also perform variable selection via an adaptive elastic-net penalization, and propose a novel approach to inference using the expectation–maximization (EM) algorithm. Extensive simulation studies are conducted across various scenarios to demonstrate the performance of the model, and results indicate that our proposed method outperforms competitor models. We apply the novel approach to analyze RNAseq gene expression data from bulk breast cancer patients included in The Cancer Genome Atlas (TCGA) database. A set of prognostic biomarkers is then derived from selected genes, and subsequently validated via both functional enrichment analysis and comparison to the existing biological literature. Finally, a prognostic risk score index based on the identified biomarkers is proposed and validated by exploring the patients’ survival.

Keywords

Mixture cure frailty model variable selection adaptive elastic-net expectation–maximization method biomarker discovery

1. Introduction

When analyzing time-to-event data, the presence of substantial censoring following a prolonged follow-up period often indicates the presence of “long-term survivors” or “cured individuals,” that is individuals who may never experience the event of interest. This phenomenon can be observed in certain clinical investigations such as cancer studies, where successful treatment can effectively prevent disease recurrence. This scenario becomes more noticeable among patients who are diagnosed in the early stages of cancer development. The presence of a cured fraction violates a key assumption in traditional survival models, like the Cox proportional hazards (PH) model, which assumes that all subjects will inevitably experience the event of interest. When a cured fraction exists, this assumption no longer holds true. Hence, the use of a Cox PH model in such cases can lead to an underestimation of hazard rates and an overestimation of survival probabilities for individuals susceptible to the event.¹ In practice, the cured subjects cannot be observed directly, as they are censored together with the individuals who will eventually experience the outcome. However, the Kaplan–Meier (KM) estimated survival curve can be used for determining the existence of a cured fraction when the follow-up period is sufficiently long. A long and consistent plateau in this curve could imply the existence of a cured subset of subjects. Cure models expand the scope of survival analysis by incorporating a fraction of cured individuals. The mixture cure (MC) model is the most commonly employed cure rate model, which was first introduced by Boag² and improved by Berkson and Gage.³ The MC model incorporates two components: one component represents the cured fraction, that is the fraction of individuals who will never experience the event of interest, having a survival probability of one. The other component instead captures the susceptible fraction, that is the fraction of individuals for whom the event occurrence is governed by a proper survival distribution, which is often referred to as the “latency distribution.”

In medical and epidemiological studies, it is common to assume the existence of unobserved factors generating heterogeneity among individuals that cannot be explained via the observed covariates. Frailty models can be employed to incorporate and account for unobserved heterogeneity among individuals, to allow for a more accurate modeling of the survival outcome. These models introduce a random frailty term, often assumed to follow a specific distribution, which captures individual-specific characteristics that affect the hazard or risk of the event. Price and Manatunga¹ introduced a gamma frailty term into the latency distribution to address the presence of unobserved risks within the MC model. Peng and Zhang⁴ extended this model by incorporating covariates into both the cure rate and the latency distribution, and they named this model the mixture cure gamma frailty model. Due to the presence of missing variables in cure models, Sy and Taylor⁵ as well as Peng and Dear⁶ employed the EM algorithm⁷ to derive maximum likelihood estimates of the model parameters. Moreover, Peng and Zhang⁴ studied semi-parametric estimation methods for the mixture cure gamma frailty model using the EM algorithm and a multiple imputation method with low-dimensional covariates. They also examined the identifiability of the general mixture cure frailty model (MCFM) in Peng and Zhang.⁸ Cai et al.⁹ developed an R package, called smcure,¹⁰ to facilitate the application of semi-parametric estimation techniques to both the proportional hazards MC model and the accelerated failure time MC model. Finally, frailty models have also been used in this context when dealing with correlated data, and particularly to be able to take into account heterogeneity in the presence of recurrent events. For instance, Rondeau et al.¹¹ studied the analysis of recurrent time-to-event data using frailty models with a cured fraction.

The focus of the present article is on situations when the available covariates are way more than the subjects included in the sample. Thanks to advancements in biomedicine, we can now generate and gather molecular data from diverse modalities, such as genomics, epigenomics, transcriptomics, proteomics, and metabolomics, often resulting in high-dimensional data sets. Due to the volume and complexity of such data, and related challenges and opportunities, more sophisticated methods than the traditional ones are required, especially to deal with variable selection. To this purpose, the use of advanced regularization methods such as least absolute shrinkage and selection operator (lasso),¹² elastic net,¹³ the adaptive elastic net,¹⁴ and others has become a standard practice for enhancing accuracy and interpretation. The three methods mentioned above are established popular choices for performing simultaneous prediction and variable selection in high-dimensional linear regression problems. These techniques allow discovering meaningful patterns, such as identifying relevant prognostic biomarkers for a certain disease, from high-dimensional molecular data. When it comes to generalized linear models, Friedman et al.¹⁵ introduced a novel approach utilizing coordinate descent to compute regularization paths, and also implemented their method in the now very popular R package glmnet.¹⁶ Simon et al.¹⁷ extended the previous work to Cox models for right-censored data, incorporating elastic net regularization, with a very efficient algorithm that was then integrated into the glmnet R package, enabling the solution of significantly larger problems than previously possible. Tay et al.¹⁸ expanded the scope of the elastic net-regularized regression to encompass all generalized linear model families, including Cox models with (start, stop] data and strata, also implemented within the glmnet R package. Adaptive versions of both the lasso Zou and Hastie¹³ and elastic net¹⁴ have also been developed, and their properties explored under specific conditions. In a discussion article, Bühlmann and Meier¹⁹ emphasized the relevance of a multi-step procedure for the adaptive lasso, showing that it allows for sparser models at each step. Furthermore, Xiao and Xu²⁰ introduced the multi-step adaptive elastic-net, showing via extensive numerical exploration that this method effectively reduces the false positives in the variable selection process while preserving estimation accuracy, and implemented it into the msaenet R package.²¹ For further insights into regularization methods and their applications, interested readers may refer to Bühlmann and van de Geer,²² Hastie et al.²³ and James et al.²⁴

In the context of survival data within high-dimensional settings, the traditional Cox PH model has been commonly used. Although there has recently been a growing interest in the application of cure models to high-dimensional data, the existing literature on this subject has remained somewhat limited. Liu et al.²⁵ introduced a variable selection procedure for the semi-parametric MC model using penalties based on the smoothly clipped absolute deviation (SCAD) and lasso. Fan et al.²⁶ explored the minimax concave penalty (MCP), ridge, and lasso penalties, to estimate the MC model in high-dimensional data sets. Masud et al.²⁷ studied variable selection problems for both the mixture and promotion time cure models, employing penalty terms such as lasso and adaptive lasso. Baretta and Heuchenne²⁸ conducted a study on variable selection using SCAD penalties for the semi-parametric PH cure model, considering time-varying covariates, and implemented the method in the penPHcure²⁹ R package. Sun et al.³⁰ developed a variable selection methodology that incorporates lasso, adaptive lasso, and SCAD-type penalties, tailored for the semi-parametric promotion time cure model, and particularly suitable for interval-censored data. Bussy et al.³¹ proposed a novel statistical model known as C-mix, specifically designed as a mixture-of-experts model for handling censored survival outcomes, to model the patients’ heterogeneity by detecting distinct subgroups, and allowed high-dimensional covariates by employing an elastic net penalization. They also conducted a comparison of the C-mix model with both the MC and Cox PH models through simulation experiments. Shi et al.³² developed a novel penalty method for the MC model that includes both the MCP and a new penalty term that promotes sign consistency using the covariate effects. Xie and Yu³³ proposed the use of neural networks to address inference in the MC model, demonstrating their favorable predictive ability in high-dimensional settings. Xu et al.³⁴ considered the variable selection problem by adopting penalties such as lasso, adaptive lasso, and SCAD for the generalized odds rate MC model, particularly in the context of interval-censored data. Lastly, Fu et al.³⁵ investigated the Weibull MC model, incorporating generalized monotone incremental forward stagewise (GMIFS)³⁶ and EM algorithm techniques for inference. They applied the EM algorithm to a penalized Weibull MC model, employing a lasso type penalty. Their findings indicated a superior performance of their penalized MC model as compared to alternative methods in various simulation scenarios. It is noteworthy to emphasize that all aforementioned studies on cure models exhibit significant limitations concerning the size of covariates relative to the sample size that can be handled by the model, in both simulation experiments and real-data applications, with the only exception being the case studies conducted by Bussy et al.³¹ and Fu et al.³⁵ Moreover, neither of the two latter studies include a frailty term in the model.

This study is therefore aimed at developing a first novel version of the MCFM useful in instances where high-dimensional covariates are potentially associated with both the cured rate and uncured subjects in the sample. To this purpose, a new penalized EM algorithm is introduced, incorporating adaptive elastic net penalties specifically tailored for the MCFM. In particular, the model assumes that the distribution of uncured subjects follows a Weibull distribution, and that their survival probability at each time depends on a subset of the high-dimensional set of covariates, which is also a target of inference. To the best of our knowledge, this study represents the first exploration of MCFM, that is the first attempt at incorporating frailty into the MCM, within high-dimensional settings. Nonetheless, surpassing the predominant focus of the existing literature on MCFM on low-dimensional settings, our study explores scenarios in which the sizes of covariates significantly exceed the sample sizes in both simulated and real-data applications. The rest of this article is organized as follows: in Section 2, we present the classical MCFM and its extension to high-dimensional settings. In Section 3, we introduce the novel penalized MCFM (penMCFM), with the associated adaptive EM algorithm for inference, in details. The performances of our proposed methods are compared with some competitors through a comprehensive Monte Carlo simulation study in Section 4. In Section 5, all considered methods are applied to publicly available breast cancer RNA-seq data, with the aim of identifying potential biomarker genes. We also conduct functional enrichment analyses for better interpretation, and determine a prognostic risk score for enhanced validation, of all the identified biomarker genes. Finally, in Section 6 we provide a brief summary and discussion of the study, also mentioning potential future research directions.

2. The penalized mixture cure frailty model

2.1. Classical mixture cure frailty models

Let the random variable $T$ represent the lifetime of interest with survival function denoted by $S_{p o p} (t), t \in [0, + \infty)$ . Let $Y$ be the indicator for a subject eventually $(Y = 1)$ or never $(Y = 0)$ experiencing the event of interest, with $π = P (Y = 1)$ representing the probability of a subject being susceptible (or uncured) for the event of interest. Among the subjects for whom $Y = 0$ , the survival function is $S (t | Y = 0) = 1, \forall t \in [0, + \infty)$ , and for those who experience the event ( $Y = 1$ ) , the survival function and the probability density function (pdf) are $S (t | Y = 1)$ and $f (t | Y = 1)$ , respectively. $Y$ is not observed for a censored subject. The population survival function is therefore defined as

S_{p o p} (t) = 1 - π + π S (t | Y = 1)

(1)

Note that since

S_{p o p} (t) \to 1 - π

t \to + \infty

S_{p o p} (t)

is not a proper survival function. The uncured rate

π

and the survival function of the uncured subjects

S (t | Y = 1)

are also referred to as the incidence and the latency distribution, respectively.

The basic model introduced in (1) can be extended to include the covariates associated with the incidence and latency distributions. Let us denote via $x$ and $z$ the covariates that have effect on the latency distribution and the incidence, respectively. The model (1) can be then rewritten as

S_{p o p} (t | x, z) = 1 - π (z) + π (z) S (t | Y = 1, x)

(2)

where

π (z)

is the probability of a subject being uncured conditionally on

z

, and

S (t | Y = 1, x)

is the survival function of the lifetime distribution of uncured subjects conditionally on

x

. Concerning the modeling of the effect of the covariates

z

on the incidence, as previously proposed in Farewell³⁷ we use a logistic regression model of the form

π (z) = e^{z^{⊤} b} / (1 + e^{z^{⊤} b}),

where

z^{⊤} \in R^{n \times P_{1} + 1}

is a covariate matrix, with columns

z_{1}, \dots, z_{n} \in R^{P_{1} + 1}

, and

b = (b_{0}, b_{1}, \dots, b_{P_{1}})^{⊤} \in R^{P_{1} + 1}

is a vector of unknown regression coefficients. When the mixture cure model defined in (2) is specified via proportional hazards, we get the following PH mixture cure model

S_{p o p} (t | x, z) = 1 - π (z) + π (z) S_{0} {(t)}^{\exp (x^{⊤} β)}

(3)

where

S_{0} (t)

is the baseline survival function,

x^{⊤} \in R^{n \times P_{2}}

is the covariate matrix, with columns

x_{1}, \dots, x_{n} \in R^{P_{2}}

, and

β = (β_{1}, \dots, β_{P_{2}})^{⊤} \in R^{P_{2}}

is the vector of unknown regression coefficients for the latency distribution.

In medical and epidemiological studies, frailty models extend the Cox PH model to account for unobservable heterogeneity among individuals. This extension provides a more flexible structure for analysis. A frailty is defined as an unobservable, random, multiplicative factor acting on the hazard function. Let $W_{i}$ be a non-negative frailty random variable associated to the $i$ th subject, $i = 1, \dots, n$ , with cumulative distribution function (cdf) $F_{W_{i}} (w)$ . The hazard function of the $i$ th subject with frailty $W_{i}$ is

h (t | W_{i}) = W_{i} h_{0} (t) \exp ({x_{i}}^{⊤} β)

where

h_{0} (t)

is a baseline hazard function common for all subjects and

x_{i}

is the covariate vector for the

i

th subject. If we include the frailty in the latency distribution in model ( 3), the conditional survival function given the frailty

W

takes the form

S (t | Y = 1, W, x) = \exp (- W e^{x^{⊤} β} H_{0} (t))

where

H_{0} (t)

is the baseline cumulative hazard function. Then, the marginal survival function of uncured subjects based on the frailty model is given by

\begin{aligned} S (t | Y = 1, x) & = \int_{0}^{+ \infty} S (t | Y = 1, W = w, x) d F_{W} (w) \\ = L_{W} (e^{x^{⊤} β} H_{0} (t)) \end{aligned}

where

L_{W} (s) = E (e^{- w s})

is the Laplace transformation of the frailty

W

. Hence, model (3) with frailty becomes

S_{p o p} (t | x, z) = 1 - π (z) + π (z) L_{W} (e^{x^{⊤} β} H_{0} (t))

(4)

Model (4) was first introduced by Peng and Zhang⁴ and called the mixture cure frailty model. Model (4) reduces to the PH mixture cure model in (3) when there is no frailty effect, namely

W \equiv 1

, and it reduces to a standard frailty model when there is no cure fraction existing in the population, namely

π (z) \equiv 1

In this study, we consider a fully parametric version of model (4). The Weibull distribution is one of the most commonly used and well-known distributions in survival analysis, recognized for its flexibility in modeling various hazard functions and accurately representing increasing, decreasing, and constant hazard rates. Additionally, because similar mixture cure models in the literature often use the Weibull distribution, employing it in our model facilitates easier comparison with existing studies such as Fu et al.³⁵ Hence, we assume that the baseline of the latency distribution follows a Weibull distribution $W E (α, γ)$ with $α$ and $γ$ being the scale and shape parameters, respectively. The hazard and cumulative hazard functions for the baseline of the latency distribution then become $h_{0} (t) = α γ t^{γ - 1}$ and $H_{0} (t) = α t^{γ}$ , respectively.

The choice of frailty distribution is crucial for deriving the functional form of our model, as the population survival function in (4) includes the Laplace transform of the frailty distribution. The Gamma distribution is particularly flexible, capable of capturing both increasing and decreasing hazard rates, and it features a closed and simple Laplace transform. Due to these mathematical conveniences, the Gamma distribution is widely used as a frailty distribution in survival analysis (see the nice discussion in the book by Klein and Moeschberger³⁸). Nonetheless, to address identifiability issues associated with the mixture cure frailty model,⁸ we must fix the mean of the frailty distribution to $1$ . The Gamma distribution then allows a convenient reparameterization that adheres to this identifiability constraint. The Gamma is frequently employed as a frailty distribution in the mixture cure model literature, as demonstrated by Price and Manatunga¹ and Peng and Zhang.⁴ For these reasons, and to effectively handle high-dimensional covariates within this framework, using the Gamma distribution as a frailty component is both practical and meaningful. Therefore, we assume that the frailty $W$ follows a gamma distribution with mean $1$ and variance $1 / θ$ . The Laplace transformation of the frailty is then $L_{w} (s) = (1 + s / θ)^{- θ}$ . Therefore, given our parametric assumptions, the survival function of model (4) can be rewritten as

S_{p o p} (t | x, z) = \frac{1}{1 + e^{z^{⊤} b}} + \frac{e^{z^{⊤} b}}{1 + e^{z^{⊤} b}} {(1 + \frac{α t^{γ} e^{x^{⊤} β}}{θ})}^{- θ}

(5)

Note that, with the parametric assumptions as detailed above, the hazard function of the latency, that is of the uncured subjects, does not satisfy the PH assumption.⁴

2.2. Mixture cure frailty models with high-dimensional covariates

In the context of medical applications, and particularly in cancer studies, molecular data of several kinds can be used to better estimate both the cured fraction, and the uncured patients survival. This is particularly relevant in cancer studies, since tissue samples originating from biopsies of the tumor are often available, and these can provide excellent information on the tumor composition and characterization. Such molecular omics data often include genomics, epigenomics, transcriptomics, proteomics, metabolomics and radiomics information, often generated using advanced high-throughput technologies to allow the understanding of complex biological systems and their underlying mechanisms. Common characteristic of all these data layers is their high-dimensionality, meaning that the number of variables $p$ included in each omics data layer is much larger than the sample size $n$ . We plan to modify the mixture cure frailty model as discussed above so that high-dimensional omics data can be used as covariates in the model.

In this set-up, we consider both covariates $x$ and $z$ , related to the latency part and the uncured rate of the model, respectively, as potentially high-dimensional, for example including some kind of omics information. These covariates can also include lower-dimensional information associated to the patients’ demographics or clinical characteristics, such as age, sex, treatment method. This means that the two vectors of variables in $x$ and $z$ could potentially combine very diverse exogenous information, including both clinical variables and high-dimensional omics features, which often show varying degrees of accuracy, redundancy, and noise.

We apply well-known regularization approaches such as lasso¹² and the adaptive elastic net¹⁴ to regression models to find the relevant variables from our complete set of omics covariates in both $x$ and $z$ . The high-dimensional covariate matrices $x$ and $z$ can share features or not, depending on the available domain knowledge, and moreover the low-dimensional covariates (e.g. demographics and clinical information) in the matrices are left unpenalized. In what follows, we will refer to the low- and high-dimensional parts of the covariate matrices via “unpenalized” and “penalized” variables, respectively. For these covariates and for their corresponding regression coefficients, we use the following notations. The covariate matrices $x$ and $z$ can include both types of unpenalized and penalized variables, which are represented as $x = (x_{u}, x_{p})$ and $z = (z_{u}, z_{p}),$ where $x_{u}, z_{u}$ and $x_{p}, z_{p}$ represent the unpenalized and penalized variables in $x$ and $z$ , respectively. The vector of regression coefficients $β$ and $b$ are also split in the same way into $β = (β_{u}, β_{p})$ and $b = (b_{0}, b_{u}, b_{p}),$ where $b_{0}$ is the intercept, and $β_{u}, b_{u}$ and $β_{p}, b_{p}$ represent the unpenalized and penalized regression coefficients, respectively.

3. Inference in the penalized mixture cure frailty model

Let the observed data be $D_{i} = (δ_{i}, t_{i}, x_{u, i}, x_{p, i}, z_{u, i}, z_{p, i}),$ $i = 1, \dots, n,$ where $t_{i}$ is the observed survival time for the $i$ th subject, $δ_{i}$ is an indicator function of censoring, with $δ_{i} = 1$ for the uncensored time and $δ_{i} = 0$ for the censored time. $z_{u, i} \in R^{P_{1}^{u}}$ and $z_{p, i} \in R^{P_{1}^{p}},$ are respectively the unpenalized and penalized observed covariates associated to the cure part of the model for the $i$ th subject, $i = 1, \dots, n$ , while $x_{u, i} \in R^{P_{2}^{u}}$ and $x_{p, i} \in R^{P_{2}^{p}}$ are the same for the survival part. When possible, we will use the compact notation $z_{i} = (1, z_{u, i}, z_{p, i})$ and $x_{i} = (x_{u, i}, x_{p, i})$ , with $z_{i} \in R^{P_{1} + 1}$ and $x_{i} \in R^{P_{2}}$ , $P_{1} = P_{1}^{u} + P_{1}^{p}$ and $P_{2} = P_{2}^{u} + P_{2}^{p}$ .

The likelihood function for the right-censored observed survival data $D = (D_{1}, \dots, D_{n})$ is

\begin{aligned} L (α, γ, θ, β, b | D) & = \prod_{i = 1}^{n} {f_{p o p} (t_{i} | x_{i} {, z}_{i})}^{δ_{i}} {S_{p o p} (t_{i} | x_{i} {, z}_{i})}^{1 - δ_{i}} \\ = \prod_{i = 1}^{n} {\frac{e^{z_{i}^{⊤} b}}{1 + e^{z_{i}^{⊤} b}} α γ t_{i}^{γ - 1} e^{x_{i}^{⊤} β} {(1 + \frac{α t_{i}^{γ} e^{x_{i}^{⊤} β}}{θ})}^{- θ - 1}}^{δ_{i}} \\ \times {\frac{1}{1 + e^{z_{i}^{⊤} b}} (1 + e^{z_{i}^{⊤} b} {(1 + \frac{α t_{i}^{γ} e^{x_{i}^{⊤} β}}{θ})}^{- θ})}^{1 - δ_{i}} \end{aligned}

(6)

where

α, γ

are the unknown parameters of the Weibull distribution,

θ

is the unknown parameter of the gamma distribution used for the frailty,

b \in R^{P_{1} + 1}

and

β \in R^{P_{2}}

are vectors of unknown regression coefficients for the covariates

x

and

z

, respectively, where we indicate

z_{i}^{⊤} b = b_{0} + z_{u, i}^{⊤} b_{u} + z_{p, i}^{⊤} b_{p}

and

x_{i}^{⊤} β = x_{u, i}^{⊤} β_{u} + x_{p, i}^{⊤} β_{p}

i = 1, \dots, n

. The maximum likelihood estimator of the unknown parameters can be obtained by direct maximization of the observed likelihood function in (6) when the number of unknown parameters is not large. Since we would like to consider high-dimensional covariates and perform variable selection via penalization, direct maximization of the likelihood function will not provide parameter estimates. We therefore propose to adapt the EM algorithm to include regularization methods to the purposes of parameter estimation and variable selection in penMCFM. Details on the proposed implementation of the EM for penMCFM are given in Section 3.2, after recalling the standard EM for MCFM in Section 3.1. For the sake of comparisons in the simulation studies, we have also adapted the GMIFS method from Ref.³⁶ to obtain parameter estimations in penMCFM using the observed likelihood function in (6), and details on the implementation of this method are given in Section 3.4.

3.1. EM algorithm for the standard MCFM

Recall that the cure indicator $Y$ is a latent boolean variable such that $Y = 1$ if an individual is susceptible (uncured), and $Y = 0$ if non-susceptible (cured). It follows from the censoring assumption that if $δ_{i} = 1$ then $y_{i} = 1$ , and if $δ_{i} = 0$ then $y_{i}$ is not observable; in this case, $y_{i}$ is latent and can assume either values (one or zero). Hence, $y = (y_{1}, \dots, y_{n})$ is only partially observed, which means that we need an EM algorithm to carry out inference on the latent variables.

The conditional survival function of model (5) for the $i$ -th subject, given the covariates and the latent value of the frailty $w_{i}$ , is

S_{p o p} (t_{i} | x_{i} {, z}_{i}, w_{i}) = 1 - π (z_{i}) + π (z_{i}) \exp (- w_{i} α t_{i}^{γ} e^{x_{i}^{⊤} β})

while the conditional survival and hazard functions of the susceptible subjects, that is those such that

Y_{i} = 1,

are respectively

S (t_{i} | x_{i}, w_{i}, Y_{i} = 1) = \exp (- w_{i} α t_{i}^{γ} e^{x_{i}^{⊤} β})

and

h (t_{i} | x, w_{i}, Y_{i} = 1) = w_{i} α γ t_{i}^{γ - 1} e^{x_{i}^{⊤} β}

Note that, for the

i

-th subject, we have that

1 - π (z_{i}) = P (y_{i} = 0 | z_{i})

. Hence, given the frailty

w_{i}

, the contribution of the

i

-th subject to the likelihood function is

1 - π (z_{i})

when

Y_{i} = 0

and

π (z_{i}) S (t_{i} | x_{i}, w_{i}, Y_{i} = 1) {[h (t_{i} | x_{i}, w_{i}, Y_{i} = 1)]}^{δ_{i}}

when

Y_{i} = 1

. Therefore, the complete-data likelihood function corresponding to (6) can be expressed as

\begin{aligned} L_{c} (α, γ, θ, β, b | D, y, w) & = \prod_{i = 1}^{n} {[1 - π (z_{i})]}^{1 - y_{i}} π (z_{i})^{y_{i}} \\ \times {[{h (t | x, y_{i} = 1)}^{δ_{i}} S (t_{i} | x_{i}, y_{i} = 1)]}^{y_{i}} f_{W} (w_{i}) \end{aligned}

given the values of the latent outcome

Y_{i} = y_{i}

and frailty

W_{i} = w_{i}

, for

i = 1, \dots n,

and where

f_{W} (w_{i})

is the pdf of the gamma distribution. Hence, the corresponding complete-data log-likelihood function

l_{c} (Λ),

where

Λ = (α, γ, θ, β, b),

can be written as the sum of the log-likelihood functions corresponding to the different model parts, and specifically

\begin{aligned} l_{c} (Λ) & = l_{c} (α, γ, θ, β, b | D, y, w) \\ = l_{c_{1}} (b) + l_{c_{2}} (α, γ, β) + l_{c_{3}} (θ) \end{aligned}

(7)

where

\begin{aligned} l_{c_{1}} (b) & = \sum_{i = 1}^{n} [(1 - y_{i}) \log (1 - π (z_{i})) + y_{i} \log π (z_{i})] \end{aligned}

\begin{aligned} l_{c_{2}} (α, γ, β) & = \sum_{i = 1}^{n} [δ_{i} \log (α γ t_{i}^{γ - 1} e^{x_{i}^{⊤} β}) - w_{i} y_{i} α t_{i}^{γ} e^{x_{i}^{⊤} β}] \end{aligned}

\begin{aligned} l_{c_{3}} (θ) & = \sum_{i = 1}^{n} [θ \log θ - \log Γ (θ) - w_{i} θ + (δ_{i} + θ - 1) \log w_{i}] \end{aligned}

and

δ_{i} y_{i} = δ_{i} .

In the E-step, we evaluate the conditional expectation of the complete-data log-likelihood function with respect to the latent variables $y_{i}$ ’s and $w_{i}$ ’s given the parameter estimates at the $m$ -th iteration of the algorithm $Λ^{(m)} = (α^{(m)}, γ^{(m)}, θ^{(m)}, β^{(m)}, b^{(m)}),$ and given the observed data $D$ . We then need to compute the conditional expectations of $y_{i},$ $w_{i}$ , $\log w_{i}$ and $w_{i} y_{i}$ given $(Λ^{(m)}, D)$ . These expectations for the $m$ -th iteration of the E-step are obtained as follows:

\begin{aligned} p_{i}^{(m)} & \equiv E (y_{i} | Λ^{(m)}, D) = δ_{i} + (1 - δ_{i}) {\frac{π (z_{i}) {(1 + α t_{i}^{γ} e^{x_{i}^{⊤} β} / θ)}^{- θ}}{(1 - π (z_{i})) + π (z_{i}) {(1 + α t_{i}^{γ} e^{x_{i}^{⊤} β} / θ)}^{- θ}} |}_{(Λ^{(m)}, D)}, \end{aligned}

(8)

\begin{aligned} a_{i}^{(m)} & \equiv E (w_{i} | Λ^{(m)}, D) = \frac{δ_{i} + θ}{θ + α t_{i}^{γ} e^{x_{i}^{⊤} β}} p_{i} + {\frac{δ_{i} + θ}{θ} (1 - p_{i}) |}_{(Λ^{(m)}, D)}, \end{aligned}

(9)

\begin{aligned} b_{i}^{(m)} & \equiv E (\log w_{i} | Λ^{(m)}, D) = [φ (δ_{i} + θ) - \log θ] (1 - p_{i}) + {[φ (δ_{i} + θ) - \log (θ + α t_{i}^{γ} e^{x_{i}^{⊤} β})] p_{i} |}_{(Λ^{(m)}, D)}, \end{aligned}

(10)

\begin{aligned} c_{i}^{(m)} & \equiv E (w_{i} y_{i} | Λ^{(m)}, D) = {\frac{δ_{i} + θ}{θ + α t_{i}^{γ} e^{x_{i}^{⊤} β}} p_{i} |}_{(Λ^{(m)}, D)} \end{aligned}

(11)

where

φ (.)

is the digamma function. In the M-step, the conditional expectation of the complete-data log-likelihood function is maximized with respect to the unknown parameters. For details about the derivation of expectations, please refer to the relevant calculations presented in Peng and Zhang.⁴

3.2. EM algorithm for penMCFM

When the covariate vectors $x$ and $z$ are high-dimensional, variable selection is also a purpose of the analysis, together with estimation of the unknown parameters and regression coefficients. To this aim, we consider a penalized version of the complete-data likelihood function, where a penalty term is added for the coefficients $β_{p}$ and $b_{p},$ corresponding to the penalized part of the high-dimensional covariates $x$ and $z$ . Then, the penalized complete-data log-likelihood function takes the form

\begin{aligned} l_{c}^{p e n} (Λ) & = l_{c_{1}} (b) + l_{c_{2}} (α, γ, β) + l_{c_{3}} (θ) - n \sum_{j = 1}^{P_{1}^{p}} P_{λ_{1}} (b_{p, j}) - n \sum_{l = 1}^{P_{2}^{p}} P_{λ_{2}} (β_{p, l}), \end{aligned}

where, from the notation introduced at the beginning of Section 3,

P_{1}^{p}

and

P_{2}^{p}

are the number of covariates included in the covariate vectors

z_{p}

and

x_{p}

, respectively, and

P_{λ_{i}} (\cdot)

is a penalty function depending on a vector of penalty parameters

λ_{i},

for

i = 1, 2

In the E-step, the conditional expectation of the penalized complete-data log-likelihood function with respect to the latent variables $y_{i}$ ’s and $w_{i}$ ’s are the same as in equations (8) to (11). Hence, $p_{i}^{(m)}, a_{i}^{(m)}, b_{i}^{(m)}$ and $c_{i}^{(m)}$ in (8) to (11) are used at the $m$ -th iteration of the E-step in penMCFM. Then, the conditional expectation of the penalized complete-data log-likelihood function at the $m$ -th step of the algorithm is $E l_{c}^{p e n, (m)} (Λ) = E l_{c_{1}}^{(m)} (b) + E l_{c_{2}}^{(m)} (α, γ, β) + E l_{c_{3}}^{(m)} (θ),$ where

\begin{aligned} E l_{c_{1}}^{(m)} (b) & = \sum_{i = 1}^{n} [p_{i}^{(m)} z_{i}^{⊤} b - \log (1 + e^{z_{i}^{⊤} b})] - n \sum_{j = 1}^{P_{1}^{p}} P_{λ_{1}} (b_{p, j}), \end{aligned}

(12)

\begin{aligned} E l_{c_{2}}^{(m)} (α, γ, β) & = \sum_{i = 1}^{n} δ_{i} \log (α γ t_{i}^{γ - 1} e^{x_{i}^{⊤} β}) - \sum_{i = 1}^{n} c_{i}^{(m)} α t_{i}^{γ} e^{x_{i}^{⊤} β} - n \sum_{l = 1}^{P_{2}^{p}} P_{λ_{2}} (β_{p, l}), \end{aligned}

(13)

\begin{aligned} E l_{c_{3}}^{(m)} (θ) & = \sum_{i = 1}^{n} [(δ_{i} + θ - 1) b_{i}^{(m)} - a_{i}^{(m)} θ] + n (θ \log θ - \log Γ (θ)) . \end{aligned}

(14)

We focus on the multi-step adaptive elastic-net penalty described by Xiao and Xu²⁰, which aims to achieve increased sparsity while concurrently reducing false positives in the variable selection process, and we adapt it to the case of penMCFM. When applying such elastic-net penalization for the high-dimensional covariates corresponding to

b_{p}

and

β_{p}

in (12) and (13), the components of the penalized complete-data log-likelihood function at the

m

th iteration of the E-step become

\begin{aligned} E l_{c_{1}}^{(m)} (b) & = \frac{1}{n} \sum_{i = 1}^{n} [p_{i}^{(m)} z_{i}^{⊤} b - \log (1 + e^{z_{i}^{⊤} b})] \\ - λ_{1, E n e t} [\frac{(1 - α_{E n e t})}{2} \sum_{j = 1}^{P_{1}^{p}} b_{p,_{j}}^{2} + α_{E n e t} \sum_{j = 1}^{P_{1}^{p}} w_{j} | b_{p, j} |] \end{aligned}

\begin{aligned} E l_{c_{2}}^{(m)} (α, γ, β) & = \frac{1}{n} \sum_{i = 1}^{n} [δ_{i} \log (α γ t_{i}^{γ - 1} e^{x_{i}^{⊤} β}) - c_{i}^{(m)} α t_{i}^{γ} e^{x_{i}^{⊤} β}] \\ - λ_{2, E n e t} [\frac{(1 - α_{E n e t})}{2} \sum_{l = 1}^{P_{2}^{p}} β_{p,_{l}}^{2} + α_{E n e t} \sum_{l = 1}^{P_{2}^{p}} w_{l} | β_{p, l} |] \end{aligned}

where

λ_{1, E n e t}

and

λ_{2, E n e t}

are the tuning parameters controlling the amount of penalty,

α_{E n e t} \in [0, 1]

is a higher level hyperparameter, and

w_{j}

is the data-driven weighting parameter corresponding to the

j -

th variable in the model (referred to as “adaptive weights” in the following). The tuning parameter values can be identified through cross-validation, and choices related to the tuning of these parameters are discussed in the following section. We update the adaptive weights

k

times as

w_{j}^{(k)} = \frac{1}{| {\hat{b}}_{p, j}^{(k - 1)} |}, j = 1, \dots, P_{1}

and

w_{l}^{(k)} = \frac{1}{| {\hat{β}}_{p, l}^{(k - 1)} |}, l = 1, \dots, P_{2}

for

b_{p}

and

β_{p}

. When

k = 1

, the adaptive weights equal

1

for all variables, and this case is equivalent to the regular elastic-net method. When

k = 2,

then the adaptive elastic-net method is used. The estimates of

b_{p}

and

β_{p}

at the

m

th iteration of the E-step are obtained by minimizing the negative penalized complete-data log-likelihood functions, that is respectively:

\begin{aligned} - E l_{c_{1}}^{(m)} (b) & = - \frac{1}{n} \sum_{i = 1}^{n} [p_{i}^{(m)} z_{i}^{⊤} b - \log (1 + e^{z_{i}^{⊤} b})] \\ + λ_{1, E n e t} [\frac{(1 - α_{E n e t})}{2} \sum_{j = 1}^{P_{1}^{p}} b_{p,_{j}}^{2} + α_{E n e t} \sum_{j = 1}^{P_{1}^{p}} w_{j}^{(k)} | b_{p, j} |] \end{aligned}

(15)

\begin{aligned} - E l_{c_{2}}^{(m)} (α, γ, β) & = - \frac{1}{n} \sum_{i = 1}^{n} [δ_{i} \log (α γ t_{i}^{γ - 1} e^{x_{i}^{⊤} β}) - c_{i}^{(m)} α t_{i}^{γ} e^{x_{i}^{⊤} β}] \\ + λ_{2, E n e t} [\frac{(1 - α_{E n e t})}{2} \sum_{l = 1}^{P_{2}^{p}} β_{p,_{l}}^{2} + α_{E n e t} \sum_{l = 1}^{P_{2}^{p}} w_{l}^{(k)} | β_{p, l} |] \end{aligned}

(16)

At the

m

th iteration of the M-step, the maximization of the

E l_{c}^{p e n, (m)} (Λ)

is equivalent to maximizing the

E l_{c_{1}}^{(m)} (b)

E l_{c_{2}}^{(m)} (α, γ, β)

and

E l_{c_{3}}^{(m)} (θ)

separately to obtain updated parameter estimates

Λ^{(m + 1)}

3.3. Implementation of the penMCFM algorithm

At the $m$ th iteration of the E-step, $E l_{c_{1}}^{(m)} (b)$ in (12) has the same structure as the penalized log-likelihood function of a logistic regression model, except for the parameters $p_{i}^{(m)}$ . As evident from the definition in (8), $p_{i}^{(m)}$ is not constrained to binary values, but can take any value in the interval $[0, 1]$ . Notably, it is important to note that these response values $p_{i}^{(m)}$ are not directly observable within this framework, in contrast to the typical situation in penalized logistic regression problems. Consequently, optimizing $E l_{c_{1}}^{(m)} (b)$ poses more challenges compared to solving a conventional penalization problem. This challenge has been recently highlighted by Bussy et al.³¹ and Fu et al.³⁵, and implies the impossibility to directly employ the glmnet R package¹⁶ to address the high-dimensional component of our problem. The second term, $E l_{c_{2}}^{(m)} (α, γ, β)$ in (13), can be considered as the log-likelihood function of the penalized Weibull regression model with $h (t_{i}) = α γ t_{i}^{γ - 1} e^{x_{i}^{⊤} β + \log c_{i}^{(m)}}$ and $S (t_{i}) = \exp (- α t_{i}^{γ} e^{x_{i}^{⊤} β + \log c_{i}^{(m)}}) .$ Here $c_{i}^{(m)}$ is a constant that is evaluated by using (11) before the optimization steps of the algorithm. We employ the lbfgs R package,³⁹ which provides the functionalities to utilize $L_{1}$ penalties for the minimization of the cost functions outlined in (15) and (16). Finally the third term, $E l_{c_{3}}^{(m)} (θ)$ in (14), is maximized to obtain the estimate of the frailty distribution parameter $θ$ (or its reciprocal, i.e. the variance).

We give a pseudo-code description of our EM algorithm for penMCFM in Algorithm 1. Initialization is carried out as follows: we use Cox regression estimates for $β_{u}$ , moment estimates for the Weibull distribution parameters $α$ and $γ$ , and we set $β_{p} = b_{0} = b_{u} = b_{p} = 0$ and $θ = 1$ . Generally, it is assumed that the tuning parameters for both penalties are common, that is $λ_{1, E n e t} = λ_{2, E n e t} = λ_{E n e t}$ for the sake of simplicity. We have followed this approach in our simulations as well. We then generate a sequence of $λ_{E n e t}$ based on the penalized Cox regression of $t, δ, x_{u}, x_{p}$ using the glmnet R package.¹⁶

3.4. GMIFS method for penMCFM

The GMIFS method is a variable selection technique that iteratively enhances the predictive power of a model by progressively adding variables, thus it can be adapted to obtain parameter estimations in penMCFM. Unlike traditional forward selection methods, GMIFS enforces a monotonic constraint on the coefficients of the added variables, ensuring that they either remain constant or increase with each iteration. By sequentially introducing variables with controlled increases in their coefficients, GMIFS strikes a balance between model sparsity and predictive accuracy. This method was first introduced by Hastie et al.³⁶ for solving penalized least squares regression problems. Since then, Makowski and Archer⁴⁰ and Hou and Archer⁴¹ provided extensions to Poisson regression and ordinal responses in the high dimensional setting, Yu et al.⁴² to variable selection for nonlinear regression, and Fu et al.³⁵ to the Weibull MCM.

In the simulation studies, with the aim of comparing our EM-based method for penMCFM not only with the penalized Weibull MC model introduced in Fu et al.,³⁵ but also with another variable selection algorithm, the GMIFS method is adapted so to be able to target our observed likelihood function in (6). For a comprehensive introduction to this method, readers are encouraged to refer to Hastie et al.³⁶ and Fu et al.,³⁵ and for further details on the implementation, to check our code details at https://github.com/fatihki/penMCFM.

4. Simulation study

In this section, we describe the results of extensive simulation studies that we have conducted to evaluate the performance of our proposed model. The evaluation of model performance involves assessing both the accuracy of the model predictions, and its capability to correctly identify covariates with true non-zero coefficients. We examine different scenarios involving varying censoring rates, cure rates, correlation among covariates, and number of nonzero coefficients.

For the sake of performance comparisons in the simulation studies, besides our proposed penMCFM EM-based algorithm (detailed in Sections 3.2-3.3), which we refer to as penMCFM(EM), we also implemented a version of the penMCFM algorithm where we incorporate the GMIFS method (detailed in Section 3.4), named penMCFM(GMIFS), and the GMIFS-based penalized Weibull MC model from Fu et al.,³⁵ named MCM(GMIFS). We are also interested in exploring the performance of a survival model that does not account for a cure fraction. To this aim, the Cox PH model with lasso penalty is employed by using the glmnet¹⁶ R package based on the observed data $(δ, t, x_{u}, x_{p})$ , where the tuning parameter $λ$ that gives the most regularized model such that the cross-validated error is within one standard error of the minimum is chosen. This $λ$ value is referred to as “lambda.1se” in Friedman et al.,¹⁶ and thus we refer to this method as penCox.1se. Note that this method only considers the latency part of our model, namely the one related to the $β_{p}$ regression coefficients.

The R code for implementing the proposed estimation algorithms and for replicating the simulation studies is available at https://github.com/fatihki/penMCFM.

4.1. Data generation under the Weibull MCFM

In order to generate random samples from a Weibull MCFM with right-censoring, cure rate and covariates, we follow the procedure outlined in Asano et al.⁴³ This method can be regarded as a modification of the inverse sampling technique, particularly suited for situations where the data generating process is assumed to have an improper survival function. From the population survival function $S_{p o p} (t | x, z)$ in (5) we can derive the inverse function for the cdf of the population as

F_{p o p}^{- 1} (u | x, z) = {θ α^{- 1} e^{- x^{⊤} β} ({[1 - u / π (z)]}^{- 1 / θ} - 1)}^{1 / γ}

where

0 \leq u \leq (1 - π (z))

. The pseudo-code description of a random sample generation for the Weibull MCFM model is detailed in Algorithm 2.

4.2. Simulation design

We carry out a large-scale simulation study to test the novel high-dimensional setup of our model. Our data generating process can be described as follows: we set the number of observations to $n = 500$ , and the penalized covariates sizes as $P_{1}^{p} = P_{2}^{p} = P = 1000$ . The sparsity levels of $β_{p}$ and $b_{p}$ are set to be $s = 20,$ meaning that $s$ is the number of non-zero coefficients (thus, 2% sparsity in this setting). We randomly split the dataset into a $80 %$ training and a $20 %$ testing set to assess performance, and implement a $4$ -fold cross-validation on the training set to tune the model parameters. The parameter $λ_{E n e t}$ is tuned over a grid $λ_{E n e t}$ of $50$ values on a log scale in the interval $[λ_{E n e t}^{min}, λ_{E n e t}^{max}],$ using the same sequence of $λ$ values as used for penalized Cox regression in the glmnet R package. The $α_{E n e t}$ parameter is chosen among the values $0.1, 0.5, 0.9$ and $1$ .

A categorical variable with three levels is used as unique unpenalized covariate $Z_{u} \in R^{n \times P_{1}^{u}}$ with $P_{1}^{u} = 1$ , generated with weight probabilities $0.4, 0.35, 0.25$ . The unpenalized covariates associated to the survival part of the model $X_{u} \in R^{n \times P_{2}^{u}}$ are i.i.d. standard normal, and $P_{2}^{u} = 10$ . The penalized covariates $Z_{p}$ and $X_{p}$ are assumed common, and generated from a $P$ -dimensional Gaussian distribution $M V N (0, Σ),$ where $Σ \in R^{P \times P}$ is a block diagonal matrix with block size $50$ , and with the correlation between any pair of covariates within the same block defined by $c o r r (Z_{p, i}, Z_{p, j}) = c o r r (X_{p, i}, X_{p, j}) = ρ^{| i - j |}$ , $\forall i, j = 1, \dots, P$ ; the tested values for $ρ$ are $0, 0.2, 0.5$ . The unpenalized regression coefficients are set as $(b_{0}, b_{u}) = (- 2, - 1, 1),$ while $β_{u} \in R^{n \times P_{2}^{u}}$ is randomly generated from a uniform distribution on the interval $[- 3, 3]$ . For what concerns the penalized regression coefficients $β_{p}, b_{p} \in R^{P},$ we consider five different settings with respect to their nonzero coefficients. In each setting, $s$ components of both $β_{p}$ and $b_{p}$ are nonzero and take the same value $v,$ which we will refer to as the “signal.” The signals are occurring at roughly equally-spaced indices between $1$ and $P$ , while the rest of the coefficients is equal to $0$ . Hence, the non-zero coefficients are distributed equally in each block, and specifically one non-zero coefficient per block. Five different signal values are considered, specifically $v = (0.5, 1, 1.5, 2, 2.5)$ , to test the model when the signal in the data becomes smaller and smaller.

Finally, we set the Weibull distribution parameters as $(α, γ) = (1.25, 2.5)$ , the frailty distribution parameter $θ = 0.5,$ and $λ_{c} = 0.5$ for generating the censoring times. After generating all model coefficients and covariates according to these assumptions, we generate the data using Algorithm 2. This process is repeated $100$ times, and the average results across all runs of the EM on the $100$ datasets so obtained are subsequently reported. Notably, the average censoring (cure) rates in our simulated samples are approximately $85 (76) %, 76 (70) %, 70 (62) %, 66 (59) %, 63 (57) %$ for the respective $v$ values chosen in $v .$

4.3. Performance evaluation

To compare the performances of the various models considered in our simulation studies, we consider several performance metrics. For selecting the tuning parameter $λ_{E n e t}$ , we use $4$ -fold cross-validation on a revised version of the concordance index defined in (17) below, and also used in Fu et al.³⁵ Harrell’s concordance index (or C-statistic, or C-index) is widely used as a measure of performance in fitted survival models for censored data. The C-statistic for a standard survival model is the proportion of concordant pairs out of the total number of possible evaluation pairs, given by

\hat{C} = \frac{\sum_{i = 1}^{n} \sum_{j = 1, i \neq j}^{n} I (x_{i}^{⊤} \hat{β} > x_{j}^{⊤} \hat{β}) I_{i, j}}{\sum_{i = 1}^{n} \sum_{j = 1, i \neq j}^{n} I_{i, j}}

where

I_{i, j} = I [t_{i} < t_{j}, δ_{i} = 1] + I [t_{i} = t_{j}, δ_{i} = 1, δ_{j} = 0]

, and

\hat{β}

is the estimated vector of coefficients. Assano and Hirakawa⁴⁴ noted that the value of

\hat{β} x_{i}

is the same for both cured patients and censored uncured patients, and therefore proposed a modified C-statistics that considers this issue by introducing a cure status weighting for each patient. The weight is defined as

1

for the uncured patients who experienced the event,

0

for the cured patients, and is equal to the estimate of the uncured probability

\hat{π} (z)

for the censored patients. Then, the C-statistic is defined by Assano and Hirakawa⁴⁴ as

\begin{aligned} {\hat{C}}_{C u r e} = \frac{\sum_{i = 1}^{n} \sum_{j = 1, i \neq j}^{n} I (x_{i}^{⊤} \hat{β} > x_{j}^{⊤} \hat{β}) {ϑ_{j} y_{j} + (1 - ϑ_{j}) \hat{π} (z_{j})} I_{i, j}}{\sum_{i = 1}^{n} \sum_{j = 1, i \neq j}^{n} {ϑ_{j} y_{j} + (1 - ϑ_{j}) \hat{π} (z_{j})} I_{i, j}} \end{aligned}

(17)

where

ϑ_{i}

is an indicator function taking values

ϑ_{i} = 1

y_{i} = 0

1

, and

ϑ_{i} = 0

y_{i}

is missing, and where

\hat{π} (z_{j}) = e^{z_{j}^{⊤} \hat{b}} / (1 + e^{z_{j}^{⊤} \hat{b}})

. Notice that

{\hat{C}}_{C u r e}

in (17) reduces to

\hat{C}

when

v_{i} = 1

and

y_{i} = 1

for all patients.

Moreover, we evaluate the performance of the variable selection part of the model by calculating the associated sensitivity, that is the percentage of non-zero coefficients accurately estimated as non-zeros, specificity, that is the percentage of zero coefficients accurately estimated as zero, and false positive rate (FPR), that is the proportion of wrongly selected zero coefficients among those estimated as non-zero coefficients, which is equal to the proportion of mistakenly selected irrelevant variables among those identified as significant. We also evaluate the model prediction performance by considering the relative model error (RME), and the model estimation performance by considering the estimation error (ERR). For the vector of regression coefficients $β$ , these measures are defined as

\begin{aligned} Sensitivity = & \sum_{l = 1}^{P} I (β_{l} \neq 0 \cap {\hat{β}}_{l} \neq 0) / \sum_{l = 1}^{P} I (β_{l} \neq 0) \\ Specificity = & \sum_{l = 1}^{P} I (β_{l} = 0 \cap {\hat{β}}_{l} = 0) / \sum_{l = 1}^{P} I (β_{l} = 0) \\ FPR = & \sum_{l = 1}^{P} I (β_{l} = 0 \cap {\hat{β}}_{l} \neq 0) / \sum_{l = 1}^{P} I (β_{l} = 0) \\ RME = & (\hat{β} - β)^{T} Σ (\hat{β} - β) / ({\hat{β}}^{*} - β)^{T} Σ ({\hat{β}}^{*} - β) \\ ERR = & (\hat{β} - β)^{T} (\hat{β} - β) / ({\hat{β}}^{*} - β)^{T} ({\hat{β}}^{*} - β) \end{aligned}

where

Σ

is the covariance matrix of the covariates,

β

is the true vector of coefficients, and

{\hat{β}}^{*}

is the oracle estimate of the coefficients derived from the model where only the true signals were included and the coefficients for the other covariates were forced to be zero.

Furthermore, according to the model, the probability of being cured is sample-specific: this means that, even when two patients share identical clinical characteristics, their unique genomic profiles can influence the cure probability in different ways. Recall that the uncured probability for the $i$ th subject is computed by using the formula $π (z_{i}) = e^{z_{i}^{⊤} b} / (1 + e^{z_{i}^{⊤} b})$ , for given $b$ coefficients and for the $i$ th subject covariates $z_{i}$ , $i = 1, \dots, n$ . Therefore, the average true uncured probability can be computed as $\sum_{k = 1}^{M} [\sum_{i = 1}^{n} π^{k} (z_{i}) / n] / M$ over $M$ Monte Carlo runs of the EM algorithm, and over the $n$ subjects in the sample, where $π^{k} (z_{i})$ is the true uncured probability for the $i$ th subject in the $k$ th Monte Carlo simulation, with $M$ being the total number of simulated datasets. Consequently, we can also evaluate the accuracy of model estimates in terms of bias and mean squared error (MSE) of the estimated uncured probability $\hat{π} (z)$ , which are calculated as

B i a s (\hat{π} (z)) = \frac{1}{M} \sum_{k = 1}^{M} [\frac{1}{n} \sum_{i = 1}^{n} {{\hat{π}}^{k} (z_{i}) - π^{k} (z_{i})}]

and

M S E (\hat{π} (z)) = \frac{1}{M} \sum_{k = 1}^{M} [\frac{1}{n} \sum_{i = 1}^{n} {{\hat{π}}^{k} (z_{i}) - π^{k} (z_{i})}^{2}]

where

π^{k} (z_{i})

and

{\hat{π}}^{k} (z_{i})

are respectively the true uncured probability and its estimate for the

i

-th subject and the

k

-th Monte Carlo run, similarly to what reported in Pal et al.⁴⁵

4.4. Simulation results

We here report the simulation results for each scenario specified in the simulation design described in Section 4.2, over $M = 100$ simulated datasets, when using the different performance measures as introduced in Section 4.3. We plot the number of non-zero regression coefficients, and the sensitivity, specificity, and FPR, to show the performance of the variable selection for both sets of regression coefficients $β_{p}$ and $b_{p}$ in Figures 1 and 2, respectively. The C-statistics plots for both training and testing data are shown in Figure 3. Here the regular C-statistic, namely $\hat{C}$ , is used for the penCox.1se method (which does not include a cured part), while ${\hat{C}}_{C u r e}$ is used for all other methods. The performance of the uncured rate estimates is also shown in Figure 4, where the absolute Bias and MSE are reported. Finally, we present results that illustrate the prediction and estimation performance of the models, as well as the cure rates, by reporting the RME, ERR, $B i a s (\hat{π} (z))$ and $M S E (\hat{π} (z))$ in Table 1, for the case in which the data are generated with $ρ = 0$ . Corresponding tables for the cases of $ρ = 0.2$ and $0.5$ are included in Tables S1 and S2 of the Supplementary Material. Additionally, we investigate the uncertainty of the non-zero regression coefficients and the uncured rate. We compute the empirical standard error of the parameter estimates for the $20$ non-zero coefficients for both $β_{p}$ and $b_{p}$ , as well as $π (z)$ , across the $100$ repeatedly simulated datasets. We present the average estimates of these parameters, accompanied by one standard error wide bars, for the cases of $ρ = 0$ and $0.5$ , which are included in Figures S1–S4 of the Supplementary Material.

Figure 1.

Results of simulation studies. Variable selection performance in inference for $β_{p}$ .

Figure 2.

Results of simulation studies. Variable selection performance in inference for $b_{p}$ .

Figure 3.

Results of simulation studies. C-statistics plot based on the train (top row) and test (bottom row) data.

Figure 4.

Results of simulation studies. Absolute bias and MSE plots for the uncured rate estimates.

Table 1.

Simulation studies results for $b_{p}, β_{p}$ and $π (z)$ when $ρ = 0$ .

		$β_{p}$		$b_{p}$		$π (z)$
$v$	Method	RME(SD)	ERR (SD)	RME(SD)	ERR (SD)	Bias	MSE
0.5	A1	1.248 (0.268)	1.248 (0.268)	0.491 (0.938)	0.491 (0.938)	0.007	0.067
	A2	1.560 (1.246)	1.560 (1.246)	0.515 (0.977)	0.515 (0.977)	0.008	0.071
	A3	1.747 (1.746)	1.747 (1.746)	0.553 (1.211)	0.553 (1.211)	0.009	0.072
	A4	1.821 (1.912)	1.821 (1.912)	0.559 (1.227)	0.559 (1.227)	0.008	0.072
	B	1.193 (0.056)	1.193 (0.056)	0.496 (0.936)	0.496 (0.936)	0.237	0.130
	C	1.502 (0.322)	1.502 (0.322)	0.461 (0.881)	0.461 (0.881)	−0.022	0.057
	D	1.134 (0.282)	1.134 (0.282)

1	A1	0.801 (0.113)	0.801 (0.113)	1.365 (2.165)	1.365 (2.165)	0.012	0.083
	A2	0.807 (0.149)	0.807 (0.149)	1.525 (2.578)	1.525 (2.578)	0.026	0.099
	A3	0.809 (0.153)	0.809 (0.153)	1.536 (2.604)	1.536 (2.604)	0.030	0.103
	A4	0.811 (0.153)	0.811 (0.153)	1.527 (2.563)	1.527 (2.563)	0.029	0.103
	B	1.056 (0.028)	1.056 (0.028)	1.809 (2.816)	1.809 (2.816)	0.122	0.154
	C	1.075 (0.045)	1.075 (0.045)	1.404 (2.192)	1.404 (2.192)	$- 0.031$	0.086
	D	0.851 (0.085)	0.851 (0.085)

1.5	A1	0.729 (0.089)	0.729 (0.089)	2.855 (5.976)	2.855 (5.976)	0.009	0.077
	A2	0.644 (0.125)	0.644 (0.125)	2.953 (6.416)	2.953 (6.416)	0.034	0.091
	A3	0.635 (0.140)	0.635 (0.140)	2.985 (6.421)	2.985 (6.421)	0.038	0.097
	A4	0.631 (0.137)	0.631 (0.137)	2.989 (6.465)	2.989 (6.465)	0.039	0.097
	B	1.021 (0.019)	1.021 (0.019)	3.873 (8.061)	3.873 (8.061)	0.104	0.180
	C	1.019 (0.026)	1.019 (0.026)	3.177 (6.822)	3.177 (6.822)	−0.038	0.094
	D	0.807 (0.048)	0.807 (0.048)

2	A1	0.748 (0.056)	0.748 (0.056)	3.267 (7.496)	3.267 (7.496)	0.004	0.073
	A2	0.669 (0.093)	0.669 (0.093)	3.290 (7.573)	3.290 (7.573)	0.032	0.092
	A3	0.638 (0.096)	0.638 (0.096)	3.430 (8.095)	3.430 (8.095)	0.053	0.106
	A4	0.636 (0.096)	0.636 (0.096)	3.431 (8.154)	3.430 (8.154)	0.050	0.103
	B	1.012 (0.014)	1.012 (0.014)	4.033 (8.926)	4.033 (8.926)	0.052	0.173
	C	1.013 (0.017)	1.013 (0.017)	3.645 (8.309)	3.645 (8.309)	-0.038	0.111
	D	0.837 (0.037)	0.837 (0.037)

2.5	A1	0.804 (0.047)	0.804 (0.047)	4.657 (11.521)	4.657 (11.521)	−0.005	0.074
	A2	0.740 (0.069)	0.740 (0.069)	4.603 (10.956)	4.603 (10.956)	0.020	0.089
	A3	0.718 (0.077)	0.718 (0.077)	4.458 (10.339)	4.458 (10.339)	0.022	0.090
	A4	0.715 (0.080)	0.715 (0.080)	4.632 (11.091)	4.632 (11.091)	0.030	0.096
	B	1.012 (0.017)	1.012 (0.017)	5.560 (13.218)	5.560 (13.218)	0.068	0.203
	C	1.012 (0.012)	1.012 (0.012)	5.202 (12.640)	5.202 (12.640)	$- 0.037$	0.130
	D	0.872 (0.024)	0.872 (0.024)

Note: Method A1–A4: penMCFM (EM) for $α_{E n e t} = 0.1, 0.5, 0.9, 1$ , B: penMCFM (GMIFS), C: MCM (GMIFS),

D: penCox.1se; the best result appears in bold.

Each panel in the figures shows the average of a given metric over $M$ repetitions for the respective methods under consideration. In the table, mean results of the different methods are listed along with the standard deviation (in parenthesis). It is worth noting that the GMIFS method is independent of the $α_{E n e t}$ parameter, while penCox.1se only includes the lasso case for the penalized Cox regression, and thus results associated to these two methods remain unchanged regardless of the $α_{E n e t}$ values.

To assess the accuracy of variable selection, it is vital to consider sensitivity, specificity, and the FPR plots in combination. Figures 1 and 2 reveal that the proposed penMCFM(EM) method exhibits very similar average results when choosing $α_{E n e t}$ equal to either values $0.5, 0.9, 1$ across all metrics, and moreover, these figures also show that penMCFM(EM) is the method providing best overall results across all simulated scenarios and considered metrics. Notably, when looking specifically at Figure 1, the penMCFM(GMIFS) method consistently selects the fewest variables, often below the true value of $20$ , while conversely the MCM model version, MCM(GMIFS), chooses the largest number of variables compared to other methods. These behaviors obviously reflect in the often suboptimal sensitivity, specificity and FPR values attained by these two methods. The penCox.1se, which utilizes the penalized Cox regression with a lasso penalty, ranks as the second method with regards to the number of selected variables in Figure 1, thus showing large sensitivity but suboptimal specificity and FPR. If we compare the variable selection performance of our proposed penMCFM(EM) method when choosing $α_{E n e t} = 1$ and penCox.1se, they show similar sensitivity plots although penMCFM(EM) selects less variables than penCox.1se. Moreover, penMCFM(EM) attains better performance across $α_{E n e t}$ values than penCox.1se in terms of specificity and FPR across all $v$ and $ρ$ settings. The MCM (GMIFS) method demonstrates better sensitivity performance for $b_{p}$ (Figure 2) as compared to $β_{p}$ (Figure 1). However, similarly to what shown for $β_{p}$ in Figure 1, penMCFM(EM) consistently outperforms other methods across all $α_{E n e t}$ values also for estimating $b_{p}$ (Figure 2), as it selects the fewest variables and frequently approaches the true value of $20$ more closely than other methods, while maintaining large sensitivity and specificity, and exhibiting low FPR.

The overall performance of the methods in terms of variable selection generally deteriorates when the signal becomes weaker, specifically when $v = 0.5$ , and it improves as the signal magnitude increases, as observed in both figures. The effect of the signal parameter $v$ is apparent in the increasing variable selection performance particularly in the number of selected variables and sensitivity. On the other hand, while the other methods seem to have slightly deteriorating performance when the correlation among variables is increasing, the impact of the correlation parameter $ρ$ on the performance of penMCFM(EM) cannot be prominently discerned. We therefore conclude that the penMCFM(EM) method, especially when choosing $α_{E n e t} = 0.5, 0.9,$ or 1, attains better variable selection performance than other methods in terms of the true number of selected variables, high sensitivity and specificity, and low FPR.

From inspection of Figure 3, we note that performance in terms of C-statistics is generally slightly better on the testing set than on the training set for all methods, as also previously observed in a similar model setting by Fu et al.³⁵ As the signal magnitude increases, the penMCFM(EM) and penCox.1se methods as expected tend to attain larger C-statistics values for both the train and test data. penCox.1se shows generally the best performance except for the cases $v = 0.5$ and $1$ , when performance deteriorates more than for other methods. penMCFM(EM) generally shows larger C-statistic values when choosing $α_{E n e t} = 0.5, 0.9, 1$ as compared to $α_{E n e t} = 0.1$ , and with respect to penMCFM(GMIFS) (except for $v = 0.5, 1$ on the test data). Even if the C-statistics values attained by MCM(GMIFS) tend to be larger than those attained by penMCFM(EM), the performance of the two methods becomes more and more similar as the signal magnitude increases, both on the training and on the testing datasets. It is also worth noting that, when calculating the C-statistics for a cure model, ${\hat{C}}_{C u r e}$ in (17), both coefficient vectors $β_{p}$ and $b_{p}$ are considered in the calculations, thus making it plausible that this contributes to the larger values attained by MCM(GMIFS) as compared to penMCFM(EM). Nonetheless, the MCM(GMIFS) method consistently exhibits the highest count of nonzero variables for both $β_{p}$ and $b_{p}$ in all scenarios (Figures 1 and 2, respectively), thus showing a worse variable selection performance. To summarize, we can conclude that generally all the methods show acceptable and comparable C-statistics values for almost all considered scenarios. We thus give more importance to the evaluation of the variable selection performance, which discriminates more among the considered methods, and which impacts in a much more crucial way on the reliable interpretation of the model outputs.

Finally, Figure 4 shows that penMCFM(EM) when choosing $α_{E n e t} = 1$ and $0.5$ generally gives the lowest first and second average absolute bias and MSE values of $\hat{π} (z)$ , respectively. penMCFM(GMIFS) exhibits the least favorable performance among all methods. Additionally, it is worth noting that the penMCFM(EM) method showcases consistent behavior across varying correlation sizes for each $α_{E n e t}$ value.

From Table 1, it can be observed that our proposed method penMCFM(EM) outperforms other methods in terms of the RME and ERR metrics, except for the most difficult case when $v = 0.5$ . In almost all scenarios, penMCFM(EM) with $α_{E n e t} = 0.1$ shows the best prediction and estimation performance for the regression coefficients in the cure part, and therefore we observe similar performance for the estimation of $π (z)$ .

5. Application to RNA-Seq data from TCGA-BRCA

In this section we utilize publicly available omics data from The Cancer Genome Atlas (TCGA) project¹ for showcasing the use of the proposed penMCFM model on a real case application. The overall survival time, demographic and gene expression data from primary invasive BReast CAncer patients in the TCGA database (TCGA-BRCA) were obtained from the Genomic Data Commons Data Portal data release v33.0. We retrieve the RNA-Seq data from the primary tumor of TCGA-BRCA patients, together with the accompanying metadata comprising survival outcomes, as well as clinical and demographic variables. We also use the BCR Biotab files to gather some additional clinical variables related to hormone status, such as estrogen receptor (ER), human epidermal growth receptor 2 (HER2) and progesterone receptor (PR), for the same TCGA-BRCA patients.

The original data set comprises $1111$ samples with $60660$ gene expression features, and $87$ clinical and demographic variables. We specifically consider protein-coding genes, and after eliminating duplicate genes we are left with $19938$ RNA-seq features. We then remove observations which show a survival time of less than one month, leaving $1017$ samples before arrangement of the clinical variables. We use years as unit for the survival time variable in this analysis. We apply the DESeq2 normalization approach for RNA-seq data as implemented in the R/Bioconductor package DESeq2⁴⁶ before further statistical analysis, as suggested by Zhao et al.^47,48

For data preprocessing, we reduce the $19938$ RNA-seq features using the variance filter method implemented in the M3C R package,⁴⁹ thus reducing the features considered to the top $2000$ RNA-seq, which explain around $41 %$ of the variation of the whole data. Since prior knowledge of BRCA plays a crucial role in the identification of potential prognostic biomarkers, we consolidate genes of interest from five distinct sources of prior knowledge as in Li and Liu.^50,51 This selection includes:

the $147$ genes from the KEGG breast cancer pathway,⁵²

the $519$ genes related to BRCA from the top ranked gene ontology (GO) terms sorted in gene ontology annotations (GOA),⁵³

the $70$ genes known as the “MammaPrint” BRCA signatures,⁵⁴

the $128$ genes collected by the online consensus survival analysis web server for breast cancers (OSbrca) previously published BRCA biomarkers,⁵⁵

the $10$ BRCA prognosis signatures selected by the scPrognosis method using single-cell RNA sequencing (scRNA-seq) data.⁵⁶

When we take the intersection of these sources with our already reduced set of

19938

RNA-seq features, and combine them with the top

2000

features, we obtain a total of

2650

RNA-seq features. Lastly, we filter the samples based on the clinical variables listed in Table 2, resulting in a dataset of

828

observations and around

89 %

censoring rate. The remaining portion of the RNA-seq data, which includes

189

observations and has around

70 %

censoring rate, will be used for validating the identified genes. Even if the censoring rate is slightly lower on this validation set (as compared to the training and testing sets), filtering on clinical variables makes this validation dataset at most homogeneous to the training and testing sets with respect to covariates, which is most crucial in calculating the Prognostic Risk Score.

Table 2.
Demographic and clinical characteristics of TCGA-BRCA patients.

Clinical outcomes Mean Range Event rate

Overall survival (year) 3.19 $[0.08, 18.05]$ 0.109

Clinical covariates Levels $n$ Clinical covariates Levels $n$

PAM50 Subtypes Basal 144 Pathological Stage (PS) Stage I 151

Her2 58 Stage II 478

LumA 432 Stage III 180

LumB 164 Stage IV 14

Normal 30 Stage X 5

Lymph nodes (LN) N0 402 Tumor size (TS) T1 221

N1 273 T2 481

M2 87 T3 101

N3 56 T4 24

NX 10 TX 1

Metastases stage (MS) M0 706 HER2 status Equivocal 172

M1 14 Positive 143

MX 108 Negative 513

Pharmaceutical Therapy (PT) Yes 669 ER status Positive 648

No 159 Negative 180

Radiation Therapy (RT) Yes 451 PR status Positive 567

No 377 Negative 261

Age (year)¹ 58.79 $[26.6, 90]$

Clinical outcomes	Mean	Range	Event rate
Overall survival (year)	3.19	$[0.08, 18.05]$	0.109

Clinical covariates	Levels	$n$	Clinical covariates	Levels	$n$
PAM50 Subtypes	Basal	144	Pathological Stage (PS)	Stage I	151
	Her2	58		Stage II	478
	LumA	432		Stage III	180
	LumB	164		Stage IV	14
	Normal	30		Stage X	5
Lymph nodes (LN)	N0	402	Tumor size (TS)	T1	221
	N1	273		T2	481
	M2	87		T3	101
	N3	56		T4	24
	NX	10		TX	1
Metastases stage (MS)	M0	706	HER2 status	Equivocal	172
	M1	14		Positive	143
	MX	108		Negative	513
Pharmaceutical Therapy (PT)	Yes	669	ER status	Positive	648
	No	159		Negative	180
Radiation Therapy (RT)	Yes	451	PR status	Positive	567
	No	377		Negative	261
Age (year)¹	58.79	$[26.6, 90]$

mean and range

We consider two real data scenarios for the choice of the penalized covariates $X_{p}$ and $Z_{p}$ , which are the same for the cure and the survival part: Scenario $1$ , where the penalized covariates only include the top $2000$ RNA-seq features after the variance filtering, and Scenario $2$ , which is built upon the larger set of $2650$ RNA-seq features, combining the same top $2000$ RNA-seq features as in Scenario $1$ with the known genes related to BRCA. In both scenarios, we employ the same set of unpenalized covariates $X_{u}$ and $Z_{u}$ . Unpenalized covariates in $X_{u}$ include the following clinical variables: ER, PR, HER2, PT, RT, Age, while $Z_{u}$ includes PAM50, MS, LN, TS, PS, ER, PR, HER2 (as in De Bin et al.,⁵⁷ Volkmann et al.⁵⁸ and Li and Liu⁵⁰).

5.1. Identification of biomarkers

To identify the relevant features (biomarkers), we initially partition the dataset randomly into a $80 %$ training and a $20 %$ testing set. In order to select the tuning parameter $λ_{E n e t}$ , we employ a $4$ -fold cross-validation on the training dataset to obtain the optimal $λ_{E n e t}$ that maximizes the C-statistic ( $\hat{C}$ for the penCox.1se method and ${\hat{C}}_{C u r e}$ for all other methods). We additionally tune the $α_{E n e t}$ parameter, by considering values in the set ${0.1, 0.3, 0.5, 0.7, 0.9, 1}$ , as the one maximizing the C-statistic value in the testing dataset when employing the $λ_{E n e t}$ parameter selected from the training dataset. We then repeat this random splitting (and subsequent parameter tuning) $20$ times to get several new training and testing datasets, as a part of our experiments. It is crucial to emphasize the importance of such approach to ensure feature selection robustness, as the selected features are consolidated as the union of the genes associated with non-zero coefficients in all $20$ experiments.

We employ our methods in both Scenarios 1 and 2 as detailed above. The $α_{E n e t}$ tuning parameter is determined based on the average values of the C-statistic over the $20$ repeats for the penMCFM(EM) and penCox.1se methods, the only ones allowing this parameter. The results obtained using the optimally selected $α_{E n e t}$ parameters can be seen in Figure 5, which shows the boxplots of the C-statistic for the different methods. The number of identified biomarker genes is different for the different methods: for instance, in Scenario $2$ , the penMCFM(EM) method with $α_{E n e t} = 1$ selects $182$ genes for the latency part of the model, genes which correspondingly show nonzero values in $β_{p}$ , while MCM(GMIFS) selects $208$ genes. These identified sets of biomarker genes show great overlap across different methods, as shown in Figure 6, which illustrates the overlap of the selected genes for the latency part of the model across the four methods. Interestingly, penMCFM(EM) generally does not select any features for the incidence part of the model, meaning that no coefficient in $b_{p}$ is estimated to be significantly different from zero, while the GMIFS method (when used with both models penMCFM and MCM) usually selects some features. A similar result was also observed in Fu et al.³⁵ in an application to acute myeloid leukemia data. A summary of the overlap of the selected genes of these two methods for the incidence part of the model is also given in Figure S5 of the Supplementary Material. Furthermore, the selected sets of biomarker genes for the different methods, along with their frequencies of occurrence obtained over the $20$ repeats of each method, are provided at https://github.com/fatihki/penMCFM as excel files.

Figure 5.

Results of the analysis of the TCGA-BRCA data. Boxplots of the C-statistic, AUC and IBS values obtained on the testing datasets over the $20$ repeated data-splitting processes. (a) Scenario $1$ and (b) Scenario $2$ .

Figure 6.

Results of the analysis of the TCGA-BRCA data. Overlap of the selected gene sets (nonzero $β_{p}$ coefficients) among the four methods: the blue barplots report the frequencies of intersections among the methods, while the bottom green lines report which methods are considered for the overlap. (a) Scenario $1$ and (b) Scenario $2$ .

From inspection of Figure 5 one can see that the proposed method penMCFM(EM) achieves the second-largest C-statistic values, whereas MCM(GMIFS) attains the largest C-statistic. It is worth noting however that the latter method selects a larger number of genes for both parts of the model compared to other methods, which may contribute to its greater C-statistic values. Moreover, it is also possible to consider other metrics, such as the area under the receiver operating characteristic (ROC) curve and the integrated Brier score, for comparing the different methods, for both the incidence and latency parts of the model. To calculate the area under the ROC curve (AUC), each method is used to estimate $5$ -years survival for the subjects in the testing set, and compared to outcome on the corresponding time interval. The integrated Brier score (IBS) is a composite measure of discrimination and calibration, and lower values indicate better model calibration and predictive accuracy. The procedures outlined by Zhao et al.⁴⁸ were followed to compute both the AUC and IBS metrics. Our analysis leads to the conclusion that the penMCFM(EM) method, as depicted in Figures 5(a) and 5(b), exhibits superior AUC and IBS performance compared to the alternative methods for both the incidence and latency components of the model.

For a fair comparison of the methods performance, we carry out more investigations for the identified biomarker genes via gene set enrichment analyses, and by constructing a prognostic risk score, as described in Sections 5.2 and 5.3, respectively. Furthermore, we validate a subset of the top-identified genes derived from penMCFM(EM) when considering the intersection of results from both scenarios, by referring to the existing BRCA literature. This additional validation can be found in Table S3 of the Supplementary Material.

5.2. Functional enrichment analysis

We perform both GO—and the Kyoto encyclopedia of genes and genomes (KEGG)—Enrichment Analyses (EA) on the union of the selected genes obtained in both scenarios to investigate the pathological implications of these identified biomarker genes. To this aim, we utilize the R/Bioconductor package clusterProfiler^59,60 for all the analyses described in this section. First, we perform GO-EA on the genes identified from penMCFM(EM), penMCFM(GMIFS), MCM(GMIFS) and penCox.1se in Scenarios $1$ and $2$ , and then we conduct the KEGG pathway EA (KEGG-EA). We only present here the results obtained when studying the identified genes from the latency part of the model in penMCFM(EM), while the results obtained with other methods (namely penMCFM(GMIFS), MCM(GMIFS), and penCox.1se) are provided in the Supplementary Materials. Moreover, the detailed results of enrichment analyses for each method are given in https://github.com/fatihki/penMCFM as excel files.

In Scenario $1$ , only one biological process (BP) for the GO-EA is significant, the “xenobiotic metabolic process,” and the association between this BP and similar breast cancer evolution processes is examined by Lee et al.⁶¹ Another BP identified by GO-EA is “GABAergic neuron differentiation,” with a p-value of $0.057$ . For KEGG-EA, the only significantly enriched term is “metabolism of xenobiotics by cytochrome P450.” The association between this enzyme expression and cancer risk, progression, metastasis and prognosis has been widely reported in basic and clinical studies (see the review study of Luo et al.⁶² for more details). Also the role of CYP1A1 and CYP2A13 genes in breast and lung cancers has been investigated extensively (see Sneha et al.⁶³ and references therein for more details). The gene network plots related to the significant GO and KEGG pathway terms are presented in Figures 7(a) and 7(b). The EA results for other methods, namely penMCFM(GMIFS) and MCM(GMIFS), are respectively shown in Figures S6–S7 in the Supplementary Materials. While the number of enriched terms in both GO- and KEGG-EA is limited for all the methods, some of the identified terms exhibit relevance to breast cancer.

Figure 7.

Results of the analysis of the TCGA-BRCA data. Gene network plots related to the significant GO and KEGG pathway terms obtained from EA for Scenario $1$ : (a) GO-enriched terms and (b) KEGG-enriched terms.

In Scenario $2$ , we detect totally $28$ enriched BP terms via GO-EA, including some important BPs associated with breast cancer, such as “Notch signaling pathway”,⁶⁴ “canonical Wnt signaling pathway”,^65,66 and “BMP signaling pathway”.⁶⁷ The barplot for some of the enriched terms in GO-EA, and the gene network plot related to the top-significant GO terms, are presented in Figures 8(a) and 8(b) respectively. Concerning results of KEGG-EA in this scenario, Figure 9(a) shows that the “Wnt signaling” and “breast cancer pathways” are two of the most significant KEGG pathways, as confirmed by the growing number of studies in the literature demonstrating that Wnt signaling involves the proliferation, metastasis, immune microenvironment regulation, stemless maintenance, therapeutic resistance, and phenotype shaping of breast cancer (see Xu et al.⁶⁵ and Abreu de Oliveira et al.,⁶⁶ and references therein, for more details). The gene network plot related to the significant KEGG-EA pathway terms is presented in Figure 9(b). Moreover, the enriched KEGG pathway “hsa05224: Breast cancer” is reported in Figure S9 in the Supplementary Materials to more clearly show the associated biological meaning. We observe a greater number of enriched terms in the EA for all methods in Scenario $2$ compared to Scenario $1$ . Notably, the “breast cancer” pathway is identified as one of the most enriched terms in at least one part of the model structure for all methods.

Figure 8.

Results of the analysis of the TCGA-BRCA data. GO-EA for Scenario $2$ : (a) barplot of significantly enriched terms and (b) network plot of enriched GO terms and related selected genes.

Figure 9.

Results of the analysis of the TCGA-BRCA data. KEGG-EA for Scenario $2$ : (a) barplot of significantly enriched terms and (b) network plot of enriched KEGG pathway terms and related selected genes.

5.3. Prognostic risk score construction and validation of identified genes

We establish a prognostic risk score (PRS) to demonstrate the prognostic power of our model and methods. The PRS is constructed similarly to in Li and Liu,⁵⁰ by using the average estimates of the coefficients corresponding to the selected genes over the $20$ repeats. Precisely, the PRS is defined as

P R S = {\hat{θ}}_{1} x_{1} + {\hat{θ}}_{2} x_{2} + \dots + {\hat{θ}}_{P^{*}} x_{P^{*}}

(18)

where

P^{*}

is the total number of selected genes for the considered method,

{\hat{θ}}_{i}

is the average estimate of the coefficient corresponding to the

i

-th gene, and

x_{i}

is the expression value of the

i

-th gene in the data. We dichotomize the prognostic scores based on the median value and subsequently employ a log-rank test to compare the survival curves between the two groups of patients.

We use the validation dataset, which includes $189$ observations with their RNA-seq features and survival time, to validate the selected genes with the proposed PRS. We assign the subjects in the validation dataset to a high and low risk group using the median of the obtained PRS values. This procedure is applied when using the selected genes obtained with penMCFM(EM), penMCFM(GMIFS), MCM(GMIFS) and penCox.1se in Scenario $2$ . The KM curves of these subgroups are presented in Figure 10 for penMCFM(EM) and MCM(GMIFS): in the figure it can be observed that the log-rank test shows significant differences between the survival distributions corresponding to the high- and low-risk groups as obtained by the PRS calculations carried out on the outputs of both penMCFM(EM) and MCM(GMIFS) methods (with p-values $0.018$ and $0.035$ , respectively). Since the $p$ -values of the log-rank tests for penMCFM(GMIFS) and penCox.1se are not significant, these KM plots are only shown in Figure S18(a) and (b) in the Supplementary Materials.

Figure 10.

Results of the analysis of the TCGA-BRCA data. KM curves for the TCGA-BRCA patients in the validation dataset when dichotomized into two groups by the median PRS using the average results of $20$ method repeats in Scenario $2$ : (a) penMCFM(EM) and (b) MCM(GMIFS).

Moreover, the heatmaps of the expression values of the selected genes, based on the penMCFM(EM) and MCM(GMIFS) methods, in the validation dataset are also presented in Figure S19 in the Supplementary Materials. In these heatmaps, only the genes which are selected in at least $2$ out of $20$ re-run of the methods are shown, for visualization purposes. From inspection of these figures it can be observed that the penMCFM(EM) method gives much more visible clusters of genes showing similar expression behaviors than those based on the MCM(GMIFS) method.

6. Discussion and conclusion

In this study, we have introduced a novel method that adapts the mixture cure frailty model to account for high-dimensional covariates in both the incidence and the latency components of the model. The method extends the classical MCFM to higher dimensions using a multi-step adaptive elastic net penalty term inside the EM algorithm, therefore the name penMCFM. The proposed penMCFM allows performing variable selection in the MCFM model in high-dimensional settings where the number of covariates is significantly larger than the sample size. A comparative analysis with alternative survival models, including the penalized mixture cure model and penalized Cox regression, was also carried out. We also applied the proposed model along with alternative approaches to analyze RNAseq data from BRCA samples from TCGA. Finally, we conducted functional enrichment analyses of the identified biomarkers to show the disease mechanisms associated with BRCA, and then validated these biomarkers using an independent validation dataset to demonstrate their effectiveness and usefulness.

Moreover, we conducted a comprehensive investigation comparing our primary methodology, penMCFM(EM), with alternative methods using various metrics including C-statistic, AUC, and IBS, alongside functional enrichment analysis and the construction of prognostic risk scores, on various real data scenarios. Our analysis revealed that penMCFM(EM) and MCM(GMIFS) methods demonstrated comparable performances under certain circumstances, while penMCFM(EM) exhibited superior performance in others. Notably, despite similarities in performance observed in certain scenarios, penMCFM(EM) consistently selected fewer genes than MCM(GMIFS), its main competitor. Moreover, through rigorous simulation studies, penMCFM(EM) consistently outperformed alternative methods across different evaluation metrics. Consequently, we conclude that penMCFM(EM) represents a compelling alternative, offering advantages across diverse data generating scenarios. Additionally, our future research agenda entails applying this method to other cancer datasets, and conducting further comparative analyses against relevant methods.

A few practical issues regarding the proposed algorithm deserve some discussion. One notable computational challenge in our algorithm pertains to the efficient selection of the tuning parameters, specifically $λ_{1, E n e t}$ and $λ_{2, E n e t}$ . While in penMCFM it is assumed that these parameters are identical for both penalties, alternative methods (such as grid search, random search, or Latin hypercube sampling) could be employed to identify optimal different values for $λ_{1, E n e t}$ and $λ_{2, E n e t}$ , however at a much higher computational cost. In one experimental setting on simulated data, using Latin hypercube sampling with $100$ pairs for tuning $λ_{1, E n e t}$ and $λ_{2, E n e t}$ did not improve the results as compared to using the same value for the two parameters, leading us to omit this approach. Moreover, when selecting the optimal tuning parameters $λ_{E n e t}$ and $α_{E n e t}$ via cross-validation, we opted for optimizing the C-statistics, that is ${\hat{C}}_{C u r e}$ and $\hat{C}$ , instead of the commonly used Akaike information criterion (AIC) and Bayesian information criterion (BIC). While AIC and BIC were initially included in the cross-validation tuning process, it was observed that these criteria tended to select only a limited number of variables. Consequently, these results are not presented in the study.

The considered MCFM model has certain limitations. First, we employed a parametric baseline distribution for the latency component of the model, specifically choosing a Weibull distribution. While this choice may limit the flexibility of the model’s baseline structure, it provides several advantages, including mathematical tractability in certain formulations, a flexible hazard function (as demonstrated in recent similar but low-dimensional studies by Jiang and Basu⁶⁸ and Pal et al.⁶⁹), and ease of comparison with other studies in the literature, such as Fu et al.³⁵ Another limitation is the lack of completely independent validation datasets from sources other than TCGA, which are needed to assess the model’s applicability and robustness in practical applications. It would be particularly valuable to utilize completely independent validation datasets, such as the Norwegian breast cancer Oslo2 cohort data from Aure et al.,⁷⁰ as well as datasets from different breast cancer cohorts. Additionally, considering other cancer types separately and incorporating data from these types could provide further insights. Addressing these aspects remains an important focus for future work.

Some potential directions for future research include exploring a semi-parametric approach as an alternative to the current use of parametric distributions in the latency part. Additionally, investigating a Bayesian approach for variable selection could be a fruitful direction for further investigation. Finally, to the best of our knowledge, the literature on cure frailty models for recurrent events in high-dimensional settings has not flourished to the same extent as that based on survival data. This could be another potentially fruitful direction of future research.

Supplemental Material

sj-pdf-1-smm-10.1177_09622802251327687 - Supplemental material for A Weibull mixture cure frailty model for high-dimensional covariates

Supplemental material, sj-pdf-1-smm-10.1177_09622802251327687 for A Weibull mixture cure frailty model for high-dimensional covariates by Fatih Kızılaslan, David Michael Swanson and Valeria Vitelli in Statistical Methods in Medical Research

Footnotes

Acknowledgments

Computer simulations were performed in HPC solutions provided by the Oslo Centre for Epidemiology and Biostatistics, at the University of Oslo.

The results published in the real data study are based upon data generated by the TCGA Research Network: .

The authors would also like to thank Dr. Zhi Zhao for fruitful discussions.

Authors contribution

FK, DMS, and VV jointly contributed to the method conception and development. FK was responsible for the implementation, for all data analyses, and for drafting the paper. VV contributed to the paper drafting and to the interpretation of all results. Both DMS and VV revised the paper for submission.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This project has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No 801133.

ORCID iDs

David Michael Swanson

Valeria Vitelli

Supplemental material

Supplemental material for this article is available online.

Notes

References

Price

Manatunga

. Modelling survival data with a cured fraction using frailty models. Stat Med 2001; 20: 1515–1527.

Boag

. Maximum likelihood estimates of the proportion of patients cured by cancer therapy. J R Stat Soc Series B Stat Methodol 1949; 11: 15–53.

Berkson

Gage

. Survival curve for cancer patients following treatment. J Am Stat Assoc 1952; 47: 501–515.

Peng

Zhang

. Estimation method of the semiparametric mixture cure gamma frailty model. Stat Med 2008; 27: 5177–5194.

Taylor

. Estimation in a cox proportional hazards cure model. Biometrics 2000; 56: 227–236.

Peng

Dear

. A nonparametric mixture model for cure rate estimation. Biometrics 2000; 56: 237–243.

Dempster

Laird

Rubin

. Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Series B Stat Methodol 1977; 39: 1–22.

Peng

Zhang

. Identifiability of a mixture cure frailty model. Stat Probab Lett 2008; 78: 2604–2608.

Cai

Zou

Peng

, et al. SMCURE: an R-package for estimating semiparametric mixture cure models. Comput Methods Programs Biomed 2012; 108: 1255–1260.

10.

Cai

Zou

Peng

, et al. SMCURE: an R-package for estimating semiparametric ph and aft mixture cure models. https://cranr-projectorg/web/packages/smcure/smcurepdf 2022; 2.1.

11.

Rondeau

Schaffner

Corbiere

, et al. Cure frailty models for survival data: application to recurrences for breast cancer and to hospital readmissions for colorectal cancer. Stat Methods Med Res 2013; 22: 243–260.

12.

Tibshirani

. Regression shrinkage and selection via the lasso. J R Stat Soc Series B Stat Methodol 1996; 58: 267–288.

13.

Zou

Hastie

. Regularization and variable selection via the elastic net. J R Stat Soc Series B Stat Methodol 2005; 67: 301–320.

14.

Zou

Zhang

. On the adaptive elastic-net with a diverging number of parameters. Ann Stat 2009; 37: 1733–1751.

15.

Friedman

Hastie

Tibshirani

. Regularization paths for generalized linear models via coordinate descent. J Stat Softw 2010; 33: 1–22.

16.

Friedman

Hastie

Tibshirani

, et al. GLMNET: lasso and elastic-net regularized generalized linear models. https://cranr-projectorg/web/packages/glmnet/indexhtml 2023; 4.1–7.

17.

Simon

Friedman

Hastie

, et al. Regularization paths for Cox’s proportional hazards model via coordinate descent. J Stat Softw 2011; 39: 1–13.

18.

Tay

Narasimhan

Hastie

. Elastic net regularization paths for all generalized linear models. J Stat Softw 2023; 106: 1–31.

19.

Bühlmann

Meier

. Discussion: one-step sparse estimates in nonconcave penalized likelihood models. Ann Stat 2008; 36: 1534–1541.

20.

Xiao

. Multi-step adaptive elastic-net: reducing false positives in high-dimensional variable selection. J Stat Comput Simul 2015; 85: 3755–3765.

21.

Xiao

. MSAENET: multi-step adaptive estimation methods for sparse regressions. https://cranr-projectorg/web/packages/msaenet/msaenetpdf 2022; 3.1.

22.

Bühlmann

Van De Geer

. Statistics for high-dimensional data: methods, theory and applications. Heidelberg, Berlin: Springer, 2011.

23.

Hastie

Tibshirani

Wainwright

. Statistical learning with sparsity: the lasso and generalizations. New York: CRC press, 2015.

24.

James

Witten

Hastie

, et al. An introduction to statistical learning. New York, NY: Springer, 2021.

25.

Liu

Peng

, et al. Variable selection in semiparametric cure models based on penalized likelihood, with application to breast cancer clinical trials. Stat Med 2012; 31: 2882–2891.

26.

Fan

Liu

Fang

, et al. Promoting structural effects of covariates in the cure rate model with penalization. Stat Methods Med Res 2017; 26: 2078–2092.

27.

Masud

. Variable selection for mixture and promotion time cure rate models. Stat Methods Med Res 2018; 27: 2185–2199.

28.

Beretta

Heuchenne

. Variable selection in proportional hazards cure model with time-varying covariates, application to us bank failures. J Appl Stat 2019; 46: 1529–1549.

29.

Beretta

Heuchenne

. Variable selection in ph cure model with time-varying covariates. https://cranr-projectorg/web/packages/penPHcure/penPHcurepdf 2022; 1.0.2.

30.

Sun

Wang

, et al. Variable selection in semiparametric nonmixture cure model with interval-censored failure time data: an application to the prostate cancer screening study. Stat Med 2019; 38: 3026–3039.

31.

Bussy

Guilloux

Gaïffas

, et al. C-mix: a high-dimensional mixture model for censored durations, with applications to genetic data. Stat Methods Med Res 2019; 28: 1523–1539.

32.

Shi

Huang

. Promoting sign consistency in the cure model estimation and selection. Stat Methods Med Res 2020; 29: 15–28.

33.

Xie

. Mixture cure rate models with neural network estimated nonparametric components. Comput Stat 2021; 36: 2467–2489.

34.

Zhao

, et al. Variable selection for generalized odds rate mixture cure models with interval-censored failure time data. Comput Stat Data Anal 2021; 156: 1–17.

35.

Nicolet

Mrózek

, et al. Controlled variable selection in Weibull mixture cure models for high-dimensional data. Stat Med 2022; 41: 4340–4366.

36.

Hastie

Taylor

Tibshirani

, et al. Forward stagewise regression and the monotone lasso. Electron J Sta 2007; 1: 1–29.

37.

Farewell

. The use of mixture models for the analysis of survival data with long-term survivors. Biometrics 1982; 38: 1041–1046.

38.

Klein

Moeschberger

. Survival analysis: techniques for censored and truncated data. New York: Springer, 2003.

39.

Coppola

Stewart

. LBFGS: efficient l-bfgs and owl-qn optimization in R. https://cranr-projectorg/web/packages/lbfgs/lbfgspdf 2022; 1.2.1.2.

40.

Makowski

Archer

. Generalized monotone incremental forward stagewise method for modeling count data: application predicting micronuclei frequency. Cancer Inform 2015; 14: 97–105.

41.

Hou

Archer

. Regularization method for predicting an ordinal response using longitudinal high-dimensional genomic data. Stat Appl Genet Mol Biol 2015; 14: 93–111.

42.

. Nonlinear variable selection with continuous outcome: a fully nonparametric incremental forward stagewise approach. Stat Anal Data Min 2018; 11: 188–197.

43.

Asano

Hirakawa

Hamada

. Assessing the prediction accuracy of cure in the cox proportional hazards cure model: an application to breast cancer data. Pharm Stat 2014; 13: 357–363.

44.

Asano

Hirakawa

. Assessing the prediction accuracy of a cure model for censored survival data with long-term survivors: application to breast cancer data. J Biopharm Stat 2017; 27: 918–932.

45.

Pal

Peng

Aselisewine

. A new approach to modeling the cure rate in the presence of interval censored data. Comput Stat 2024; 39: 2743–2769.

46.

Love

Huber

Anders

. Moderated estimation of fold change and dispersion for RNA-seq data with deseq2. Genome Biol 2014; 15: 1–21.

47.

Zhao

Konaté

, et al. Tpm, fpkm, or normalized counts? A comparative study of quantification measures for the analysis of RNA-seq data from the nci patient-derived models repository. J Transl Med 2021; 19: 1–15.

48.

Zhao

Zobolas

Zucknick

, et al. Tutorial on survival modeling with applications to omics data. Bioinform 2024; 40: 1–15.

49.

John

Watson

Russ

, et al. M3c: Monte Carlo reference-based consensus clustering. Sci Rep 2020; 10: 1–14.

50.

Liu

. Detecting prognostic biomarkers of breast cancer by regularized cox proportional hazards models. J Transl Med 2021; 19: 1–20.

51.

Liu

. Biomarker discovery from high-throughput data by connected network-constrained support vector machine. Expert Syst Appl 2023; 226: 1–12.

52.

Kanehisa

Furumichi

Sato

, et al. Kegg: integrating viruses and cellular organisms. Nucleic Acids Res 2021; 49: D545–D551.

53.

Consortium

TGO

. The gene ontology resource: enriching a gold mine. Nucleic Acids Res 2021; 49: D325–D334.

54.

Cardoso

van’t Veer

Bogaerts

, et al. 70-Gene signature as an aid to treatment decisions in early-stage breast cancer. N Engl J Med 2016; 375: 717–729.

55.

Yan

Wang

Sun

, et al. OSBRCA: a web server for breast cancer prognostic biomarker investigation with massive data from tens of cohorts. Front Oncol 2019; 9: 1–8.

56.

Liu

Goodall

, et al. A novel single-cell based method for breast cancer prognosis. PLoS Comput Biol 2020; 16: 1–20.

57.

De Bin

Sauerbrei

Boulesteix

. Investigating the prediction ability of survival models based on both clinical and omics data: two case studies. Stat Med 2014; 33: 5310–5329.

58.

Volkmann

De Bin

Sauerbrei

, et al. A plea for taking all available clinical information into account when assessing the predictive value of omics data. BMC Med Res Methodol 2019; 19: 1–15.

59.

Wang

Han

, et al. Clusterprofiler: an R package for comparing biological themes among gene clusters. Omics: A J Int Biol 2012; 16: 284–287.

60.

, et al. Clusterprofiler 4.0: a universal enrichment tool for interpreting omics data. The Innovation 2021; 2: 1–11.

61.

Lee

Schuetz

Lai

, et al. Interactions between exposure to polycyclic aromatic hydrocarbons and xenobiotic metabolism genes, and risk of breast cancer. Breast Cancer (Auckl) 2022; 29: 38–49.

62.

Luo

Yan

, et al. Cytochrome p450: implications for human breast cancer (review). Oncol Lett 2021; 22: 1–9.

63.

Sneha

Baker

Green

, et al. Intratumoural cytochrome p450 expression in breast cancer: impact on standard of care treatment and new efforts to develop tumour-selective therapies. Biomedicines 2021; 9: 1–22.

64.

Yousefi

Bahramy

Zafari

, et al. Notch signaling pathway: a comprehensive prognostic and gene expression profile analysis in breast cancer. BMC Cancer 2022; 22: 1282.

65.

Zhang

, et al. Wnt signaling in breast cancer: biological mechanisms, challenges and opportunities. Mol Cancer 2020; 19: 1–35.

66.

Abreu de Oliveira

El Laithy

Bruna

, et al. Wnt signaling in the breast: from development to disease. Front Cell Dev Biol 2022; 10: 1–19.

67.

Ehata

Miyazono

. Bone morphogenetic protein signaling in cancer; some topics in the recent 10 years. Front Cell Dev Biol 2022; 10: 883523.

68.

Jiang

Basu

. Cure models with adaptive activation for modeling cancer survival. Stat Methods Med Res 2024; 33: 227–242.

69.

Pal

Peng

Aselisewine

. A new approach to modeling the cure rate in the presence of interval censored data. Comput Stat 2024; 39: 2743–2769.

70.

Aure

Vitelli

Jernström

, et al. Integrative clustering reveals a novel split in the luminal a subtype of breast cancer with impact on outcome. Breast Cancer Res 2017; 19: 19–44.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

2.75 MB