Sage Journals: Discover world-class research

Abstract

Standard survival models such as the proportional hazards model contain a single regression component, corresponding to the scale of the hazard. In contrast, we consider the so-called “multi-parameter regression” approach whereby covariates enter the model through multiple distributional parameters simultaneously, for example, scale and shape parameters. This approach has previously been shown to achieve flexibility with relatively low model complexity. However, beyond a stepwise type selection method, variable selection methods are underdeveloped in the multi-parameter regression survival modeling setting. Therefore, we propose penalized multi-parameter regression estimation procedures using the following penalties: least absolute shrinkage and selection operator, smoothly clipped absolute deviation, and adaptive least absolute shrinkage and selection operator. We compare these procedures using extensive simulation studies and an application to data from an observational lung cancer study; the Weibull multi-parameter regression model is used throughout as a running example.

Keywords

Variable selection multi-parameter regression Weibull penalized maximum likelihood differential evolution algorithm

1. Introduction

The most popular regression model for censored survival data is Cox’s proportional hazards (PH) model,¹ with a hazard function given by $h (t ∣ τ_{i}, γ) = τ_{i} h_{0} (t ∣ γ)$ , where $τ_{i}$ is a scale parameter that varies per individual $i = 1, \dots, n$ , and $γ$ is a shape parameter common to all individuals in which case $h_{0}$ is referred to as a baseline hazard function. Its appeal is due to the fact that covariate effects can be estimated in the form of relative risks, without specializing to any particular underling hazard shape, that is, $h_{0}$ is an unspecified function whose shape is a nuisance parameter. Such is its popularity, the model is often used without critical assessment of the fundamental PH assumption.² One of the most common approaches for dealing with a covariate that does not follow this assumption is through stratification of the hazard function, whereby a different baseline hazard is assumed for different levels of the covariate in question. However, issues with this approach are that: it is limited to categorical variables so that application to continuous variables requires ad-hoc binning, stratification with respect to multiple variables and/or variables with many levels leads to efficiency losses due to small number of individuals in the sub-groups, and the effects of stratified variables (via relative risks) are not estimated as they are absorbed into the “nuisance” hazard shape (see Therneau and Grambsch³ and references therein).

Rather than treating the hazard shape as a nuisance, we make use of a fully parametric hazard function characterized by multiple distributional parameters. In the two-parameter case that we focus on in this article, the hazard has the form $h (t ∣ τ_{i}, γ_{i}) = τ_{i} h_{0} (t; γ_{i})$ , where $τ_{i}$ is the scale parameter for the hazard (as in the Cox model) and $γ_{i}$ is its shape, which now also varies per individual (hence, $h_{0}$ is no longer a baseline hazard function). Importantly, both the scale and the shape parameters depend on individual covariates (via the subscript $i$ ), an activity that we refer to as multi-parameter regression (MPR) modeling⁴; the approach may also be referred to as distributional regression⁵ (see also Rigby and Stasinopoulos⁶ and Stasinopoulos et al.⁷). The multi-parameter regression nomenclature distinguishes our approach from classical models where there is just one regression component such as the hazard scale parameter in a Cox model, or the location parameter in a generalized linear model.⁸

As a motivating example, Figure 1 displays data from a lung cancer study used in Burke and MacKenzie,⁴ where it is clear that the MPR model has the flexibility to adapt to the different distributional shapes evident across the treatment groups (see Section 5 for a multi-factor analysis of this dataset). Indeed, Burke and MacKenzie⁴ explored the general use of MPR models in the survival context, demonstrating the usefulness of jointly modeling the scale and shape of the Weibull distribution. Earlier examples of MPR models in survival analysis include a location-dispersion extension of the Weibull accelerated failure time (AFT) model,⁹ and first-hitting-time models with covariate-dependent drift and initial-state parameters.^10,11 More recently, MPR survival models have been developed further through the incorporation of frailty effects in interval-censored data,¹² the use of the adapted power generalized Weibull model,¹³ which is more general than the Weibull and has also been extended to handle bivariate data,¹⁴ and a semi-parametric extension of the AFT model.¹⁵

Figure 1.

Kaplan–Meier curves (solid) for different treatment groups with model-based curves overlaid (dashed) for the Cox PH model (left) and Weibull MPR model (right). PH: proportional hazards; MPR: multi-parameter regression.

Regardless of the modeling approach taken (MPR or otherwise), a commonly encountered challenge in statistical applications is the selection of a subset of explanatory variables of interest,^16,17 that is, the elimination of unimportant variables to yield simpler, more explainable models. However, the literature in this area is somewhat lacking for MPR survival models beyond a stepwise procedure developed by Burke and MacKenzie.⁴ Due to the inherent discreteness of stepwise procedures (i.e., covariates are either “in” or “out”), they may be unstable in terms of the selected model.¹⁸ Furthermore, they can be computationally demanding since, with $p$ covariates, there are $2^{p}$ submodels—and this issue is more acute in the MPR setting where there are multiple regression components, for example, in the scale-shape model, there are $2^{2 p}$ submodels. On the other hand, more modern approaches based on penalization carry out estimation and (continuous) model selection simultaneously, and are capable of handling a larger number of covariates in a more stable and efficient manner, for example, the least absolute shrinkage and selection operator (LASSO),¹⁹ the smoothly clipped absolute deviation (SCAD),²⁰ and the adaptive LASSO (ALASSO).²¹

To the best of our knowledge, the use of the aforementioned penalized procedures for MPR survival models is lacking. Given that classical statistical models contain only one regression component, it is not unexpected that the penalized estimation literature is focused around this setting.²² An exception is the work of Groll et al.²³ who develop LASSO-type penalization for generalized additive models for location, scale and shape (GAMLSS) albeit not in a survival context. Therefore, the aim of this article is to develop such procedures for MPR survival models. More specifically, we use the Weibull MPR model as an example, develop gradient-based estimation procedures for LASSO, SCAD, and ALASSO by using a smooth approximation to the absolute value function,^24–26 and investigate the need for a separate tuning parameter for each regression component. Tuning parameter selection is carried out using a Bayesian information criterion (BIC) function, where, due to the intensivity of grid search when there are multiple regression components (as noted in the GAMLSS context²³), we make use of a differential evolution “global” optimization procedure^27,28 to explore the tuning parameter space.

The remainder of this article is organized as follows. In Section 2, we present the Weibull MPR model, the penalty functions, and the penalized likelihood estimation procedure. Section 3 describes the model estimation and inference procedure along with the algorithm for selecting tuning parameters. Simulation studies are provided in Section 4 where we evaluate the performance of the proposed methods, and these methods are then applied to data from an observational study of patients with lung cancer in Section 5. We conclude with some final remarks in Section 6.

2. Model formulation

2.1. Weibull MPR model

Although the variable selection methods we consider in this article can be applied to any parametric MPR model, it is helpful to focus on a specific example. We, therefore, consider the Weibull MPR model since the Weibull distribution is one of the most popular parametric survival distributions. In this case, the hazard function for survival time ${\tilde{T}}_{i}$ corresponding to the $i$ th individual is given by

h (t ∣ τ_{i}, γ_{i}) = τ_{i} γ_{i} t^{γ_{i} - 1}

for

i = 1, 2, \dots, n

, where

τ_{i} > 0

and

γ_{i} > 0

are the covariate-dependent scale and shape parameters, respectively. This is an MPR model by virtue of both distributional parameters depending on covariates, and we specify these regression components as follows:

\log (τ_{i}) = x_{i}^{T} β, \log (γ_{i}) = z_{i}^{T} α

where

x_{i} = (1, x_{i 1}, \dots, x_{i p})^{T}

and

z_{i} = (1, z_{i 1}, \dots, z_{i q})^{T}

are scale and shape covariate vectors which may or may not have covariates in common,

β = (β_{0}, β_{1}, \dots, β_{p})^{T}

and

α = (α_{0}, α_{1}, \dots, α_{q})^{T}

are the corresponding regression coefficients, and the log link is used to ensure positivity of the parameters.

Without the loss of generality, we consider the ratio of hazards under this model for individuals $i$ and $i^{'}$ who we assume have identical covariate profiles apart from the first covariate in the scale and shape vectors, where $x_{i 1} = z_{i 1} = c + 1$ and $x_{i^{'} 1} = z_{i^{'} 1} = c$ , that is, the first covariate differs by one unit for these individuals. Given this setup, $τ_{i} = \exp (β_{1}) τ_{i^{'}}$ and $γ_{i} = \exp (α_{1}) γ_{i^{'}}$ , and the hazard ratio is

\frac{h (t ∣ τ_{i}, γ_{i})}{h (t ∣ τ_{i^{'}}, γ_{i^{'}})} = \exp (β_{1} + α_{1}) t^{\exp (z_{i^{'}}^{T} α) {\exp (α_{1}) - 1}},

where

z_{i^{'}}^{T} = (1, c, z_{i 2}, \dots, z_{i q})

, that is, the vector

z_{i}

with the element

z_{i 1}

fixed at the value

c

. The dependence on

z_{i^{'}}

is typically dealt with through the use of a representative covariate profile such as the empirical modal or mean values.⁴ Importantly, when

α_{1} = 0

, the hazard ratio reduces to

\exp (β_{1})

, which is the typical constant hazard ratio from a PH (Cox) model.

Parameter estimation within the unpenalized MPR model can be carried out in a standard fashion using maximum likelihood. First, let $T_{i} = min ({\tilde{T}}_{i}, C_{i})$ be the observed survival time for the $i$ th individual. Then the associated log-likelihood function is given by

ℓ_{0} (θ) = \sum_{i = 1}^{n} δ_{i} {\log τ_{i} + \log γ_{i} + (γ_{i} - 1) \log t_{i}} - τ_{i} t_{i}^{γ_{i}}

(1)

where

θ = (β^{T}, α^{T})^{T}

is the full parameter vector,

t_{i}

is the realization of

T_{i}

, and

δ_{i}

is the censoring indicator which takes the value 0 for censored survival times and 1 for uncensored survival times. Beyond the Weibull case we consider here, the likelihood function is

\sum_{i = 1}^{n} δ_{i} \log h (t_{i} | x_{i}, z_{i}) - H (t_{i} | x_{i}, z_{i})

, where

H (t | x_{i}, z_{i}) = \int_{0}^{t} h (u | x_{i}, z_{i}) d u

is the cumulative hazard function.

2.2. Penalized likelihood

Penalized MPR estimation can be developed on the basis of maximizing a penalized log-likelihood given by

ℓ (θ) = ℓ_{0} (θ) - n \sum_{j = 0}^{p} J_{λ_{β_{j}}} (| β_{j} |) - n \sum_{j = 0}^{q} J_{λ_{α_{j}}} (| α_{j} |)

(2)

where

ℓ_{0} (θ)

is the unpenalized likelihood,

λ = (λ_{β_{0}}, λ_{β_{1}}, \dots, λ_{β_{p}}, λ_{α_{0}}, λ_{α_{1}}, \dots, λ_{α_{q}})

is a vector of coefficient-specific tuning parameters, and

J_{λ_{β_{j}}} (\cdot)

and

J_{λ_{α_{j}}} (\cdot)

are scale and shape penalty functions which we assume have the same functional form (but differ with respect to the tuning parameter). As is standard practice, we assume that the intercepts are not penalized, and, therefore, define

λ_{β_{0}} \equiv λ_{α_{0}} \equiv 0

. We also assume that covariates are standardized so that penalization is independent of the particular units of measurement; in all of our numerical work, we carry out this standardization internally and present regression coefficients on the original scale.

Although we have defined $λ$ quite generally, we will in fact impose constraints on this vector (beyond fixing $λ_{β_{0}} \equiv λ_{α_{0}} \equiv 0$ ) by considering the following possibilities (for $j \neq 0$ ): (i)

single penalty,

λ_{β_{j}} = λ_{α_{j}} = λ

(ii)

single adaptive penalty,

λ_{β_{j}} = λ w_{β_{j}}, λ_{α_{j}} = λ w_{α_{j}}

where

w_{β_{j}}

and

w_{α_{j}}

are predefined weights,

(iii)

separate non-adaptive penalties,

λ_{β_{j}} = λ_{β}, λ_{α_{j}} = λ_{α}

(iv)

separate adaptive penalties,

λ_{β_{j}} = λ_{β} w_{β_{j}}, λ_{α_{j}} = λ_{α} w_{α_{j}}

(i) and (ii) are standard approaches where a single penalty,

λ

, applies to the whole vector of parameters. This is reasonable in a standard setting where there is only a

β

vector. However, in this particular MPR setting, we have two separate distributional parameters, which exist on different scales. For this reason, we investigate methods (iii) and (iv) which apply different penalties to the two regression vectors via

λ_{β}

and

λ_{α}

For the purpose of this article, we consider the most commonly used penalties, namely the LASSO,¹⁹

J_{λ_{θ_{j}}} (| θ_{j} |) = λ_{θ_{j}} | θ_{j} |

which although popular, is known to select too many variables²⁹; the non-convex SCAD,²⁰

J_{λ_{θ_{j}}} (| θ_{j} |) = {\begin{cases} λ_{θ_{j}} (| θ_{j} |) & if | θ_{j} | \leq λ_{θ_{j}} \\ \frac{2 a λ_{θ_{j}} | θ_{j} | - θ_{j}^{2} - λ_{θ_{j}}^{2}}{2 (a - 1)} & if λ_{θ_{j}} < | θ_{j} | < a λ_{θ_{j}} \\ \frac{λ_{θ_{j}}^{2} (a + 1)}{2} & if | θ_{j} | \geq a λ_{θ_{j}} \end{cases}

where

a = 3.7

, and the ALASSO,²¹

J_{λ_{θ_{j}}} (| θ_{j} |) = λ_{θ_{j}} w_{θ_{j}} | θ_{j} |

where typically,

w_{θ_{j}} = 1 / | {\hat{θ}}_{0, j} |

and

{\hat{θ}}_{0, j}

is an unpenalized estimate of

θ_{j}

. These so-called adaptive weights are used to apply different penalties to different regression coefficients such that a larger amount of shrinkage is applied to the unimportant variables. Here, we use

θ_{j}

to denote a generic regression coefficient, and

λ_{θ_{j}}

is the corresponding tuning parameter. Note that the LASSO and SCAD are non-adaptive, and, therefore, relate to options (i) and (iii) above, whereas the ALASSO relates to options (ii) and (iv). SCAD and the ALASSO are known to possess the oracle property, that is, the procedure asymptotically identifies the right subset model and estimates the coefficients and covariance matrix as though the true model was known in advance.²⁰ Fan and Li²⁰ found the choice of a = 3.7 to give very good practical performance for various variable selection problems, and as a result this value has been widely used throughout the literature.^30–33

3. Penalized estimation procedure

3.1. Model fitting

We define

\hat{θ} = \underset{θ}{argmax} ℓ (θ)

(3)

where

ℓ (θ)

is given by (2). The corresponding score functions are given by

\begin{aligned} \frac{\partial ℓ}{\partial β} & = \frac{\partial ℓ_{0}}{\partial β} - n V_{β} = X^{T} U_{β} - n V_{β} \\ \frac{\partial ℓ}{\partial α} & = \frac{\partial ℓ_{0}}{\partial α} - n V_{α} = Z^{T} U_{α} - n V_{α} \end{aligned}

(4)

where

X

is an

n \times (p + 1)

matrix whose

i

th row is

x_{i}

Z

is an

n \times (q + 1)

matrix whose

i

th row is

z_{i}

;

U_{β}

and

U_{α}

are vectors of length

n

such that

U_{β i} = δ_{i} - τ_{i} t_{i}^{γ_{i}}

and

U_{α i} = δ_{i} (1 + γ_{i} \log t_{i}) - τ_{i} γ_{i} t_{i}^{γ_{i}} \log t_{i}

;

V_{β}

and

V_{α}

are vectors of lengths

p + 1

and

q + 1

, respectively, such that, for

j \geq 0

V_{β, j + 1} = d J_{λ_{β_{j}}} (| β_{j} |) / d β_{j} = J_{λ_{β_{j}}}^{'} (| β_{j} |) d | β_{j} | / d β_{j}

and

V_{α, j + 1} = d J_{λ_{α_{j}}} (| α_{j} |) / d α_{j} = J_{λ_{α_{j}}}^{'} (| α_{j} |) d | α_{j} | / d α_{j}

Note however, the presence of the absolute value function renders the penalty functions non-differentiable at zero. Various algorithms have been developed to overcome this issue including quadratic programing,¹⁹ least-angle regression (LARS),³⁴ co-ordinate descent,³⁵ and the local quadratic approximation.²⁰ In this article, we take a similar approach to that of Hunter and Li,²⁴ Oelker and Tutz,²⁵ and Lloyd-Jones et al.,²⁶ and use an extension of the absolute value function given by

a (x) = \sqrt{x^{2} + ϵ^{2}} - ϵ

where

lim_{ϵ \to 0} a (x) = | x |

. This yields a differentiable penalty so that standard gradient-based optimization algorithms can be applied straightforwardly and transparently. Thus,

a^{'} (x) = x / \sqrt{ϵ^{2} + x^{2}}

(which is an approximation of the signum function) and

a^{″} (x) = ϵ^{2} / (ϵ^{2} + x^{2})^{3 / 2}

. Smaller values of

ϵ

bring the approximate penalty closer to the original penalty, but also closer to the penalty being non-differentiable; we have found that fixing

ϵ = 10^{- 4}

generally works well. As we use smooth

J (\cdot)

functions, and

a (x)

in place of

| x |

, (4) is then smooth in the parameters and can therefore be solved using the Netwon-Raphson algorithm.

We denote by $I (θ)$ the matrix of second derivatives of $ℓ (θ)$ , that is, $- \nabla_{θ} \nabla_{θ}^{T} ℓ (θ)$ . Then,

\begin{aligned} I (θ) = I_{0} (θ) + (\begin{matrix} n Σ_{β} & 0 \\ 0 & n Σ_{α} \end{matrix}) = (\begin{matrix} X^{T} W_{β} X + n Σ_{β} & X^{T} W_{α β} Z \\ Z^{T} W_{α β} X & Z^{T} W_{α} Z + n Σ_{α} \end{matrix}) \end{aligned}

where

I_{0} (θ) = - \nabla_{θ} \nabla_{θ}^{T} ℓ_{0} (θ)

is the usual observed information matrix of the unpenalized likelihood;

Σ_{β}

and

Σ_{α}

appear due to the penalties, and are diagonal matrices of dimension

(p + 1) \times (p + 1)

and

(q + 1) \times (q + 1)

, respectively, such that, for

j \geq 0

Σ_{β, j + 1, j + 1} = d^{2} J_{λ_{β_{j}}} (| β_{j} |) / {d β_{j}}^{2}

and

Σ_{α, j + 1, j + 1} = d^{2} J_{λ_{α_{j}}} (| α_{j} |) / {d α_{j}}^{2}

; and

W_{β}

W_{α}

, and

W_{α β}

are

n \times n

diagonal matrices whose

i

th diagonal elements are given by

τ_{i} t_{i}^{γ_{i}}

{τ_{i} t_{i}^{γ_{i}} (γ_{i} \log t_{i} + 1) - δ_{i}} γ_{i} \log t_{i}

, and

τ_{i} γ_{i} t_{i}^{γ_{i}} \log t_{i}

, respectively. Thus, following Ha et al.,³⁶ the resulting system of Newton-Raphson equations, which are iteratively solved for

θ^{(m + 1)} = ({β^{(m + 1)}}^{T}, {α^{(m + 1)}}^{T})^{T}

, can be written compactly as

(\begin{matrix} X^{T} W_{β}^{(m)} X + n Σ_{β}^{(m)} & X^{T} W_{α β}^{(m)} Z \\ Z^{T} W_{α β}^{(m)} X & Z^{T} W_{α}^{(m)} Z + n Σ_{α}^{(m)} \end{matrix}) (\begin{matrix} β^{(m + 1)} - β^{(m)} \\ α^{(m + 1)} - α^{(m)} \end{matrix}) = (\begin{matrix} X^{T} U_{β}^{(m)} - n V_{β}^{(m)} \\ Z^{T} U_{α}^{(m)} - n V_{α}^{(m)} \end{matrix})

(5)

where the various elements superscripted by

(m)

depend on

θ^{(m)}

, but this dependence is suppressed for notational convenience; we use the unpenalized estimates as the initial values in this iterative procedure, that is,

θ^{(0)} = {\hat{θ}}_{0}

. Having obtained the penalized estimates,

\hat{θ}

, the covariance can be estimated using the sandwich formula.^20,37,36,38

\hat{c o v} (\hat{θ}) = {I (\hat{θ})}^{- 1} I_{0} (\hat{θ}) {I (\hat{θ})}^{- 1}

(6)

This formula is known to have good accuracy when the sample size is moderate,^20,37 and its performance in our MPR setting is investigated in Section 4 through simulation studies.

Figure 2.

The BIC function evaluated at different tuning parameter values for the Weibull MPR model with the one tuning parameter LASSO penalty for the lung cancer data analysed in Section 5. The equivalent plot for the two tuning parameter LASSO penalties can be found in the Supplemental Material. BIC: Bayesian information criterion; MPR: multi-parameter regression;LASSO: least absolute shrinkage and selection operator.

3.2. Tuning parameter selection

The selection of the optimal tuning parameter(s) is typically done through the use of data-driven criteria such as generalized cross-validation (GCV), Akaike information criterion (AIC), or BIC. GCV and the AIC are known to be less efficient and selection inconsistent as model selection criteria.^39–41 Wang et al.⁴² provided a formal proof that the shrinkage or tuning parameter selected using GCV may not be able to identify the true model consistently for the SCAD estimator in linear models and partially linear models. Instead, they suggest using the BIC and prove its model selection consistency property. A similar conclusion has been reached by Wang and Leng⁴³ for the ALASSO. Hence, due to its widely reported superior empirical performance in variable selection, we use a BIC-type criterion to determine the values of the tuning parameter(s), where

BIC (λ) = - 2 ℓ_{0} (\hat{θ}) + d f \cdot \log n

(7)

ℓ_{0} (\hat{θ})

is the unpenalized likelihood function defined in (1),

n

is the sample size, and

d f = tr [{I (\hat{θ})}^{- 1} I_{0} (\hat{θ})]

is the effective degrees of freedom.⁴⁴ The BIC is a function of

λ

through

\hat{θ}

and

d f

, but for notational convenience, we suppress this dependence. We define

λ^{*} = \underset{λ}{argmin} BIC (λ)

(8)

Note that, as described in Section 2.2,

λ^{*}

will either be one-dimensional (when a common penalty is applied to

β

and

α

) or two-dimensional (when separate penalties are applied). We define

{\hat{θ}}^{*}

to be the vector of coefficients values corresponding to

λ^{*}

from (8).

The simplest method to solve this optimization problem is grid search. While it is straightforward to implement, grid search is known to suffer from the curse of dimensionality, that is, the number of grid points grows exponentially with the dimension. Furthermore, if the grid is too coarse, the minimum may be overlooked. This is especially true in the case of a multi-modal function, such as the BIC objective function as shown in Figure 2. This multi-modality arises in the BIC due to the tradeoff between complexity ( $d f$ ) and data fit ( $ℓ_{0} (\hat{θ})$ ). As an example, Table 1 “zooms in” on a particular portion of the BIC function from Figure 2, wherein the estimated $β_{6}$ coefficient decreases by an order of magnitude (becoming close to zero). Typically, $d f$ decreases rapidly when a given coefficient gets close to zero, but in such a way that the decreasing likelihood leads to a local minima in the BIC (at $λ = 0.023$ in Table 1).

Table 1.

The degrees of freedom, likelihood function, and BIC value evaluated at different tuning parameter values for the model with the LASSO penalty (one tuning parameter case) for the lung cancer dataset analysed in Section 5.

$λ$	df	$ℓ$	BIC	$β_{0}$	$β_{6}$
0.0221	34.00	−1825.3	3880.1	−2.7096	−0.0152
0.0226	33.50	−1825.9	3877.9	−2.7034	−0.0133
0.0230	33.30	−1826.6	3877.8	−2.6973	−0.0114
0.0234	33.27	−1827.1	3878.8	−2.6915	−0.0095
0.0238	33.23	−1827.7	3879.7	−2.6857	−0.0075
0.0243	33.20	−1828.3	3880.7	−2.6799	−0.0056
0.0247	33.15	−1828.9	3881.7	−2.6742	−0.0037
0.0251	33.00	−1829.6	3882.0	−2.6684	−0.0019

BIC: Bayesian information criterion; df: degrees of freedom; LASSO: least absolute shrinkage and selection operator.

Although the BIC’s consistency property has led to its extensive use in tuning-parameter selection, we suggest that such a multi-modal function would be better optimized by a “global” optimizer (rather than grid search as is typically used in the literature). In an empirical comparison of a wide variety of (stochastic and deterministic) algorithms for continuous global optimization, Mullen²⁸ found DEoptim (implemented in R)²⁷ to be among the best. The function implements a differential evolution algorithm, an example of an evolutionary strategy developed by Storn and Price⁴⁵ (see Mullen et al.²⁷ for a detailed overview of the underlying algorithm). As highlighted by an anonymous reviewer, model-based (Bayesian) optimization (MBO) lends itself well to situations where the objective function is computationally burdensome, for example, embeded within each BIC computation in (7) is a model estimated by iteratively solving the equations in (5) for a given $λ$ vector. Although not considered by Mullen,²⁸ our initial testing suggests that DEoptim also outperforms MBO as implemented in the mlrMBO package⁴⁶ (see Supplemental Material).

3.3. Variable selection algorithm

The variable selection algorithm described above is summarized in the following points:

•
Initialization. Set $θ^{(0)} = {\hat{θ}}_{0}$ where ${\hat{θ}}_{0}$ is the vector of unpenalized estimates, that is, those which minimize $ℓ_{0} (θ)$ (defined in (1)).
•
Optimization. –
Outer. Minimize $BIC (λ)$ with respect to $λ$ using DEoptim, yielding $λ^{}$ as defined in (8). Convergence occurs when $| BIC (λ^{(r + 1)}) - BIC (λ^{(r)}) |$ is below a prespecified threshold. (Here, $λ^{(r)}$ is the best $λ$ value found at step $r$ of the DEoptim algorithm.)
–
Inner.* For a given value of $λ$ , maximize $ℓ (θ)$ by iteratively re-solving the system of equations given in (5) starting from the initial value, $θ^{(0)}$ ; this yields $\hat{θ}$ . Convergence occurs when $| | θ^{(m + 1)} - θ^{(m)} | |_{\infty}$ is below a prespecified threshold. (Here $| | y | |_{\infty} = max_{j} | y_{j} |$ is the infinity norm.)

•
Output. The estimates ${\hat{θ}}^{}$ corresponding to $λ^{}$ , are returned from the above procedure, and the corresponding standard errors are calculated by evaluating (6) at ${\hat{θ}}^{*}$ .

4. Simulation studies

4.1. Setup

The performance of the proposed variable selection methods is evaluated through simulation studies. The failure time is simulated from a Weibull MPR model with

\begin{aligned} \log (τ_{i}) & = x_{i}^{T} (- 1.5, - 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, - 0.8, 0.5, 0.0, 0.0)^{T} \\ \log (γ_{i}) & = z_{i}^{T} (0.5, 0.4, 0.0, 0.0, 0.0, 0.4, - 0.2, 0.0, 0.0, 0.0, 0.0)^{T} \end{aligned}

where

x_{i} = z_{i} = (1, x_{i 1}, \dots, x_{i 10})^{T}

is a vector of correlated variables generated from an AR(1) process with a correlation coefficient

ρ = 0.5

. Each variable is marginally standard normal and the correlation between any two consecutive variables

x_{i j}

and

x_{i k}

is given by

ρ^{| j - k |}

. The corresponding censored times were generated from a uniform distribution such that the censoring proportion was

p_{cen} = 25 %

. This setup was chosen so as to yield realistic survival data, where the true model is sparse and the covariates are correlated. The results for three different sample sizes (

n

= 100, 500, and 1000) are presented here. For each scenario, we considered the LASSO, SCAD, and ALASSO penalties with a single tuning parameter or two tuning parameters (i.e., one for each of the two regression components). Each simulation scenario was replicated 1000 times.

4.2. Simulation results

The variable selection and estimation procedures described in Sections 2 and 3 are applied to the simulated data and the results are summarized and discussed here. A number of metrics are used to evaluate the performance of the variable selection procedures, namely the average number of true zero coefficients correctly set to zero (C), the average number of true non-zero coefficients incorrectly set to zero (IC), and the probability of choosing the true model (PT); for the oracle model, C = 7 and IC = 0. As a measure of prediction accuracy, we also consider the mean squared error (MSE), given by $MSE (\hat{β}) = (\hat{β} - β)^{T} S_{β} (\hat{β} - β)$ and $MSE (\hat{α}) = (\hat{α} - α)^{T} S_{α} (\hat{α} - α)$ , where $S_{β}$ and $S_{α}$ , the simulated sample covariance matrices of the covariates, are computed for each simulation replicate .^47,48 These metrics, averaged over simulation replicates for the scenarios with 25% censoring, are reported in Table 2.

Table 2.
Selection results: Variable selection metrics averaged over 1000 simulation replicates.

LASSO SCAD ALASSO

$p_{cen}$ = 25% $n$ C(7) IC(0) PT MSE C(7) IC(0) PT MSE C(7) IC(0) PT MSE

One tuning parameter

100 5.38 0.18 0.13 0.35 6.37 0.11 0.53 0.28 6.21 0.08 0.42 0.22

Scale ( $β$ ) 500 5.63 0.00 0.25 0.08 6.96 0.00 0.97 0.02 6.87 0.00 0.88 0.03

1000 5.88 0.00 0.34 0.05 7.00 0.00 1.00 0.01 6.96 0.00 0.96 0.01

100 3.61 0.08 0.02 0.05 4.49 0.09 0.05 0.05 6.37 0.22 0.45 0.04

Shape ( $α$ ) 500 3.46 0.00 0.01 0.01 5.95 0.00 0.34 0.01 6.89 0.00 0.90 0.00

1000 3.48 0.00 0.00 0.00 6.52 0.00 0.63 0.00 6.95 0.00 0.95 0.00

Two tuning parameters

100 4.88 0.10 0.11 0.30 6.28 0.11 0.56 0.28 6.29 0.10 0.46 0.23

Scale ( $β$ ) 500 5.18 0.00 0.17 0.07 6.89 0.00 0.95 0.02 6.88 0.00 0.90 0.02

1000 5.42 0.00 0.21 0.04 6.95 0.00 0.98 0.01 6.96 0.00 0.96 0.01

100 5.08 0.26 0.09 0.06 5.06 0.18 0.10 0.05 6.44 0.26 0.45 0.04

Shape ( $α$ ) 500 5.42 0.00 0.21 0.01 6.10 0.00 0.42 0.01 6.90 0.00 0.91 0.00

1000 5.57 0.00 0.26 0.00 6.51 0.00 0.62 0.00 6.96 0.00 0.96 0.00

		LASSO	SCAD	ALASSO
One tuning parameter
	100	5.38	0.18	0.13	0.35	6.37	0.11	0.53	0.28	6.21	0.08	0.42	0.22
Scale ( $β$ )	500	5.63	0.00	0.25	0.08	6.96	0.00	0.97	0.02	6.87	0.00	0.88	0.03
	1000	5.88	0.00	0.34	0.05	7.00	0.00	1.00	0.01	6.96	0.00	0.96	0.01
	100	3.61	0.08	0.02	0.05	4.49	0.09	0.05	0.05	6.37	0.22	0.45	0.04
Shape ( $α$ )	500	3.46	0.00	0.01	0.01	5.95	0.00	0.34	0.01	6.89	0.00	0.90	0.00
	1000	3.48	0.00	0.00	0.00	6.52	0.00	0.63	0.00	6.95	0.00	0.95	0.00
Two tuning parameters
	100	4.88	0.10	0.11	0.30	6.28	0.11	0.56	0.28	6.29	0.10	0.46	0.23
Scale ( $β$ )	500	5.18	0.00	0.17	0.07	6.89	0.00	0.95	0.02	6.88	0.00	0.90	0.02
	1000	5.42	0.00	0.21	0.04	6.95	0.00	0.98	0.01	6.96	0.00	0.96	0.01
	100	5.08	0.26	0.09	0.06	5.06	0.18	0.10	0.05	6.44	0.26	0.45	0.04
Shape ( $α$ )	500	5.42	0.00	0.21	0.01	6.10	0.00	0.42	0.01	6.90	0.00	0.91	0.00
	1000	5.57	0.00	0.26	0.00	6.51	0.00	0.62	0.00	6.96	0.00	0.96	0.00

C: average correct zeros; IC: average incorrect zeros; PT: the probability of choosing the true model; MSE: the average mean squared error; LASSO: least absolute shrinkage and selection operator;SCAD: smoothly clipped absolute deviation; ALASSO: adaptive least absolute shrinkage and selection operator.

As the sample size increases, we see an improvement across all four metrics, for both the shape and the scale parameters and across all penalties. However, it is evident that the LASSO penalty does not set enough covariates equal to zero (i.e., it selects an overly complex model). While the LASSO with one tuning parameter outperforms the LASSO with two tuning parameters in the scale component, it has very poor performance in the shape component (and we can also confirm that the BIC values are much higher). In any case, the LASSO over-selects irrespective of whether it has one or two tuning parameters, leading to quite low PT values. SCAD performs better than the LASSO, but still over-selects somewhat in the shape component. The best overall performance comes from the ALASSO penalty, which, for the largest sample size, selects the true scale and shape covariates more than 90% of the time. Interestingly, the ALASSO performs well even with a single tuning parameter (but it does improve with two tuning parameters). In terms of the computation time, SCAD has been found to be slower than the LASSO and ALASSO penalties. Furthermore, the computation times for the cases with two tuning parameters are two to three times longer than those with one tuning parameter.

Figure 3 provides the boxplots of the C and MSE performance metrics over simulation replicates to account for variability in the results (for the penalties with two tuning parameters). Moreover, we additionally include the results for the full unpenalized and true oracle models, respectively, to act as worst-case and best-case benchmarks. It is clear that the C metric tends to be lower in the LASSO than SCAD and the ALASSO. The latter two are comparable in the scale component but the ALASSO outperforms SCAD in the shape, achieving the oracle value of C $= 7$ when $n \geq 500$ with essentially no variation. In terms of MSE, we again see that SCAD and the ALASSO are similar (with lower values than LASSO). However, the ALASSO has slightly lower MSE values in the shape component and, as with the C metric, is very similar to the oracle model for $n \geq 500$ .

Figure 3.

(a) True zero coefficients correctly set to zero (C) and (b) mean squared error (MSE) by model, distributional parameter and sample size across 1000 replicates for the models with two tuning parameters.

In addition to variable selection performance, we also consider parameter inference in terms of estimation bias, accuracy of the estimated standard error (SEE) computed using the sandwich formula, (6) compared to the true standard error (SE) compuated as the standard deviation over simulation replicates, and the empirical coverage probability (CP) of a nominal 95% confidence interval. The results for the ALASSO penalty (for the 25% censoring level) are presented in Table 3. Overall, we can see that the estimation bias and SEs reduce with the sample size and the CP values get closer to the nominal 95% level. However, at $n = 100$ , the CP values are more than 10 percentage points lower than the nominal level in many cases. This is due to the parameters being overshrunk and the SEEs underestimating the SEs; this particularly impacts parameters which have smaller magnitudes (e.g. $β_{8}$ and $α_{6}$ ). All results are improved by having two tuning parameters, and, indeed, for $n = 1000$ , the CP values for most parameters are within 2 percentage points of the nominal value of 95%, and within 4 percentage points for $β_{8}$ and $α_{6}$ . We defer LASSO and SCAD results to the Supplemental Material, where we find: the LASSO overshrinks parameters and underestimates the SEs; SCAD has lower bias, but underestimates the SEs such that CP values remain poor even at $n = 1000$ (e.g., 20% for $β_{8}$ and 70% for $α_{6}$ ).

Table 3.

Inferential results: estimates, standard errors, and confidence intervals.

ALASSO
		$n$ = 100				$n$ = 500				$n$ = 1000
$p_{cen}$ = 25%	$θ$	$\hat{θ}$	SE	SEE	CP	$\hat{θ}$	SE	SEE	CP	$\hat{θ}$	SE	SEE	CP
One tuning parameter
$β_{0}$	−1.50	−1.47	0.23	0.21	0.91	−1.48	0.09	0.09	0.93	−1.49	0.06	0.06	0.94
$β_{1}$	−1.00	−0.95	0.21	0.16	0.86	−0.98	0.07	0.07	0.92	−0.99	0.05	0.05	0.96
$β_{7}$	−0.80	−0.73	0.20	0.15	0.85	−0.77	0.06	0.06	0.92	−0.79	0.05	0.04	0.93
$β_{8}$	0.50	0.40	0.19	0.13	0.80	0.47	0.06	0.06	0.89	0.48	0.04	0.04	0.89
$α_{0}$	0.50	0.52	0.11	0.09	0.91	0.50	0.04	0.04	0.95	0.50	0.03	0.03	0.94
$α_{1}$	0.40	0.37	0.07	0.06	0.88	0.39	0.02	0.02	0.94	0.40	0.01	0.01	0.95
$α_{5}$	0.40	0.35	0.08	0.06	0.78	0.39	0.03	0.02	0.91	0.39	0.02	0.02	0.91
$α_{6}$	−0.20	−0.13	0.09	0.05	0.68	−0.18	0.03	0.02	0.87	−0.19	0.02	0.02	0.90
Two tuning parameters
$β_{0}$	−1.50	−1.49	0.24	0.21	0.92	−1.49	0.09	0.09	0.94	−1.49	0.06	0.06	0.95
$β_{1}$	−1.00	−0.97	0.21	0.16	0.87	−0.99	0.07	0.07	0.93	−0.99	0.05	0.05	0.96
$β_{7}$	−0.80	−0.75	0.21	0.15	0.85	−0.78	0.06	0.06	0.94	−0.79	0.05	0.04	0.93
$β_{8}$	0.50	0.41	0.20	0.12	0.80	0.47	0.06	0.06	0.91	0.49	0.04	0.04	0.91
$α_{0}$	0.50	0.53	0.11	0.09	0.90	0.50	0.04	0.04	0.95	0.50	0.03	0.03	0.94
$α_{1}$	0.40	0.37	0.07	0.05	0.87	0.40	0.02	0.02	0.95	0.40	0.01	0.01	0.95
$α_{5}$	0.40	0.35	0.09	0.06	0.77	0.39	0.03	0.02	0.92	0.39	0.02	0.02	0.92
$α_{6}$	−0.20	−0.14	0.10	0.05	0.70	−0.19	0.03	0.02	0.88	−0.19	0.02	0.02	0.91

SE: standard deviation of estimates over 1000 replications; SEE: average of estimated standard errors over 1000 replications; CP: the empirical coverage probability of a nominal 95% confidence interval;ALASSO: adaptive least absolute shrinkage and selection operator.

Figure 4 displays the boxplots of estimates of $β_{1}$ (non-zero coefficient) and $β_{2}$ (zero coefficient) for the ALASSO over simulation replicates; for comparison, the full unpenalized and true oracle models are shown. It is clear that estimates from the ALASSO penalty (both one or two tuning parameters) converge to those of the oracle model, and, for $n \geq 500$ , we see that the zero coefficient is correctly set to zero in all but a small few outlying cases. The boxplots of the SEEs for $β_{1}$ are also shown in Figure 5, where we again see convergence to the oracle model (but note that SEs are underestimated for $n = 100$ ). Similar boxplots for other parameters are shown in the Supplemental Material.

Figure 4.

Coefficient estimates from the models with adaptive least absolute shrinkage and selection operator (ALASSO) penalties by sample size and across the 1000 replicates: (a) $β_{1}$ coefficient estimates and (b) $β_{2}$ coefficient estimates (the dashed line represents the true coefficient value).

Figure 5.

Boxplots of SEEs for $β_{1}$ along with SEs (dot). SEEs: estimated standard errors; SEs: standard errors.

We have also tested all approaches at the higher censoring proportion of 50% (see Supplemental Material), where performance decreases across all metrics (e.g., reduced selection performance, increased bias, and variability), especially at smaller sample sizes. However, at $n = 1000$ , the two-tuning-parameter ALASSO performs quite favorably, for example, PT $> 90$ % and CP $\in [88 %, 96 %]$ . We have further tested the ALASSO in additional simulation scenarios where we increased the correlation amongst covariates (from $ρ = 0.5$ to $ρ = 0.8$ ) and number of covariates (from 10 to 20), and decreased the proportion of non-zero effects (from 30% to 10%). Again, these can be found in the Supplemental Material, and, in all cases, the performance of the ALASSO is very favorable and broadly similar to the results already discussed.

5. Lung cancer study

Here we consider data from an observational lung cancer study which was collected by Wilkinson⁴⁹ (see also Burke and MacKenzie⁴). This study includes all individuals, of all ages, diagnosed with lung cancer in Northern Ireland during the one-year period 1 October 1991 to 30 September 1992. Only cases of primary lung cancer were included. The date of diagnosis was taken to be the time origin for an individual and the end point was the earlier of the occurrence of death or the study end date, which was on 30 May 1993. Individuals who were still alive on the study end date were taken to have censored survival times. Individuals who died from another cause or who dropped out of the study were also censored. The final dataset included 855 patients, of which there were 673 deaths and 182 censored times. Besides the survival time and the censoring indicator, a number of other variables were recorded for each of the patients enrolled in the study (reference categories are listed first): age group (< 40-, 50-, 60-, 70-, and > 80), sex (female and male), treatment group (palliative, surgery, chemotherapy, radiotherapy, chemotherapy, and radiotherapy), WHO status (normal activity, light work, unable to work, $> 50 %$ walking, and bed/chair bound), cancer cell type (squamous cell, small cell, adenocarcinoma, and other), serum sodium level ( $\geq 136 mmol/L$ , $< 136 mmol/L$ , missing), serum albumen level ( $\geq 35 g/L$ , $< 35 g/L$ , missing), metastases (no, yes, and unknown), and smoking status (non-smoker, current smoker, ex-smoker, and\break missing).

5.1. Adequacy of Weibull

Before considering covariates and variable selection, we first carry out an initial check that a baseline Weibull distribution is appropriate for the lung cancer data. The cumulative hazard function for the Weibull model is given by $H (t) = \int_{0}^{t} h (u) d u = τ t^{γ}$ , and, hence, $\log H (t) = \log τ + γ \log t$ . Therefore, given an estimate $\hat{H} (t)$ , a plot of $\log \hat{H} (t)$ against $\log t$ should produce a straight line. This standard Weibull model check is shown in Figure 6, and, despite a slight deficiency for very small survival times, it appears that the Weibull model is reasonable.

Figure 6.

Weibull model check. Here $\hat{H} (t)$ , along with the 95% confidence intervals, come from the Kaplan–Meier estimator.

5.2. Variable selection results

The variable selection results for the different penalties are summarized in Table 4. In line with the results of the simulation study, the LASSO penalty selects the most complex model and the ALASSO penalty selects the least complex. Both ALASSO penalties (one and two tuning parameter cases) are in agreement on the non-importance of sex and smoking status, and although age group is selected in the scale in the case with one tuning parameter, it is not significant. Interestingly, the two tuning parameter ALASSO selects the same set of covariates as identified by Burke and MacKenzie⁴ using a BIC stepwise procedure (albeit they additionally selected treatment in the shape). We also see that, in the two tuning parameters cases, the scale tuning parameter is smaller than that of the one tuning parameter case, while the shape tuning parameter is larger. This suggests that the single penalty over-penalizes the scale coefficients and under-penalizes the shape; this is also evident from the scale and shape degrees of freedom. Interestingly, the one tuning parameter ALASSO converges in less than half the time of the two tuning parameter ALASSO, and achieves similar results. We expect this based on our simulation studies, and also expect the results of the two tuning parameter case to be marginally better (albeit it takes longer to converge).

Table 4.
Summary of penalized models (lung cancer data).

One tuning parameter Two tuning parameters

LASSO SCAD ALASSO LASSO SCAD ALASSO

Treatment $β$ , $α$ $β$ , $α$ $β$ , $α$ $β$ , $α$ $β$ , $α$ $β$

Age group $α$ $α$ $β$ $α$ $α$ –

WHO status $β$ , $α$ $β$ , $α$ $β$ $β$ , $α$ $β$ , $α$ $β$

Sex $α$ – – – – –

Smoking status $α$ $α$ – $β$ $β$ –

Cell type $β$ , $α$ $β$ , $α$ $β$ $β$ $β$ , $α$ $β$

Metastases $β$ , $α$ $β$ , $α$ $β$ $β$ $β$ $β$

Sodium $β$ , $α$ $β$ , $α$ $β$ $β$ $β$ $β$

Albumen $β$ , $α$ $β$ , $α$ $β$ $β$ , $α$ $β$ , $α$ $β$

Tuning parameter(s) 0.026 0.041 0.015 0.014 0.024 0.004

0.080 0.074 0.045

Degrees of freedom 32.5 27.1 15.5 25.6 24.6 15.2

Scale degrees of freedom 14.2 12.0 13.2 18.3 17.1 14.2

Shape degrees of freedom 18.3 15.0 2.4 7.4 7.6 1.0

	One tuning parameter	Two tuning parameters
Treatment	$β$ , $α$	$β$ , $α$	$β$ , $α$	$β$ , $α$	$β$ , $α$	$β$
Age group	$α$	$α$	$β$	$α$	$α$	–
WHO status	$β$ , $α$	$β$ , $α$	$β$	$β$ , $α$	$β$ , $α$	$β$
Sex	$α$	–	–	–	–	–
Smoking status	$α$	$α$	–	$β$	$β$	–
Cell type	$β$ , $α$	$β$ , $α$	$β$	$β$	$β$ , $α$	$β$
Metastases	$β$ , $α$	$β$ , $α$	$β$	$β$	$β$	$β$
Sodium	$β$ , $α$	$β$ , $α$	$β$	$β$	$β$	$β$
Albumen	$β$ , $α$	$β$ , $α$	$β$	$β$ , $α$	$β$ , $α$	$β$
Tuning parameter(s)	0.026	0.041	0.015	0.014	0.024	0.004
				0.080	0.074	0.045
Degrees of freedom	32.5	27.1	15.5	25.6	24.6	15.2
Scale degrees of freedom	14.2	12.0	13.2	18.3	17.1	14.2
Shape degrees of freedom	18.3	15.0	2.4	7.4	7.6	1.0

$β$ = “selected in scale,” $α$ = “selected in shape,” and those which are non-significant (at the $5 %$ level) are shown in gray. LASSO: least absolute shrinkage and selection operator; SCAD: smoothly clipped absolute deviation; ALASSO: adaptive least absolute shrinkage and selection operator; WHO: World Health Organization.

Table 5 displays the estimated coefficients for both ALASSO penalties along with the unpenalized coefficients (we focus on the ALASSO due to its superior performance in our simulation studies, but similar tables for LASSO and SCAD can be found in the Supplemental Material). Note that the scale coefficients characterize the overall scale of the hazard (a positive value indicates an increase relative to the reference category), while the shape coefficients characterize its time evolution (a positive value indicates a hazard which increases over time relative to the reference category). We clearly see the similarity of the coefficient values for both the one and two tuning parameter ALASSO penalties, and, furthermore, that the selected variables are broadly in line with those which are statistically significant in the unpenalized model. Focusing on the results of the two tuning parameter case we find that all treatments (apart from chemotherapy) have a negative scale coefficient suggesting that treatment reduces hazard (relative to palliative care); however, worse WHO status, small cancer cell type, presence of metasteses, and reduced sodium and albumen levels increase the hazard; lastly, sex, age group, and smoking status have no significant effect on the hazard. Since no variable appears in the shape component (i.e. all shape coefficients are set to zero), the selected model is a PH model, and exponentiating the scale coefficients yields the hazard ratios, for example, the surgery hazard ratio is $\exp (- 0.98) = 0.375$ so that the risk of death is approximately 37.5% that of a patient receiving palliative care.

Table 5.

Coefficients estimates and standard errors for the ALASSO penalties (lung cancer data).

		Scale			Shape
Covariate		No penalty	One tuning	Two tuning	No penalty	One tuning	Two tuning
Intercept		−3.38 (0.66)	−3.12 (0.17)	−3.15 (0.17)	−0.16 (0.22)	0.04 (0.03)	0.04 (0.03)
Treatment	Surgery	−1.69 (0.83)	−0.89 (0.21)	−0.98 (0.22)	0.11 (0.21)	0.00 (0.00)	0.00 (0.00)
	Chemotherapy	−0.33 (0.37)	0.00 (0.00)	0.00 (0.00)	−0.03 (0.15)	0.00 (0.00)	0.00 (0.00)
	Radiotherapy	−0.85 (0.21)	−0.16 (0.10)	−0.21 (0.10)	0.22 (0.08)	0.00 (0.00)	0.00 (0.00)
	Chemo. and radio.	−3.83 (0.98)	−2.30 (0.89)	−0.63 (0.22)	0.77 (0.20)	0.51 (0.21)	0.00 (0.00)
Age group	$50 -$	−0.90 (0.43)	0.00 (0.00)	0.00 (0.00)	0.39 (0.16)	0.00 (0.00)	0.00 (0.00)
	$60 -$	−0.94 (0.39)	0.00 (0.00)	0.00 (0.00)	0.40 (0.15)	0.00 (0.00)	0.00 (0.00)
	$70 -$	−0.77 (0.39)	0.02 (0.08)	0.00 (0.00)	0.31 (0.15)	0.00 (0.00)	0.00 (0.00)
	$> 80$	−0.78 (0.42)	0.00 (0.00)	0.00 (0.00)	0.31 (0.17)	0.00 (0.00)	0.00 (0.00)
WHO status	Light work	−0.02 (0.45)	0.00 (0.00)	0.00 (0.00)	0.02 (0.12)	0.00 (0.00)	0.00 (0.00)
	Unable to work	0.84 (0.43)	0.41 (0.10)	0.44 (0.10)	−0.10 (0.13)	0.00 (0.00)	0.00 (0.00)
	$> 50 %$ walking	1.31 (0.44)	0.99 (0.11)	0.97 (0.11)	−0.13 (0.14)	0.00 (0.00)	0.00 (0.00)
	Bed/chair bound	1.78 (0.50)	1.28 (0.28)	1.54 (0.25)	−0.03 (0.20)	0.00 (0.00)	0.00 (0.00)
Sex	Male	0.03 (0.14)	0.00 (0.00)	0.00 (0.00)	−0.03 (0.05)	0.00 (0.00)	0.00 (0.00)
Smoking status	Current smoker	0.10 (0.22)	0.00 (0.00)	0.00 (0.00)	0.15 (0.08)	0.00 (0.00)	0.00 (0.00)
	Ex-smoker	−0.05 (0.23)	0.00 (0.00)	0.00 (0.00)	0.17 (0.09)	0.00 (0.00)	0.00 (0.00)
	Missing	0.29 (0.40)	0.00 (0.00)	0.00 (0.00)	0.00 (0.00)	0.00 (0.00)	0.00 (0.00)
Cell type	Small cell	0.83 (0.26)	0.31 (0.12)	0.43 (0.13)	−0.05 (0.10)	0.00 (0.00)	0.00 (0.00)
	Adenocarcinoma	0.28 (0.28)	0.00 (0.00)	0.00 (0.00)	0.03 (0.10)	0.00 (0.00)	0.00 (0.00)
	Other	0.32 (0.20)	0.00 (0.00)	0.09 (0.09)	−0.04 (0.07)	0.00 (0.00)	0.00 (0.00)
Metastases	Yes	1.35 (0.28)	0.89 (0.12)	0.84 (0.12)	−0.19 (0.08)	0.00 (0.00)	0.00 (0.00)
	Unknown	0.83 (0.30)	0.53 (0.13)	0.41 (0.13)	−0.14 (0.09)	0.00 (0.00)	0.00 (0.00)
Sodium level	$< 136 mmol/L$	0.33 (0.14)	0.14 (0.08)	0.24 (0.08)	−0.01 (0.05)	0.00 (0.00)	0.00 (0.00)
	Missing	−0.77 (0.45)	0.00 (0.00)	0.00 (0.00)	0.32 (0.16)	0.00 (0.00)	0.00 (0.00)
Albumen level	$< 35 g/L$	0.65 (0.16)	0.36 (0.09)	0.37 (0.09)	−0.10 (0.06)	0.00 (0.00)	0.00 (0.00)
	Missing	0.59 (0.28)	0.00 (0.00)	0.27 (0.14)	0.09 (0.15)	0.00 (0.00)	0.00 (0.00)

ALASSO: adaptive least absolute shrinkage and selection operator; WHO: World Health Organization. Bold indicates statistically significant at the $5 %$ level.

6. Discussion

The MPR approach results in flexible models which extend standard models, but the presence of multiple regression components means that variable selection is necessarily more challenging than in standard settings where there is only a single regression component. In this article, we have proposed a penalized variable selection procedure for the simultaneous selection of variables in the scale and shape parameters of a Weibull MPR model in the survival analysis setting. The favorable performance of these methods was examined using simulation studies, and an analysis of lung cancer data was presented. While we have considered the Weibull model example in this article, the proposed variable selection procedures can be applied straightforwardly to other MPR survival models by adapting the likelihood function.

Given that we model different distributional parameters (a scale and a shape parameter), there is no reason to assume that variable selection can be achieved with a single penalty applied to both regression components; hence, we also investigated the need for a separate tuning parameter for each regression component. We have found that the ALASSO performs very well in terms of identifying the true subset of covariates and coverage of calculated confidence intervals. This is true even with a single tuning parameter, however the results are improved when there are two tuning parameters (albeit this is more computationally intensive). On the other hand, SCAD does not perform well in the MPR setting, selecting an overly complex model and with poor confidence interval coverage for shape parameters.

Supplemental Material

sj-pdf-1-smm-10.1177_09622802231203322 - Supplemental material for Penalized variable selection in multi-parameter regression survival modeling

Supplemental material, sj-pdf-1-smm-10.1177_09622802231203322 for Penalized variable selection in multi-parameter regression survival modeling by Fatima-Zahra Jaouimaa, Il Do Ha and Kevin Burke in Statistical Methods in Medical Research

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The first author was funded by the Irish Research Council. The second author was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. NRF2020R1F1A1A01056987). The third author was supported by the Confirm Smart Manufacturing Centre () funded by Science Foundation Ireland (Grant Number: 16/RC/3918).

ORCID iDs

Fatima-Zahra Jaouimaa

Il Do Ha

Supplemental material

Supplemental material for this article is available online.

References

Cox

. Regression models and life-tables. J R Stat Soc: Ser B (Methodological) 1972; 34: 187–202.

Schemper

. Cox analysis of survival data with non-proportional hazard functions. J R Stat Soc: Ser D (The Statistician) 1992; 41: 455–465.

Therneau

Grambsch

. Modeling survival data: extending the Cox model. New York: Springer, 2000.

Burke

MacKenzie

. Multi-parameter regression survival modeling: an alternative to proportional hazards. Biometrics 2017; 73: 678–686.

Stasinopoulos

Rigby

Bastiani

. Gamlss: a distributional regression approach. Stat Modelling 2018; 18: 248–273.

Rigby

Stasinopoulos

. Generalized additive models for location, scale and shape. J R Stat Soc: Ser C (Applied Statistics) 2005; 54: 507–554.

Stasinopoulos

Rigby

, et al. Generalized additive models for location scale and shape (gamlss) in R. J Stat Softw 2007; 23: 1–46.

McCullagh

Nelder

. Generalized linear models. 2nd ed. London: Chapman & Hall/CRC, 1989.

Anderson

. A nonproportional hazards Weibull accelerated failure time regression model. Biometrics 1991; 47: 281–288.

10.

Lee

MLT

Whitmore

. Threshold regression for survival analysis: modeling event times by a stochastic process reaching a boundary. Statist Sci 2006; 21: 501–513.

11.

Aalen

Borgan

Gjessing

. Survival and event history analysis: a process point of view. New York: Springer Science & Business Media, 2008.

12.

Peng

MacKenzie

Burke

. A multiparameter regression model for interval-censored survival data. Stat Med 2020; 39: 1903–1918.

13.

Burke

Jones

Noufaily

. A flexible parametric modelling framework for survival analysis. J R Stat Soc: Ser C (Applied Statistics) 2020; 69: 429–457.

14.

Jones

Noufaily

Burke

. A bivariate power generalized Weibull distribution: a flexible parametric model for survival analysis. Stat Methods Med Res 2020; 29: 2295–2306.

15.

Burke

Eriksson

Pipper

. Semiparametric multiparameter regression survival modeling. Scand J Stat 2019; 47: 555–571.

16.

Thompson

. Selection of variables in multiple regression: Part I. A review and evaluation. Int Statist Review 1978; 46: 1–19.

17.

George

. The variable selection problem. J Am Statist Ass 2000; 95: 1304–1308.

18.

Breiman

. Heuristics of instability and stabilization in model selection. Ann Stat 1996; 24: 2350–2383.

19.

Tibshirani

. Regression shrinkage and selection via the LASSO. J R Statist Soc B 1996; 58: 267–288.

20.

Fan

. Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Statist Ass 2001; 96: 1348–1360.

21.

Zou

. The adaptive LASSO and its oracle properties. J Am Statist Ass 2006; 101: 1418–1429.

22.

Fan

. A selective overview of variable selection in high dimensional feature space. Stat Sinica 2010; 20: 101–148.

23.

Groll

Hambuckers

Kneib

, et al. LASSO-type penalization in the framework of generalized additive models for location, scale and shape. Comput Stat Data Anal 2019; 140: 59–73.

24.

Hunter

. Variable selection using MM algorithms. Ann Stat 2005; 33: 1617.

25.

Oelker

Tutz

. A uniform framework for the combination of penalties in generalized structured models. Adv Data Anal Classif 2017; 11: 97–120.

26.

Lloyd-Jones

Nguyen

McLachlan

. A globally convergent algorithm for LASSO-penalized mixture of linear regression models. Comput Stat Data Anal 2018; 119: 19–38.

27.

Mullen

Ardia

Gil

, et al. DEoptim: an R package for global optimization by differential evolution. J Stat Softw 2011; 40: 1–26.

28.

Mullen

. Continuous global optimization in R. J Stat Softw 2014; 60: 1–45.

29.

Radchenko

James

. Variable inclusion and shrinkage algorithms. J Am Stat Ass 2008; 103: 1304–1315.

30.

Wang

. Weighted Wilcoxon-type smoothly clipped absolute deviation method. Biometrics 2009; 65: 564–571.

31.

Liu

. Variable selection in quantile regression. Stat Sin 2009; 19: 801–817.

32.

Benner

Zucknick

Hielscher

, et al. High-dimensional Cox models: the choice of penalty as part of the model building process. Biometrical J 2010; 52: 50–69.

33.

Xin

You

. Model determination and estimation for the growth curve model via group SCAD penalty. J Multivar Anal 2014; 124: 199–213.

34.

Efron

Hastie

Johnstone

, et al. Least angle regression. Ann Stat 2004; 32: 407–499.

35.

Friedman

Hastie

Höfling

, et al. Pathwise coordinate optimization. Ann Appl Stat 2007; 1: 302–332.

36.

Pan

, et al. Variable selection in general frailty models using penalized h-likelihood. J Comp Graph Stat 2014; 23: 1044–1060.

37.

Fan

. Variable selection for Cox’s proportional hazards model and frailty model. Ann Stat 2002; 30: 74–99.

38.

Park

. Penalized variable selection for accelerated failure time models with random effects. Stat Med 2019; 38: 878–892.

39.

Shao

. An asymptotic theory for linear model selection. Stat Sinica 1997; 7: 221–242.

40.

Yang

. Can the strengths of AIC and BIC be shared? A conflict between model indentification and regression estimation. Biometrika 2005; 92: 937–950.

41.

Wang

Leng

. Shrinkage tuning parameter selection with a diverging number of parameters. J R Stat Soc B 2009; 71: 671–683.

42.

Wang

Tsai

. Tuning parameter selectors for the smoothly clipped absolute deviation method. Biometrika 2007; 94: 553–568.

43.

Wang

Leng

. Unified LASSO estimation by least squares approximation. J Am Stat Ass 2007; 102: 1039–1048.

44.

Lee

MacKenzie

. Model selection for multi-component frailty models. Stat Med 2007; 26: 4790–4807.

45.

Storn

Price

. Differential evolution—a simple and efficient heuristic for global optimization over continuous spaces. J Global Optim 1997; 11: 341–359.

46.

Bischl

Richter

Bossek

, et al. mlrMBO: A Modular Framework for Model-Based Optimization of Expensive Black-Box Functions. https://arxiv.org/abs/1703.03373. 1703.03373.

47.

Zhang

. Adaptive LASSO for Cox’s proportional hazards model. Biometrika 2007; 94: 691–703.

48.

Tibshirani

. The LASSO method for variable selection in the Cox model. Stat Med 1997; 16: 385–395.

49.

Wilkinson

. Lung cancer in Northern Ireland 1991–1992. PhD thesis, Queen’s University Belfast, 1995.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

1.71 MB

		LASSO				SCAD				ALASSO
$p_{cen}$ = 25%	$n$	C(7)	IC(0)	PT	MSE	C(7)	IC(0)	PT	MSE	C(7)	IC(0)	PT	MSE
One tuning parameter
	100	5.38	0.18	0.13	0.35	6.37	0.11	0.53	0.28	6.21	0.08	0.42	0.22
Scale ( $β$ )	500	5.63	0.00	0.25	0.08	6.96	0.00	0.97	0.02	6.87	0.00	0.88	0.03
	1000	5.88	0.00	0.34	0.05	7.00	0.00	1.00	0.01	6.96	0.00	0.96	0.01
	100	3.61	0.08	0.02	0.05	4.49	0.09	0.05	0.05	6.37	0.22	0.45	0.04
Shape ( $α$ )	500	3.46	0.00	0.01	0.01	5.95	0.00	0.34	0.01	6.89	0.00	0.90	0.00
	1000	3.48	0.00	0.00	0.00	6.52	0.00	0.63	0.00	6.95	0.00	0.95	0.00
Two tuning parameters
	100	4.88	0.10	0.11	0.30	6.28	0.11	0.56	0.28	6.29	0.10	0.46	0.23
Scale ( $β$ )	500	5.18	0.00	0.17	0.07	6.89	0.00	0.95	0.02	6.88	0.00	0.90	0.02
	1000	5.42	0.00	0.21	0.04	6.95	0.00	0.98	0.01	6.96	0.00	0.96	0.01
	100	5.08	0.26	0.09	0.06	5.06	0.18	0.10	0.05	6.44	0.26	0.45	0.04
Shape ( $α$ )	500	5.42	0.00	0.21	0.01	6.10	0.00	0.42	0.01	6.90	0.00	0.91	0.00
	1000	5.57	0.00	0.26	0.00	6.51	0.00	0.62	0.00	6.96	0.00	0.96	0.00