Sage Journals: Discover world-class research

Abstract

The cause-specific hazard Cox model is widely used in analyzing competing risks survival data, and the partial likelihood method is a standard approach when survival times contain only right censoring. In practice, however, interval-censored survival times often arise, and this means the partial likelihood method is not directly applicable. Two common remedies in practice are (i) to replace each censoring interval with a single value, such as the middle point; or (ii) to redefine the event of interest, such as the time to diagnosis instead of the time to recurrence of a disease. However, the mid-point approach can cause biased parameter estimates. In this article, we develop a penalized likelihood approach to fit semi-parametric cause-specific hazard Cox models, and this method is general enough to allow left, right, and interval censoring times. Penalty functions are used to regularize the baseline hazard estimates and also to make these estimates less affected by the number and location of knots used for the estimates. We will provide asymptotic properties for the estimated parameters. A simulation study is designed to compare our method with the mid-point partial likelihood approach. We apply our method to the Aspirin in Reducing Events in the Elderly (ASPREE) study, illustrating an application of our proposed method.

Keywords

Cause-specific Cox model constrained optimization penalized likelihood Gaussian quadrature

1. Introduction

Competing risks consider the situation where there are multiple events, or multiple causes for a single event, that compete with each other. “Competing” here entails that these risks or events are mutually exclusive (e.g. Putter et al.¹). Traditionally for competing risks data analysis, people only consider right censoring, where, if an event occurs, one will observe the event time with the corresponding risk type, and if no event has occurred, one will only observe a right censoring time without any risk type. A nice feature of a right-censored competing risks problem is that the regression coefficients of a cause-specific hazard (CSH) Cox model can be easily computed using the partial likelihood method.² More specifically, for a given risk and its CSH Cox model, one can simply set the event times and right censoring times of all the other risks as right censoring times for the risk under consideration and then implement the partial likelihood method to estimate the Cox model regression coefficients; see Putter et al.¹ and Prentice et al.³ However, this feature will be lost when there are interval-censored event times.

Interval-censored data can arise in competing risks. For example, when a competing risk is related with a chronic disease where its symptoms are slowly progressing, and hence the disease can only be confirmed to occur during a time interval. In this article, we consider semi-parametric CSH Cox models, where the observed event times from competing risks are partly interval-censored (e.g. Kim⁴ for partly interval censoring), which means these observed times can include competing risks event times, as well as left, right and interval censoring times. When partly interval censoring arises one may opt to maximize the full log-likelihood to estimate the model parameters. Its computation, however, is intensive⁵ and hence will not be useful without improved computational efficiency.

In this article, we will develop a Gaussian-quadrature based fast maximum penalized likelihood (MPL) method for fitting CSH Cox models, where the unknown baseline hazards are approximated using M-splines. Our proposed method provides estimates of both the regression coefficients and approximated baseline hazards, where the non-negative constraint on the baseline hazards is respected. In this method, penalty functions are employed for two reasons: to regularize estimates of the baseline hazards and to ease the requirement for optimal location and number of knots⁶ needed for approximating the baseline hazards.

For competing risks data under interval censoring, there exist a few likelihood based methods, particularly Li⁵ and Joly et al.⁷ The latter is in fact for the illness-death model without covariates, but it can be implemented to competing risks. The method by Li⁵ allows covariates, but its optimization computations are achieved through an existing R nonlinear optimization function “nlminb,” which can be inefficient and slow. Another issue of Li⁵ is that it ignores some non-negative parameters of the approximated baseline hazards can be active, leading to possible unpleasant consequences such as negative variances. Our proposed method will address properly the issue of active constraints.

The remaining of this article is organized as follows. Section 2 defines the likelihood under competing risks CSH Cox models and with partly interval censoring. Section 3 provides the details on the computation of the MPL estimates of regression coefficients and baseline hazards and it also discusses how to automatically select the optimal smoothing parameter. Asymptotic results of the MPL estimates are presented in Section 4. Section 5 contains simulation results and also an application of our method to the ASPREE study. Finally, the concluding remarks are given in Section 6.

2. Model and likelihood

Partly interval censoring (e.g. Kim⁴) refers to the situation where observed survival times are flexible, and they can include event times, and also left, right or interval censoring times. In this article, we study competing risks CSH models with partly interval censoring. Particularly, we propose a Gaussian quadrature based penalized likelihood approach to estimate semiparametric CSH models, where both regression coefficients and baseline hazards are estimated.

Let $Y_{i}$ denote the event time for individual $i$ , but due to censoring, $Y_{i}$ may not be fully observed. Partly interval censoring means that $Y_{i}$ can be fully observed or left-, right-, and interval-censored. For competing risks, when an event has occurred (i.e. an event time or left or interval censoring time is observed), the event is then linked with only one of the risks. When an event does not occur (i.e. a right censoring time is observed), there is no link with any of the risks. Therefore, we can define an indicator variable $κ_{i r}$ to link an event with risk $r$ : $κ_{i r} = 1$ if $Y_{i}$ is associated with risk $r$ and $κ_{i r} = 0$ for otherwise. Let the vector $κ_{i} = (κ_{i 1}, \dots, κ_{i g})^{⊤}$ . When $Y_{i}$ is right-censored, then $κ_{i} = 0_{g \times 1}$ , a zero vector with length $g$ .

To accommodate interval censoring, we define $t_{i}^{L}$ and $t_{i}^{R}$ (where $t_{i}^{L} \leq t_{i}^{R}$ ) as a pair of times for each $i$ such that $Y_{i} \in [t_{i}^{L}, t_{i}^{R}]$ . If $t_{i}^{L} = t_{i}^{R} = t_{i}$ , then $Y_{i}$ is fully observed with $Y_{i} = t_{i}$ . If $t_{i}^{L} = 0$ and $t_{i}^{R}$ is finite, then $Y_{i}$ is left censored at $t_{i}^{R}$ . If $t_{i}^{R} = + \infty$ and $t_{i}^{L} \neq 0$ , then $Y_{i}$ is right censored at $t_{i}^{L}$ . For other cases, $Y_{i}$ is interval censored. We also need indicators for censoring types. Let $δ_{i}, δ_{i}^{R}, δ_{i}^{L}$ , and $δ_{i}^{I}$ denote event, right, left, and interval censoring, respectively. Clearly, $δ_{i} + δ_{i}^{R} + δ_{i}^{L} + δ_{i}^{I} = 1$ . We need both $δ_{i}, δ_{i}^{R}, δ_{i}^{L}, δ_{i}^{I}$ , and $κ_{i}$ to define the event, censoring, and the associated risk when the event occurs. Let $κ_{i 0} = 1 - \sum_{r = 1}^{g} κ_{i r}$ . Clearly, a right-censored event can be represented by either $κ_{i 0} = 1$ or $δ_{i}^{R} = 1$ .

For each individual $i$ , the observed values can be denoted as $(t_{i}^{L}, t_{i}^{R}, δ_{i}, δ_{i}^{R}, δ_{i}^{L}, δ_{i}^{I}, κ_{i}^{⊤}, X_{i}^{⊤})$ , where $i = 1, \dots, n$ ( $n$ denotes the sample size), and $X_{i}$ is a vector for values of $p$ covariates. Let $h_{r} (t)$ denote the CSH for risk $r$ , which quantifies the chance of an event with risk $r$ occurring at time $t$ when an event of any risk has not yet occurred up to time $t$ . In this article, we consider semi-parametric CSH Cox regression models. Thus, for each risk $r$ , $r = 1, \dots, g$ , we consider the following CSH Cox model:

h_{r} (t | X_{i}) = h_{0 r} (t) \exp {X_{i}^{⊤} β_{r}}

(1)

where

β_{r} = (β_{r 1}, \dots, β_{r p})^{⊤}

is a

p

-vector of regression coefficients of risk

r

and

h_{0 r} (t) \geq 0

is an unspecified baseline hazard function for risk

r

Let $H_{r} (t) = \int_{0}^{t} h_{r} (w) d w$ , and this function represents the cumulative cause-specific hazard at time $t$ . From all $h_{r} (t)$ , the overall hazard for an event is given by $h (t) = \sum_{r = 1}^{g} h_{r} (t),$ and the overall survival can be computed from $S (t) = \prod_{r = 1}^{g} S_{r} (t),$ where $S_{r} (t) = \exp {- H_{r} (t)}$ .

In competing risks analysis, a cumulative incidence function (CIF), which quantifies the probability of an event with a particular risk occurring before a time $t$ , is often computed and reported. Let ${\bar{F}}_{r} (t)$ denote the CIF for competing risk $r$ at time $t$ , and this function is given by

{\bar{F}}_{r} (t) = \int_{0}^{t} h_{r} (w) S (w) d w

(2)

The cumulative distribution function for the event at time

t

F (t) = \sum_{r = 1}^{g} {\bar{F}}_{r} (t)

, implying that

S (t)

can also be recovered from

{\bar{F}}_{r} (t)

through

S (t) = 1 - \sum_{r = 1}^{g} {\bar{F}}_{r} (t)

Since each $h_{0 r} (t)$ is non-parametric, it is an infinite-dimensional parameter. Estimation of $h_{0 r} (t)$ from a finite number of observations is an ill-posed problem. One common approach to alleviate this issue is to approximate $h_{0 r} (t)$ using a finite number of non-negative basis functions, where the number of basis functions is allowed to vary with the sample size at a slow rate. This method is known as the method of sieves.⁸ Using basis functions, the approximated $h_{0 r} (t)$ is given by

h_{0 r} (t) = \sum_{u = 1}^{m_{r}} θ_{r u} ψ_{r u} (t)

(3)

where

ψ_{r u} (t) \geq 0

are basis functions (e.g. Ma et al.⁹ and Joly et al.¹⁰). Note both

m_{r}

and

ψ_{r u} (t)

are associated with risk

r

. Some examples of basis functions include M-splines (e.g. Ramsay¹¹), Gaussian densities (e.g. Ma et al⁹ and Li and Ma¹²) or even indicator functions. The conventional non-parametric maximum likelihood estimate (NPMLE) of a baseline hazard (e.g. Anderson et al.¹³) is equivalent to adopting indicator basis functions using event times. We can now enforce the constraint

h_{0 r} (t) \geq 0

more easily by

θ_{r u} \geq 0

for all

u

. We comment that although a polynomial function can be used to approximate

h_{0 r} (t)

but is not ideal when compared with splines. To capture complex nonlinearities in a baseline hazard, a polynomial may require a high order, making it less flexible. In contrast, splines are more efficient as they utilize low-order polynomials between knots, allowing them to capture local nonlinear patterns more effectively.

Let $θ_{r} = (θ_{r 1}, \dots, θ_{r m_{r}})^{⊤}$ . For each $i$ , we assume the censoring interval $(t_{i}^{L}, t_{i}^{R})$ is independent of the event time $Y_{i}$ given the covariates. Let vectors $β = (β_{1}^{⊤}, \dots, β_{g}^{⊤})^{⊤}$ and $θ = (θ_{1}^{⊤}, \dots, θ_{g}^{⊤})^{⊤}$ . The length of $β$ is $g p$ and the length of $θ$ is $\sum_{r = 1}^{g} m_{r}$ . The log-likelihood from the partly interval censored competing risks data is

\begin{aligned} l (β, θ) & = \sum_{i = 1}^{n} \sum_{r = 1}^{g} {κ_{i r} [δ_{i} \log h_{r} (t_{i}) + δ_{i}^{L} \log ({\bar{F}}_{r} (t_{i}^{R})) + δ_{i}^{I} \log ({\bar{F}}_{r} (t_{i}^{R}) - {\bar{F}}_{r} (t_{i}^{L}))] \\ - (δ_{i} (1 - κ_{i 0}) + κ_{i 0}) H_{r} (t_{i}) - δ_{i}^{R} H_{r} (t_{i}^{L})} \end{aligned}

(4)

When there are neither left nor interval censoring times (so only events and right censoring), expression (4) will contain only

h_{r} (t)

and

H_{r} (t)

and the log-likelihood is fully separated for different

r

. Thus, in this context, (4) can be maximized separately for each risk

r

, similar to the partial likelihood method for competing risks, by setting the event times from other risks as the right censoring times. However, when left or interval censoring presents, the log-likelihood in (4) is in general difficult to maximize. This is because the left or interval censoring will cause the log-likelihood to include the CIFs

{\bar{F}}_{r} (t)

, while each CIF demands cause-specific survival functions from all the risks, which means parameters of one competing risk depend on the parameters of all the other risks. In the next section, we will discuss how to estimate

β

and

θ

by maximizing a penalized log-likelihood function where penalty functions are used to restrain the

θ

estimate.

3. Penalized likelihood estimation

3.1. Computation

We wish to estimate parameters $β$ and $θ$ by the maximum penalized likelihood (MPL) approach where penalty functions are used for two reasons: (i) to smooth the estimates of the baseline hazards $h_{0 r} (t)$ ; and (ii) to make the $h_{0 r} (t)$ estimates less sensitive to the number of location of knots. Specifically, the MPL estimates of $β$ and $θ$ are given by

(\hat{β}, \hat{θ}) = \underset{β, θ}{argmax} {Φ (β, θ) = l (β, θ) - \sum_{r = 1}^{g} λ_{r} J_{r} (θ_{r})}

(5)

subject to

θ \geq 0

. Here,

λ_{r} > 0

are smoothing parameters and

J_{r} (\cdot)

are penalty functions used to restrain

θ_{r}

. A common choice of the penalty

J_{r} (\cdot)

is the roughness penalty,¹⁴ given by

J_{r} (θ_{r}) = \int (h_{0 r}^{″} (t))^{2} d t = θ_{r}^{⊤} R_{r} θ_{r}

(6)

where matrix

R_{r}

has the dimension of

m_{r} \times m_{r}

and its

(u, v)

-th element is

\int ψ_{r u}^{″} (t) ψ_{r v}^{″} (t) d t

This is a difficult constrained optimization problem. Employing an R or MATLAB optimization solver to find the solutions is not ideal. This is because (i) these solvers can be limited on the number of constraints (thus the size of $θ$ ) they can handle; (ii) a solver can be slower, meaning an unnecessary long computation time. Li,⁵ for example, reported that when sample size $n = 500$ it took close to 10 min for the proposed R program to find the constrained optimal solution under a fixed smoothing value, and this computational time extended to 7 h when leave-one-out cross validation was used to select the smoothing parameter. In this article, we aim to develop a practically feasible computation procedure to estimate $β$ , $θ$ and also to estimate the smoothing values $λ_{1}, \dots, λ_{g}$ . We first discuss how to estimate $β$ and $θ$ (where $θ \geq 0$ ) when the smoothing parameters are fixed. Then, in Section 3.2, we will develop a marginal likelihood based approach to estimate the smoothing parameters.

For given smoothing parameters $λ_{1}, \dots, λ_{g}$ , the Karush-Kuhn-Tucker (KKT) necessary conditions for the constrained optimization problem (5) are defined by the following set of equations: for $r = 1, \dots, g$ , $u = 1, \dots, m_{r}$ , and $j = 1, \dots, p$ ,

\begin{aligned} \frac{\partial Φ (β, θ)}{\partial β_{r j}} = 0 \\ \frac{\partial Φ (β, θ)}{\partial θ_{r u}} = 0, if θ_{r u} > 0 \\ \frac{\partial Φ (β, θ)}{\partial θ_{r u}} < 0, if θ_{r u} = 0 \end{aligned}

We solve these KKT equations using an alternating algorithm, in which

β

and

θ

are updated alternately. Specifically, we use a quasi-Newton step with line search to update

β

in each iteration, and then adopt a multiplicative-iterative (MI) step (e.g. Chan and Ma¹⁵) with line search to update

θ

, ensuring that the updated

θ

values are non-negative.

The derivative of $Φ$ with respect to $β$ or $θ$ (details of these derivatives are given in Supplemental Material of this article) involves the derivative of ${\bar{F}}_{q} (t_{i})$ , $q = 1, \dots, g$ , with respect to $β$ or $θ$ , where the latter derivatives are given by

\begin{aligned} \frac{\partial {\bar{F}}_{q} (t_{i})}{\partial β_{r j}} & = A_{q r i} (t_{i}) x_{i j} \end{aligned}

(7)

\begin{aligned} \frac{\partial {\bar{F}}_{q} (t_{i})}{\partial θ_{r u}} & = (B_{q r u i 1} (t_{i}) - B_{q r u i 2} (t_{i})) e^{X_{i} β_{r}} \end{aligned}

(8)

Within the expressions in (7) and (8), we have

\begin{aligned} A_{q r i} (t) = \int_{0}^{t} h_{q} (w | X_{i}) S (w | X_{i}) (1_{{q = r}} - H_{r} (w | X_{i})) d w \\ B_{q r u i 1} (t) = \int_{0}^{t} 1_{{q = r}} ψ_{r u} (w) S (w | X_{i}) d w \\ B_{q r u i 2} (t) = \int_{0}^{t} h_{q} (w | X_{i}) Ψ_{r u} (w) S (w | X_{i}) d w \end{aligned}

where

Ψ_{r u} (t) = \int_{0}^{t} ψ_{r u} (w) d w

, representing the cumulative basis function. Note that the amount of these quantities can be large, depending on

n

g

, and

m

. More specifically, the size of

A_{q r i} (t_{i})

B_{q r u i 1} (t_{i})

, and

B_{q r u i 2} (t_{i})

are, respectively,

g \times g \times n

g \times m \times n

, and

g \times g \times m \times n

. Unless there are closed-form expressions for the integrals, evaluation of these quantities is a tremendous computational burden, let alone that these quantities need to be re-evaluated each time when

β

θ

are updated. Therefore, to design a fast algorithm, these integrals must be approximated so that fast evaluations become possible. Towards this, we adopt the Gaussian quadrature approximations; see, for example, Golub and Welsch.¹⁶

Consider a general integral $c = \int_{0}^{a} e (w) d w .$ Let $κ_{1}, \dots, κ_{V}$ be predetermined quadrature points and $w_{1}, \dots, w_{V}$ be the corresponding weights, then the Gaussian quadrature approximation to this integral is given by

c = \sum_{v = 1}^{V} e (κ_{v}) w_{v}

(9)

Computation of (9) is typically straightforward and yields accurate results. The quadrature points and weights are determined by the chosen approach, such as the Legendre-Gauss quadrature employed in this article.

Before explaining the algorithm, we first introduce some notations. In this article, we let $a^{(k)}$ represent a parameter estimate at iteration $k$ . Let $[a]^{+} = max {0, a}$ and $[a]^{-} = min {0, a}$ . We employ an alternating algorithm where each iteration comprises a pseudo Newton step to update $β$ and a MI step to update $θ$ . Thus, we call this the Newton–MI algorithm. Convergence properties of this algorithm can be found by Chan and Ma.¹⁵ In iteration $k + 1$ , $β_{r}$ and $θ_{r}$ , $r = 1, \dots, g$ , are updated by pseudo Newton and MI schemes, respectively:

\begin{aligned} β_{r}^{(k + 1)} & = β_{r}^{(k)} + ω_{1 r}^{(k)} [X^{T} V_{r} ({\tilde{β}}_{[r]}^{(k)}, θ^{(k)}) X]^{- 1} \frac{\partial Φ ({\tilde{β}}_{[r]}^{(k)}, θ^{(k)})}{\partial β_{r}} \end{aligned}

(10)

\begin{aligned} θ_{r}^{(k + 1)} & = θ_{r}^{(k)} + ω_{2 r}^{(k)} S_{r} (β^{(k + 1)}, {\tilde{θ}}_{[r]}^{(k)}) \frac{\partial Φ (β^{(k + 1)}, {\tilde{θ}}_{[r]}^{(k)})}{\partial θ_{r}} \end{aligned}

(11)

where

ω_{1 r}

and

ω_{2 r}

are the line search step sizes. In (10),

V_{r} (β, θ)

is a diagonal matrix given by

\begin{aligned} V_{r} (β, θ) & = diag ((δ_{i} (1 - κ_{i 0}) + κ_{i 0}) {\tilde{H}}_{r} (t_{i}) + δ^{L} \sum_{q} κ_{i q} \frac{A_{q r i}^{2} (t_{i})}{{\bar{F}}_{q}^{2} (t_{i})} \\ + δ_{i}^{I} \sum_{q} κ_{i q} \frac{(A_{q r i} (t_{i}^{R}) - A_{q r i} (t_{i}^{L}))^{2}}{({\bar{F}}_{q} (t_{i}^{R}) - {\bar{F}}_{q} (t_{i}^{L}))^{2}}) \end{aligned}

and

{\tilde{β}}_{[r]}^{(k)} = (β_{< r}^{(k + 1) ⊤}, β_{\geq r}^{(k) ⊤})^{⊤}

, where

β_{< r} = (β_{1}^{⊤}, \dots, β_{r - 1}^{⊤})^{⊤}

and

β_{\geq r} = (β_{r}^{⊤}, \dots, β_{g}^{⊤})^{⊤}

. In (11), matrix

S_{r} (β, θ)

is also a diagonal matrix, given by:

S_{r} (β, θ) = diag (θ_{r u} / (τ_{r u} + ε))

, where

\begin{aligned} τ_{r u} & = \sum_{i} e^{X_{i} β_{r}} ((δ_{i} (1 - κ_{i 0}) + κ_{i 0}) Ψ_{r u} (t_{i}) + δ^{L} \sum_{q} κ_{i q} \frac{B_{q r u i 2} (t_{i})}{{\bar{F}}_{q} (t_{i})} \\ + δ^{I} \sum_{q} κ_{i q} \frac{B_{q r u i 1} (t_{i}^{L}) + B_{q r u i 2} (t_{i}^{R})}{{\bar{F}}_{q} (t_{i}^{R}) - {\bar{F}}_{q} (t_{i}^{L})}) + [R_{r} θ_{r}]^{+} \end{aligned}

and

{\tilde{θ}}_{[r]}^{(k)}

is defined similar to

{\tilde{β}}_{[r]}^{(k)}

. It is not difficult to verify that when adopting the MI scheme to update

θ

, for any

r

, if

θ_{r}^{(k)} \geq 0

then we have

θ_{r}^{(k + 1)} \geq 0

. From the above expressions, we can see that fast evaluation of the

A_{q r i}

B_{q r u i 1}

B_{q r u i 2}

, and

{\bar{F}}_{q}

function values is a key requirement for our algorithm (or any computing procedure to solve this problem) to be practically useful.

We comment that line search step sizes $ω_{1}^{(k)}$ and $ω_{2}^{(k)}$ can be fast computed using, for example, the Armijo’s inexact line search method (e.g. Luengerger¹⁷). Following the arguments of Chan and Ma,¹⁵ it can be demonstrated that, under certain regularity conditions, this Newton–MI algorithm produces a convergent sequence ${(β^{(k)}, θ^{(k)})}$ , and moreover, the solution obtained at the convergence satisfies the KKT conditions.

3.2. Automatic smoothing parameter selection

Our penalized likelihood approach includes an essential feature: automatic selection of the smoothing parameter. This procedure is especially valuable for users who may not be familiar with the penalized likelihood method. This smoothing parameter selection method is devised by treating the penalty functions as log-prior density functions and views $λ_{r}$ as related to the variances of these prior distributions, and these variances can then be estimated by maximization of a marginal likelihood.^18,19 The roughness penalty on $h_{0 r} (t)$ can be expressed as a quadratic function of $θ_{r}$ ; see (6). Thus, we can relate the penalty $λ_{r} J_{r} (θ_{r})$ to the log density of $N (0_{m \times 1}, σ_{r}^{2} R_{r}^{- 1})$ , where $σ_{r}^{2} = 1 / (2 λ_{r})$ . The corresponding log-posterior density is

l_{p} (β, θ) = - \frac{m}{2} \sum_{r = 1}^{g} \log σ_{r}^{2} + l (β, θ) - \sum_{r = 1}^{g} \frac{1}{2 σ_{r}^{2}} θ_{r}^{T} R_{r} θ_{r}

(12)

The log-marginal likelihood for

σ_{1}^{2}, \dots, σ_{g}^{2}

(after integrating out

β

and

θ

) is

l_{m} (σ_{1}^{2}, \dots, σ_{g}^{2}) = - \frac{m}{2} \sum_{r = 1}^{g} \log σ_{r}^{2} + \log \int \exp (l (β, θ) - \sum_{r = 1}^{g} \frac{1}{2 σ_{r}^{2}} θ_{r}^{T} R_{r} θ_{r}) d β d θ

(13)

To overcome the infeasibility of directly maximizing the high-dimensional integral in equation (13), we employ Laplace’s method as an approximation technique. Let

\hat{β}

and

\hat{θ}

denote, respectively, the

β

and

θ

values maximizing

l_{p} (β, θ)

with fixed

σ_{r}^{2}

values (so that equivalent to

Φ (β, θ)

with fixed smoothing values), then using Laplace’s approximation we have

l_{m} (σ_{1}^{2}, \dots, σ_{g}^{2}) \approx - \frac{m}{2} \sum_{r = 1}^{g} \log σ_{r}^{2} + l (\hat{β}, \hat{θ}) - \frac{1}{2} \sum_{r = 1}^{g} [{\hat{θ}}^{T} R_{r} {\hat{θ}}_{r} / σ_{r}^{2} - \log | {\hat{G}}_{r} + Q (σ_{r}^{2}) |]

(14)

where

{\hat{G}}_{r} = - \partial^{2} l (\hat{β}, {\hat{θ}}_{r}) / \partial β \partial θ_{r}

and matrix

Q (σ_{r}^{2}) = (\begin{array}{cc} 0_{p \times p} & 0_{p \times m_{r}} \\ 0_{m_{r} \times p} & R_{r} / σ_{r}^{2} \end{array})

Maximizing (14) gives the solutions satisfying

{\hat{σ}}_{r}^{2} = \frac{{\hat{θ}}_{r}^{T} R_{r} {\hat{θ}}_{r}}{m_{r} - {\hat{ν}}_{r}}

(15)

where

{\hat{ν}}_{r}

is given by

{\hat{ν}}_{r} = tr {{({\hat{G}}_{r} + Q ({\hat{σ}}_{r}^{2}))}^{- 1} Q ({\hat{σ}}_{r}^{2})}

If active constraints of

θ_{r} \geq 0

are taken into consideration, then

{\hat{ν}}_{r}

can be modified using the technique specified in the next section, where a matrix

U_{r}

is constructed, similar to (17) and with the dimension of

(m_{r} + p) \times (m_{r} + p - d_{r})

. Here,

d_{r}

is the number of active constraints of

θ_{r} \geq 0

. The modified

ν_{r}

is given by

{\hat{ν}}_{r} = tr {U_{r} {(U_{r}^{T} ({\hat{G}}_{r} + Q ({\hat{σ}}_{r}^{2})) U_{r})}^{- 1} U_{r}^{T} Q ({\hat{σ}}_{r}^{2})}

We note that the estimate of

σ_{r}^{2}

provided in equation (15) is only an approximate solution for maximizing the approximate marginal likelihood (14). These approximations will make it difficult to obtain asymptotic properties for

{\hat{σ}}_{r}^{2}

. However, this lack of

σ_{r}^{2}

estimate properties does not pose an issue as our primary focus lies on the estimation of

β

and

h_{r 0} (t)

, for which consistent results are provided in Theorem 1.

Note that both $β$ and $θ$ depend on all $σ_{r}^{2}$ , which means that their estimation requires different iterative procedures. Specifically, our algorithm involves inner and outer iterations. In each round of inner–outer iterations, $β$ and $θ$ are first updated in inner iterations using the Newton–MI algorithm described above with $σ_{r}^{2}$ (or equivalently $λ_{r}$ ) fixed at its current value. Then, with $β$ and $θ$ fixed at their current estimates, all $σ_{r}^{2}$ are updated using (15) in the outer iterations. This process continues until the degrees-of-freedom $ν_{r}$ are stabilized, which means that the differences between their values in consecutive iterations are less than 1. The simulation study reported in Section 5 reveals that this procedure converges quickly, where the M-splines basis functions are adopted.

Cross-validation (CV) is another commonly adopted approach for estimating the smoothing parameters, and it aims at minimizing prediction errors. As explained by Wood,¹⁹ for semiparametric generalized linear model estimation, marginal likelihood might be preferable to CV due to its resistance to overfitting, lower smoothing parameter variability, and reduced tendency towards multiple minima. Our experience reveals that the proposed approximate marginal likelihood approach has much less computational burden than the CV or the generalized CV method.

4. Large sample properties

In this section, we provide some large sample results for the MPL estimates of the CSH Cox model. Let $η = (θ^{⊤}, β^{⊤})^{⊤}$ and let $μ_{r n} = λ_{r} / n$ . The penalized log-likelihood in (5) can be expressed as

Φ (η) = \sum_{i = 1}^{n} (l_{i} (η) - \sum_{r = 1}^{g} μ_{r n} J_{r} (η))

(16)

where

l_{i} (η)

can be obtained from (4) and

J_{r} (η) = J_{r} (θ_{r})

. When maximizing

Φ (η)

subject to

θ \geq 0

, it is common for some estimated elements of

θ

to be zero, which means they are active constraints. Ignoring active constraints may result in negative asymptotic variances.

4.1. Asymptotics when all

m_{r} \to \infty

and all

μ_{r n} \to 0

Consistency of the MPL estimates will be considered here under the assumption that, when $n \to \infty$ , all $m_{r} \to \infty$ (but $m_{r} / n \to 0$ ) and all $μ_{r n} \to 0$ . We assume specifically that $μ_{r n} = o (n^{- 1 / 2})$ . For a given sample with size $n$ , let $\hat{θ}$ be the MPL estimate of $θ$ from this sample and the baseline hazards for cause $r$ is ${\hat{h}}_{0 r} (t) = \sum_{i = 1}^{m} {\hat{θ}}_{r u} ψ_{r u} (t)$ .

Consistency of $\hat{β}$ and ${\hat{h}}_{0 r} (t)$ ( $r = 1, \dots, g$ ) can be established using similar arguments as in the Supplemental Material of Ma et al.⁹

Theorem 1
Assume Assumptions A1–A3 stated in the Supplemental Material hold and assume all $h_{r 0} (t)$ have the first $c \geq 1$ derivatives. Let $a = min_{i} {t_{i}}$ and $b = max_{i} {t_{i}}$ . Take $μ_{r n} = o (n^{- 1 / 2})$ and $m_{r} = n^{ρ_{r}}$ , where $1 / (2 (1 + c)) \leq ρ_{r} \leq 1 / 2 c$ , then when $n \to \infty$ :
$‖ \hat{β} - β_{0} ‖ \to 0$ .

For all $r$ , $sup_{t \in [a, b]} | {\hat{h}}_{0 r} (t) - h_{0 r}^{0} (t) | \overset{a . s .}{\to} 0$ .
Here, $β_{0}$ and $h_{0 r}^{0} (t)$ denote the true regression coefficient vector and true baseline hazard for risk $r$ , respectively.

The proofs of these results can be given directly by following the proofs that appear in the Supplemental Material of Ma et al.⁹

The results in Theorem 1 establish consistency for both $\hat{β}$ and ${\hat{h}}_{0 r} (t)$ , but they have limited practical usefulness because (i) they do not specify the asymptotic covariance of ${\hat{h}}_{0 r} (t)$ , and (ii) in data analysis, $n$ is always finite and therefore neither $m_{r}$ ’s are infinity nor $μ_{r n}$ ’s are zero.

In the next section, we develop a large sample normality property for $\hat{β}$ and $\hat{θ}$ , but with fixed $m_{r}$ ’s and nonzero $μ_{r n}$ ’s. This result also addresses the problem of active constraints.
4.2. Large sample normality when all $m_{r}$ are finite

In practice, since the $m_{r}$ ’s are never infinite, it is necessary to develop large sample normality when $m_{r}$ ’s are finite. Let $β_{0}$ and $θ_{0}$ be the true parameters corresponding to the fixed $m_{r}$ ’s and $λ_{r}$ ’s. For each constraint $θ_{r} \geq 0$ , it is possible that some components of the MPL estimate of $θ_{r}$ are actively constrained. Our large sample normality result will be able to accommodate active constraints. Asymptotic properties for constrained maximum likelihood estimates under parametric models are available in, for example, Moore et al.²⁰ We will follow this reference in the following discussions.

Assume, without loss of generality, that the first $d_{r}$ elements of each $θ_{r} \geq 0$ constraint are active in the MPL solution. Define matrix $U_{r}$ as

\begin{aligned} U_{r} = [0_{(m_{r} + p - d_{r}) \times d_{r}}, I_{(m_{r} + p - d_{r}) \times (m_{r} + p - d_{r})}]^{⊤} \end{aligned}

(17)

where

0

is a matrix of zeros and

I

is an identity matrix. Note that

U_{r}^{⊤} U_{r} = I_{(m_{r} + p - d_{r}) \times (m_{r} + p - d_{r})}

. Let a block-diagonal matrix

U = diag (U_{1}, \dots, U_{g})

, then

U^{⊤} U

is also an identity matrix with

\sum_{r = 1}^{g} m_{r} + g p - \sum_{r = 1}^{n} d_{r}

diagonals. We need the following assumptions before giving the large sample normality result.

Theorem 2

Assume Assumptions A1, A3–A7 stated in the Supplemental Material hold. Assume in each $θ_{r} \geq 0$ there are $d_{r}$ active constraints, $r = 1, \dots, g$ . Corresponding to these active constraints, the matrix $U_{r}$ is defined similar to (17) and a block-diagonal matrix $U$ matrix is constructed as above. Then, when $n$ is large, $\sqrt{n} (\hat{η} - η_{0})$ is approximately multivariate normal with $0$ mean vector and covariance matrix $\tilde{F} (η_{0})^{- 1} G (η_{0}) [\tilde{F} (η_{0})^{- 1}]^{⊤}$ , where $\tilde{F} (η)^{- 1} = U (U^{⊤} F (η) U)^{- 1} U^{⊤}$ .

Again, the proofs of these results can be obtained by directly following the proofs of the similar results that appear in the Supplemental Material of Ma et al.⁹

We remark that matrix $\tilde{F} (η)^{- 1}$ is easy to compute. This is because $U^{⊤} F (η) U$ is simply given by deleting the rows and columns of $F (η)$ where these row and column indices are determined by the positions of the active constraints. Then $\tilde{F} (η)^{- 1}$ is obtained by adding zeros to the inverse of $U^{⊤} F (η) U$ at the positions of the deleted rows and columns. In applications, the unknown $η_{0}$ can be replaced by the MPL estimates $\hat{η}$ . The simulation results reported in Section 5 demonstrate that standard errors of the MPL estimates are generally accurate.

5. Results

5.1. Simulation

Monte Carlo simulations were undertaken to compare the performance of the MPL method with an ad-hoc method commonly used in practice denoted as midpoint Cox. This involves applying a Cox regression to each risk separately, while right censoring all other risks. For interval censored observations, the midpoint of the interval was treated as an exact observed time, and for left censored data, the observed time was divided by 2. We generated survival times, $T$ , from the following Weibull distribution for two risks ( $r = 1, 2$ ).

\begin{aligned} h_{0 r} (t) = λ_{r} ρ_{r} t^{ρ_{r} - 1} \end{aligned}

(18)

Three studies were conducted. In study 1,

T

was generated from the Weibull distribution with scale parameters

λ_{1} = 1

and

λ_{2} = 0.5

, and the shape parameters

ρ_{1} = ρ_{2} = 3.

Covariates from two variables were simulated from

x_{i 1} \sim N (0, 1)

and

x_{i 2} \sim N (0, 1)

, with coefficient values

β_{11} = - 1, β_{12} = 0.5

for risk

1

, and

β_{21} = 1, β_{22} = - 0.5

for risk

2.

In study 2,

T

was generated from the same scale and shape parameters, as well as covariates from study 1. The coefficient values were

β_{11} = 1, β_{12} = 0.5

for risk

1

, and

β_{21} = 0.5, β_{22} = 0.5

for risk

2.

In study 3,

T

was generated from

λ_{1} = 1

and

λ_{2} = 0.5

with shape parameter

ρ_{1} = ρ_{2} = 2.

The covariates from two variables were simulated from

x_{i 1} \sim N (0, 1)

and a binary variable with

x_{i 2} \sim B (1, 0.5) .

The coefficient values were

β_{11} = - 1, β_{12} = 0.5

for risk

1

, and

β_{21} = 1, β_{22} = - 0.5

for risk

2.

Observed survival times $(t_{i}^{L}, t_{i}^{R})$ for right censored, event, left censored, and interval censored times were generated as follows. Let $T_{i}$ denote a simulated event time as described, $π^{E}$ represent a chosen proportion for event times, $U_{i}^{L}$ and $U_{i}^{E}$ are generated from $unif (0, 1)$ and $U_{i}^{R}$ from $unif (U_{i}^{L}, 1)$ . Let $γ_{L}$ and $γ_{R}$ (with $γ_{L} \leq γ_{R}$ ) be two positive scalars. The width of randomly generated intervals can be increased by selecting $γ$ values that are further apart. If $U_{i}^{E} < π^{E}$ then we have an event time so that $t_{i}^{L} = t_{i}^{R} = T_{i}$ ; otherwise, if $γ_{L} U_{i}^{L} \leq T_{i} \leq γ_{R} U_{i}^{R}$ we have interval censoring with $t_{i}^{L} = γ_{L} U_{i}^{L}$ and $t_{i}^{R} = γ_{R} U_{i}^{R}$ ; if $T_{i} < γ_{L} U_{i}^{L}$ we have left censoring with $t_{i}^{L} = 0$ and $t_{i}^{R} = γ_{L} U_{i}^{L}$ and if $γ_{R} U_{i}^{R} < T_{i}$ we have right censoring with $t_{i}^{L} = γ_{R} U_{i}^{R}$ and $t_{i}^{R} = \infty$ . The probability of risk group membership, $p_{i r}$ , is given by a ratio of the hazards by:

p_{i r} = \frac{h_{r} (t)}{h (t)}

(19)

We generated a random variable

W_{i}

from

unif (0, 1)

, and if

W_{i} < p_{i 1}

then

i

belongs to risk group 1, else

i

belongs to risk group 2.

The simulation study was designed to examine the estimation method under varying sample sizes and censoring proportions. We ran Monte Carlo simulations with sample sizes of n = 200 and 1000. For all studies, the event proportion was fixed at $π^{E} = 5 %$ . Within each study, we examined two sets of censoring proportions for each sample size. The first set was an approximate right censoring proportion of $47.5 %$ , and an approximate interval censoring (including left censoring) of $47.5 %$ . These correspond to gamma values of $γ_{L} = 0.5$ and $γ_{R} = 0.91$ for study 1, $γ_{L} = 0.5$ and $γ_{R} = 1.47$ for study 2, and $γ_{L} = 0.5$ and $γ_{R} = 0.74$ for study 3. The second set of censoring proportions was an approximate right censoring proportion of $20 %$ , and an approximate interval censoring (including left censoring) of $75 %$ . These correspond to gamma values of $γ_{L} = 0.5$ and $γ_{R} = 1.34$ for study 1, $γ_{L} = 0.5$ and $γ_{R} = 1.64$ for study 2, and $γ_{L} = 0.5$ and $γ_{R} = 1.27$ for study 3. Bias was calculated by taking the average of differences between the parameter estimates and the true values. Asymptotic standard deviations (std asymp) was calculated by taking the average of the estimated standard deviations. Monte Carlo standard deviations (std mc) was calculated by taking the standard deviation of the Monte Carlo parameter estimates. Coverage probability (cp) was calculated by taking the proportion of Monte Carlo runs where the true parameter was contained within the confidence interval.

Table 1.

Cox (midpoint t) and MPL regression parameter estimation for study 1.

				$β_{11} = - 1$		$β_{12} = 0.5$		$β_{21} = 1$		$β_{22} = - 0.5$
n	int cens (%)	right cens (%)		Cox	MPL	Cox	MPL	Cox	MPL	Cox	MPL
200	47.5	47.5	bias	−0.053	0.033	0.033	−0.01	0.019	−0.048	−0.002	0.032
			std asymp	0.148	0.169	0.134	0.146	0.191	0.208	0.173	0.182
			std mc	0.154	0.17	0.142	0.15	0.204	0.22	0.181	0.189
			cov prob	0.92	0.957	0.927	0.949	0.937	0.945	0.936	0.95
200	75	20	bias	−0.223	0.024	0.114	−0.012	0.121	−0.038	−0.054	0.026
			std asymp	0.119	0.173	0.109	0.145	0.15	0.203	0.136	0.17
			std mc	0.135	0.176	0.115	0.144	0.165	0.209	0.142	0.175
			cov prob	0.502	0.947	0.792	0.954	0.828	0.952	0.912	0.939
1000	47.5	47.5	bias	−0.081	0.016	0.039	−0.01	0.044	−0.013	−0.02	0.009
			std asymp	0.064	0.074	0.058	0.064	0.081	0.09	0.074	0.078
			std mc	0.066	0.076	0.06	0.066	0.081	0.089	0.074	0.08
			cov prob	0.721	0.941	0.873	0.945	0.912	0.948	0.946	0.942
1000	75	20	bias	−0.245	0.02	0.119	−0.012	0.141	−0.015	−0.07	0.01
			std asymp	0.056	0.078	0.047	0.064	0.064	0.091	0.059	0.075
			std mc	0.056	0.078	0.049	0.066	0.07	0.092	0.058	0.074
			cov prob	0.01	0.95	0.299	0.946	0.4	0.945	0.763	0.955

Note: MPL: maximum penalized likelihood. Cox (midpoint t) refers to the midpoint of interval censored times being treated as event times in a Cox regression; std asymp: asymptotic standard deviations; std mc: Monte Carlo standard deviations; cov prob: coverage probability.

Table 2.

Cox (midpoint t) and MPL regression parameter estimation for study 2.

				$β_{11} = 1$		$β_{12} = 0.5$		$β_{21} = 0.5$		$β_{22} = 0.5$
n	int cens (%)	right cens (%)		Cox	MPL	Cox	MPL	Cox	MPL	Cox	MPL
200	47.5	47.5	bias	0.12	−0.002	0.072	0.003	0.094	0.039	0.05	0.008
			std asymp	0.134	0.158	0.125	0.14	0.201	0.211	0.191	0.198
			std mc	0.141	0.158	0.129	0.142	0.2	0.219	0.202	0.211
			cov prob	0.822	0.953	0.891	0.951	0.93	0.931	0.931	0.938
200	75	20	bias	0.39	0.02	0.228	0.011	0.341	0.069	0.199	0.028
			std asymp	0.103	0.158	0.1	0.141	0.148	0.181	0.147	0.171
			std mc	0.111	0.164	0.106	0.143	0.148	0.186	0.147	0.174
			cov prob	0.073	0.931	0.371	0.949	0.361	0.923	0.696	0.939
1000	47.5	47.5	bias	0.136	−0.022	0.076	−0.015	0.101	0.016	0.06	0.007
			std asymp	0.058	0.071	0.055	0.063	0.087	0.095	0.082	0.087
			std mc	0.061	0.074	0.057	0.065	0.086	0.097	0.081	0.086
			cov prob	0.346	0.936	0.704	0.941	0.784	0.949	0.879	0.956
1000	75	20	bias	0.398	−0.009	0.23	−0.007	0.342	−0.023	0.203	−0.017
			std asymp	0.045	0.079	0.044	0.067	0.065	0.095	0.063	0.081
			std mc	0.048	0.085	0.046	0.068	0.063	0.095	0.063	0.082
			cov prob	0	0.932	0.001	0.939	0.001	0.951	0.111	0.941

Note: MPL: maximum penalized likelihood. Cox (Midpoint t) refers to the midpoint of interval censored times being treated as event times in a Cox regression; std asymp: asymptotic standard deviations; std mc: Monte Carlo standard deviations; cov prob: coverage probability.

In studies 1 and 2, bias was < 0.1 and lower for the majority of MPL results compared with the Cox regression with midpoint t (Tables 1 and 2). The asymptotic standard errors for the MPL were close to its Monte Carlo estimates, and were larger than those estimated by the Cox regression. However, the increase in precision for the Cox regression resulted in relatively poor coverage probabilities due to its large bias. The MPL coverage probabilities were close to its nominal value of 95%. For the same event proportions, both methods improved in bias and coverage probability when a smaller proportion of left and interval censored times were observed. This is partly due to the decreased width of the intervals that is necessary to produce a smaller proportion of left and interval censored data for the same set of randomly generated $t_{i}$ under our data generation process. As the sample size increases, the coverage probabilities of the MPL method are maintained around the 0.95 level, however, are reduced for the Cox with midpoint t method. In study 3 (Supplemental Table 1), we observe an improvement in the performance of the Cox regression with reduced bias and coverage probabilities close to the nominal level of 0.95. However, some parameters are still estimated poorly, particularly when estimating coefficient values for risk 1, with coverage probabilities decreased with increased sample size. For MPL, all coverage probabilities are close to the nominal 0.95 level.

Figure 1(A) and (B) shows the estimated average estimated baseline hazards for scenario 1 for the sample size of 200, and 47.5% right censoring proportion are estimated to be close to their true values for risks 1 and 2, respectively. The true baseline hazards for both risks are contained within the confidence bands. Figure 1(C) and (D) shows the coverage probabilities of the baseline hazards over $t$ for the two risks. In general, the coverage probability is maintained close to the nominal 0.95 level and remains above 0.8 for the majority of the range. It is evident that the coverage probabilities are closer to the nominal 95% value when $t$ is around $(0.3, 0.6)$ for both risks, and the coverage probability decreases as $t$ approaches 0 and 1. This phenomenon occurs because most generated survival times are located closer to the middle of $(0, 1)$ , and survival times become scarce when $t$ is close to 0 or 1. Figures 2 to 4(A) and (B) show the baseline cumulative incidence function graphs for risks 1 and 2, respectively. The true value is close to the average of the estimated CIF’s for both risks and within the estimated confidence bands.

Figure 1.

(A) and (B) The average baseline hazards for 1000 Monte Carlo runs, the true baseline hazards and the confidence bands for risks 1 and 2, respectively. (C) and (D) The coverage of the baseline hazards for risks 1 and 2, respectively.

Figure 2.

(A) and (B) The average of cumulative incidence functions (solid blue line) for 1000 Monte Carlo runs for scenario 1 with $n = 200$ . Black lines represent the true cumulative incidence function (CIF) and confidence bands from the Monte Carlo runs are dashed lines.

Figure 3.

Baseline hazards for death (blue) and dementia (red) and their confidence bands based on the MPL in the ASPREE study. MPL: maximum penalized likelihood; ASPREE: Aspirin in Reducing Events in the Elderly.

Figure 4.

Cumulative incidence functions for death (blue) and dementia (red) and their confidence bands based on MPL in the ASPREE study. MPL: maximum penalized likelihood; ASPREE: Aspirin in Reducing Events in the Elderly.

The simulations were conducted on a 2021 MacBook Pro with an Apple M1 Pro chip and 16 GB RAM. For study 1, with approximate right censoring proportions of $47.5 %$ and interval censoring (including left censoring) proportions of $47.5 %$ and $n = 200$ , these simulated data were estimated between 1 and 60 s (average 17 s). For the $n = 1000$ sample size, the data were estimated between 23 and 777 s (average 165 s).

5.2. Application to dementia dataset

This study was initiated to explore the time to dementia in elderly patients while accounting for the competing risk of death and baseline covariates such as age, gender, education level, country of birth, and living arrangements (live alone or with others). The data was sourced from the “ASPirin in Reducing Events in the Elderly” (ASPREE) trial, a clinical trial conducted in Australia and the United States to determine whether daily use of 100 mg of enteric-coated aspirin would prolong the healthy lifespan of older adults.²¹ The trial included 19,114 relatively healthy older individuals, with 9525 randomized into the intervention group (aspirin) and 9589 into the control group (placebo). The participants were followed up every year for a maximum of 7 years. We selected dementia as an appropriate interval censored outcome since its diagnosis typically occurs over an extended period and in this trial was assessed every 2 years.

There were 963 (5%) deaths (event data) and 575 (3%) dementia cases (interval censored), and the remaining 17,576 (92%) were right censored. The hazard ratios estimated under both methods are largely in agreement for both outcomes. The standard errors in all parameters tend to be slightly larger for the MPL as observed in the simulation study. The p-values based on a cut-off of $< 0.05$ would also agree in terms of significance testing. We suspect the large proportion of right censored observations in this dataset resulted in the similarity of these results. For death, the MPL results show that older age (HR 1.12, $p < 0.001$ ), male (HR 1.73, $p < 0.001$ ), less educated (HR 1.23, $p = 0.009$ ), country of birth in the US (HR 1.29, $p = 0.026$ ) and those that live with others (HR 1.27, $p = 0.004$ ) were more likely to die (Table 3). For dementia, older age (HR 1.14, $p < 0.001$ ), male (HR 1.41, $p < 0.001$ ) and country of birth in the US (HR 1.39, $p = 0.019$ ) were more likely to be identified with dementia.

Table 3.
MPL and Cox (midpoint t) regression parameter estimates for death and dementia in the ASPREE trial.

MPL Cox (midpoint t)

Death outcome HR SE (coef) p-value HR SE (coef) p-value

Age 1.123 0.007 <0.001 1.119 0.006 <0.001

Male versus female 1.73 0.078 <0.001 1.851 0.068 <0.0001

<13 years versus 13+ year education 1.233 0.08 0.009 1.215 0.069 0.005

COB: US versus Australia 1.287 0.113 0.026 1.253 0.098 0.0212

Living situation: with others versus alone 1.268 0.081 0.004 1.297 0.07 <0.001

Dementia outcome

Age 1.138 0.009 <0.001 1.129 0.008 <0.001

Male versus female 1.41 0.098 <0.001 1.339 0.087 <0.001

<13 years versus 13+ year education 1.11 0.1 0.2966 1.122 0.089 0.1952

COB: US versus Australia 1.382 0.138 0.0187 1.445 0.12 0.0022

Living situation: with others versus alone 1.202 0.102 0.071 1.167 0.09 0.0864

	MPL	Cox (midpoint t)
Age	1.123	0.007	<0.001	1.119	0.006	<0.001
Male versus female	1.73	0.078	<0.001	1.851	0.068	<0.0001
<13 years versus 13+ year education	1.233	0.08	0.009	1.215	0.069	0.005
COB: US versus Australia	1.287	0.113	0.026	1.253	0.098	0.0212
Living situation: with others versus alone	1.268	0.081	0.004	1.297	0.07	<0.001
Dementia outcome
Age	1.138	0.009	<0.001	1.129	0.008	<0.001
Male versus female	1.41	0.098	<0.001	1.339	0.087	<0.001
<13 years versus 13+ year education	1.11	0.1	0.2966	1.122	0.089	0.1952
COB: US versus Australia	1.382	0.138	0.0187	1.445	0.12	0.0022
Living situation: with others versus alone	1.202	0.102	0.071	1.167	0.09	0.0864

Note: Living situation, “with others” includes living at home with family, in a residential home (with or withour supervised care). “Alone” includes living alone at home. MPL:maximum penalized likelihood. Cox (midpoint t) refers to the midpoint of interval censored times being treated as event times in a Cox regression; ASPREE: Aspirin in Reducing Events in the Elderly.

Figure 3 shows the baseline hazards of death and dementia. Based on this, we can see the risk of death is greater than dementia throughout the study period. Figure 4 shows the baseline cumulative incidence functions for the two outcomes. We can see that the incidence is higher in death than in dementia, however, based on their confidence bands is not significantly different. Figure 5 shows the survival curves of death (Figure 5(A)) and dementia (Figure 5(B)) by sex. In both outcomes, males’ survival curves are significantly lower than that of females.

Figure 5.

Survival curves for death (a) and dementia (b) by sex and their confidence bands based on MPL in the ASPREE study. MPL: maximum penalized likelihood; ASPREE: Aspirin in Reducing Events in the Elderly.

6. Conclusions

When compared with the method of sieves for semi-parametric models, an advantage of penalty is that it makes the optimal number of basis functions and knots less important since the smoothing parameter can correct under or over smoothing. The MPL method is implemented in R and currently available on Github at https://github.com/josephdescallar/phcshMPL.

In this article, we also develop asymptotic properties of the constrained MPL estimates. Any semi-parametric PH regression model contains a baseline hazard as the non-parametric component and regression coefficients as the parameters of interest.

For inference of semi-parametric proportional hazard (PH) regression models, one unfortunate weakness of Cox’s partial likelihood or its modifications is that they cannot provide accurate risk predictions for individuals, which can badly limit applications of PH models. For likelihood based methods, although they are able to supply accurate individual predictions as they estimate baseline hazard and regression coefficients together, asymptotic results are more difficult to derive due to possibilities of active constraints.

Supplemental Material

sj-pdf-1-smm-10.1177_09622802241262526 - Supplemental material for Cause-specific hazard Cox models with partly interval censoring – Penalized likelihood estimation using Gaussian quadrature

Supplemental material, sj-pdf-1-smm-10.1177_09622802241262526 for Cause-specific hazard Cox models with partly interval censoring – Penalized likelihood estimation using Gaussian quadrature by Joseph Descallar, Jun Ma, Houying Zhu, Stephane Heritier and Rory Wolfe in Statistical Methods in Medical Research

Footnotes

Acknowledgements

The authors thank the ASPREE study group for use of the ASPREE trial data, and Le Thi Phuong Tao for assistance with the dataset.

Data availability

The code used to generate the simulated data and its results are available at on Github at . The ASPREE data can be accessed upon request from Monash University.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship and/or publication of this article. This research received no specific grant from any funding agency in the public, commerical, or not-for-profit sectors.

ORCID iD

Joseph Descallar

Supplemental material

Supplemental material for this article is available online. Supplemental material containing the asymptotic properties and details of the Score vectors and Hessian matrices of the proposed estimators are available online.

References

Putter

Fiocco

Geskus

. Tutorial in biostatistics: competing risks and multi-state models. Stat Med 2007; 26: 2389–2430.

Cox

. Partial likelihood. Biometrika 1975; 62: 269–276.

Prentice

Kalbfleisch

Peterson

et al. The analysis of failure times in the presence of competing risks. Biometrics 1978; 34: 541–554.

Kim

. Maximum likelihood estimation for the proportional hazards model with partly interval-censored data. J R Stat Soc B 2003; 65: 489–502.

. Cause-specific hazard regression for competing risks data under interval censoring and left truncation. Comput Stat Data An, 2016; 104: 197–208.

Ruppert

Wand

Carroll

. Semiparametric regression. Cambridge: Cambridge University Press, 2003.

Joly

Commenges

Helmer

et al. A penalized likelihood approach for an illness-death model with interval-censored data: application to age-specific incidence of dementia. Biostatistics 2002; 3: 433–443.

Grenander

. Abstract inference. New York: J. Wiley, 1981.

Couturier

Heritier

et al. Penalized likelihood estimation of the proportional hazards model for survival data with interval censoring. Int J Biostat 2022; 18: 553–575.

10.

Joly

Commenges

Letenneur

. A penalized likelihood approach for arbitrarily censored and truncated data: application to age-specific incidence of dementia. Biometrics 1998; 54: 185–194.

11.

Ramsay

. Monotone regression splines in action. Stat Sci 1988; 3: 425–441.

12.

. On hazard-based penalized likelihood estimation of accelerated failure time model with partly interval censoring. Stat Methods Med Res 2020; 29: 3804–3817.

13.

Andersen

Ørnulf

Richard

et al. Statistical models based on counting processes. New York: Springer, 1992.

14.

Green

Silverman

. Nonparametric regression and generalized linear models – a roughness penalty approach. London: Chapman and Hall, 1994.

15.

Chan

. A multiplicative iterative algorithm for box-constrained penalized likelihood image restoration. IEEE Trans Image Process 2012; 21: 3168–3181.

16.

Golub

Welsch

. Calculation of gauss quadrature rules. Math Comput 1969; 23: 221–230.

17.

Luenberger

. Linear and nonlinear programming. 2nd ed. New York: J. Wiley, 1984.

18.

Cai

Betensky

. Hazard regression for interval-censored data with penalized spline. Biometrics 2003; 59: 570–579.

19.

Wood

. Fast stable restricted maximum likelihood and marginal likelihood estimation of semiparametric generalized linear models. J R Stat Soc B 2011; 73: 3–36.

20.

Moore

Sadler

Kozick

. Maximum-likelihood estimation, the CramÉr-Rao bound, and the method of scoring with parameter constraints. IEEE Trans Signal Process 2008; 56: 895–908.

21.

REE study group

ASP

. Study design of ASPirin in Reducing Events in the Elderly (ASPREE): a randomized, controlled trial. Contemp Clin Trials 2013; 36: 555–564.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.24 MB

	MPL			Cox (midpoint t)
Death outcome	HR	SE (coef)	p-value	HR	SE (coef)	p-value
Age	1.123	0.007	<0.001	1.119	0.006	<0.001
Male versus female	1.73	0.078	<0.001	1.851	0.068	<0.0001
<13 years versus 13+ year education	1.233	0.08	0.009	1.215	0.069	0.005
COB: US versus Australia	1.287	0.113	0.026	1.253	0.098	0.0212
Living situation: with others versus alone	1.268	0.081	0.004	1.297	0.07	<0.001
Dementia outcome
Age	1.138	0.009	<0.001	1.129	0.008	<0.001
Male versus female	1.41	0.098	<0.001	1.339	0.087	<0.001
<13 years versus 13+ year education	1.11	0.1	0.2966	1.122	0.089	0.1952
COB: US versus Australia	1.382	0.138	0.0187	1.445	0.12	0.0022
Living situation: with others versus alone	1.202	0.102	0.071	1.167	0.09	0.0864