Inference on latent factor models for informative censoring

Abstract

This work discusses the problem of informative censoring in survival studies. A joint model for the time to event and the time to censoring is presented. Their hazard functions include a latent factor in order to identify this joint model without sacrificing the flexibility of the parametric specification. Furthermore, a fully Bayesian formulation with a semi-parametric proportional hazard function is provided. Similar latent variable models have been described in literature, but here the emphasis is on the performance of the inferential task of the resulting mixture model with unknown number of components. The posterior distribution of the parameters is estimated using Hamiltonian Monte Carlo methods implemented in Stan. Simulation studies are provided to study its performance and the methodology is implemented for the analysis of the ACTG175 clinical trial dataset yielding a better fit. The results are also compared to the non-informative censoring case to show that ignoring informative censoring may lead to serious biases.

Keywords

Survival models mixture models bayesian inference proportional hazard HMC model selection

Introduction

Right-censored survival times are very common in event time studies. Right-censoring is non-informative if the censoring times do not depend on the event of interest. For instance, units whose event of interest has not occurred by the end of the clinical study (type I censoring). Conversely, units may drop out from the study for reasons which depend on the event of interest. For example, in a clinical study on the efficacy of a new treatment, a patient may withdraw due to the worsening of their medical conditions.

In case of informative censoring, we need to consider a model for the joint distribution of the censoring time $C$ and the time to event $T$ . Not addressing this dependency between $T$ and $C$ may result in a biased estimation of the distribution of $T$ . However, this joint distribution is not identifiable given the data, since we only observe the minimum between $C$ and $T$ (see¹,² and³ for a more detailed account of this issue). Thus, we have to make untestable assumptions if we wish to study the joint distribution of $T$ and $C$ .

Some examples from the literature are the works of⁴, who proposed and analysed the use of a bivariate Weibull model for $(T, C)$ , and the work of⁵ who analyse the consistency of the estimator of the marginal survival function of $T$ and $C$ based on a copula model with known dependence parameters. These parametric models are typically used to investigate sensitivity with respect to the dependency parameter.

Scharfstein and Robins⁶ consider a hazard function specification for the censoring time which has a multiplicative relationship with a function of $T$ , while⁷ specify a semi-parametric hazard function (⁸) for $T$ , which has a step-change associated with the censoring event. In this way, they avoid modeling the marginal distribution of $C$ . The limitation of these approaches is their need to fix the parameter which calibrates the degree of dependence between $T$ and $C$ , since it cannot be estimated on the basis of available data.

Huang and Wolfe⁹ proposed the use of a latent variable to account for the dependency between $T$ and $C$ . Given this latent variable, then $T$ and $C$ are independently distributed. They assume that this latent variable is normally distributed with mean zero and unknown variance. This latent variable operates on a cluster level (e.g., clinics with patients) where each cluster has its own frailty, which is shared by all units in the cluster. The limitation of this approach is that the parametric class of distributions for the latent variable must be known.

Rowley et al.¹⁰ extended the work of⁹ by considering a latent factor with a finite but unknown number of possible outcomes. The resulting joint distribution of $T$ and $C$ turns out to be a mixture distribution with an unknown number of components. The authors suggested to estimate this joint distribution with the Maximum-a-Posteriori (MAP) estimator and they recommended to determine the number of levels with the Bayesian Information Criterion (BIC).

Among all options of modeling the joint distribution of T and C, the latent variable approach seems to be the most promising, since it has the least limitations compared to alternative parametric hazard function specifications.

However, statistical models including latent factors, such as those proposed in¹⁰, contain singularities in their parameter space, which means that there is not a one-to-one mapping from the parameter space to a probability distribution. As a consequence, the Fisher Information Matrix may not be invertible (see¹¹ and¹²). Thus, estimators like maximum likelihood (ML) and MAP do not have asymptotic normal distributions and may yield divergent parameter estimates when applied to real datasets. The asymptotic distribution of the parameter estimators deviates from being Gaussian in ill-posed models, particularly when the estimator reaches the boundary of the parameter space (¹³). When dealing with mixture distributions this problem occurs when the mixture weight is close to zero, when two mixture components are very close to one another, or more generally when a mixture in $K$ components can be equivalently obtained as a mixture in $K^{'}$ components, with $K^{'} < K$ .

The problem of model selection between distributions with different number of mixture components turns out to be extremely critical in this context. Classical information criteria such as the AIC (¹⁴) and the BIC (¹⁵) cannot be used (see¹⁶ for further details).

Another criticism towards the use of ML and MAP estimators is that for a mixture model with $K$ components the likelihood function is most likely multimodal. Fitting a $K$ -component mixture model on $n$ observations, results in a likelihood function of the form of a sum of $K^{n}$ terms, each corresponding to the likelihood function obtainable under the possible realizations of the latent factor. Thus, there may exist multiple roots, and the observed solution can be affected by the choice of the starting values. Finally, the numerical optimization can require high computational time, due to the large number of parameters and the complexity of the objective function, and there would not be any guarantee that all roots will be found.

This work focuses on a joint model for $T$ and $C$ , where they are independently distributed conditional on a latent class or factor $H$ , similar as in¹⁰. A point-identifying assumption for this distribution is specified, without sacrificing the flexibility of the parametric model.

We propose the use of a fully Bayesian analysis, where instead of focusing on a point estimate, we try to estimate the posterior distribution of the parameters. The contribution of this work is thus to address the inferential challenges posed by these mixture models for addressing informative censoring in survival models. We will discuss identifiability, the Bayesian inference for this model, including prior specification and parametrization of the mixture model, and the model selection problem for mixture distributions with unknown number of components.

Section 2 will discuss these model formulations, while Section 3 addresses the estimation problem within the Bayesian inferential paradigm. Section 4 contains a simulation study to assess the performance of the inferential strategy described in Section 3. As an illustrative example the ACTG175 clinical trial (¹⁷) is analysed, and the results are illustrated in Section 5. Extension of the use of joint models with latent factors are discussed in Section 6. Finally, Section 7 provides a summary and a discussion.

Modeling non-informative censoring with latent factors

In our model we assume that the event time $T$ and the censoring time $C$ are conditionally independent given latent factor $H$ . Thus the joint density of $(T, C)$ is given by:

f (t, c) = \int_{h} f_{T ∣ H} (t ∣ h) f_{C ∣ H} (c ∣ h) d F_{H} (h)

(1)

where

f (\cdot)

denotes the density function while

F (\cdot)

denotes the distribution function. We refer to the latent factor

H

as the frailty.

Furthermore, we allow that the distribution of $T$ and $C$ may depend on a vector of time varying covariates at time $u \in [0, \infty)$ , denoted by $x^{T} (u) = (x_{1}^{T} (u), \dots, x_{P_{T}}^{T} (u))$ and $x^{C} (u) = (x_{1}^{C} (u), \dots, x_{P_{C}}^{C} (u))$ respectively. In this work, we include covariates by means of the Cox proportional hazard model (⁸). We assume that the distribution of $H$ is independent of $x^{T} (\cdot)$ and $x^{C} (\cdot)$ .

Model

We assume that the latent factor $H$ is a discrete variable which takes $K + 1$ possible values, denoted for simplicity by the set of integers ${0, 1, \dots, K}$ . $K$ is considered to be fixed but unknown. We set $H = 0$ as the baseline level, and write $h = (h_{1}, \dots, h_{K})$ , such that $h_{k} = 1_{[H = k]}$ .

The following models for the hazard function of $T$ and $C$ are considered:

μ^{T} (u ∣ x^{T} (u), h; β^{T}, γ^{T}) = μ_{0}^{T} (u) \exp (β^{T^{'}} x^{T} (u) + γ^{T^{'}} h)

(2)

μ^{C} (u ∣ x^{C} (u), h; β^{C}, γ^{C}) = μ_{0}^{C} (u) \exp (β^{C^{'}} x^{C} (u) + γ^{C^{'}} h)

(3)

where

β^{ℓ}

is the vector of regression coefficients capturing the effect of the covariates

x^{ℓ} (\cdot)

, and

μ_{0}^{ℓ} (\cdot)

is the baseline hazard function. The unknown parameters

γ^{ℓ} = (γ_{0}^{ℓ}, \dots, γ_{K}^{ℓ})

capture the proportional effect of each outcome of

H

on the the hazard (

ℓ \in {T, C}

). We assume that

γ_{0}^{ℓ} = 0

As result, now the frailty component has a discrete distribution with sample space $(1, \exp (γ_{1}^{ℓ}), \dots, \exp (γ_{K}^{ℓ}))$ , for $ℓ \in {T, C}$ .

¹⁰ also allow for the possibility of regression parameters $β$ which depend on $H$ , and the distribution of $H$ may depend on $X$ . An even more structured model, would allow $H$ to be time-varying.

The latent factor $H$ can be interpreted as an underlying health state or situation of a (group of) patient(s). For example, whether a patient used zidovudine for HIV Type I infection. In this case, the number of levels $K$ for latent factor $H$ is known and equal to two.

The resulting joint distribution of $T$ and $C$ still corresponds to the integral of equation (1), where the frailty component is dominated by the counting measure, hence the density can be rewritten as follows:

\begin{aligned} f_{T, C ∣ X^{T}, X^{C}} & (t, c ∣ x^{T} (t), x^{C} (c); β^{T}, γ^{T}, β^{C}, γ^{C}, ζ) \\ = \sum_{k = 0}^{K} ζ_{k} f_{T ∣ X^{T}, H} (t ∣ x^{T} (t), h_{k}) f_{C ∣ X^{C}, H} (c ∣ x^{C} (c), h_{k}) \end{aligned}

(4)

where

ζ = (ζ_{0}, \dots, ζ_{K})

and

ζ_{k} = Pr (H = k)

. Equation (4) is in fact a mixture distribution with

K + 1

components induced by the presence of a latent factor

H

ζ

can be considered as the non-parametric distribution of

H

(see^18,^19,²⁰).

The mixture formulation also accounts for non-informative censoring, which occurs when either one or both conditions hold:

i) $γ_{0}^{T} = \dots = γ_{K}^{T} = 0$ ;

ii) $γ_{0}^{C} = \dots = γ_{K}^{C} = 0$ ;

In particular, in situation i) we obtain the Cox proportional hazards model without heterogeneity between patients in hazard for the time to event $T$ , while in ii) the hazard function for $T$ is still characterized by sources of heterogeneity which are not explained by the censoring mechanism.

Identifiability

General conditions for the identifiability of this modelling approach have been analysed in²¹,²² and²³.

Nevertheless, given the finite mixture nature of our approach, two additional conditions are needed to ensure its global identifiability¹ (see also²⁵ and²⁶):

For every value of $(t, c, x^{T} (t), x^{C} (c))$ , different values of $H$ should result into different p.d.f.s:

\begin{aligned} f (t, c ∣ x^{T} (t), x^{C} (c), H = j; β^{T}, β^{C}, γ_{j}^{T}, γ_{j}^{C}) \\ \neq f (t, c ∣ x^{T} (t), x^{C} (c), H = k; β^{T}, β^{C}, γ_{k}^{T}, γ_{k}^{C}) for j \neq k \end{aligned}

and thus two different hazard functions for

T

or two different hazard function for

C

given their conditional independence;

$0 < ζ_{k} < 1$ for $k = 0, \dots, K$ . This condition is needed because if for any $k$ we have $ζ_{k} = 0$ , then $γ_{k}^{T}$ and $γ_{k}^{C}$ can take any value without affecting the mixture distribution.

However, the joint probability distribution of

(T, C)

outlined so far is not locally identifiable: by permuting the labels of

H

, we obtain exactly the same joint p.d.f.:

\begin{aligned} \sum_{k = 0}^{K} ζ_{k} [\exp (- \int_{0}^{t} μ^{T} (s ∣ x^{T} (s), h; β^{T}, γ_{k}^{T}) d s) μ^{T} (t ∣ x^{T} (t), h; β^{T}, γ_{k}^{T})] \\ \times [\exp (- \int_{0}^{c} μ^{C} (s ∣ x^{C} (s), h; β^{C}, γ_{k}^{C}) d s) μ^{C} (c ∣ x^{C} (c), h; β^{C}, γ_{k}^{C})] \\ = \sum_{k = 0}^{K} ζ_{ρ (k)} [\exp (- \int_{0}^{t} μ^{T} (s ∣ x^{T} (s), h; β^{T}, γ_{ρ (k)}^{T}) d s) μ^{T} (t ∣ x^{T} (t), h; β^{T}, γ_{ρ (k)}^{T})] \\ \times [\exp (- \int_{0}^{c} μ^{C} (s ∣ x^{C} (s), h; β^{C}, γ_{ρ (k)}^{C}) d s) μ^{C} (c ∣ x^{C} (c), h; β^{C}, γ_{ρ (k)}^{C})] \end{aligned}

where

ρ (k)

represents any permutation of the indexing given by

k

, and

\sum_{k = 0}^{K} ζ_{k} = 1

This problem can be easily overcome by restricting the parameter space of the hazard function. In Section 3.2 we address this issue in the perspective of improving the efficiency of the Hamiltonian Monte Carlo sampler.

Inference

For the reasons described in Section 1, in this work we carry out our inferential exercise using Bayesian techniques. Let $p (β^{T}, γ^{T}, β^{C}, γ^{C}, ζ ∣ K)$ denote the $K$ -dependent prior distribution of the parameter vector $(β^{T}, γ^{T}, β^{C}, γ^{C}, ζ)$ . The learning object is thus given by the following posterior distribution of the vector $(β^{T}, γ^{T}, β^{C}, γ^{C}, ζ)$ :

\begin{aligned} p (β^{T}, γ^{T}, β^{C}, γ^{C}, ζ ∣ data, K) \propto p (β^{T}, γ^{T}, β^{C}, γ^{C}, ζ ∣ K) L (β^{T}, γ^{T}, β^{C}, γ^{C}, ζ ∣ data, K) \\ = p (γ^{T}, γ^{C}, ζ ∣ K, β^{T}, β^{C}) p (β^{T}, β^{C}) L (β^{T}, γ^{T}, β^{C}, γ^{C}, ζ ∣ data, K) \end{aligned}

(5)

where

L (β^{T}, γ^{T}, β^{C}, γ^{C}, ζ ∣ data, K)

denotes the likelihood as a function of the parameters, given the data and

K

. In order to avoid the use of a cumbersome notation

p

will denote the density function of the parameters.

In equation (5) we remarked the dependence of the posterior distribution on $K$ , since it determines the dimension of the parameter space, particularly for $γ^{T}$ , $γ^{C}$ and $ζ$ .

Likelihood function

Let us first distinguish $C$ from $C^{*}$ , where the former denotes the time to informative censoring, and the latter denotes the time to non-informative censoring (e.g. administrative censoring). Let $T^{'} = min (T, C, C^{*})$ , $d_{i} = 1_{[t_{i}^{'} = t_{i}]}$ and $d_{i}^{*} = 1_{[t_{i}^{'} = c_{i}]}$ . Thus, if a time to event is subject to non-informative censoring, then $d_{i} = d_{i}^{*} = 0$ .

The latent factor $H$ is never observed, hence the $i$ th individual likelihood contribution must be marginalized with respect to $H$ :

\begin{aligned} L_{i} (β^{T}, γ^{T}, β^{C}, γ^{C}, ζ) \propto f (t_{i}^{'} ∣ x_{i}^{T} ([0, t_{i}^{'}]), x_{i}^{C} ([0, t_{i}^{'}]); β^{T}, γ^{T}, β^{C}, γ^{C}, ζ) \\ = \sum_{k = 0}^{K} ζ_{k} [\exp (- \int_{0}^{t_{i}^{'}} μ_{0}^{T} (s) \exp (β^{T^{'}} x_{i}^{T} (s) + γ_{k}^{T}) d s) {(μ_{0}^{T} (t_{i}^{'}) \exp (β^{T} x_{i}^{T} (t_{i}^{'}) + γ_{k}^{T}))}^{d_{i}}] \\ \times [\exp (- \int_{0}^{t_{i}^{'}} μ_{0}^{C} (s) \exp (β^{C^{'}} x_{i}^{C} (s) + γ_{k}^{C}) d s) {(μ_{0}^{C} (t_{i}^{'}) \exp (β^{C^{'}} x_{i}^{C} (t_{i}^{'}) + γ_{k}^{C}))}^{d_{i}^{*}}] \end{aligned}

(6)

where

γ_{0}^{ℓ} = 0

and

x_{i}^{ℓ} (s)

for

ℓ \in {T, C}

is the set of covariates value at time

s \in [0, t_{i}^{'}]

In equation (6) the covariates are assumed as continuously observed throughout the whole time span $[0, t_{i}^{'}]$ . In practice, covariates are observed at discrete time points, and several assumptions can be made for their value at intermediate time points. For example, in the simulation study (Section 4) and in the empirical analysis (Section 5) the covariates are assumed constant between two observation times, and equal to the previously observed value.

The resulting likelihood function of the parameters, given the data $L (β^{T}, γ^{T}, β^{C}, γ^{C}, ζ ∣ data) = \prod_{i} L_{i} (β^{T}, γ^{T}, β^{C}, γ^{C}, ζ)$ is the sum of $(K + 1)^{n}$ terms, each corresponding to a realization of the latent factor for each unit in the sample.

Prior specification and model reparametrization

In general, the Bayesian analysis of a mixture model is particularly challenging. For this reason, the prior should be chosen accurately, and the model needs to be parametrized in a way that makes the sampling process efficient.

First of all, we assume for convenience that the vectors $β^{T}$ , $β^{C}$ , $γ^{T}$ , $γ^{C}$ and $ζ$ are pairwise independently distributed, that is:

p (β^{T}, γ^{T}, β^{C}, γ^{C}, ζ ∣ K) = p (β^{T}) p (β^{C}) p (γ^{T} ∣ K) p (γ^{C} ∣ K) p (ζ ∣ K)

(7)

where each density depends on some hyperparameters.

The specification of the prior distribution for the regression coefficient parameters $β^{T}$ and $β^{C}$ does not represent a problem, hence it is possible to specify a non-informative prior in order to reduce the extent of subjective assumptions on the inference.

The prior distribution of $γ^{T}$ , $γ^{C}$ and $ζ$ must be specified with more care. The use of prior distributions which are at least weakly informative is appropriate, in order to enhance the efficiency of the sampler.

For example, with the use of an exchangeable prior on $(γ_{1}^{T}, \dots, γ_{K}^{T})$ , $(γ_{1}^{C}, \dots, γ_{K}^{C})$ and of a uniform prior on $ζ$ ², then their posterior distribution will maintain the permutation invariance induced by the mixture likelihood (²⁷ and²⁸).

This multimodality hinders the efficiency of the sampler, because it would take a large number of iterations to fully explore the posterior distribution. Indeed, the sampler is likely to get stuck in an area of the parameter space corresponding to one mode of the posterior distribution (see²⁹).

In Bayesian inference the identifiability of a mixture distribution turns out to be more crucial. In Section 2.2 we mentioned how a constraint allows for a model to be identifiable. In our case we also need to follow a strategy which allows for an efficient sampling for the posterior distribution.

For the joint distribution of equation (4), we need to place an ordering constraint on the set $[(γ_{1}^{T}, γ_{1}^{C}), \dots, (γ_{K}^{T}, γ_{K}^{C})]$ , which can be done by ordering either $(γ_{1}^{T}, \dots, γ_{K}^{T})$ or $(γ_{1}^{C}, \dots, γ_{K}^{C})$ . For example, we can set:

0 < γ_{1}^{T} < γ_{2}^{T} < \dots < γ_{K}^{T}

(8)

where we take into account also the component corresponding to

γ_{0}^{T} = 0

. An account of the geometric interpretation of this constraint can be found in²⁸. This constraint also turns exchangeable priors into nonexchangeable ones due to the restriction of the parameter space into its subset, singling out one mode of the posterior distribution, as induced by the label permutation. Conversely, the exchangeability in the prior distribution will be inherited also by the posterior distribution.

However, when the mixture components are not well separated (i.e. these are very close to one another in value), then also the posterior distribution will be affected by the imposed constraint. For example, suppose we set an ordering as in equation (8), and there is a nonnegative probability that $γ_{1}^{T} > γ_{2}^{T}$ (all else being equal, which is more likely if $γ_{1}^{T}$ is very close to $γ_{2}^{T}$ ), then the true posterior distribution cannot be captured by the posterior distribution satisfying such constraint.

In the extreme case where the mixture components overlap, then the posterior distribution will be more difficult to explore (²⁸), as may occur when the value of $K$ is larger than actually needed to explain the heterogeneity among the units. In this case the components in excess will collapse into those which are necessary and have been already included in the model.

This justifies our inferential strategy of analyzing increasingly complex models in terms of needed mixture components, as we describe in Section 3.3.

Choice of the number of mixture components

The value of $K$ is inferred by solving a model selection problem, where models with different values of $K$ are separately fitted and then compared to one another using appropriate information criteria. Algebraic geometry tools have been proven useful in order to understand the asymptotic behavior of the posterior distribution of the parameters when dealing with singular statistical models. For example,¹¹ proposes the Widely Applicable Information Criterion (WAIC), which generalizes the AIC to the analysis of singular models.

The WAIC is computed as follows (notation simplified for ease of exposition):

\begin{aligned} WAIC = & - \sum_{i = 1}^{n} \log [\frac{1}{M} \sum_{m = 1}^{M} f (t_{i}^{'} ∣ x_{i}^{T} (s), x_{i}^{C} (s), d_{i}, d_{i}^{*}; β^{T (m)}, β^{C (m)}, γ^{T (m)}, γ^{C (m)}, ζ^{(m)})] + p_{WAIC}; \\ p_{WAIC} = 2 \sum_{i = 1}^{n} [\log (\frac{1}{M} \sum_{m = 1}^{M} f (t_{i}^{'} ∣ x_{i}^{T} (s), x_{i}^{C} (s), d_{i}, d_{i}^{*}; β^{T (m)}, β^{C (m)}, γ^{T (m)}, γ^{C (m)}, ζ^{(m)})) \\ - \frac{1}{M} \sum_{m = 1}^{M} \log f (t_{i}^{'} ∣ x_{i}^{T} (s), x_{i}^{C} (s), d_{i}, d_{i}^{*}; β^{T (m)}, β^{C (m)}, γ^{T (m)}, γ^{C (m)}, ζ^{(m)})] \end{aligned}

where

M

is the number of sampled values from the posterior distribution of the parameters, and

p_{WAIC}

is a penalization term which can be interpreted as the effective number of parameters, and measures the fluctuation of the posterior distribution (see¹⁶).

Our strategy is to first analyze the model with $K = 0$ ( $T$ and $C$ independently distributed given $x^{T} (\cdot)$ and $x^{C} (\cdot)$ ) and then we fit models with increasing $K$ until the WAIC does not decrease, as we aim at its minimization.

Similar information criterion have been proposed to generalize the BIC, such as the singular BIC of¹² and the Widely Applicable Bayesian Information Criterion (WBIC) of³⁰. The former is of harder implementation in practice, while we found in a simulation study (not discussed here) that the WBIC shows a slightly better performance than the WAIC in selecting the true model, particularly when the mixture components are well separated. The WAIC is hereby preferred over the BIC generalization for two practical reasons: i) it provides a measure of model dimension through $p_{WAIC}$ ; ii) it turns out to be more practical since it allows to use the same output from the posterior sampling process.

Estimation

For the estimation of the posterior distribution of equation (5) we use the Hamiltonian Monte Carlo (HMC) sampler (see³¹). This algorithm belongs to the general Metropolis-Hastings family. HMC uses Hamiltonian dynamics which allow for a faster exploration of the parameter space. It turns out to be more efficient than more traditional Gibbs and random-walk Metropolis samplers for posterior distributions showing high correlations among the parameters, as it occurs for mixture distributions.

The HMC sampler is implemented by using the Stan software package (³²) and its R interface (³³), by means of the package rstan. In its default implementation Stan uses the No-U-Turn sampler of³⁴ which allows for an automatic tuning of the sampler. Furthermore, the rstan can automatically initialize the sampling process.

Simulation study

Set up

Simulation studies are carried out to assess the performance of the methodology outlined in this work. We focus on the posterior distribution of the parameters $β^{T}$ , $β^{C}$ , $γ^{T}$ and $γ^{C}$ , and thus on whether the WAIC allows to chose the true model. We look at whether the 95% credible intervals of the posterior distribution include the true value of the parameters, and we analyse whether the posterior mean can be considered as an unbiased point estimate of the parameters.

We compare these results with the parameter estimates and the 95% coverage intervals obtainable from the Cox proportional hazards model. This gives us the opportunity to quantify the need for our method in certain settings.

For each simulation run, we generate samples of 500 potentially censored lifetimes.

For each patient we assume the vector of time-varying binary covariates $X (t) = X^{T} (t) = X^{C} (t) = (X_{1} (t), X_{2} (t))$ (for example treatment and disease) with the probability distribution shown in Table 1.

Table 1.

Joint probability distribution of $(X_{1} (t), X_{2} (t))$ .

		$X_{2} (t)$
		0	1	Mar. $X_{1} (t)$
$X_{1} (t)$	0	0.37	0.23	0.6
	1	0.33	0.07	0.4
	Mar. $X_{2} (t)$	0.7	0.3

The binary covariate variables $(X_{1} (t), X_{2} (t))$ are independent throughout time.

We consider an observational period of 8 years, and a follow up period of 4 years for simplicity such that the researcher observes $X (t)$ for $t = 0, 4, 8$ . In medical practice, these values can change at very high frequency over time, for example whether a patient has a high or low blood pressure, or the time-dependent treatment, although applied statistical analyses contain simplifications to overcome the issue. In this simulation study we assume that $X (t)$ is constant between one observation and the other.

For each unit, $H$ is generated from a $Mult (1, ζ)$ distribution.

For $K = 1$ we consider two scenarios: the case of well separated mixture components (WS), and the case of poorly separated mixture components (PS).

In the WS scenario $T$ and $C$ are generated under the following hazard functions:

μ^{T} (u ∣ x^{T} (u), h; β^{T}, γ^{T}) = \exp (- 1.2 - 0.9 x_{1} (u) + 0.8 x_{2} (u) + 2 h_{1})

(9)

μ^{C} (u ∣ x^{C} (u), h; β^{C}, γ^{C}) = \exp (- 2 - x_{1} (u) + 0.5 x_{2} (u) + 3.5 h_{1})

(10)

Conversely, in case of poorly separated mixture components, we specify:

μ^{T} (u ∣ x^{T} (u), h; β^{T}, γ^{T}) = \exp (- 0.2 - 0.9 x_{1} (u) + 0.8 x_{2} (u) + 0.5 h_{1})

(11)

μ^{C} (u ∣ x^{C} (u), h; β^{C}, γ^{C}) = \exp (- 0.5 - x_{1} (u) + 0.5 x_{2} (u) + h_{1})

(12)

For the case of two mixture components we set

ζ = (0.65, 0.35)

and

γ^{C}

is chosen in order to keep a stochastic ordering in the hazard function of the censoring time due to higher values of

H

When $K = 2$ , then $T$ and $C$ are generated according to the following hazard functions:

μ^{T} (u ∣ x^{T} (u), h; β^{T}, γ^{T}) = \exp (- 1 + 0.3 x_{1} (u) - x_{2} (u) + 3 h_{1} + 2 h_{2})

(13)

μ^{C} (u ∣ x^{C} (u), h; β^{C}, γ^{C}) = \exp (- 1.6 - 0.7 x_{1} (u) + 0.5 x_{2} (u) + 2.5 h_{1} + 4 h_{2})

(14)

and

ζ = (0.48, 0.32, 0.2)

. In this case, we consider only the case of well separated mixture components due to the smaller sample size.

The analysis of models with more than three mixture components requires ideally larger samples than those analysed in this work, while the case $K = 0$ can be more efficiently fitted with a Cox proportional hazard model. For each value of $K$ we generate 100 datasets. In all cases we ensure that the datasets have around 40% censored units and a negligible percentage of type I censoring cases.

Drawing a value for (T,U)

A general property of survival models is that the integrated hazard function $\int_{0}^{t} μ (s) d s$ is a random variable with $Exponential (1)$ distribution.

Therefore, conditional on a standard uniform random variable $Z \sim Unif (0, 1)$ , a set of covariates $X (t)$ and a factor $H$ , a sampled value of $T$ , denoted by $t$ , is given by the solution of the following equation:

- \log z = \int_{0}^{t} μ^{T} (s ∣ x (s), h; β^{T}, γ^{T}) d s

(15)

Given the set-up described in Section 4.1 and the hazard function specification of equations (9)-(14), then equation (15) has a closed form solution. In the same way we can sample a value for the censoring time

C

Then, once $t$ and $c$ are obtained, a value for the triplet $(t^{'}, d, d^{*})$ is calculated as follows:

t^{'} = min (t, c)

(16)

d = 1_{[t \leq c \leq 8]}

(17)

d^{*} = 1_{[c < t \leq 8]}

(18)

Implementation

We use weakly informative prior distributions for each parameter, instead of fully non-informative ones in order to foster the convergence of the sampler. In this way, the posterior distribution will not be strongly affected by prior assumptions. Indeed, with enough observations, the effect of the prior should be negligible (¹⁶).

Furthermore, we assume that parameters are a priori independently distributed.

We model the baseline hazard function of $T$ and $C$ using a piecewise constant hazard function as follows:

μ_{0}^{ℓ} (t) = {\begin{matrix} \exp (λ_{00}^{ℓ}) & if t < 4 \\ \exp (λ_{01}^{ℓ}) & if t \geq 4 \end{matrix}

where

ℓ \in {T, C}

, in order to ease the representation, considering a time span divided into two sub-periods of 4 years.

The posterior modeling involves $λ_{0}^{ℓ} = (λ_{00}^{ℓ}, λ_{01}^{ℓ})$ , since in this way we do not need to constrain the parameter space, and the sampler works more efficiently.

The piecewise constant baseline hazard function represents a useful yet flexible specification, mainly if the number of knots (or intervals) increases. Alternative models can be given for example by a Gamma Process for the integrated baseline hazard function (see³⁵), or the use of splines (³⁶). Further alternatives are discussed in the book of³⁷.

For the prior distribution we assume that $λ_{0}^{T}$ , $λ_{0}^{C}$ , $β^{T}$ and $β^{C}$ are vectors of pairwise independent and normally distributed variables with mean 0 and variance equal to 10. When fitting a model with $K = 1$ we assume that $γ_{1}^{T} \sim N (0.5, 100)$ , $γ_{1}^{C} \sim N (1, 100)$ truncated at the lower bound of 0 and $ζ \sim Dirichlet (30, 20)$ . When $K = 2$ , the prior of $γ_{1}^{T}$ and $γ_{1}^{C}$ are as before, while $γ_{2}^{T} \sim N (1, 100)$ , $γ_{2}^{C} \sim N (2, 100)$ truncated at the lower bound of $γ_{1}^{C}$ and $ζ \sim Dirichlet (21, 17, 12)$ . In a similar fashion, when fitting a model with $K = 3$ we assume $γ_{3}^{T} \sim N (2, 100)$ , $γ_{3}^{C} \sim N (3, 100)$ truncated at the lower bound of $γ_{2}^{C}$ and $ζ \sim Dirichlet (20, 15, 10, 5)$ . In all cases, when choosing the Dirichlet distribution parameters, we kept a prior sample size (the sum of the Dirichlet distribution parameters) of 50.

The HMC sampler is run for 5,000 iterations and the first half draws are discarded in order to allow for burn-in and fine-tune the No-U-Turn sampler. Indeed, given our simulation settings, 5,000 iterations are sufficient to allow for the chains to mix and obtain a stationary posterior distribution, which ensures convergence of the sampler.

Results

The results in Table 2 for $K = 1$ show that the 95% credible intervals from the posterior distribution of the parameters include their true value with a coverage close to normal when fitting the true model $K = 1$ and the mixture components are well separated. In case of lesser separated mixture components, the 95% credible intervals still show a coverage larger than 90% for $β^{T}$ , $β^{C}$ and $ζ$ . However, the coverage sensibly decreases for $γ^{T}$ and $γ^{C}$ , whose posterior mean turns out to be divergent from their true values.

Table 2.

Parameter estimates with the Cox proportional hazard model (Cox PH), posterior mean estimate for the models with $K = 0, 1, 2$ , coverage of the 95% confidence interval (for the Cox model) and coverage of the 95% credible intervals.

Par.	True v.	Par. est	Post. mean			95% CI Cov.	$95 %$ Cred. int. Cov.
		Cox PH	K=0	K=1	K=2	Cox PH	K=0	K=1	K=2
Well separated mixture components
$β_{1}^{T}$	$- 0.9$	$- 0.74$	$-$ 0.79	$-$ 0.89	$-$ 0.90	75	83	94	95
$β_{2}^{T}$	0.8	$0.68$	0.78	0.80	0.83	85	95	98	94
$γ_{1}^{T}$	$2$	–	–	1.97		–	–	95	–
$β_{1}^{C}$	$- 1$	$- 0.55$	$-$ 0.75	$-$ 1.02	$-$ 1.04	22	64	92	93
$β_{2}^{C}$	$0.5$	$0.21$	0.43	0.49	0.47	56	82	93	93
$γ_{1}^{C}$	$3.5$	–	–	3.50	–	–	–	98	–
$ζ_{0}$	$0.65$	–	–	–	0.65	–	–	97	–
$ζ_{1}$	$0.35$	–	–	–	0.35	–	–	97	–
Poorly separated mixture components
$β_{1}^{T}$	$- 0.9$	$- 0.90$	$-$ 0.88	$-$ 0.89	$-$ 0.89	92	93	92	94
$β_{2}^{T}$	0.8	$0.76$	0.83	0.82	0.85	89	94	95	94
$γ_{1}^{T}$	0.5	–	–	$-$ 4.12		–	–	67	–
$β_{1}^{C}$	$- 1$	$- 0.89$	$-$ 0.96	$-$ 0.97	$-$ 0.98	92	96	94	96
$β_{2}^{C}$	$0.5$	$0.45$	0.49	0.50	0.46	94	90	90	91
$γ_{1}^{C}$	1	–	–	2.00	–	–	–	79	–
$ζ_{0}$	$0.65$	–	–	–	0.61	–	–	99	–
$ζ_{1}$	$0.35$	–	–	–	0.39	–	–	99	–

When the mixture components are well separated, the 95% credible intervals have a coverage below nominal for almost all parameters when fitting the model assuming that $T$ and $C$ are independently distributed (K=0): $β_{1}^{T}$ and $β_{2}^{T}$ have a coverage of 83% and 95% respectively, and $β_{1}^{C}$ and $β_{2}^{C}$ are included in the credible intervals with 64% and 82% probability respectively. As expected, in case of poorly separated mixture components, without a very large sample size, the model with $K = 0$ tends to show a coverage close to nominal for the parameters $β^{T}$ and $β^{C}$ . Indeed, the true model tends to be closer to the model with $K = 0$ .

The results for the posterior mean follow from those of the credible intervals. We see that the regression parameters $β^{T}$ and $β^{C}$ are in line with respect to their true value when $K = 1, 2$ and the mixture components are well separated. Same applies for the baseline hazard (not shown) when fitting the true model with $K = 1$ , as well as for $γ^{T}$ , $γ^{C}$ and $ζ$ .

Analogous results are obtained for the model with $K = 2$ as shown in Table A1 in Appendix A.1. For this particular case, with a sample of 500 units we note that the posterior mean of $γ_{2}^{T}$ is different compared to its true value, despite the credible intervals show a coverage of 92%. A closer inspection of this result showed that this is the consequence of the small number of events for those units with $H = 2$ (around 5% of the observations), which causes $γ_{2}^{T}$ to be estimated with larger uncertainty.

Table 3 shows that when the mixture components are well separated in all cases the WAIC excludes the lack of heterogeneity, ruling out the model with $K = 0$ . In 90 cases out of 100 it selects the model with K=1, while in 10 cases it selects the model with $K = 2$ . In this latter case, the WAIC for the model $K = 2$ is never higher than 2 compared to the model with $K = 1$ . This means that if we chose the more parsimonious model with $K = 1$ , we do not lose much information. Clearly, these results depend on the values of $γ^{T}$ and $γ^{C}$ , which we purposely chose in order to ensure that the mixture components are well separated. Conversely, when the mixture components are poorly separated, in case of smaller sample sizes the models with $K = 0, 1, 2$ tend to behave similarly. Indeed, when the WAIC picks the model with $K = 0$ , then it is never larger than 1, compared to the true model with $K = 1$ . Similarly, when the WAIC picks the model with $K = 2$ , the difference with the WAIC for the true model is never larger than 3.2.

Table 3.

Number of times out of 100 the WAIC selects the true model based on $K$ .

True	Sample	Mixture	Fitted models
model (K)	size	sep.	K=0	K=1	K=2	K=3
$K = 1$	500	PS	41	$19$	40	0
$K = 1$	500	WS	0	$90$	10	0
$K = 2$	500	WS	0	54	$39$	7
$K = 2$	1,000	WS	0	18	$60$	22
$K = 2$	5,000	WS	0	7	$83$	10

Given our choice of the parameters in equations (13)–(14) related to the case $K = 2$ we could see that the WAIC chooses the true model in 39 cases, while in 54 cases it chose the model with $K = 1$ and in 7 cases the model with $K = 3$ . The model with $K = 0$ is always ruled out. When the model with $K = 3$ is chosen the difference in WAIC is never higher than 1.79, and thus we reach the same conclusions as the previous cases.

These results do not necessarily mean that our approach to model selection has a poor performance as the true $K$ increases: the more complex the model, the larger the sample size should be in order to capture the heterogeneity in the units. Indeed when using a sample size of 1,000 and of 5,000 units, then the true model is selected in 60 and 83 cases out of 100 respectively. In addition, as aforementioned, also the true values of the parameter $γ^{T}$ and $γ^{C}$ play a fundamental role to capture heterogeneity among the units.

This simulation study showed how the model is always capable to spot the heterogeneity in the distribution of $T$ induced by informative censoring. Another key point is that failing to account for this heterogeneity yields a bias in the estimation of the regression coefficients, as shown by their smaller coverage. The probability of selecting the true model increases with sample size, as we can expect in highly parametrized models as analysed in this work. Parsimony is also preserved, since even when fitting a larger model which shows a better performance in terms of WAIC, then selecting a smaller model does not result in a great loss of information. Compared to the use of the Cox proportional hazard model, the results of this work are even more robust if we consider that in our model we needed to specify a baseline hazard function.

Analysis of ACTG 175 dataset

The AIDS Clinical Trial Group (ACTG) 175 study (¹⁷) is a double-blind randomised clinical trial where a total of 2,467 adults infected with HIV type I and CD4 cell counts between 200 and 500/mm $^{3}$ are randomly assigned to four different treatments: i) zidovudine, ii) didandosine, iii) zidovudine plus didandosine and iv) zidovudine plus zalcitabine.

Enrolment run from December 1991 until October 1992, while patients were scheduled to stay under analysis until November 1994. In particular, they are examined at weeks 2, 4, 8 and every 12 weeks afterwards.

For each patient we could observe baseline covariates, such as age, gender, ethnicity, CD4 count, Karnofsky score, prior use of antiretroviral therapy, haemophilia and whether the HIV was symptomatic or not; as well as some other information like the primary endpoint, and the reason for drop out from the study³ . CD4 cell counts are checked at the initial visit and then from week 8 onwards.

The primary endpoints (the time to the event of interest we want to model in this work) of the study were a 50% decrease in CD4 cell counts with respect to the baseline, AIDS or death. At the end date of the observational study there are some patients whose primary end point did not occur, hence these can be considered as non-informatively type 1 censored.

However, some other patients can drop out from the study earlier than the end date without reaching the primary endpoint for some reasons, such as toxicity of the therapy, request of the patients themselves or of the investigator, while some others just do not show up at the next planned visit (loss to follow up). In these cases it is reasonable to assume that these drop out causes are related to the primary endpoint, since for example a patient may ask to discontinue a therapy because she feels better, therefore the death event is less likely to occur (other things being equal). In this case, the Kaplan-Meier estimate of the survival function (which considers censoring as non informative) is likely to be pessimistic about patient survival.

Similar to the work of⁶ and³⁹ we focus on the analysis of those patients treated with zidovudine only. We thus have 614 patients of which 195 experienced the primary endpoint (151 had 50% CD4 reduction, 14 died and 30 developed AIDS).

Hence, of the remaining 419 patients, 183 are Type I censored and 236 are considered as subject to informative censoring.

Furthermore,¹⁷ observed that throughout the study ”younger patients, those reporting injection-drug use and those with lower CD4 cell counts, lower Karnofsky score and symptoms of HIV infection at enrollment are more likely to discontinue the treatment before the study ends”.

For illustration we consider the following covariates vector for $T$ and $C$ :

\begin{aligned} X^{T} (t) = CD4 (t) \\ X^{C} (t) = (age, iv, kar, sym, CD4 (t)) \end{aligned}

We chose

X^{T} (t)

based on the WAIC, while

X^{C} (t)

is chosen on the basis of the aforementioned considerations of¹⁷.

X^{C} (t)

is the same vector of covariates chosen in⁶ and³⁹.

In particular, age denotes the age of the patient at time of randomization, iv is an indicator of intravenous drug use, kar indicates the Karnofsky score, sym is a binary variable indicating the presence of symptoms of HIV infection and CD4 $(t)$ indicates the CD4 cells count at each visit $(t)$ . We assume for simplicity that the CD4 cell count is constant between one visit and the other.

From a look at the Kaplan-Meier curve of the integrated hazard function, shown in Figure 1 we could have a rough idea about the shape of the baseline hazard function which seems approximately piecewise linear for both $T$ and $C$ . For this reason we specify a piecewise constant baseline hazard function for $μ_{T} (\cdot)$ and $μ_{C} (\cdot)$ with two breakpoints. As stated in Section 4.3 other specifications for the non-parametric baseline hazard function can be possible. For this dataset we noted that the use of a Gamma process would rather slower the HMC sampler without returning better results.

Figure 1.

Kaplan Meier estimate of the cumulative hazard function for T (solid line) and for the censoring (dotted line).

The breakpoints have been chosen by fitting a piecewise linear function with two breakpoints where the cumulative hazard function obtained from the Kaplan-Meier estimator is regressed against the time to event using the package segmented (⁴⁰). The breakpoint vector for $μ_{0}^{T}$ is $(368.5, 1006.2)$ , while the vector for $μ_{0}^{C}$ is $(841, 958.2)$ .

In a similar fashion with respect to the simulation study we chose weakly informative prior distributions. Hence, we assume that $λ_{0}^{T}$ , $λ_{0}^{U}$ , $β^{T}$ and $β^{C}$ are vectors of independent and normal random variables with mean 0 and standatd deviation equal to 10. $γ^{T}$ and $γ^{C}$ are chosen in the same way as in the simulation study, while assuming a prior sample size for $ζ$ equal to 25 for all fitted models. For $K = 1, 2, 3$ different prior distribution have been specified for $ζ$ . This does not exclude the possibility for the researcher to use expert judgement when specifying the prior distribution. The choice of the best model is again based on the WAIC.

For each value of $K$ we run 4 parallel chains, each for 20,000 iterations (10,000 used as warm-up).

In addition, we compare these results with those obtainable from fitting a Cox proportional hazard model, which assumes independence between $T$ and $C$ .

Results

We observe that for the analysed models the HMC sampler converged towards the posterior distribution, as we can see for example from the traceplots of Figure A1, where we showed only the first two chains in Section A.2.1, and the $\hat{R}$ statistic of⁴¹ for all parameters which is always equal to 1. The marginal posterior density of each parameter is nearly symmetric, except for $γ_{1}^{C}$ due to the chosen local identifiability constraint (Figure A2).

The WAIC leads to the choice of the model with $K = 1$ (Table 5), whose results are shown in Table 4. As emphasized in the simulation study of Section 4 the true value of $K$ may be higher than 1 due to the relatively small sample size, although we did not have a great loss of information. As we can see, the WAIC progressively increase with $K$ after hitting its lowest point at the value of 1.

Table 4.

Cox proportional hazard model estimate of the parameters (Cox PH) and summary of the posterior distribution of the parameters for the model with $K = 1$ .

Par.	Cox PH	Post. mean	Post. quantile
	(est. s.err.)	(st. dev.)	2.5-th%	50-th%	97.5-th%
$λ_{00}^{T}$	–	$-$ 3.85(0.32)	$-$ 4.48	$-$ 3.85	$-$ 3.21
$λ_{01}^{T}$	–	$-$ 3.17(0.35)	$-$ 3.84	$- 3.18$	$-$ 2.45
$λ_{02}^{T}$	–	$-$ 3.78(0.59)	$-$ 4.96	$-$ 3.77	$-$ 2.66
$λ_{00}^{C}$	–	$-$ 4.40(1.03)	$-$ 6.53	$-$ 4.39	$-$ 2.45
$λ_{01}^{C}$	–	$-$ 3.08(1.04)	$-$ 5.20	$-$ 3.06	$-$ 1.13
$λ_{02}^{C}$	–	$-$ 1.93(1.03)	$-$ 4.03	$-$ 1.91	0.03
$β_{CD4( t)}^{T}$	$- 0.0104$ (0.0007)	$- 0.0139 (0.00104)$	$- 0.016$	$- 0.0139$	$- 0.0119$
$γ_{1}^{T}$	–	$- 2.03 (0.33)$	$- 2.68$	$- 2.03$	$- 1.36$
$β_{age}^{C}$	$- 0.046$ (0.0089)	$- 0.05$ (0.01)	$- 0.07$	$- 0.05$	$- 0.04$
$β_{iv}^{C}$	0.74(0.1678)	$0.76$ (0.19)	$0.39$	$0.76$	$1.12$
$β_{kar}^{C}$	$- 0.019$ (0.0107)	$- 0.03$ (0.01)	$- 0.05$	$- 0.03$	$- 0.01$
$β_{sym}^{C}$	0.25(0.1770)	$0.18$ (0.19)	$- 0.19$	$0.19$	$0.55$
$β_{CD4( t)}^{C}$	$- 0.00069$ (0.0005)	$- 0.00085$ (0.0005)	$- 0.0019$	$- 0.00081$	$- 0.000057$
$γ_{1}^{C}$	–	$1.07 (0.62)$	$0.08$	$1.02$	$2.43$
$ζ_{0}$	–	0.45(0.08)	0.29	0.45	0.61
$ζ_{1}$	–	0.55(0.08)	0.39	0.55	0.71

Table 5.

WAIC and effective number of parameters ( ${\,p}_{WAIC}$ ) for the models with $K = 0, 1, 2, 3$ .

	K=0	K=1	K=2	K=3
WAIC	3,462.76	$3, 456.4$	3,457.34	3,458.13
${\,p}_{WAIC}$	12.30	13.83	14.90	15.48

For all values of $K$ we see that lower time to primary endpoints are associated with lower CD4 cell counts, coherently to our expectations. This is because it is an indicator of progression of the HIV and of immunologic health.

Again, when analysing the time to censoring, we can see that our results are consistent with the observations of¹⁷ mentioned in the previous section, hence with the evidences obtained by⁶.

The standard errors of the parameter estimates obtained with the Cox PH model are smaller than the standard deviations of the posterior distribution of the parameters from our Bayesian modelling approach. This is presumably due to the fact that the Cox model tends to yield lower uncertainty about parameter estimates than the joint model developed in this work since the latter has more parameters, as we specify a baseline hazard and include a latent factor.

The covariates intravenous drug use and the presence of symptoms do not show statistical significance, which may be explained by the role of CD4 cell counts. For example,⁴² in the analysis of the progression of the HIV between hard drug users and other subjects in the Women’s Interagency HIV Study, noted that hard drug users tend to drop out from the study earlier, and that those subjects have lower mean of CD4 cell counts. Thus the effect of intravenous drug use may be mediated by CD4 cell count. Additionally, the CD4 count may be a confounder for the effect of symptoms and the direct effect of symptoms is relatively small when corrected for CD4 cell count.

For this particular dataset the summaries of the posterior distribution of $β^{T}$ and $β^{C}$ from the model with $K = 1$ are very close to those for the models with $K = 0, 2, 3$ , as we can see in Tables A2–A4 in Appendix A.2.2, although we can see some major differences with reference to the baseline hazard functions (from 16 to 40% in absolute value), $β_{C D 4 (t)}^{T}$ (25% higher than the same parameter for the model with $K = 0$ ) and $β_{C D 4 (t)}^{C}$ (34% lower). As we could see also from the simulation study, the assumption of independence may lead to considerable risk of wrong parameter estimates following model misspecification. When comparing the results of the latent factor model with $K = 1$ with the Cox proportional hazard model which leaves the baseline hazard function unspecified (Table 4), once again we see that the heterogeneity may have a considerable effect on the value of parameter estimates which may be biased if this further heterogeneity is not taken into account. For example, the Cox model underestimates the effect of the CD4 cell counts.

The parameter $γ_{1}^{T}$ , has a marginal posterior distribution such that values around zero are very unlikely. This means that there is further heterogeneity in $T$ which may be captured from other factors. When coupled with the distribution of $γ_{1}^{C}$ (Figure A2) we can see that this heterogeneity may be due to the presence of informative censoring (posterior mean equal to 1.07). A further sign of convergence is that large probability mass lies far from the boundary of the sample space, as determined by the constraint.

In addition, we compare these two models in terms of goodness of fit, which can help in the choice between these two models. At this purpose we look at the Cox-Snell residuals, $r_{i}^{C S}$ (⁴³), defined as the integrated hazard function. A well known result in mathematical statistics states that these residuals have an Exp(1) distribution if the model has been properly specified.

For the Cox PH model, these can be derived as a by product of the estimation process by using the package survival (⁴⁴), while for the latent factor model of this work, $r_{i}^{C S}$ is calculated as follows:

\begin{aligned} r_{i}^{C S} & = E [- \ln S_{T} (t_{i}^{'} ∣ x_{i}^{T} (\cdot), x_{i}^{C} (\cdot), c_{i}, d_{i}^{*}, λ^{T}, λ^{C}, β^{T}, β^{C}, ζ)] \\ = E [- \ln \sum_{k = 0}^{K} A_{k} + \ln \sum_{k = 0}^{K} B_{k}] \end{aligned}

(19)

where

A_{k} = \exp [- \int_{0}^{t_{i}^{'}} μ^{T} (s ∣ x_{i}^{T} (s), h; β^{T}, γ^{T}) d s] B_{k}

(20)

B_{k} = \exp [- \int_{0}^{t_{i}^{'}} μ^{C} (s ∣ x_{i}^{C} (s), h; β^{C}, γ^{C}) d s] μ^{C} {(t_{i}^{'} ∣ x_{i}^{C} (t_{i}^{'}), h; β^{C}, γ^{C})}^{d_{i}^{*}}

(21)

In other words, unlike the Cox PH model where we have a point estimate for the integrated hazard function, we average the integrated hazard function for the latent factor model over the draws from the posterior distribution. In addition, since we jointly model the time to event and the censoring time, we need to consider the probability distribution of

T

conditional on the censoring time (other than the covariates).

In order to check whether $r_{i}^{C S}$ has Exp(1) distribution, we then calculate the Kaplan-Meier estimate of such residuals, and we plot $r_{i}^{C S}$ against their Kaplan-Meier estimate of the integrated hazard function, as shown in Figure 2.

Figure 2.

Cox-Snell residuals for the latent factor model with $K = 1$ (LF K1, o) and for the Cox PH model (x) plotted against their cumulative (or integrated) hazard function.

We see that the Cox-Snell residuals of the latent factor approach of this work follows more closely the 45-degrees straight line, compared to the Cox PH model, meaning that the former returns a better fit for the distribution of the time to event $T$ .

A by-product of this analysis is the possibility to analyse the profile of the two subgroups (since K=1) arising from this modelling approach. Using Bayes’ theorem we can calculate the posterior distribution of $H$ for each patient $q_{i h}$ :

\begin{aligned} q_{i h} = Pr (H = h ∣ t_{i}^{'}, x_{i}^{T} (\cdot), x_{i}^{C} (\cdot)) & = \frac{Pr (H = h) Pr (t_{i}^{'}, x_{i}^{T} (\cdot), x_{i}^{C} (\cdot) ∣ H = h)}{\sum_{k = 0}^{K} Pr (H = k) Pr (t_{i}^{'}, x_{i}^{T} (\cdot), x_{i}^{C} (\cdot) ∣ H = k)} \\ = \frac{ζ_{h} A_{h}^{'}}{\sum_{k = 0}^{K} ζ_{k} A_{k}^{'}} \end{aligned}

(22)

where

\begin{aligned} A_{h}^{'} = & A_{h} μ^{T} {(t_{i}^{'} ∣ x_{i}^{T} (t_{i}^{'}), h; β^{T}, γ^{T})}^{d_{i}} \\ = & \exp [- \int_{0}^{t_{i}^{'}} μ^{T} (s ∣ x_{i}^{T} (s), h; β^{T}, γ^{T}) d s] μ^{T} {(t_{i}^{'} ∣ x_{i}^{T} (t_{i}^{'}), h; β^{T}, γ^{T})}^{d_{i}} \\ \times \exp [- \int_{0}^{t_{i}^{'}} μ^{C} (s ∣ x_{i}^{C} (s), h; β^{C}, γ^{C}) d s] μ^{C} {(t_{i}^{'} ∣ x_{i}^{C} (t_{i}^{'}), h; β^{C}, γ^{C})}^{d_{i}^{*}} \end{aligned}

(23)

The patient can be hard-assigned to each group by using the Bayes’ rule: let

κ_{i}

denote to which group the patient has been assigned. According to this rule we obtain that

κ_{i} = h

q_{i h} \geq q_{i j}

for

j = 0, \dots, K

, that is, the patient is assigned to the group for which

q_{i \cdot}

is highest.

For this dataset we could observe for example that 234 out of 236 censored patients are classified as corresponding to the group with $H = 1$ , which is characterized by a lower hazard function for the primary event (since $γ_{1}^{T} < 0$ ) and (obviously) a higher hazard for the censoring event. In addition, patients with lower baseline CD4 cell count are more likely to be classified in the group with $H = 0$ (mean 345.64 against a mean of 365.28 for $H = 1$ ). Further analysis with other covariates are likewise possible. We hereby take into account the CD4 as it is part of the hazard function specification for $T$ and $C$ .

Extensions

Analysis of competing risks

We can extend the approach described so far to the analysis of the joint distribution of $(T_{1}, \dots, T_{M})$ where $T_{j}$ is the time to event for the $j$ th cause of decrement ( $j = 1, \dots, M$ ). Within a competing risks framework only $T^{'} = min (T_{1}, \dots, T_{M})$ can be observed.

Suppose that conditional on the latent factor $H$ , $(T_{1}, \dots, T_{M})$ is a vector of pairwise independently distributed times to event, whose distribution is characterized by the following hazard function:

μ^{m} (t ∣ x^{m} (t), h; β^{m}, γ^{m}) = μ_{0}^{m} (t) \exp (β^{m^{'}} x^{m} (t) + γ_{1}^{m} h_{1} + \dots + γ_{K}^{m} h_{K})

(24)

for

m = 1, \dots, M

In this way, we can also account for further independence assumptions: for example, if $γ^{m} = (γ_{1}^{m}, \dots, γ_{K}^{m}) = 0$ , then $T_{m}$ is independently distributed with respect to the vector $(T_{1}, \dots, T_{m - 1}, T_{m + 1}, \dots, T_{M})$ .

More generally, instead of having one common latent factor for all competing risks, we can have $H_{m}$ ( $m = 1, \dots, M$ ), such that the statistical association among time to events is characterized by the joint distribution of $(H_{1}, \dots, H_{M})$ .

Clustered units

Suppose we deal with units divided into clusters, such as patients in different clinics, each corresponding to a level of the categorical variable $G$ , whose value is known for each patient.

It is possible to account for the specific cluster either by including $G$ as a factor in the hazard function, or by letting $G$ to explain the heterogeneity in the distribution of $H$ for each unit.

In the latter case, we can reasonably assume that $G$ is independently distributed with respect to $T$ and $C$ conditional on $x^{T}$ , $x^{C}$ and $H$ . Then, the joint distribution of $(T, C)$ for the $i$ th individual becomes:

\begin{aligned} f (t, c ∣ x_{i}^{T} (t), x_{i}^{C} (c); β^{T}, γ^{T}, β^{C}, γ^{C}) \\ = \sum_{k = 0}^{K} Pr (H = k ∣ g_{i}) [\exp (- \int_{0}^{t} μ_{0}^{T} (s) \exp (β^{T^{'}} x_{i}^{T} (s) + γ_{k}^{T}) d s) \\ \times μ_{0}^{T} (t) \exp (β^{T^{'}} x_{i}^{T} (t) + γ_{k}^{T})] \\ \times [\exp (- \int_{0}^{c} μ_{0}^{C} (s) \exp (β^{C^{'}} x_{i}^{C} (s) + γ_{k}^{C}) d s) (μ_{0}^{C} (c) \exp (β^{C^{'}} x_{i}^{C} (c) + γ_{k}^{C}))] \end{aligned}

(25)

This modelling framework allows also to compare clusters. One possibility is the use of the Hellinger distance, which is used for example in topic models to qualitatively compare the topical content of two documents (⁴⁵). Let

CL1

and

CL2

be any two different known clusters for the units in the sample: the Hellinger distance

d_{H} (CL1, CL2)

d_{H} (CL1, CL2) = \sum_{k = 0}^{K} {(\sqrt{Pr (H = k ∣ G = CL1)} - \sqrt{Pr (H = k ∣ G = CL2)})}^{2}

(26)

When Bayesian techniques are used for inference, as we do in the present work, then the Hellinger distance is given by the expectation of

d_{H} (\cdot, \cdot)

with respect to the posterior distribution of the parameters.

Conclusions

This work illustrated the potential of the use of latent factors to accomodate for the possibility of informative censoring in survival analysis. We rejoined existing literature, and described how we can generalize the specification of the heterogeneity component of the hazard function. We also emphasized how this methodology can be extended to the analysis of competing risks (Section 6.1) and of data from different known clusters (Section 6.2).

The inferential challenges of this type of ill-posed models have been addressed by means of a fully Bayesian approach and we described how modern computational tools such as Hamiltonian Monte Carlo methods are strongly recommended in these circumstances.

We applied those methodologies to the analysis of the ACTG175 clinical trial and we found that modelling the heterogeneity and the presence of informative censoring turns out to improve the model fit in terms of information criterion when compared with standard approaches such as the assumption of non-informative censoring.

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship and/or publication of this article.

ORCID iD

Francesco Ungolo

Notes

Appendix

References

Tsiatis

. A nonidentifiability aspect of the problem of competing risks. Proc Natl Acad Sci U.S.A 1975; 72: 20–22.

Crowder

. On assessing independence of competing risks when failure times are discrete. Lifetime Data Anal 1996; 2: 195–209.

Crowder

. A test for independence of competing risks with discrete failure times. Lifetime Data Anal 1997; 3: 215.

Emoto

Matthews

. A weibull model for dependent censoring. Ann Statist 1990; 18: 1556–1577.

Zheng

Klein

. Estimates of marginal survival for dependent competing risks based on an assumed copula. Biometrika 1995; 82: 127–138.

Scharfstein

Robins

. Estimation of the failure time distribution in the presence of informative censoring. Biometrika 2002; 89: 617–634.

Jackson

White

Seaman

, et al. Relaxing the independent censoring assumption in the cox proportional hazards model using multiple imputation. Stat Med 2014; 33: 4681–4694.

Cox

. Regression models and life-tables. Journal of the Royal Statistical Society Series B (Methodological) 1972; 34: 187–220.

Huang

Wolfe

. A frailty model for informative censoring. Biometrics 2002; 58: 510–520.

10.

Rowley

Garmo

Van Hemelrijck

, et al. A latent class model for competing risks. Stat Med 2017; 36: 2100–2119.

11.

Watanabe

. Asymptotic equivalence of Bayes cross validation and widely applicable information criterion in singular learning theory. J Mach Learn Res 2010; 11: 3571–3594.

12.

Drton

Plummer

. A Bayesian information criterion for singular models. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2017; 79: 323–380.

13.

Pereyra

. Maximum-a-posteriori estimation with Bayesian confidence regions. SIAM J Imaging Sci 2017; 10: 285–302.

14.

Akaike

. A new look at the statistical model identification. IEEE Trans Automat Contr 1974; 19: 716–723.

15.

Schwarz

. Estimating the dimension of a model. Ann Statist 1978; 6: 461–464.

16.

Gelman

Carlin

Stern

, et al. Bayesian Data Analysis, Third Edition. Chapman & Hall/CRC Texts in Statistical Science. Taylor & Francis, 2013.

17.

Hammer

Katzenstein

Hughes

, et al. A trial comparing nucleoside monotherapy with combination therapy in HIV-infected adults with CD4 cell counts from 200 to 500 per cubic millimeter. N Engl J Med 1996; 335: 1081–1090.

18.

Lindsay

. Properties of the Maximum Likelihood Estimator of a Mixing Distribution. Dordrecht: Springer Netherlands, 1981. ISBN 978-94-009-8552-0, 1981. pp. 95109. DOI: 10.1007/978-94-009-8552-0 8.

19.

Lindsay

. The geometry of mixture likelihoods: A general theory. Ann Statist 1983a; 11: 86–94.

20.

Lindsay

. the geometry of mixture likelihoods, part II: The exponential family. Ann Statist 1983b; 11: 783–792.

21.

Heckman

Singer

. A method for minimizing the impact of distributional assumptions in econometric models for duration data. Econometrica 1984; 52: 271–320.

22.

Heckman

Honorè

. The identifiability of the competing risks model. Biometrika 1989; 76: 325–330.

23.

Abbring

Van Den Berg

. The identifiability of the mixed proportional hazards competing risks model. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2003; 65: 701–710.

24.

Catchpole

Morgan

BJT

. Detecting parameter redundancy. Biometrika 1997; 84: 187–196.

25.

McLachlan

Peel

. Finite mixture models. Wiley Series in Probability and Statistics, New York, 2000.

26.

Titterington

Smith

AFM

Makov

. Statistical Analysis of Finite Mixture Distributions. Wiley, New York, 1985.

27.

Marin

J-M

Mengersen

Robert

. Bayesian modelling and inference on mixtures of distributions. In Dey, D. and Rao, C., editors, Handbook of Statistics: Volume 25. Elsevier, 2005.

28.

Betancourt

. Identifying Bayesian Mixture Models, https://betanalpha.github.io/assets/case˙studies/identifying˙mixture˙models.html, 2017.

29.

Stan Development Team, Stan reference manual. http://mc-stan.org/, version 2.18, 2018.

30.

Watanabe

. A widely applicable Bayesian information criterion. J Mach Learn Res 2013; 14: 867–897.

31.

Neal

. MCMC using hamiltonian dynamics. Handbook of Markov Chain Monte Carlo 2010; 54: 113–162.

32.

Stan Development Team. RStan: the R interface to Stan. R package version 2.17.3, http://mc-stan.org/, 2018.

33.

R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. http://www.R-project.org/, 2013.

34.

Hoffman

Gelman

. The no-U-turn sampler: Adaptively setting path lengths in Hamiltonian monte carlo. J Mach Learn Res 2014; 15: 1593–1623.

35.

Kalbfleisch

. Non-parametric Bayesian analysis of survival time data. Journal of the Royal Statistical Society Series B (Methodological) 1978; 40: 214–221.

36.

Bremhorst

Lambert

. Flexible estimation in cure survival models using Bayesian P-splines. Comput Stat Data Anal 2016; 93: 270–284. DOI: 10.1016/j.csda.2014.05.009. http://www.sciencedirect.com/science/article/pii/S0167947314001492 .

37.

Ibrahim

Chen

Sinha

. Bayesian Survival Analysis, Berlin, Heidelberg, New York, London, Paris, Tokyo, Hong Kong: Springer Verlag, 2001, pp. 479, ISBN: 0-387-95277-2, 2001.

38.

Elashoff

. Joint Modeling of Longitudinal and Time-to-Event Data. Chapman and Hall/CRC, DOI:10.1201/9781315374871, 2016.

39.

Rotnitzky

Farall

Bergesio

, et al. Analysis of failure time data under competing censoring mechanisms. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2007; 69: 307–327. DOI: 10.1111/j.1467-9868.2007.00590.x.

40.

Muggeo

. segmented: an R package to fit regression models with broken-line relationships. R News 2008; 8: 20–25. https://cran.r-project.org/doc/Rnews/ .

41.

Gelman

Rubin

. Inference from iterative simulation using multiple sequences. Statist Sci 1992; 7: 457–472. DOI: DOI:10.1214/ss/1177011136.

42.

Moore

Carlson

MaWhinney

, et al. A dirichlet process mixture model for non-ignorable dropout. Bayesian Analysis 2020; 15: 1139–1167. DOI: 10.1214/19-BA1181.

43.

Cox

Snell

. A general definition of residuals. Journal of the Royal Statistical Society Series B (Methodological) 1968; 30: 248–275.

44.

Therneau

. A Package for Survival Analysis in R. R package version 3.1–12, 2020. https://CRAN.R-project.org/package=survival .

45.

Blei

Lafferty

. A correlated topic model of science. Ann Appl Stat 2007; 1: 17–35.