Sage Journals: Discover world-class research

Abstract

Background

The Health Technology Assessment agencies typically require an economic evaluation considering a lifetime horizon for interventions affecting survival. However, survival data are often censored and are typically analyzed assuming the censoring mechanism independent of the event process. This assumption may lead to biased results when the censoring mechanism is informative.

Methods

We propose a flexible approach to jointly model the participants experiencing an event and censored participants by incorporating the pattern-mixture (PM) model in the fractional polynomial (FP) model within the network meta-analysis (NMA) framework. We introduce the informative censoring hazard ratio parameter that quantifies the departure from the censored at random assumption. The FP-PM model is exemplified in an NMA of the overall survival from non-small cell lung carcinoma studies using Bayesian methods.

Results

The results on hazard ratio and survival from the FP-PM model are similar to those from the FP model. However, the posterior standard deviation of the hazard ratio is slightly greater when censored data are modeled because the uncertainty induced by censoring is naturally accounted for in the FP-PM model. The between-study standard deviation is almost identical in both models due to the low censoring rate across the studies. At the end of the corresponding studies, the informative censoring hazard ratio demonstrated a possible departure from the censored at random assumption for gefitinib and best supportive care.

Conclusions

The proposed method offers a comprehensive sensitivity analysis framework to examine the robustness of the NMA results to clinically plausible censoring scenarios.

Keywords

Censoring fractional polynomials network meta-analysis pattern-mixture model survival analysis

Background

The Health Technology Assessment agencies typically require an economic evaluation considering a lifetime horizon for interventions affecting survival. Decision-making additionally requires comparisons of all relevant competing interventions. When multiple studies form a network of evidence comparing different treatments, directly and indirectly, the available evidence can be synthesized employing network meta-analysis (NMA).¹ Network meta-analysis provides an internally coherent set of relative treatment effects for all possible pairwise comparisons, assisting the interested stakeholders in deciding which intervention(s) should be considered for a specific disease.² However, accurately estimating the survival benefit associated with the new intervention compared to all relevant competing interventions becomes problematic when censoring occurs.

In the absence of individual participant data (IPD), the authors of systematic reviews resort to the reported hazard ratios (HR) and standard errors. These are usually estimated using either the Cox proportional hazards (PH) model, an approximation of the HR using log-rank analysis, or reconstructed data from published Kaplan-Meier (KM) curves.^3,4 Cox PH regression is the most widely used model for time-to-event data for the simplicity of its assumptions: the HR is assumed to be constant over time (the PH assumption), and the relationship between the response variable in the logarithmic scale and the predictors is linear.⁵ However, these assumptions are difficult to defend in practice, and the lack of their validity may lead to biased results.

Several methods have been proposed in the literature to synthesize time-to-event data as an alternative to the PH assumption, which provides one HR per pairwise comparison in the network. Initially, the use of parametric models was proposed,⁶ which then was extended to the family of fractional polynomial (FP) models offering a multi-dimensional treatment effect approach.^5,7 Saramago et al.⁸ extended the NMA model to allow simultaneous synthesis of IPD and aggregate survival data, assuming that event times are Weibull distributed. However, the distribution assumption can only be assessed in IPD studies. A recent article proposed using restricted cubic splines without making distributional assumptions.⁹ All the models proposed are applied in the Bayesian framework.

In the present study, we focus on the FP model of Jansen⁷ as probably the most commonly used flexible survival model in the NMA framework.¹⁰ The FP approach of Jansen⁷ relies on aggregate data from digitized survival curves divided into multiple consecutive time intervals over the follow-up period.⁷ The observed number of event times in each interval is then modeled as a function of the number of participants alive at time t, assuming all censoring occurs before death in the consecutive follow-up intervals. Censoring is a form of the missing participant outcome data (MOD) problem and is typically handled assuming the censoring mechanism independent of the event process, namely, non-informative censoring.¹¹ This is the censored at random (CAR) assumption.¹¹ Censored at random is an untestable assumption. It may result in imprecise absolute survival estimates, especially when the amount of censoring is substantial.¹² In practice, censoring is usually informative, introducing bias to the results when not addressed appropriately in the analysis.¹³ For instance, exclusion and imputation of the censored participants are popular approaches for their simplicity. However, they are conceptually and statistically suboptimal for taking exclusion and imputation at face value and discounting the uncertainty induced by censoring, increasing the risk of biased conclusions. Therefore, it is imperative to investigate the robustness of the conclusions to different yet clinically plausible scenarios about the censoring mechanism while accounting for the uncertainty and maintaining the randomized sample size.

Statistical modeling of MOD has received attention in the last years for aggregate binary and continuous outcomes as an elegant framework that acknowledges the uncertainty about the assumed missingness scenarios (e.g., Mavridis et al.,¹⁴ Spineli et al.,¹⁵ and Turner et al.¹⁶). This can be achieved under a model that reflects the distribution of the outcome in completers and missing participants, known as the pattern-mixture (PM) model.¹⁷ Modeling MOD via the PM model provides biased-adjusted results and a thorough investigation of the underlying missingness mechanisms across different studies and interventions.^15,16 The PM model incorporates an informative missingness parameter that quantifies the departure from the missing at random assumption, indicating whether the missingness mechanism may be informative.^14–16 To our knowledge, the PM model has not been implemented in aggregate time-to-event data.

The present study aims to fill the methodological gap on properly addressing censoring in aggregate time-to-event data when access to IPD is not granted to allow for more sophisticated analyses. Specifically, we extend the FP modeling framework proposed by Jansen⁷ for NMA to allow for the joint synthesis of observed and censored time-to-event data via the PM model, referred to as the FP-PM model. This extension designates a sensitivity framework that allows 'learning' about the censoring mechanisms in the network and provides bias-adjusted NMA results while accounting for the uncertainty induced by censoring.¹⁰ The article is organized as follows. First, we introduce the motivating example, and present the FP and Bayesian random-effects NMA models for a time-to-event outcome without censored participants (as described in Jansen⁷). Then, we expand the model to incorporate censoring through the PM model (the FP-PM model). We illustrate the FP model after excluding the censored participants and the FP-PM model under the ‘on average' CAR assumption. We discuss the findings and limitations of the study and conclude with recommendations.

Motivating example

As a motivating example, we revisit the example considered in Jansen.⁷ The author compared four therapies (best supportive care (BSC), gefitinib, pemetrexed, and docetaxel) in participants with non-small cell lung cancer (Figure 1). The example included seven two-arm studies^18–24 with 3,288 participants (median: 342, interquartile range (IQR): 156−530). The median follow-up time was 11 months, with IQR from 9 to 18. The censoring in the network was relatively low across the studies (median %: 6, IQR: 5.6 − 7.0) and balanced in the compared arms (median difference in %: 0.1, IQR: −3.5 − 4.4). An HR<1 (and equivalently, logHR<0) indicated a beneficial effect of the first intervention in the comparison for increasing the overall (and progression-free) survival. We considered the random-effects NMA for naturally encompassing the statistical heterogeneity expected in the included studies.²⁵

Figure 1.

The network of randomized controlled studies. The thickness of lines and the size of nodes are proportional to the number of studies in the corresponding comparisons and treatments, respectively. BSC, Best Supportive Care.

For each included study, we used the method developed by Guyot et al.²⁶ to derive the reconstructed IPD: the number of randomized and censored participants and the number of observed events in intervals of 2 months (dt=2). The method uses iterative numerical methods to solve inverted KM equations. We digitized the KM curves using DigitizeIt v2.3.²⁷ It should be noted that the dataset in Jansen⁷ provides incorrect numbers for the study comparing docetaxel with BSC.²⁴ This study²⁴ enrolled 100 participants in the BSC arm and 104 in the docetaxel arm (49 at 100 mg/m², 55 at 75 mg/m²) rather than 222 and 441 participants, respectively, as provided in Jansen.⁷

Methods

Fractional polynomial models

Royston and Altman²⁸ introduced the concept of FP models as a flexible alternative to the standard parametric models for time-to-event data to capture a wide range of shapes of the survival curve.¹⁰ Specifically, the hazard of an intervention is modeled as a function of the PH model and power transformations of time to reflect the change in the hazard over time. A first-order FP model to estimate the natural logarithm of the hazard in arm k of a two-arm randomized controlled study at time t is defined as

{\log (h_{k t}) = β}_{0 k} + β_{1 k} t^{p_{1}}

and similarly, for the second-order FP model:

{\log (h_{k t}) = β}_{0 k} + β_{1 k} t^{p_{1}} + β_{2 k} t^{p_{2}}

The power of p₁ and p₂ can be chosen from a set within {−2, −1, −0.5, 0, 0.5, 1, 2, 3} where different choices correspond to different hazard functions, thus allowing a range of different shapes (e.g., a monotone increasing or decreasing and constant) that can fit the data more closely than the simpler PH model.^7,28 The researchers should bear in mind two considerations when selecting the pair of powers (p₁, p₂): (i) the clinical relevance of the model in terms of the investigated health condition and interventions (namely, the shape of the hazard function over time), and (ii) and the data availability in association with the number of model parameters that need to be estimated to prevent overfitting. We elaborate on our perspectives on selecting the values for p₁ and p₂ in the Discussion section in the context of our motivating example.

The log hazard in arm k is a function of the log hazard in the baseline arm of the study and the log HR,

(\begin{array}{c} β_{0 k} \\ β_{1 k} \\ β_{2 k} \end{array}) = {\begin{array}{c} (\begin{array}{c} u_{0} \\ u_{1} \\ u_{2} \end{array}) for arm A \\ (\begin{array}{c} u_{0} \\ u_{1} \\ u_{2} \end{array}) + (\begin{array}{c} d_{0} \\ d_{1} \\ d_{2} \end{array}) for arm B \end{array}

where each component of the vector

{(u_{0}, u_{1}, u_{2})}^{T}

refers to the log hazard of the baseline arm in the study under the PH model, first and second orders, respectively, and each component of the vector

{(d_{0}, d_{1}, d_{2})}^{T}

is the log HR of B relative to A under the PH model, first and second orders, respectively. Specifically, d₁ and d₂ reflect the change in the log HR over time under the corresponding orders. Under the PH assumption, d₁ and d₂ equal 0, and d₁,d₂≠0 suggests a departure from the PH assumption. For the first-order FP model, the parameters β_2k, u₂ and d₂ are dropped.

Bayesian random-effects NMA model without censoring

Consider a network of N studies investigating different sets of T interventions for a time-to-event outcome. The KM curve of the outcome of interest in arm k=1, 2, …,a_i of study i= 1, 2,…,N, divided into consecutive time intervals [t,t+dt] illustrates the cumulative proportion of participants without the outcome at each time interval. Then the number of events in arm k of study i in the interval [t,t+dt] is assumed to follow a binomial distribution,

r_{i k t} \sim B i n (p_{i k t}, n_{i k t})

where n_ikt is the number of participants at risk in arm k of study i at timepoint t, and p_ikt is the cumulative risk of an event in that interval expressed as a function of the hazard rate,

p_{i k t} = 1 - \exp (h_{i k t} d t)

with dt being the length of the investigated interval. Then, using the log link function, the random-effects NMA model for the log HR between the arm k and baseline arm of study i under the second-order FP model is defined as follows,

(\begin{array}{c} β_{0 i k} \\ β_{1 i k} \\ β_{2 i k} \end{array}) = {\begin{array}{c} (\begin{array}{c} u_{0 i} \\ u_{1 i} \\ u_{2 i} \end{array}) for k = 1 \\ (\begin{array}{c} u_{0 i} \\ u_{1 i} \\ u_{2 i} \end{array}) + (\begin{array}{c} θ_{0 i k} \\ θ_{1 i k} \\ θ_{2 i k} \end{array}) for k > 1 \end{array}

with

(\begin{array}{c} θ_{0 i k} \\ θ_{1 i k} \\ θ_{2 i k} \end{array}) \sim M V N ((\begin{array}{c} d_{0 c_{i k} c_{i 1}} \\ d_{1 c_{i k} c_{i 1}} \\ d_{2 c_{i k} c_{i 1}} \end{array}), Σ)

(1)

and

Σ = (\begin{array}{c} τ_{0}^{2} & ρ_{01} τ_{0} τ_{1} & ρ_{02} τ_{0} τ_{2} \\ ρ_{10} τ_{1} τ_{0} & τ_{1}^{2} & ρ_{12} τ_{1} τ_{2} \\ ρ_{20} τ_{2} τ_{0} & ρ_{21} τ_{2} τ_{1} & τ_{2}^{2} \end{array})

where each component of the vector

{(θ_{0 i k}, θ_{1 i k}, θ_{2 i k})}^{T}

denotes the underlying log HR of the arm k versus the baseline arm 1 in study i under the PH model, first and second orders, respectively,

τ_{j}^{2}

is the between-study variance under model j=0,1,2 assumed common across the observed comparisons, and

ρ_{j j^{'}}

is the correlation coefficient between θ_jik and

θ_{j^{'} i k}

with j,j^′∈{0,1,2} and j≠j^′. The index c_ik refers to the intervention investigated in arm k of study i.

Under the consistency assumption (i.e., the agreement between direct and more than one indirect route of evidence), the summary effect of intervention m against l can be estimated indirectly through the reference treatment A as follows

(\begin{array}{c} d_{0 m l} \\ d_{1 m l} \\ d_{2 m l} \end{array}) = (\begin{array}{c} d_{0 m A} \\ d_{1 m A} \\ d_{2 m A} \end{array}) - (\begin{array}{c} d_{0 l A} \\ d_{1 l A} \\ d_{2 l A} \end{array})

with m,l∈{B,C,…,T}, m≠l and d_AA=0. The description of the model in the presence of multi-arm trials can be found in the Supplementary material.

Assuming random effects only for the PH model implies that the between-study variance does not change over time; therefore, equation (1) is replaced by $θ_{0 i k} \sim N (d_{0 c_{i k} c_{i 1}}, τ_{0}^{2})$ , while θ_1ik and θ_2ik are fixed to $d_{1 c_{i k} c_{i 1}}$ and $d_{2 c_{i k} c_{i 1}}$ , respectively. A random-effects model only for the higher orders (here, 1 or 2) would disregard the heterogeneity in the true treatment effects due to effect-modifiers, as it would consider statistical heterogeneity to be only a function of the time.⁷

Pattern-mixture model

Consider that in the interval [t,t+dt], m_ikt participants were censored (lost to follow-up for reasons related or not to the design and conduct of the study) in arm k of study i with a risk of censoring, q_ikt. Among those $n_{i k t}^{o} = n_{i k t} - m_{i k t}$ participants who endured by the end of this interval (called completers), $r_{i k t}^{o}$ experienced the studied outcome with a cumulative risk of an event $p_{i k t}^{o}$ , which is a function of the hazard rate conditional on the completers, $h_{i k t}^{o} .$ It follows that the number of censored participants and the number of observed events in arm k of study i in the interval [t,t+dt] are realizations from the respective binomial distributions:

m_{i k t} \sim B i n (q_{i k t}, n_{i k t}) a n d r_{i k t}^{o} \sim B i n (p_{i k t}^{o}, n_{i k t}^{o})

Under the PM model, the underlying hazard in arm k of study i at timepoint t is decomposed to the hazard rate among the completers and the hazard rate among the censored participant as follows:

h_{i k t} = h_{i k t}^{o} \cdot (1 - q_{i k t}) + h_{i k t}^{m} \cdot q_{i k t}

where

h_{i k t}^{m}

is the censoring parameter that indicates the hazard rate conditional on censored participants. The parameters

h_{i k t}^{o}

and q_ikt can be estimated directly from the data, while we need a proper prior distribution on

h_{i k t}^{m}

. The hazard rate conditional on the randomized sample is a weighted average of the hazard rates among completers and censored participants. Hence, the FP-PM model maintains the randomized sample.

The informative censoring hazard ratio parameter

Alternatively, $h_{i k t}^{m}$ can be replaced with the informative censoring hazard ratio (ICHR) parameter to quantify the association between informative censoring and a time-to-event outcome. The ICHR parameter in arm k of study i at interval [t,t+dt] is defined as the ratio of the hazard rate given the censored participants to the hazard rate given the completers:

δ_{i k t} = h_{i k t}^{m} / h_{i k t}^{o}

with

\log (δ_{i k t}) = φ_{i k t} \sim N (ω_{i k}, σ_{i k}^{2})

After replacing $h_{i k t}^{m} = δ_{i k t} h_{i k t}^{o}$ in the PM model and re-arranging, we obtain the following linking equation for $h_{i k t}^{o}$ :

h_{i k t}^{o} = \frac{h_{i k t}}{1 - q_{i k t} (1 - δ_{i k t})}

The ICHR parameter expresses the relationship between $h_{i k t}^{m}$ and $h_{i k t}^{o}$ which lies in one of the following possible cases:

(i) The hazard rate given the censored participants equals the hazard rate given the completers at timepoint t (i.e., $h_{i k t}^{m} = h_{i k t}^{o}$ ) suggesting CAR assumption (i.e., δ_ikt=1 and φ_ikt=0);

(ii) The hazard rate given the censored participants is greater or lower than the hazard rate given the completers at timepoint t (i.e., $h_{i k t}^{m} > h_{i k t}^{o} o r h_{i k t}^{m} < h_{i k t}^{o}$ , respectively). Both cases suggest deviations from the CAR assumption (informative censoring).

The ICHR parameter can be further structured to be common or independent across the study-arms, as well as study-specific or intervention-specific, and additional assumptions may concern the specification of the prior distribution for φ_ikt being fixed, exchangeable, or unrelated (Table S1 in the Supplementary material). Note that we have assumed the censoring mechanism to change over time which aligns with the concept of the FP model. In the present work, we have considered independent, unrelated φ_ikt∼N(0, 1) that indicates that CAR holds ‘on average' (ω_ik=0) with variance $σ_{i k}^{2} = 1$ in each arm of every study and at each interval. This is the preferred assumption in the relevant literature when we have not consulted expert opinion to determine the value of ω_ik that aligns with the investigated outcome and interventions.^14–16

Model implementation

We performed two Bayesian random-effects NMA models using the FP model: (i) after excluding the censored participants from each arm of every study (this boils down to Jansen's original FP model without the PM model), and (ii) with the incorporation of the PM model into each study's arm assuming φ_ikt∼N(0, 1). In essence, we considered the CAR assumption ‘at face value' for the FP model. We assumed that CAR ‘holds' on average for the FP-PM model as the recommended starting point according to the relevant published literature.^14–16 In line with Jansen,⁷ we fitted 44 FP and FP-PM models for different values of p₁ and p₂ and we tabulated the corresponding results on the posterior mean of residual deviance, deviance information criterion (DIC; it measures the model fit that penalizes complexity²⁹), and the effective number of parameters (Table 1). For the calculation of the DIC, we used the formula provided in Spiegelhalter et al.²⁹ (also considered in WinBUGS), which is different from the formula used in JAGS.³⁰ We considered the pair p₁=−2, p₂=1 used in Jansen⁷ as the primary analysis. In a secondary analysis, we considered the pairs of powers p₁=−2, p₂=2, and p₁=−1, p₂=2 on clinical relevance, successful convergence, and DIC value.

Table 1.

Goodness-of-fit measures for random-effects fractional polynomial (FP) models and random-effects fractional polynomial models with pattern-mixture (FP-PM) for different powers p₁ and p₂.

Powers		FP model			FP-PM model
p ₁	p ₂	pD	${\bar{D}}_{r e s}$	DIC	pD	${\bar{D}}_{r e s}$	DIC
-2	-	25.3	422.8	448.2	44.6	255.5	300.1
-1	-	25.1	429.5	454.6	45.7	258.1	303.8
-0.5	-	24.9	434.0	458.9	45.1	257.5	302.6
0	-	24.6	435.5	460.2	45.3	257.8	303.0
0.5	-	24.6	431.5	456.1	45.4	258.7	304.1
1	-	24.4	422.2	446.6	45.3	256.4	301.7
2	-	24.2	402.1	426.3	45.5	257.9	303.4
3	-	24.4	390.6	415.0	45.9	257.2	303.1
-2	-2	36.0	365.3	401.3	52.9	249.5	302.4
-2	-1	35.9	345.2	381.1	52.1	244.7	296.8
-2	-0.5	35.5	322.4	357.9	51.3	238.2	289.5
-2	0	35.9	334.1	370.0	52.0	242.1	294.2
-2	0.5	35.8	318.4	354.2	51.9	238.1	290.0
-2^a	1	35.5	306.8	342.3	51.5	234.7	286.2
-2	2	35.3	291.1	326.4	50.8	228.4	279.2
-2	3	35.2	282.7	317.9	50.1	222.8	273.0
-1	-1	36.0	342.0	377.9	52.1	244.8	296.9
-1	-0.5	35.6	327.0	362.6	51.8	242.6	294.4
-1	0	35.6	325.7	361.3	51.8	242.8	294.6
-1	0.5	35.9	314.2	350.1	51.8	241.1	292.9
-1	1	35.5	305.9	341.4	51.2	238.9	290.1
-1	2	35.4	295.5	330.8	50.6	234.4	285.0
-1	3	35.1	290.1	325.3	50.2	230.4	280.6
-0.5	-0.5	36.4	328.3	364.7	51.8	244.2	296.1
-0.5	0	35.5	320.8	356.3	52.2	244.6	296.9
-0.5	0.5	35.9	312.4	348.3	51.7	243.1	294.9
-0.5	1	35.6	306.5	342.1	50.7	241.5	292.3
-0.5	2	35.2	299.7	335.0	50.7	239.1	289.8
-0.5	3	35.1	296.8	331.9	50.3	235.8	286.1
0	0	35.7	316.1	351.8	51.4	244.2	295.6
0	0.5	35.5	310.9	346.4	51.0	245.3	296.3
0	1	35.2	307.8	343.1	50.9	245.4	296.4
0	2	35.3	305.7	341.0	50.8	243.6	294.4
0	3	34.8	305.0	339.8	50.9	240.4	291.3
0.5	0.5	35.2	310.7	345.9	51.5	246.7	298.2
0.5	1	35.6	310.8	346.4	50.4	248.3	298.7
0.5	2	35.1	312.4	347.5	51.6	245.7	297.3
0.5	3	35.1	314.5	349.6	51.9	241.6	293.5
1	1	35.2	312.6	347.7	50.8	249.7	300.5
1	2	35.0	319.3	354.3	52.5	244.7	297.2
1	3	34.8	323.0	357.8	52.8	237.7	290.5
2	2	35.0	328.7	363.7	53.0	236.4	289.3
2	3	34.7	336.1	370.8	51.8	232.9	284.7
3	3	34.8	342.3	377.1	50.9	231.7	282.6

Bold values refer to the two pairs of powers that comprised the secondary analyses.

^aThis pair of powers was used in Jansen.⁷

${\bar{D}}_{r e s}$ : posterior mean of residual deviance; DIC: deviance information criterion; FP: fractional polynomial; FP-PM: fractional polynomial pattern-mixture; pD: effective number of parameters.

We assigned a non-informative normal prior distribution for the location parameters, with a mean of 0 and a variance of 10 000. We assigned a half-normal prior distribution with a scale parameter equal to 1.0 (median: 0.67; IQR: 0.03 − 2.24) on τ₀ to allow for a more accurate and precise estimation of the parameters due to the limited number of studies per comparison suggested by relevant literature.³¹ We ran three chains of different initial values with 80 000 iterations, 20 000 burn-in, and thinning equal to three. We used the Gelman–Rubin convergence diagnostic and visual inspection of trace plots to assess the convergence.³² To implement the models, we used the R statistical software³³ and JAGS³⁰ via the R package R2jags.³⁴ We used the R-package ggplot2³⁵ to create the figures and the R package pcnetmeta³⁶ to draw the network plot.

Results

The model-fit measures for the different FP and FP-PM models are presented in Table 1. The FP-PM model systematically yielded more effective parameters than FP for estimating more parameters (the difference ranged from 14.8 to 21.5). However, the FP-PM model yielded a systematically lower posterior mean of residual deviance than FP for all pairs of powers, leading to a consistently lower DIC (the difference ranged from 44.7 to 157.2). According to the Gelman–Rubin convergence diagnostic, convergence was achieved for all model parameters (range: 1.001−1.002 for $d_{{j t}_{i k}, A}$ , τ₀, and φ_ikt with j=0, 1, 2). Figure S1 (Supplementary material) illustrates the leverage plot for the FP and FP-PM models for the primary analysis (i.e., p₁=−2, p₂=1). In both models, the same data points found outside the red parabola contributed a DIC value larger than 3. However, this contribution was more prominent in the FP model than in the FP-PM model, which explained the strikingly higher DIC in the former (Table 1). Most outlying points corresponded to zero events and zero non-events (i.e., the number of the observed events equals the number of the completers).

Network meta-analysis results

The results on the log HR (posterior median and 95% credible interval (CrI)) for comparisons with the reference intervention of the network (docetaxel) are presented in Table 2 (primary and secondary analyses). The posterior median of log HR from the FP-PM model was very similar to those from the FP model across the three pairs of powers. The FP-PH model yielded slightly wider CrI than the FP model across the three pairs of powers. The estimates of the expected survival time were similar across both models (Table 2). In both models, the expected survival time with docetaxel was associated with the narrowest 95% CrI, whilst the expected survival time with pemetrexed was associated with the most expansive 95% CrI. Both models yielded almost identical results for

τ_{0}^{2}

due to the low amount of censoring in the network.

Table 2.

Posterior median and 95% CrI of log HR for comparisons with docetaxel, between-study standard deviation, and expected survival time.

	p ₁=−2, p ₂=1		p ₁=−2, p ₂=2		p ₁=−1, p ₂=2
	FP	FP − PM	FP	FP − PM	FP	FP − PM
Gefitinib (B) versus docetaxel (A)
d_0BΑ	0.14 (-0.10, 0.42)	0.15 (-0.10, 0.41)	0.10 (-0.12, 0.35)	0.09 (-0.12, 0.33)	0.08 (-0.18, 0.35)	0.08 (-0.19, 0.36)
d_1BΑ	0.03 (-0.32, 0.38)	0.02 (-0.24, 0.37)	0.08 (-0.23, 0.40)	0.07 (-0.26, 0.40)	0.09 (-0.28, 0.46)	0.08 (-0.31, 0.47)
d_2BΑ	-0.01 (-0.01, 0.00)	-0.01 (-0.02, 0.00)	0.00 (-0.00, 0.00)	0.00 (-0.00, 0.00)	0.00 (-0.00, 0.00)	0.00 (-0.00, 0.00)
BSC (C) versus docetaxel (A)
d_0CΑ	1.38 (0.70, 2.10)	1.41 (0.70, 1.11)	1.16 (0.64, 1.68)	1.18 (0.68, 1.70)	1.67 (0.95, 2.41)	1.70 (0.97, 2.45)
d_1CΑ	-1.10 (-2.01, -0.21)	-1.14 (-2.08, -0.22)	-0.92 (-1.71, -0.13)	-0.95 (-1.75, -0.18)	-1.48 (-2.51, -0.48)	-1.52 (-2.56, -0.48)
d_2CΑ	-0.06 (-0.12, 0.00)	-0.06 (-0.12, 0.00)	-0.01 (-0.02, 0.00)	-0.01 (-0.02, 0.00)	-0.01 (-0.02, -0.00)	-0.01 (-0.02, -0.00)
Pemetrexed (D) versus docetaxel (A)
d_0DΑ	0.20 (-0.43, 0.81)	0.17 (-0.48, 0.83)	0.14 (-0.35, 0.64)	0.12 (-0.40, 0.64)	0.29 (-0.35, 0.92)	0.31 (-0.38, 1.00)
d_1DΑ	-0.60 (-1.38, 0.16)	-0.57 (-1.39, 0.23)	-0.55 (-1.25, 0.12)	-0.54 (-1.26, 0.18)	-0.65 (-1.53, 0.22)	-0.68 (-1.62, 0.24)
d_2DΑ	-0.01 (-0.01, 0.00)	-0.01 (-0.05, 0.04)	0.00 (-0.00, 0.00)	0.00 (-0.01, 0.01)	-0.00 (-0.01, 0.00)	-0.00 (-0.01, 0.01)
Between-study standard deviation
τ₀	0.09 (0.01, 0.45)	0.09 (0.01, 0.44)	0.10 (0.00, 0.47)	0.09 (0.01, 0.44)	0.09 (0.00, 0.45)	0.09 (0.00, 0.45)
Expected survival (in months)
Docetaxel	10.7 (9.9, 11.6)	10.6 (9.8, 11.5)	9.6 (8.8, 11.1)	9.6 (8.8, 11.4)	9.5 (8.7, 11.4)	9.5 (8.7, 11.8)
BSC	5.0 (3.2, 7.6)	4.9 (3.2, 7.6)	7.5 (4.0, 14.3)	7.8 (4.1, 14.5)	7.1 (3.9, 13.0)	7.2 (3.8, 13.8)
Pemetrexed	10.1 (7.0, 14.7)	10.0 (6.9, 14.6)	9.4 (6.8, 17.1)	9.2 (6.6, 17.6)	9.4 (6.7, 17.2)	9.2 (6.5, 17.9)
Gefitinib	9.8 (8.1, 11.7)	9.7 (8.1, 11.5)	9.0 (7.6, 11.3)	9.1 (7.6, 11.6)	9.0 (7.6. 11.6)	9.1 (7.5, 12.0)

All numbers are expressed as median (95% CrI) unless indicated otherwise.

BSC: Best Supportive Care, CrI: Credible Interval, FP: fractional polynomial, FP-PM: fractional polynomial pattern-mixture.

Overall, the development of HR had a similar pattern in both models but differed for different powers (Figure 2, panels a and b for the primary analysis; Figure S2 in the Supplementary material for the secondary analyses). Models with a second-order power of 2 showed implausible results for extrapolation purposes: the upper bound of the 95% CrI reached very large values (Figure S2 in the Supplementary material). The same pattern of survival curves was also shared between the FP-PM and FP models (Figure 2, panels c and d for the primary analysis; Figure S3 in the Supplementary material for the secondary analyses). Nevertheless, they demonstrated slightly greater uncertainty in estimating the hazard over time under the FP-PM model than under the FP model. When modeling survival time with either model, the curves of docetaxel, gefitinib and pemetrexed were hardly distinguished. Gefitinib showed better results on average, and BSC showed the least good results with a plateau after month 20 (Figure 2, panels c and d for the primary analysis; Figure S3 in the Supplementary material for the secondary analyses).

Figure 2.

First row: hazard ratio over time for each intervention relative to docetaxel under the FP-PM model for the CAR assumption (panel (a)) and FP model (panel (b)). Second row: survival function over time for each intervention under the FP-PM model for the CAR assumption (panel (c)) and FP model (panel (d)). Both models have been fitted for p₁=−2 and p₂=1 (primary analysis). The solid and the dot-dashed lines represent the posterior distribution's median and 95% credible interval, respectively.

Learning about the censoring mechanisms

The posterior distribution of log ICHR for each study's arm in the network and at each timepoint is shown in Figure 3. For some of the study-arms, the posterior mean of log ICHR deviated from zero, and the bounds of the 95% CrI protruded from the interval of the prior distribution on φ_ikt, indicating a departure from the CAR assumption; hence, the censoring process may be informative. For those study-arms with a positive posterior mean of log ICHR, the hazard of death may be higher among the censored participants than among the completers, whilst the opposite holds for a negative posterior mean of log ICHR. Also, for two study-arms referring to gefitinib¹⁹ and BSC,²⁴ respectively (highlighted in red in Figure 3), the 95% CrI of log ICHR excluded zero and protruded from the interval of the prior distribution of log ICHR, which was a strong indication of informative censoring. This was an expected finding resulting from censoring at the end of the follow-up.

Figure 3.

Interval plot of the posterior distribution of log ICHR for all study-arms of the networks and timepoints. The vertical lines indicate the prior distribution for ICHR as obtained using the FP-PM (p₁=−2, p₂=1) network meta-analysis model under the CAR assumption.

Discussion

The present article addressed a methodological gap in handling censoring in aggregate survival data. We proposed a straightforward approach to model aggregate time-to-event data with censoring by extending the FP model to incorporate the PM model and obtain bias-adjusted results. We introduced the ICHR parameter to investigate possible departures from the CAR assumption and ‘learn' about the censoring mechanisms in the network.

In the absence of IPD, applying the PM on reconstructed IPD is advantageous to imputation and exclusion for accounting for the uncertainty induced by censoring through assumptions about the distribution of the ICHR parameter. Ideally, the researcher would invite expert opinion to determine a series of clinically plausible assumptions about the distribution of the ICHR parameter to investigate the sensitivity of the results to censoring. For instance, the present work could be extended to consider the approach of White et al.³⁷ on eliciting expert opinions about the degree of departure from the CAR assumption.

An indication of informative censoring was observed at the end of the KM in Cufer et al.¹⁹ and Shepherd et al.²⁴ However, no strong indication of informative censoring was observed in any other time interval, showing that the follow-up ended at a pre-specified date when most participants had not experienced the event yet. In principle, we should distinguish censoring due to losses to follow-up from administrative censoring. In oncology trials, loss to follow-up is relatively rare, and most censoring tends to be administrative and unlikely to cause bias. Administrative censoring may lead to overly precise or imprecise results if censoring is imputed or excluded, leading to possibly spurious conclusions depending on the amount of censoring in the data. However, it is impossible to distinguish between the losses to follow-up and administrative censoring when IPD is unavailable. Therefore, regardless of the censoring type, our approach is advantageous and should be preferred to imputation or exclusion for maintaining the randomized sample and incorporating the uncertainty about the censoring mechanism in the results. Our proposed method is also relevant to observational studies or studies examining more chronic conditions as they are not immune to losses to follow-up and its implications.³⁸

The models with a power equal to 2 for the second order did not provide clinically plausible results for the log HRs over time (Figure S2 in the Supplementary material), which may impact the extrapolation of the KM curves. Furthermore, we observed greater variability in the results for different powers than between the FP and FP-PM models (Table 2). Selecting the proper set of powers for the dataset may require a heuristic procedure considering the trade-off between clinical plausibility and model fit. Ideally, the selected powers should not result in overfitting as it may compromise the clinical plausibility of the extrapolation.¹⁰ Royston has developed a function in the STATA package to assist the researchers in selecting the powers for the p₁ and p₂ that result in a parsimonious FP model through a model selection process.³⁹ To determine the model with the best representation of hazards beyond the study period, the researchers should consider whether the extrapolation is realistic, seeking external data sources and clinical expert opinion.⁴⁰

The limitations of our proposed method are primarily inherent to those resulting from the KM digitization. The digitization of the published KM curves tends to pool data over different covariates that might (or not) affect survival, possibly resulting in biased treatment effects, with the extent of bias being proportional to the strength of the covariate effect.²⁶ However, this limitation can be argued if the studies considered in the NMA are comparable for essential effect modifiers. The reliability of the reconstructed data also strongly depends on the information in the published reports. There is high interest in reconstructing KM curves, and the recently developed R shiny app might be the solution to the limitations of digitization.⁴¹

Another limitation of our study is that incorporating the PM model in the FP framework did not materially change the conclusions. This may be primarily attributed to the relatively low censoring rate across the studies (typical in pivotal oncology trials), which was accommodated in an aggregate form without information on the exact censoring times and any essential covariates that would have allowed for more sophisticated analyses to handle censoring. However, information on the exact censoring times and important covariates would require access to IPD. Nonetheless, having studies with very low censoring (3% and 1% in Cufer et al.¹⁹ and Shepherd et al.,²⁴ respectively) allowed us to learn about the censoring mechanisms (the prior distribution for φ_ikt was updated in some study-arms and timepoints).¹⁶ In studies examining more chronic conditions, loss to follow-up can be substantial with serious implications for decision-making. Without access to IPD, the proposed FP-PM model would safeguard against spurious conclusions by naturally increasing the uncertainty around the model parameters. However, the substantial censoring may compromise the ability to learn about the censoring mechanisms.

Moreover, we did not perform a sensitivity analysis to investigate whether the primary analysis results (under CAR) are sensitive to different informative censoring scenarios. Due to low censoring observed across the studies, one may presume that the results would have been robust to different structures and assumptions about the ICHR parameter. However, a recent study on aggregate binary outcome data in NMA revealed that bias might be imminent even for low attrition, particularly when event frequency is low, the sample size is small and statistical heterogeneity is substantial.⁴²

Other survival approaches can be extended straightforwardly to incorporate the PM model to handle censoring,⁴⁰ which falls beyond the scope of this work. Selecting among the different survival models (e.g., FP and parametric models) requires careful consideration of the necessary assumptions and the scope of the application. Generally, relying solely on the goodness-of-fit measures for model selection may be misleading because the best-fitting model may not have a clinically meaningful interpretation. The structure of the network should also be taken into account. For instance, sparse networks are characterized by few connections between the treatments informed mainly by a handful of small studies. In such networks, estimating many parameters may result in convergence problems.

Conclusions

When the collated studies report the KM curves, the researchers should opt for the reconstructed time-to-event and censoring data. Then applying the FP-PM NMA model using clinically plausible scenarios about the ICHR parameter would provide biased-adjusted results and safeguard against spurious conclusions.

Supplemental Material

Supplemental Material - Dealing with censoring in a network meta-analysis of time-to-event data

Supplemental Material for Dealing with censoring in a network meta-analysis of time-to-event data by Chrysostomos Kalyvas, Katerina Papadimitropoulou, William Malbecq and Loukia M. Spineli in Research Methods in Medicine & Health Sciences

Footnotes

Acknowledgments

CK and WM are employed by Merck Sharp & Dohme. The authors alone are responsible for the views expressed in this article, and they should not be construed with the views, decisions, or policies of the institutions they are affiliated with.

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

ORCID iD

Loukia M Spineli

Data Availability Statement

The data given this article are all functions and data related to this manuscript are publicly available at .

Supplemental Material

Supplemental material for this article is available online.

References

Ades

. Combination of direct and indirect evidence in mixed treatment comparisons. Stat Med 2004; 23(20): 3105–3124.

Caldwell

. An overview of conducting systematic reviews with network meta-analysis. Syst Rev 2014; 3: 109.

Higgins

JPT

Deeks

. Chapter 6: Choosing effect measures and computing estimates of effect. In: Higgins

JPT

Thomas

Chandler

, et al. (eds) Cochrane Handbook for Systematic Reviews of Interventions version 6.3. Cochrane, 2022. (updated February 2022) www.training.cochrane.org/handbook.

Tierney

Stewart

Ghersi

, et al. Practical methods for incorporating summary time-to-event data into meta-analysis. Trials 2007; 8: 16.

Hosmer

Lemeshow

. Applied survival analysis, regression modeling of time to event data. New York: John Wiley and Sons, 1999.

Ouwens

Philips

Jansen

. Network meta-analysis of parametric survival curves. Res Synth Methods 2010; 1(3–4): 258–271.

Jansen

. Network meta-analysis of survival data with fractional polynomials. BMC Med Res Methodol 2011; 11: 61.

Saramago

Chuang

Soares

. Network meta-analysis of (individual patient) time to event data alongside (aggregate) count data. BMC Med Res Methodol 2014; 14: 105.

Freeman

Carpenter

. Bayesian one-step IPD network meta-analysis of time-to-event data using Royston-Parmar models. Res Synth Methods 2017; 8(4): 451–464.

10.

Welton

Phillippo

Owen

, et al. CHTE2020 sources and synthesis of evidence; update to evidence synthesis methods. DSU report. March 2020. https://www.sheffield.ac.uk/nice-dsu/methods-development/chte2020-sources-and-synthesis-evidence

11.

Atkinson

Kenward

Clayton

, et al. Reference-based sensitivity analysis for time-to-event data. Pharm Stat 2019; 18(6): 645–658.

12.

Ibrahim

Chu

Chen

. Missing data in clinical studies: issues and methods. J Clin Oncol 2012; 30(26): 3297–3303.

13.

Carpenter

Kenward

. Missing data in randomised controlled trials: a practical guide. Missing data in randomised controlled trials: a practical guide. Birmingham: Health Technology Assessment Methodology Programme, 2007, http://researchonline.lshtm.ac.uk/id/eprint/4018500. Accessed 24 March 2022.

14.

Mavridis

White

Higgins

, et al. Allowing for uncertainty due to missing continuous outcome data in pairwise and network meta-analysis. Stat Med 2015; 34(5): 721–741.

15.

Spineli

Kalyvas

Papadimitropoulou

. Continuous(ly) missing outcome data in network meta-analysis: A one-stage pattern-mixture model approach. Stat Methods Med Res 2021; 30(4): 958–975.

16.

Turner

Dias

Ades

, et al. A Bayesian framework to account for uncertainty due to missing binary outcome data in pairwise meta-analysis. Stat Med 2015; 34(12): 2062–2080.

17.

Little

RJA

. Pattern-mixture models for multivariate incomplete data. J Am Stat Assoc 1993; 88(421): 125–134.

18.

Chang

Parikh

Thongprasert

, et al. Gefitinib (IRESSA) in patients of Asian origin with refractory advanced non-small cell lung cancer: subset analysis from the ISEL study. J Thorac Oncol 2006; 1(8): 847–855.

19.

Cufer

Vrdoljak

Gaafar

SIGN Study Group . Phase II, open-label, randomized study (SIGN) of single-agent gefitinib (IRESSA) or docetaxel as second-line therapy in patients with advanced (stage IIIb or IV) non-small-cell lung cancer. Anti Cancer Drugs 2006; 17(4): 401–409.

20.

Hanna

Shepherd

Fossella

, et al. Randomized phase III trial of pemetrexed versus docetaxel in patients with non-small-cell lung cancer previously treated with chemotherapy. J Clin Oncol 2004; 22(9): 1589–1597.

21.

Kim

Hirsh

Mok

, et al. Gefitinib versus docetaxel in previously treated non-small-cell lung cancer (INTEREST): a randomised phase III trial. Lancet 2008; 372(9652): 1809–1818.

22.

Lee

Park

Kim

, et al. Randomized Phase III trial of gefitinib versus docetaxel in non-small cell lung cancer patients who have previously received platinum-based chemotherapy. Clin Cancer Res 2010; 16(4): 1307–1314.

23.

Maruyama

Nishiwaki

Tamura

, et al. Phase III study, V-15-32, of gefitinib versus docetaxel in previously treated Japanese patients with non-small-cell lung cancer. J Clin Oncol 2008; 26(26): 4244–4252.

24.

Shepherd

Dancey

Ramlau

, et al. Prospective randomized trial of docetaxel versus best supportive care in patients with non-small-cell lung cancer previously treated with platinum-based chemotherapy. J Clin Oncol 2000; 18(10): 2095–2103.

25.

Higgins

. Commentary: Heterogeneity in meta-analysis should be expected and appropriately quantified. Int J Epidemiol 2008; 37(5): 1158–1160.

26.

Guyot

Ades

Ouwens

, et al. Enhanced secondary analysis of survival data: reconstructing the data from published Kaplan-Meier survival curves. BMC Med Res Methodol 2012; 12: 9.

27.

DigitizeIt . Digitizer software – digitize a scanned graph or chart into (x,y)-data. version 2.3. https://www.digitizeit.xyz/.

28.

Royston

Altman

. Regression using fractional polynomials of continuous covariates: parsimonious parametric modelling. Appl Statist 1994; 43(3): 429–453.

29.

Spiegelhalter

Best

Carlin

, et al. Bayesian measures of model complexity and fit. J R Stat Soc Ser B 2002; 64(4): 583–639.

30.

Plummer

. JAGS: Just Another Gibbs Sampler. version 4.3.0 user manual. 1–74, 2017.

31.

Friede

Röver

Wandel

, et al. Meta-analysis of two studies in the presence of heterogeneity with applications in rare diseases. Biom J 2017; 59(4): 658–671.

32.

Gelman

Rubin

. Inference from Iterative Simulation Using Multiple Sequences. Stat Sci 1992; 7(4): 457–472.

33.

R Core Team . R: a language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing, 2022, https://www.r-project.org.

34.

Yajima

. R2jags: using R to Run ‘JAGS’. R package version 0.5–7. 2015, https://cran.r-project.org/package=R2jags.

35.

Wickham

. ggplot2: elegant Graphics for data analysis. New York: Springer-Verlag, 2009.

36.

Lin

Zhang

Hodges

, et al. Performing arm-based network meta-analysis in R with the pcnetmeta package. J Stat Softw 2017; 80: 5.

37.

White

Carpenter

Evans

, et al. Eliciting and using expert opinions about dropout bias in randomized controlled trials. Clin Trials 2007; 4(2): 125–139.

38.

Howe

Tilling

Galobardes

, et al. Loss to follow-up in cohort studies: bias in estimates of socioeconomic inequalities. Epidemiology 2013; 24(1): 1–9.

39.

Royston

. Model selection for univariable fractional polynomials. STATA J 2017; 17(3): 619–629.

40.

Rutherford

Lambert

Sweeting

, et al. NICE DSU technical support document 21: flexible methods for survival analysis. 2020. Report by the Decision Support Unit. http://www.nicedsu.org.uk.

41.

Liu

Zhou

Lee

. IPDfromKM: reconstruct individual patient data from published Kaplan-Meier survival curves. BMC Med Res Methodol 2021; 21: 111.

42.

Spineli

Papadimitropoulou

Kalyvas

. Pattern-mixture model in network meta-analysis of binary missing outcome data: one-stage or two-stage approach? BMC Med Res Methodol 2021; 21: 12.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.87 MB