Sage Journals: Discover world-class research

Abstract

We investigate the effect of the choice of parameterisation of meta-analytic models and related uncertainty on the validation of surrogate endpoints. Different meta-analytical approaches take into account different levels of uncertainty which may impact on the accuracy of the predictions of treatment effect on the target outcome from the treatment effect on a surrogate endpoint obtained from these models. A range of Bayesian as well as frequentist meta-analytical methods are implemented using illustrative examples in relapsing–remitting multiple sclerosis, where the treatment effect on disability worsening is the primary outcome of interest in healthcare evaluation, while the effect on relapse rate is considered as a potential surrogate to the effect on disability progression, and in gastric cancer, where the disease-free survival has been shown to be a good surrogate endpoint to the overall survival. Sensitivity analysis was carried out to assess the impact of distributional assumptions on the predictions. Also, sensitivity to modelling assumptions and performance of the models were investigated by simulation. Although different methods can predict mean true outcome almost equally well, inclusion of uncertainty around all relevant parameters of the model may lead to less certain and hence more conservative predictions. When investigating endpoints as candidate surrogate outcomes, a careful choice of the meta-analytical approach has to be made. Models underestimating the uncertainty of available evidence may lead to overoptimistic predictions which can then have an effect on decisions made based on such predictions.

Keywords

Meta-analysis surrogate endpoints Bayesian statistics bivariate meta-analysis meta-regression

1 Introduction

Biomarkers and surrogate endpoints are increasingly being investigated as candidate endpoints in clinical trials where measuring a primary outcome of interest may be too costly, too difficult or require a long follow-up time. Use of surrogate endpoints in clinical trial design has advantages in overcoming these difficulties by choosing more convenient, cheaper or shorter term endpoints. Such endpoints are also becoming increasingly important in health technology assessment (HTA) and in particular in the early stages of drug development when conditional licensing based on a biomarker takes place and evidence on treatment effectiveness on a target outcome may be limited. Suitable methods need to be identified that would incorporate data on surrogate outcomes most efficiently in evidence synthesis as part of HTA.

Validating candidate outcomes as surrogate endpoints to target outcomes requires the correlation between the candidate endpoint and the target outcome on the individual level as well as the correlation between the treatment effect measured by the surrogate endpoint and the treatment effect measured by the target outcome to be established.¹ Methods for evaluating surrogacy on the individual level include, for example, Prentice's criteria,² proportion of treatment explained³ and adjusted association (between the endpoints adjusted for the treatment).⁴ For the evaluation to be valid in a general context of a particular disease area, it needs to be performed on a number of studies rather than based on a single trial. Meta-analysis serves the purpose of combining evidence from a number of trials and also provides a convenient tool for evaluating the association between treatment effects on the surrogate and final outcome on the study level. A number of meta-analytical methods have been proposed that aim to validate such surrogate endpoints.^1,5,6 For example, Daniels and Hughes proposed a Bayesian model for a joint synthesis of correlated outcomes, focused on summary data where partially available patient data can contribute to determining the within-study correlation.⁶ Buyse et al., on the other hand, designed a frequentist meta-analytic model based on patient-level data from a number of studies in the form of a mixed effects model with two measures of surrogacy derived: on the patient level and the study level.⁵ Part of the validation process, beyond establishing the correlations on both levels, involves investigating whether the treatment effect measured by the target outcome can be predicted from the treatment effect measured by the surrogate endpoint (from a model built based on treatment effect on both outcomes measured in historical trials) by comparing the predicted effect with the observed effect on a target endpoint in a validation study. Methods used for prediction include linear regression (for example proposed by Buyse et al. to predict the log hazard ratio measured by overall survival from the log hazard ratio measured by progression-free survival in colorectal cancer⁷), weighted linear regression (for example by Sormani et al.⁸ in a study in relapsing–remitting multiple sclerosis (RRMS)), error-in-variables regression methods¹ (for example used by Burzykowski et al. in metastatic breast cancer study⁹ or Oba et al. in gastric cancer study¹⁰), meta-regression (for example used by Gabler et al. investigating 6 min walk distance as a surrogate endpoint to development of clinical events in pulmonary arterial hypertension¹¹), or bivariate meta-analysis methods, such as by Daniels and Hughes in a Bayesian framework developed to evaluate CD4 cell count as a candidate surrogate endpoint for the treatment effect on the development of AIDS or death.⁶

Different meta-analytical approaches take into account different levels of uncertainty which may impact on the accuracy of the validation and predictions. The aim of this study was to investigate the effect of the choice of parameterisation of meta-analytic models and related uncertainty (that these models allow to incorporate) on the predictions obtained from those models. Bayesian methods are most suitable for this purpose as they are flexible in modelling the uncertainty. This study is concerned with predictive models for normally distributed treatment effects that are based on the summary data only. A range of Bayesian meta-analytical methods (using summary data) is implemented in order to investigate the impact of the choice of a model and level of uncertainty on the model predictions. When simple meta-regression is used to validate a candidate surrogate endpoint, the treatment effect on such an endpoint is included in the model as a covariate and hence is incorporated with no uncertainty, while the effect of treatment on each endpoint, including the surrogate, is in fact measured with error. Two approaches to meta-regression (described in Section 3.1) are investigated here: a standard use of mean trend with fixed coefficients estimated from the fixed effects meta-regression model (FEMR) and a random effects approach where between-study variability is taken into account when making predictions. In contrast to the meta-regression, the model proposed by Daniels and Hughes⁶ (described in Section 3.2) includes the treatment effect on the surrogate endpoint with uncertainty by modelling it as a response (rather than a covariate). Alternatively this can be achieved using bivariate meta-analytic methods^12–14 (Sections 3.3 and 3.4) which allow one to simultaneously model the estimates of treatment effects on both the surrogate and the final endpoint by taking into account the between- and within-study correlations. Models are implemented using WinBUGS.¹⁵ While, as noted above, Bayesian methods are most suited to flexibly model the uncertainty, similar differences in the way uncertainty is taken into account and the impact of it on predictions can be also demonstrated using frequentist methods. We illustrate this by the use of meta-regression and bivariate meta-analysis in Stata.¹⁶

In the remainder of this paper, illustrative examples in RRMS and gastric cancer are introduced in Section 2, followed by the details of each model described in the Bayesian framework in Section 3, with additional details of the use of frequentist methods in Section 3.7 and methods for surrogate endpoint validation and model comparison in Section 3.8. Results are then presented and differences between the models discussed in Section 4 which are complemented by a simulation study in Section 5 aiming to test the performance of each method and its sensitivity to the distributional assumptions. The paper is concluded by a discussion section. WinBUGS coding for each of the models, R code for the simulation and Stata code for the frequentist approach are included in Appendix 1.

2 Illustrative examples

2.1 Multiple sclerosis

Sormani et al.⁸ showed that in studies investigating treatment effect in patients with multiple sclerosis, the treatment effect on relapse rate can potentially be used as a surrogate endpoint to the treatment effect on the disability progression rate. We use data from this study as an illustrative example to investigate the effect of the choice of modelling technique and corresponding level of uncertainty which is allowed to be included in each of the models. We refer to these data as the ‘Sormani data’ in the remainder of this paper.

The annualised relapse rate ratio, the ratio between the relapse rate in the experimental and the control arms, was used as the summary estimate of the treatment effect on relapses (the surrogate endpoint measuring the treatment effect). The disability progression rate ratio, the ratio between the proportion of patients with a disability progression in the experimental and the control arms at year 2 (or at year 3 for trials of longer follow-up time which do not report the outcome at year 2), was used as the summary estimate of the treatment effect on disability progression, which was the target endpoint. Details of the specific treatment regimens are included in Table 1. Figure 1 shows data on both outcomes graphically, revealing similar heterogeneity patterns between the studies for both outcomes, implying a strong correlation between the effects on these outcomes. The studies are grouped as placebo-controlled and active-treatment-controlled.

Table 1.

Studies in the ‘Sormani data’ reporting the annualised relapse rate ratio and the disability progression rate ratio.

Study	Contrast	Number	Follow-up	Annualised relapse	Disability progression
		of patients	(months)	rate ratio	rate ratio
Paty (1) 1993	IFNbeta-1b 1.6 MIU vs PBO	248	24	0.92 (0.82, 1.03)	1.00 (0.67, 1.49)
Paty (2) 1993	IFNbeta-1b 8 MIU vs PBO	247	24	0.66 (0.58, 0.75)	0.71 (0.46, 1.12)
Miligan 1994	Methylprednisolone vs PBO	26	24	0.81 (0.50, 1.30)	1.14 (0.26, 5.03)
Johnson 1995	GA vs PBO	251	24	0.71 (0.61, 0.82)	0.88 (0.57, 1.35)
Jacobs 1996	IFNbeta-1a 6 MIU vs PBO	172	24	0.68 (0.57, 0.81)	0.63 (0.38, 1.04)
Fazekas 1997	IVIg vs PBO	150	24	0.41 (0.34, 0.49)	0.70 (0.36, 1.35)
Millefiorini 1997	Mitoxantrone vs PBO	51	24	0.34 (0.24, 0.47)	0.19 (0.05, 0.78)
Achiron 1998	IVIg vs PBO	40	24	0.37 (0.27, 0.52)	0.82 (0.19, 3.50)
Li (1) 1998	IFNbeta1a 22 μg vs PBO	376	24	0.71 (0.64, 0.78)	0.81 (0.61, 1.08)
Li (2) 1998	IFNbeta1a 44 μg vs PBO	371	24	0.68 (0.62, 0.75)	0.73 (0.54, 0.99)
Baumhackl 2005	Hydrolytic enzymes vs PBO	306	24	0.85 (0.74, 0.97)	1.08 (0.74, 1.57)
Polman 2006	NAT vs PBO	942	24	0.32 (0.29, 0.36)	0.59 (0.46, 0.75)
Comi (1) 2009	Cladribine 3.5 mg/kg vs PBO	870	24	0.42 (0.36, 0.49)	0.69 (0.52, 0.93)
Comi (2) 2009	Cladribine 5.25 mg/kg vs PBO	893	24	0.45 (0.39, 0.52)	0.73 (0.55, 0.97)
Sorensen 2009	IFNbeta-1a and oral methylprednisolone	130	24	0.37 (0.27, 0.50)	0.64 (0.32, 1.28)
	vs IFNbeta-1a and PBO
Clanet 2002	IFNbeta-1a 60 μg vs 30μg	802	36	1.05 (0.99, 1.12)	1.00 (0.84, 1.20)
Durelli 2002	IFNbeta1b vs IFNbeta1a	188	24	0.71 (0.59, 0.86)	0.43 (0.24, 0.78)
Rudick 2006	NAT + IFNbeta-1a vs IFNbeta-1a	1171	24	0.45 (0.41, 0.49)	0.79 (0.65, 0.96)
Coles (1) 2008	ALE 12 mg vs IFNbeta-1a	223	36	0.31 (0.24, 0.40)	0.35 (0.16, 0.73)
Coles (2) 2008	ALE 24 mg vs IFNbeta-1a	221	36	0.22 (0.16, 0.30)	0.38 (0.19, 0.76)
Mikol 2008	IFNbeta vs GA	764	24	1.03 (0.90, 1.17)	1.34 (0.88, 2.06)
Havrdova (1) 2009	IFNbeta-1a 30 μg plus AZA 50 mg	118	24	0.87 (0.73, 1.04)	1.23 (0.58, 2.62)
	vs IFNbeta-1a 30 μg
Havrdova (2) 2009	IFNbeta-1a 30 μg IM plus AZA 50 mg plus	123	24	0.70 (0.58, 0.85)	1.04 (0.48, 2.27)
	prednisone 10 mg vs IFNbeta-1a 30 μg
O'Connor (1) 2009	IFNbeta-1b 250 μg vs GA	1345	24	1.06 (0.97, 1.16)	1.05 (0.84, 1.31)
O'Connor (2) 2009	IFNbeta-1b 500 μg vs GA	1347	24	0.97 (0.88, 1.06)	1.10 (0.88, 1.37)

AZA: azathioprine; GA: glatiramer acetate; IFNβ: interferon-β; IVIg: IV immunoglobulin; PBO: placebo.

Figure 1.

Summary of the ‘Sormani data’.

2.2 Gastric cancer

Oba et al.¹⁰ investigated disease-free survival (DFS) as a surrogate endpoint for the overall survival (OS) in patients with curative gastric cancer. The study included randomised clinical trials that compared adjuvant chemotherapy with surgery alone. DFS was defined as the time to cancer recurrence, second cancer or death from any cause. DFS and OS hazard ratios were estimated with five years of follow-up.

We use data from Oba et al.¹⁰ as a second illustrative example to investigate the effect of the choice of a modelling technique and corresponding level of uncertainty on predictions. Data are presented in detail in Table 2 and graphically in Figure 2. We refer to these data as the ‘Oba data’ in the remainder of this paper. As in Oba et al, the studies are grouped as historical and validation studies. They are used in two sets of validation analyses, the cross-validation by taking out the effect on OS from one study at a time (this effect is predicted from DFS and the model based on the data on both outcomes from the remaining historical trials) and external validation where predictions are made for each of the validation trials using a model developed based on data from all the historical trials. As can be seen in Figure 2, the effects on DFS and OS have similar heterogeneity patterns between the studies suggesting a strong association between the effects on those outcomes.

Table 2.

Studies in the ‘Oba data’ reporting the hazard ratio measured by the disease-free survival (DFS) and overall survival (OS).

Study	Number of patients		Follow-up	DFS	OS
	Chemotherapy	Surgery	(years)	HR (95% CI)	HR (95% CI)
Historical trials
FFCD-8801	133	136	8.1	0.83 (0.61, 1.12)	0.84 (0.62, 1.14)
NSAS-GC	95	95	6.0	0.49 (0.29, 0.83)	0.51 (0.29, 0.90)
JCOG-9206-1	128	124	5.9	0.62 (0.33, 1.17)	0.60 (0.31, 1.17)
JCOG-8801	272	264	6.7	0.79 (0.52, 1.20)	0.82 (0.53, 1.26)
SWOG-7804	107	112	16.6	0.88 (0.66, 1.17)	0.93 (0.70, 1.24)
EORCT-40813	152	154	6.5	0.76 (0.57, 1.01)	0.85 (0.64, 1.13)
Tsavaris	44	44	4.9	0.55 (0.34, 0.89)	0.55 (0.33, 0.90)
ICCG-1/81	133	148	13	0.87 (0.65, 1.16)	0.85 (0.64, 1.13)
ITMO	135	136	6.2	0.90 (0.65, 1.24)	0.98 (0.70, 1.37)
GITSG-8174	90	88	12.1	0.73 (0.52, 1.02)	0.74 (0.53, 1.04)
NCTTG-794151	62	64	15.6	0.95 (0.64, 1.41)	1.02 (0.69, 1.51)
ECCOG-EST3275	91	89	16.5	0.89 (0.64, 1.23)	0.94 (0.68, 1.30)
EORTC-40905	103	103	7.0	0.88 (0.60, 1.29)	0.93 (0.64, 1.36)
ICCG	89	97	6.9	1.05 (0.74, 1.48)	1.05 (0.74, 1.49)
Validation trials
A-Cirera	520	515	2.8	0.55 (0.36, 0.84)	0.60 (0.39, 0.93)
B-CLASSIC	76	72	3.1	0.56 (0.44, 0.72)	0.72 (0.52, 1.00)
E-GOIM-9602	112	113	5.0	0.88 (0.66, 1.17)	0.91 (0.69, 1.21)
F-GOIRC	130	128	6.1	0.92 (0.65, 1.30)	0.90 (0.64, 1.26)

Details of chemotherapy regimens can be found in the supplementary material of Oba et al.¹⁰

Figure 2.

Summary of the ‘Oba data’.

3 Methods for evaluating surrogate endpoints

In this section, the technical details of the meta-analytic models are listed with emphasis on the use of such methods to predict a treatment effect measured by a target outcome of interest from the effect measured by a surrogate endpoint. The prediction is based on the association between the treatment effects on the two outcomes evaluated by a model developed based on the data in a ‘training set’, usually data from historical studies available for both outcomes from which a model ‘learns’ the relationship between them.

The methods in a Bayesian framework are described in Sections 3.1 to 3.4. To investigate the impact of the choice of parameterisation on the uncertainty around the predicted effects, we start with the simplest model allowing for a minimum variability, the FEMR. We then increase the allowed variability in the model by the use of random effects meta-regression (REMR) and further by introducing bivariate meta-analytic models which allow for the measurement error of the treatment effect on the surrogate endpoint. Sensitivity analyses to prior distributions and the distributional assumptions are discussed in Sections 3.5 and 3.6, respectively. Some frequentist approaches are then discussed in Section 3.7. Strategies for the validation of surrogate endpoints and model comparison are described in Section 3.8.

3.1 Meta-regression

3.1.1 Fixed-effects meta-regression

Linear or weighted regression models have been used to evaluate surrogate endpoints with regard to predictions,^7,8 by including the treatment effect on a surrogate endpoint in the meta-analysis as a covariate. In the meta-analytic context, this approach can be described by the FEMR which in the Bayesian framework for normally distributed outcomes has the form

Y_{2 i} \sim N (μ_{2 i}, σ_{2 i}^{2}) μ_{2 i} = λ_{0} + λ_{1} Y_{1 i}

(1) with prior distributions

λ_{0}, λ_{1} \sim N (0.0, 1000000)

. Y_1i and Y_2i are the estimates of the treatment effects on the surrogate and the final outcomes, respectively, with standard deviation σ_2i corresponding to the effect on the final outcome in each study i. The normally distributed effects Y_2i estimate underlying true effects μ_2i. The intercept λ₀ and slope λ₁ define the relationship between the effects on the two outcomes.

Having estimated the parameters λ₀ and λ₁, this model can be used to predict the treatment effect on the target outcome based on the observed treatment effect on the surrogate endpoint. If for a new study j, the observed treatment effect on the surrogate outcome is Y_1j then, based on model (1), prediction is made using the regression equation

{\hat{μ}}_{2 j} = λ_{0} + λ_{1} Y_{1 j} .

(2)

In this model, uncertainty around the predicted effect on the target outcome is related to the uncertainty around the intercept λ₀, whereas the treatment effect on the surrogate endpoint is treated as a fixed covariate.

3.1.2 Random effects meta-regression

A REMR model can be used to evaluate surrogate endpoints.¹⁷ The model allows for between-study variability by assuming that the treatment effects Y_2i estimate different underlying true effects μ_2i (regardless of the value of the covariate) in each study i. In a Bayesian framework, meta-regression can be formulated as in Sutton and Abrams¹⁸ in the following way using the random effects approach

Y_{2 i} \sim N (μ_{2 i}, σ_{2 i}^{2}) μ_{2 i} = λ_{0 i} + λ_{1} Y_{1 i} λ_{0 i} \sim N (β, ψ 2)

(3) where Y_1i are the summary measures of the treatment effect on the candidate surrogate outcome and Y_2i represent the summary measures of the treatment effect on the target outcome with corresponding standard deviations σ_2i from each study i. The normally distributed Y_2i are estimates of the underlying true effects μ_2i. The λ_0i are the true effects at value zero of the treatment effect on the surrogate endpoint and they follow a common Normal distribution with mean β and standard deviation ψ, representing the between-study heterogeneity. The regression coefficient λ₁ represents the relationship between the treatment effects on the target and the surrogate outcomes. In this Bayesian framework, all parameters are given prior distributions:

β \sim N (0.0, 1000), λ_{1} \sim N (0.0, 1000000)

and

ψ \sim N (0, 100) I (0,)

(a half-normal distribution truncated at zero).

The prediction can be made by

{\hat{μ}}_{2 j} = λ_{0 j} + λ_{1} Y_{1 j},

(4) where λ_0j is obtained from the model, by the use of the Markov chain Monte Carlo (MCMC) simulation, with data that include the new study, but the target outcome is coded as missing (NA in WinBUGS).

An alternative approach is also possible by centring the values of the effect on the surrogate, Y_1i. In this case, the interpretation would change and the intercept would represent the true treatment effect on the final outcome at the average value of the effect on the surrogate endpoint. This approach could have an advantage when external information is available to construct an informative prior distribution to be placed on the intercept. Also, the centring of the effect on the surrogate may help to reduce the autocorrelation when conducting the MCMC simulation. However, for the purpose of predicting the effect for a new study, which is central to the evaluation of surrogate endpoints, the effect would have to be ‘un-centred’.

WinBUGS code corresponding to this model is included in Appendix 1.1.

3.2 Meta-analysis by Daniels and Hughes

In a model proposed by Daniels and Hughes,⁶ the estimates of the treatment effects measured by the surrogate endpoint Y_1i and the target outcome Y_2i are assumed to come from a bivariate normal distribution and they estimate the underlying true effects on the surrogate and target outcomes μ_1i and μ_2i, respectively, from each study i with corresponding within-study standard deviations σ_1i and σ_2i and within-study correlation ρ_wi

(Y_{1 i} Y_{2 i}) \sim MVN ((μ_{1 i} μ_{2 i}), (σ_{1 i}^{2} σ_{1 i} σ_{2 i} ρ_{wi} σ_{1 i} σ_{2 i} ρ_{wi} σ_{2 i}^{2})) μ_{2 i} | μ_{1 i} \sim N (λ_{0} + λ_{1} μ_{1 i}, ψ 2),

(5) where the underlying true effects μ_1i measured by the surrogate endpoint are assumed to be fixed effects and to have a linear relationship with the true effect on the target outcome μ_2i. Prior distributions are given to all parameters:

μ_{1 i} \sim N (0, 1000), λ_{0} \sim N (0.0, 1000), λ_{1} \sim

N (0.0, 1000), ψ \sim N (0, 100) I (0,)

In this model, estimates of the treatment effects on both the target as well as the surrogate endpoints are treated as response variables and therefore the uncertainty around the treatment effect on the surrogate outcome is taken into account in this model. If for a study j the observed treatment effect on the surrogate outcome is Y_1j, then the treatment effect on the target outcome Y_2j can be predicted from the model by assuming that this outcome is missing at random. By assuming that the two effects are correlated and follow a common bivariate distribution, the missing effect (on the target outcome in this case) is estimated automatically by the MCMC simulation, from the model which takes into account the correlation between the effects on the two outcomes. WinBUGS code for this model is listed in Appendix 1.2.

3.3 Bivariate random effects meta-analysis (BRMA)

Bivariate meta-analytic methods have been proposed for joint modelling of correlated outcomes^12,19 and included approaches in a Bayesian framework.^20,21 BRMA is discussed here in the form described by van Houwelingen et al.¹² and Riley et al.,¹³ where estimates of treatment effect on both outcomes Y_1i and Y_2i are assumed to be normally distributed

(Y_{1 i} Y_{2 i}) \sim MVN ((μ_{1 i} μ_{2 i}), Σ_{i}), Σ_{i} = (σ_{1 i}^{2} σ_{1 i} σ_{2 i} ρ_{wi} σ_{1 i} σ_{2 i} ρ_{wi} σ_{2 i}^{2})

(6)

(μ_{1 i} μ_{2 i}) \sim MVN ((β_{1} β_{2}), T), T = (τ_{1}^{2} τ_{1} τ_{2} ρ_{b} τ_{1} τ_{2} ρ_{b} τ_{2}^{2}) .

(7)

In this model, the treatment effect on the surrogate endpoint Y_1i and the treatment effect on the target outcome Y_2i are assumed to estimate the correlated true effects μ_1i and μ_2i with corresponding within-study variances $σ_{1 i}^{2}$ and $σ_{2 i}^{2}$ of the estimates and the within-study correlation ρ_wi between them. These true study-level effects follow a bivariate normal distribution with means $(β_{1}, β_{2})$ , between-study variances $τ_{1}^{2}$ and $τ_{2}^{2}$ and a between-study correlation ρ_b in this hierarchical framework. Equation (6) represents the within-study model and equation (7) is the between-study model. To implement the model in the Bayesian framework, prior distributions are placed on the between-study covariance matrix using the inverse Wishart distribution $T - 1 \sim Wishart ((1001), 3)$ where the degrees of freedom parameter was set to 3 (the dimension of the matrix plus 1) to induce a uniform prior distribution for the between-study correlation ρ_b.²² Non-informative prior distributions are placed on the within-study correlations using uniform distributions $ρ_{wi} \sim U (- 1, 1)$ and on the mean effects $β_{1, 2} \sim N (0, 10000)$ .

As in the model (5) by Daniels and Hughes, the treatment effect on the target outcome in a study j can be predicted from the treatment effect on the surrogate endpoint observed by this study, by assuming that the effect on the target outcome is missing at random and assuming exchangeability of the treatment effects. In contrast to model (5), the BRMA model allows an estimation of the pooled effects measured by both outcomes (rather than only the pooled effect of the target endpoint in equation (5) which is only possible when centring the effect on the surrogate outcome on the mean). Although the ability to estimate the pooled effect does not impact on the validation process, it can be advantageous when modelling treatment effects on surrogate and target outcomes jointly to combine all available evidence in the assessment of the effectiveness. However, to make it possible, stronger distributional assumptions about the true effects are made in this model in comparison with model (5). WinBUGS code for this model is listed in Appendix 1.3.

3.4 BRMA in product normal formulation (BRMA PNF)

The BRMA models (6) and (7) can be parameterised in an alternative form where instead of placing a prior distribution on the between-study covariance matrix as a whole, the between-study model (7) is represented in the PNF^14,23 (a product of univariate conditional normal distributions), whereas the within-study model remains the same

(Y_{1 i} Y_{2 i}) \sim MVN ((μ_{1 i} μ_{2 i}), Σ_{i}), Σ_{i} = (σ_{1 i}^{2} σ_{1 i} σ_{2 i} ρ_{wi} σ_{1 i} σ_{2 i} ρ_{wi} σ_{2 i}^{2})

(8)

{μ_{1 i} \sim N (η_{1}, ψ_{1}^{2}) μ_{2 i} | μ_{1 i} \sim N (η_{2 i}, ψ_{2}^{2}) η_{2 i} = λ_{0} + λ_{1} μ_{1 i} .

(9) As for the BRMA model, Y_1i and Y_2i are the estimates of the treatment effects measured by the surrogate and target endpoints, respectively, and the μ_1i and μ_2i are the true effects in the population which are correlated and modelled here through a linear relationship. Prior distributions are placed on the following parameters:

ρ_{wi} \sim U (- 1, 1), λ_{0} \sim N (0.0, 1000),

η_{1} \sim N (0.0, 1000), ψ_{1} \sim N (0, 100) I (0,), ψ_{2} \sim N (0, 100) I (0,), ρ_{b} \sim U (- 1, 1)

. The between-study variances are

τ_{1}^{2} = ψ_{1}^{2}

and

τ_{2}^{2} = ψ_{2}^{2} + λ_{21}^{2} ψ_{1}^{2}

and hence the implied prior distribution is placed on

λ_{1} = \frac{ψ_{2}}{ψ_{1}} \frac{ρ_{b}}{\sqrt{1 - (ρ_{b}) 2}}

.¹⁴

The PNF provides better control over the prior distributions placed on specific parameters of the model (compared to BRMA with Wishart prior distribution), helping to ensure that they are non-informative when this is required or allowing for informative prior distributions, based on external evidence, to be placed directly on the desirable parameters of the model.¹⁴ WinBUGS code corresponding to this model is included in Appendix 1.4.

3.5 Sensitivity analysis: Prior distributions

When investigating the impact of parameterisation and the related uncertainty on the precision of the predicted estimates, we carried out sensitivity analysis using a range of prior distributions for the heterogeneity parameters (ψ in meta-regression and model by Daniels and Hughes and $ψ_{1, 2}$ in BRMA (PNF)). The following distributions were included:

Prior I: $ψ \sim N (0, 100) I (0,)$

Prior II: $ψ \sim N (0, 10) I (0,)$

Prior III: $\frac{1}{ψ 2} \sim Gamma (0.001, 0.001)$

Prior IV: $ψ \sim Uniform (0, 2)$ .

Other examples of non-informative prior distributions can be found in the simulation study by Lambert et al.²⁴ Sensitivity analysis was also carried out to investigate the impact of the choice of the parameters of the inverse Wishart prior distribution on the implied prior distributions for the heterogeneity parameters (while maintaining the implied uniform prior distribution on the between-study correlation). Wishart prior distributions with the following parameters were tested:

Wishart A: $T - 1 \sim Wishart ((1001), 3)$

Wishart B: $T - 1 \sim Wishart ((0.1 00 0.1), 3)$ .

Figure 3 shows the prior distributions for the standard deviations overlayed (distributions I, II and IV used directly and distributions obtained from priors III, Wishart A and B by transformation on the standard deviation scale). Prior distributions I–III have large variances and hence are non-informative. The uniform prior distribution IV is locally non-informative on the scale of the modelled data. The implied prior distributions on the standard deviations obtained from the Wishart distributions placed on the between-study precision matrix are both quite informative (as mentioned above, the corresponding implied distribution on the between-study correlation is uniform on the range of values between –1 and 1).

Figure 3.

Prior distributions for the standard deviations used in the sensitivity analysis.

3.6 Sensitivity analysis: Relaxing the normality assumption

The methods considered here are models with random effects to reflect the assumption that the modelled treatment effects are different between the studies. The differences in the effects may be due to the varying populations, different treatments under investigation in those studies or perhaps heterogeneity in the definitions of the outcomes.²⁵ Typically, the normal distribution of the between-study random effects is assumed to reflect the similarity of the effects. The assumption that the true treatment effects on both outcomes (such as log relative risk and log rate ratio for the example in RRMS or log hazard ratio on OS and DFS in gastric cancer) are normally distributed may, however, not always be reasonable. When dealing with departures from normality of the modelled data, this assumption can lead to limitations of modelling and restricted inferences.²⁶ For example, as discussed by Marshall and Spiegelhalter, inadequate use of normality assumption about the random effects may lead to ‘overshrinkage’ of the true effects and hence to misleading inferences.²⁷

One way of relaxing this assumption is to use a t-distribution as recommended, for example, by Lee and Thompson²⁶ or Smith et al.²⁸ In contrast to the normal distribution, the t-distribution gives more weight in the tails which is more likely to be better at modelling extreme effects such as outlying observations.²⁷ We apply the t-distribution to the random effect in the BRMA model by adapting its PNF form. In the product of t-distributions formulation (PTDF), the between-study model can be formulated as

{μ_{1 i} \sim t (η_{1}, ν_{1}, df) μ_{2 i} | μ_{1 i} \sim t (η_{2 i}, ν_{2}, df) η_{2 i} = λ_{0} + λ_{1} μ_{1 i} .

(10) with prior distributions placed on the parameters,

λ_{0}, λ_{1} \sim N (0.0, 1000)

and

η_{1} \sim N (0.0, 1000)

. Placing non-informative prior distributions on the between-study standard deviations corresponding to the true effects μ_1i and μ_2i,

τ_{1} \sim N (0, 100) I (0,)

and

τ_{2} \sim N (0, 100) I (0,)

gives implied prior distributions on the corresponding parameters,

ν_{1} = (τ_{1}^{2 *} (df - 2)) / df

and

ν_{2} = (τ_{2}^{2 *} (df - 2)) / df

. WinBUGS code corresponding to this model is included in Appendix 1.5.

3.7 Frequentist approaches

The above models for evaluation of surrogate endpoints differ in the way they take into account the uncertainty around the model parameters. The Bayesian framework gives a flexible environment for modelling of uncertainty. Some of the models, however, can be also implemented in a frequentist approach using software such as, for example, Stata. To compare the different degrees of uncertainty allowed by different frequentist models, two models are compared here: the meta-regression and the bivariate meta-analysis.

3.7.1 Meta-regression

Suppose Y_1i is the estimate of the treatment effect on the candidate surrogate outcome and Y_2i represents the estimate of the treatment effect on the target outcome with corresponding within-study variance v_2i in study i ( $i = 1, \dots, n$ ). In the frequentist framework, meta-regression for the association between the effects on the surrogate and the target endpoints can be written following the formulation by Sharp²⁹ in the following form

Y_{2} \sim N (Y_{1} λ, V)

(11) where

Y_{2} = (Y_{21}, \dots, Y_{2 n}) T

is the

n \times 1

vector of the treatment effect on the final outcome and

Y_{1}

is the

n \times 2

design matrix with ith row

(1, Y_{1 i}), λ = (λ_{0}, λ_{1}) T

is the vector of parameters and V is a diagonal n × n variance matrix with ith diagonal element

v_{2 i} + τ 2

, where the τ² represents the between-study variability for the random effects model. Maximum likelihood methods are used to estimate the parameters λ and τ² and in Stata this can be achieved by using the command metareg. The predictions are made using the linear predictor, and in Stata using the post-estimation command predict.

3.7.2 Bivariate meta-analysis

As in the Bayesian framework, the random effects bivariate meta-analysis can be described in the hierarchical framework

(Y_{1 i} Y_{2 i}) \sim MVN ((μ_{1 i} μ_{2 i}), {- 20 % Σ}_{i}), {- 20 % Σ}_{i} = (σ_{1 i}^{2} σ_{1 i} σ_{2 i} ρ_{wi} σ_{1 i} σ_{2 i} ρ_{wi} σ_{2 i}^{2})

(12)

(μ_{1 i} μ_{2 i}) \sim MVN ((β_{1} β_{2}), T), T = (τ_{1}^{2} τ_{1} τ_{2} ρ_{b} τ_{1} τ_{2} ρ_{b} τ_{2}^{2}) .

(13) with the treatment effect on the surrogate endpoint Y_1i and the treatment effect on the target outcome Y_2i in each study i and corresponding within-study variances of the estimates

σ_{1 i}^{2}

and

σ_{2 i}^{2}

and the within-study correlation ρ_wi between them. The correlated true effects μ_1i and μ_2i follow bivariate normal distribution with means

(β_{1}, β_{2})

, between-study variances

τ_{1}^{2}

and

τ_{2}^{2}

and a between-study correlation ρ_b. In Stata, the model can be implemented using the command mvmeta.³⁰ In the Bayesian framework, the predicted estimates for the final endpoint assumed missing at random are obtained from a MCMC simulation. Here, we obtain the estimate of the true effect on the final outcome for study j as follows

E (μ_{j} | Y_{j}, β, T) = β + ({- 20 % Σ}_{j} + T) - 1 T (Y_{j} - β)

(14)

var (μ_{j} | Y_{j}, β, T) = ({- 20 % Σ}_{j} + T) - 1 T {- 20 % Σ}_{j},

(15) where

Y_{j}, μ_{j}

and

β

are two-dimensional vectors and

{- 20 % Σ}_{j}

and T are 2 × 2 matrices.

Stata code for the model predictions using the meta-regression and the BRMA is included in Appendix 1.6.

3.8 Cross-validation procedure and model comparison

Evaluation of surrogate endpoints on the study level, assessing whether the treatment effect on the final outcome can be predicted from the treatment effect on the surrogate endpoint, can be carried out by the take-one-out approach in the cross-validation procedure, as described by Daniels and Hughes.⁶ This procedure aims to establish goodness of fit of the meta-analytic prediction model. In each study the effect on the final outcome is assumed unknown (in one study at a time) and it is then predicted from the effect on the surrogate endpoint, conditional on the data on the treatment effects on both outcomes from the remaining studies and the parameters of the model.

Ultimately we want to draw inferences about predicting the true effect on the final outcome μ_2j in a future study j. However, in a real data scenario (as opposed to simulated data) we do not know what the true effect is. Hence for the purpose of the cross-validation, we predict the ‘observed estimate’ ${\hat{Y}}_{2 j}$ . For this purpose, we assume σ_2j known and hence effectively only the true effect μ_2j is predicted. We then check if the observed value of Y_2j falls within the predicted interval of ${\hat{Y}}_{2 j}$ with the standard deviation equal to $\sqrt{σ_{2 j}^{2} + var ({\hat{μ}}_{2 j} | Y_{1 j}, σ_{1 j}, Y_{1 (- j)}, Y_{2 (- j)})}$ , where $Y_{1 (- j)}$ and $Y_{2 (- j)}$ denote the data from the remaining studies without the validation study j.

To investigate the impact of the uncertainty on predictions, we compare the models with respect to the predicted intervals. To compare how the choice of parameterisation affects the uncertainty of predictions, we compare the widths of the intervals of the predicted ${\hat{Y}}_{2 j}$ and predicted true effects ${\hat{μ}}_{2 j}$ across the models. To do so, we summarise the ratios $w_{{\hat{Y}}_{2 j}} / w_{Y_{2 j}}$ of the widths of the intervals for ${\hat{Y}}_{2 j}$ to the widths of the intervals for Y_2j to investigate how this varies across the models and the ratios $w_{{\hat{μ}}_{2 j}^{CM}} / w_{{\hat{μ}}_{2 j}^{FEMR}}$ of the widths of the predicted true effects ${\hat{μ}}_{2 j}$ from each current model (CM) to the width of the predicted interval for ${\hat{μ}}_{2 j}$ obtained from the FEMR.

4 Results

4.1 Results from Bayesian models: multiple sclerosis

To compare the models, in the first instance the estimates of the pooled effects on both outcomes, the relapse rate ratio and the disability progression rate ratio, were obtained from all the models. Due to the large heterogeneity of the control arm between the studies (and the fact that an intervention which is a control arm in one study may be an experimental arm in the other) only placebo-controlled studies were included in this particular estimation. The inclusion of all studies would not give clinically interpretable results and in order to combine evidence from all the trials in a sensible way, a network meta-analysis would need to be conducted which is beyond the scope of this paper. Note that the whole data set (including both placebo- and active-controlled trials) is used for the remaining analyses that focus on the predictions for the purpose of evaluation of surrogate endpoints. The results shown in Table 3 are for the comparison of models only. Both forms of BRMA allowed for the estimation of the pooled effect of both outcomes, in contrast to meta-regression and model by Daniels and Hughes which allowed estimation of the pooled effect on the disability progression only. The pooled effect measured by the surrogate endpoint, relapse rate ratio, was the same using both forms of BRMA. The point estimate of the pooled effect measured by the target endpoint, the disability progression rate ratio, was the same for all models but obtained with different precisions from different models. The largest uncertainty around the estimate was obtained from the BRMA model with the Wishart A prior distribution placed on the between-study precision matrix. Effectiveness estimates of the highest precision were obtained from the meta-regression and the model by Daniels and Hughes. Relatively high precision of the pooled effect was also obtained from BRMA PNF.

Table 3.

Summary results for placebo-controlled studies for the treatment effects on the risk of disability progression and relapse rate ratio.

	Relapse incidence rate ratio			Disability relative risk
Model	Mean	95% CrI	$τ_{1}$ ^a (sd)	Mean	95% CrI	ψ₂ (sd)	τ₂ (sd)
REMR				0.75^b	[0.67; 0.84]	0.07 (0.06)
D&H^c				0.75^b	[0.66; 0.84]	0.07 (0.06)
BRMA	0.57	[0.44; 0.72]	0.44 (0.09)	0.75	[0.58; 0.95]		0.38 (0.09)
BRMA PNF	0.57	[0.46; 0.70]	0.36 (0.09)	0.75	[0.65; 0.87]	0.10 (0.06)	0.15 (0.08)

$ψ_{1} = τ_{1}$ in BRMA PNF.

Obtained by centring the effects on surrogate endpoint on the mean. ^cD&H refers to the model by Daniels & Hughes.

All four models were then applied to make predictions in a cross-validation procedure. The treatment effect on the final outcome (disease progression rate ratio) in the 25 studies was assumed unknown (in one study at a time which in that case became a validation study) and then predicted from the surrogate endpoint (relapse rate ratio) by each model.

Table 4 lists all the predictions made by all of the models for all of the studies (using prior distribution I for the heterogeneity parameter and Wishart A for the between-study precision matrix). For most studies, all models gave predicted

{\hat{Y}}_{2 j}

with intervals containing the corresponding observed estimates, except for one study by Durelli for which only the interval obtained from BRMA with Wishart prior B contained the observed estimate of the treatment effect. Most intervals obtained from BRMA with Wishart prior A were largely inflated apart from the interval in study by Miligan which was the smallest study with largest intervals for the treatment effects on both outcomes.

Table 4.

Predictions obtained from all models for all studies in the ‘Sormani data’.

	Disability progression rate ratio, mean (95% CrI)
	Paty (1)	Paty (2)	Miligan	Johnson	Jacobs/Simon
Observed	1.00 (0.67, 1.49)	0.71 (0.45, 1.12)	1.14 (0.26, 5.03)	0.88 (0.57, 1.35)	0.63 (0.38, 1.05)
Meta-regression (FE)	0.99 (0.66, 1.48)	0.84 (0.53, 1.33)	0.93 (0.21, 4.11)	0.87 (0.56, 1.35)	0.85 (0.51, 1.42)
Meta-regression (RE)	0.99 (0.64, 1.53)	0.84 (0.52, 1.35)	0.92 (0.21, 4.13)	0.87 (0.54, 1.38)	0.85 (0.50, 1.45)
Daniels & Hughes	0.99 (0.63, 1.54)	0.84 (0.51, 1.37)	0.93 (0.20, 4.31)	0.87 (0.54, 1.41)	0.85 (0.50, 1.46)
BRMA (Wishart)	1.00 (0.47, 2.13)	0.81 (0.39, 1.68)	0.83 (0.16, 4.29)	0.81 (0.36, 1.82)	0.82 (0.36, 1.87)
BRMA (PNF)	0.97 (0.60, 1.57)	0.83 (0.49, 1.40)	0.86 (0.19, 3.95)	0.86 (0.52, 1.42)	0.83 (0.47, 1.48)
	Fazekas	Millefiorini	Achiron	Li (1)	Li (2)
Observed	0.70 (0.36, 1.35)	0.19 (0.05, 0.79)	0.82 (0.19, 3.50)	0.81 (0.61, 1.08)	0.73 (0.54, 0.99)
Meta-regression (FE)	0.66 (0.34, 1.29)	0.61 (0.14, 2.55)	0.63 (0.15, 2.69)	0.87 (0.65, 1.17)	0.86 (0.63, 1.17)
Meta-regression (RE)	0.65 (0.33, 1.30)	0.60 (0.14, 2.53)	0.62 (0.14, 2.67)	0.87 (0.62, 1.21)	0.85 (0.60, 1.20)
Daniels & Hughes	0.65 (0.32, 1.32)	0.60 (0.14, 2.60)	0.62 (0.14, 2.73)	0.87 (0.62, 1.22)	0.86 (0.60, 1.23)
BRMA (Wishart)	0.70 (0.28, 1.76)	0.65 (0.14, 3.16)	0.64 (0.13, 3.16)	0.85 (0.43, 1.68)	0.84 (0.39, 1.79)
BRMA (PNF)	0.67 (0.33, 1.38)	0.65 (0.15, 2.81)	0.67 (0.15, 2.97)	0.86 (0.58, 1.25)	0.84 (0.57, 1.24)
	Clanet	Durelli	Baumhackl	Polman	Rudick
Observed	1.00 (0.83, 1.20)	0.43 (0.24, 0.78)	1.08 (0.74, 1.57)	0.59 (0.46, 0.75)	0.79 (0.65, 0.96)
Meta-regression (FE)	1.08 (0.87, 1.34)	0.88 (0.48, 1.59)*	0.94 (0.64, 1.39)	0.58 (0.43, 0.78)	0.66 (0.53, 0.83)
Meta-regression (RE)	1.08 (0.82, 1.43)	0.87 (0.48, 1.61)*	0.94 (0.62, 1.43)	0.57 (0.40, 0.80)	0.66 (0.51, 0.86)
Daniels & Hughes	1.10 (0.84, 1.44)	0.88 (0.47, 1.64)*	0.94 (0.60, 1.46)	0.57 (0.39, 0.82)	0.66 (0.51, 0.87)
BRMA (Wishart)	1.04 (0.60, 1.79)	0.84 (0.35, 2.01)	0.91 (0.42, 1.95)	0.56 (0.27, 1.15)	0.69 (0.30, 1.59)
BRMA (PNF)	1.07 (0.77, 1.48)	0.85 (0.45, 1.61)*	0.91 (0.58, 1.44)	0.57 (0.37, 0.88)	0.67 (0.48, 0.94)
	Coles (1)	Coles (2)	Mikol	Comi (1)	Comi (2)
Observed	0.35 (0.16, 0.74)	0.38 (0.19, 0.77)	1.34 (0.88, 2.06)	0.69 (0.52, 0.93)	0.73 (0.55, 0.97)
Meta-regression (FE)	0.58 (0.27, 1.26)	0.49 (0.24, 1.01)	1.03 (0.66, 1.60)	0.66 (0.48, 0.91)	0.69 (0.51, 0.93)
Meta-regression (RE)	0.58 (0.26, 1.26)	0.48 (0.23, 1.01)	1.03 (0.65, 1.63)	0.65 (0.46, 0.93)	0.68 (0.48, 0.95)
Daniels & Hughes	0.58 (0.26, 1.30)	0.49 (0.23, 1.05)	1.04 (0.64, 1.69)	0.64(0.42, 0.99)	0.67 (0.45, 1.00)
BRMA (Wishart)	0.64 (0.23, 1.75)	0.60 (0.22, 1.58)	0.92 (0.43, 1.97)	0.59 (0.28, 1.22)	0.71 (0.30, 1.67)
BRMA (PNF)	0.63 (0.28, 1.41)	0.55 (0.25, 1.21)	0.97 (0.59, 1.59)	0.68 (0.45, 1.04)	0.69 (0.46, 1.05)
	Havrdova (1)	Havrdova (2)	Sorensen	O'Connor (1)	O'Connor (2)
Observed	1.23 (0.58, 2.62)	1.04 (0.48, 2.27)	0.64 (0.32, 1.28)	1.05 (0.84, 1.31)	1.10 (0.88, 1.37)
Meta-regression (FE)	0.96 (0.45, 2.05)	0.86 (0.39, 1.88)	0.63 (0.31, 1.27)	1.06 (0.83, 1.37)	1.00 (0.78, 1.27)
Meta-regression (RE)	0.96 (0.44, 2.07)	0.86 (0.39, 1.89)	0.62 (0.30, 1.27)	1.07 (0.78, 1.45)	1.00 (0.75, 1.34)
Daniels & Hughes	0.96 (0.43, 2.10)	0.86 (0.38, 1.92)	0.62 (0.29, 1.31)	1.06 (0.79, 1.42)	0.99 (0.75, 1.32)
BRMA (Wishart)	0.93 (0.34, 2.51)	0.81 (0.30, 2.19)	0.63 (0.24, 1.65)	0.84 (0.43, 1.65)	0.95 (0.48, 1.87)
BRMA (PNF)	0.93 (0.42, 2.07)	0.84 (0.37, 1.92)	0.66 (0.31, 1.42)	1.01 (0.70, 1.47)	0.98 (0.68, 1.39)

The discrepancies between the observed and predicted values were obtained for all studies (by taking the absolute difference between the observed estimate of the treatment effect and the predicted effect) and summarised in Table 5, which also summarises the degree of uncertainty around the predicted estimate compared to the uncertainty around the observed value (by calculating the ratio

w_{{\hat{Y}}_{2 j}} / w_{Y_{2 j}}

of the length of the 95% predicted interval to the length of the 95% confidence interval of the observed estimate, shown in the second to last column of the table). Note that the intervals of the predicted

{\hat{Y}}_{2 j}

were inflated compared to those corresponding to the observed effects Y_2j due to the additional between-study variability. To compare uncertainty of predicted true effects across models, ratio

w_{{\hat{μ}}_{2 j}^{CM}} / w_{{\hat{μ}}_{2 j}^{FEMR}}

of the length of the 95% credible interval around

{\hat{μ}}_{2 j}

obtained from the CM to the length of that interval from the FEMR was calculated and presented in the last column of Table 5.

Table 5.

Results of the comparison of the models for predicting the treatment effect on disability progression from the treatment effect on relapse rate.

		Absolute discrepancy	$w_{{\hat{Y}}_{2 j}} / w_{Y_{2 j}}$	$w_{{\hat{μ}}_{2 j}^{CM}} / w_{{\hat{μ}}_{2 j}^{FEMR}}$
Model	Prior	Median (range)	Median (range)	Median (range)
FEMR		0.16 (0.01, 1.16)	1.02 (1.00, 1.21)
REMR	I	0.15 (0.01, 1.15)	1.07 (1.00, 1.54)	1.96 (1.36, 2.56)
REMR	II	0.16 (0.01, 1.15)	1.07 (1.01, 1.52)	1.95 (1.34, 2.53)
REMR	III	0.15 (0.01, 1.15)	1.07 (1.01, 1.51)	1.91 (1.36, 2.43)
REMR	IV	0.16 (0.01, 1.15)	1.07 (1.01, 1.51)	1.93 (1.37, 2.58)
Daniels & Hughes	I	0.16 (0.01, 1.15)	1.11 (1.02, 1.50)	2.44 (1.65, 5.14)
Daniels & Hughes	II	0.17 (0.02, 1.16)	1.11 (1.02, 1.56)	2.28 (1.62, 5.78)
Daniels & Hughes	III	0.16 (0.01, 1.15)	1.11 (1.02, 1.59)	2.43 (1.61, 5.15)
Daniels & Hughes	IV	0.16 (0.01, 1.16)	1.11 (1.02, 1.45)	2.43 (1.51, 5.11)
BRMA PNF	I	0.14 (0.02, 1.23)	1.16 (1.02, 1.83)	2.95 (1.95, 4.85)
BRMA PNF	II	0.16 (0.01, 1.23)	1.18 (1.02, 1.73)	2.88 (2.02, 4.68)
BRMA PNF	III	0.15 (0.00, 1.23)	1.11 (1.02, 1.52)	2.26 (1.45, 4.48)
BRMA PNF	IV	0.15 (0.01, 1.24)	1.17 (1.02, 1.86)	2.90 (1.74, 4.92)
BRMA	Wishart A	0.16 (0.00, 1.24)	1.78 (1.10, 4.27)	7.00 (3.48, 10.07)
BRMA	Wishart B	0.13 (0.00, 1.23)	1.23 (1.03, 1.95)	3.28 (2.09, 5.60)

CM: current model in each row.

The accuracy of predictions for the point estimate was similar across models, but the uncertainty around the predicted effects varied depending on the parameterisation. Using the meta-regression equation (2) the effect on the target outcome was predicted with much increased precision compared to other models. For example, when using prior distribution I the interval for the predicted true effect ${\hat{μ}}_{2 j}$ from the REMR was almost twice as wide (on log relative risk scale) compared to the interval obtained from the FEMR. The results obtained from the models by Daniels and Hughes and BRMA PNF were much more conservative with moderately reduced precision (with intervals, respectively, 2.44 and 2.95 times wider than those obtained from the FEMR). When applying the BRMA model with a Wishart prior distribution, the results were sensitive to the parameters of the prior distribution. In the case of Wishart A distribution with identity matrix, the predicted intervals were largely inflated (most likely due to implied prior distributions on the between-study variances not being suitably non-informative). Using the Wishart B prior distribution led to predictions comparable to those obtained from BRMA PNF with slightly more inflated intervals. The use of the REMR approach, as in equation (4), resulted in increased uncertainty around the predicted effect on the disability progression (compared to predictions obtained when using the FEMR approach) of similar magnitude to the results obtained from models by Daniels and Hughes and BRMA PNF. This uncertainty in the predictions obtained from REMR can be related to the number of studies in the set or the level of the between-study heterogeneity and hence precision can be gained when using a larger set of studies. The same scenario applies to some extent to other models as well. This is mostly the case for the model by Daniels and Hughes which has a form similar to the REMR, but in addition the uncertainty in this model is related to the uncertainty around the effect on the surrogate endpoint, while this is not the case when using meta-regression which includes the effect on surrogate endpoint as a fixed covariate. Similarly, BRMA PNF gives predictions with uncertainty related to both the size and heterogeneity of the data set (as well as the uncertainty around the effect on the surrogate outcome); however, perhaps less so because of strong distributional assumptions about the between-study heterogeneity which leads to a greater effect of ‘borrowing of strength’ across the studies and the outcomes. Sensitivity analysis in relation to the choice of the prior distribution placed on the standard deviations (ψ in the meta-regression and model by Daniels and Huhges, and ψ₁ and ψ₂ in the BRMA PNF) was carried out as described in Section 3.5. The sensitivity analyses using prior distributions I–IV gave very similar results as can be seen in Table 5. As mentioned above, predictions were sensitive to the parameters of the Wishart prior distribution.

The results suggest that prediction of true effects obtained from the FEMR (and potentially also REMR) can be overly optimistic and artificially precise, likely with intervals not containing the true value, due to underestimated between-study variability and the measurement error corresponding to the treatment effect on the surrogate endpoint (relapse rate ratio in this case). However, the success of the prediction may also be affected by the strong assumptions about the distribution of the data made in the models, such as for example exchangeability assumption in BRMA PNF. To investigate this further, a simulation study was conducted which is presented in Section 5.

4.1.1 Discussion of the results for RRMS

Based on our results we cannot conclude that relapse rate is a good surrogate for disability progression as the prediction did not give good results for all of the studies (it failed for the study by Durelli using all methods apart from the BRMA with Wishart prior (A) which largely inflated the variance of predictions). The study by Durelli differs from the rest of the set in that the effect on the disability progression is much larger than the effect on the relapse rate, with the ratio of the relative effects on those outcomes (the effect on progression to the effect on relapse) equal to 0.6. In most of the remaining studies, this ratio is usually higher than 1.0 (it ranges between 0.94 and 2.16) owing to the fact that disability progression is a longer term outcome and the effect measured on this outcome at the same follow-up time as the effect on the relapse rate will be less due to relatively few events occurring for this outcome on this time scale. The only other study with that ratio below one was the study by Millefiorini, with the ratio of 0.56. The cross-validation did not fail for this study likely because it is a small study with estimates of the treatment effects on both outcomes having large variances (included in the predicted intervals for the cross-validation).

In the Millefiorini study, the patients were relatively young compared to the other studies with a relatively high baseline disability score which can explain the extreme treatment effect on disability of the mitoxantrone relative to the effect of the placebo. The baseline relapse rate was more representative of other studies and hence the effect on this outcome was less extreme (albeit still substantial). There does not seem to be anything, however, in the population of the study by Durelli that would explain the opposite relationship in the magnitude of the effects on the two outcomes. The patients were slightly older compared to other studies and the average baseline disability score was relatively low. This may suggest that the treatment effect on annualised relapse rate may not be a perfect predictor of the effect on the disability progression rate. However, the predictions overwhelmingly worked for the remaining studies which would encourage further research. Note that the effect on the final outcome in the data set investigated here is measured at the same time point as the effect on the surrogate endpoint. Since the disability progression is considered a long-term endpoint, when measured early it is measured with a relatively large uncertainty due to low number of events. Further research is required to establish whether the relapse rate is a good surrogate endpoint and in particular an early marker of disability progression. Such further research should include disability progression reported later compared to relapse rate, but potentially also consider both outcomes on alternative scales such as the hazard ratio for the time to disability progression. Sormani et al. already point out the limitations of using the summary data alone to evaluate the surrogate outcomes. To properly establish the surrogacy, outcomes on an individual level need to be investigated ideally based on data from all of the clinical trials.

4.2 Results from Bayesian models: Gastric cancer

As in the case of RRMS, in the first instance pooled effects were obtained using the historical trials data set to compare the models. The data were then used to perform the cross-validation of the surrogate endpoints. ‘Oba data’ also included another group of studies, the validation trials, which were then used for external validation. Pooled effects obtained from all of the models are shown in Table 6 for comparison. As noted in the previous section on RRMS, only the two forms of BRMA allowed for the estimation of the pooled treatment effects on both outcomes. The pooled effect measured by the surrogate endpoint, DFS, had higher uncertainty in BRMA Wishart (A) model compared to BRMA PNF model. The point estimate of the pooled treatment effect measured by the target endpoint, OS, was similar for all models. Moreover, all models gave estimates with similar precisions except for the BRMA model with inverse Wishart (A) prior which resulted in estimates with a remarkably higher uncertainty.

Table 6.

Summary results for treatment effect on overall survival and disease-free survival.

	Disease-free survival			Overall survival
Model	Mean	95% CrI	$τ_{1} (sd)$ ^a	Mean	95% CrI	ψ₂ (sd)	τ₂ (sd)
REMR				0.81^b	[0.73; 0.90]	0.05 (0.04)
D&H^c				0.82^b	[0.74; 0.91]	0.05 (0.04)
BRMA	0.84	[0.67; 1.02]	0.35 (0.08)	0.80	[0.64; 0.98]		0.35 (0.07)
BRMA PNF	0.87	[0.79; 0.95]	0.05 (0.04)	0.84	[0.76; 0.91]	0.04 (0.04)	0.05 (0.05)

$ψ_{1} = τ_{1}$ in BRMA PNF.

Obtained by centring the effects on surrogate endpoint on the mean. ^cD&H refers to the model by Daniels & Hughes.

When applying the four models to cross-validation, the effect on OS in the historical studies was assumed unknown (in one study at a time which in that case became a validation study) and then predicted from the effect on DFS by each model. The predicted effects on OS with corresponding intervals obtained for each historical study from each model are presented in Table 7 along with the predictions obtained for the validation studies. For one study (B-CLASSIC), the predicted effects on OS obtained from both meta-regression models were statistically significant while the observed effect was only borderline significant (predictions marked in bold font). This could be interpreted to be due to the fact that the effect on DFS is likely to be measured with higher precision due to a larger number of events observed on this outcome compared to OS. Therefore, when predicting the treatment effect on OS from the effect on DFS, higher precision can be expected. However, it occurred only when using meta-regression, not when using other methods, and hence was likely due to underestimated uncertainty by not including measurement error corresponding to the treatment effect on DFS when making the predictions. As in the RRMS example, most intervals obtained from BRMA with Wishart prior A were largely inflated.

Table 7.

Predictions obtained from all models for all studies in the ‘Oba data’.

Overall survival, mean (95% CrI)
	Historical trials
	FFCD-8801	NSAS-GC	JCOG-9206-1	JCOG-8801	SWOG-7804
Observed	0.84 (0.62, 1.14)	0.51 (0.29, 0.90)	0.60 (0.31, 1.18)	0.82 (0.54, 1.27)	0.93 (0.70, 1.24)
Meta-regression	0.87 (0.63, 1.19)	0.50 (0.25, 1.01)	0.65 (0.32, 1.30)	0.82 (0.53, 1.27)	0.91 (0.67, 1.24)
Meta-regression 2	0.86 (0.61, 1.23)	0.50 (0.24, 1.03)	0.64 (0.31, 1.31)	0.82 (0.52, 1.30)	0.91 (0.65, 1.28)
Daniels & Hughes	0.86 (0.55, 1.33)	0.62 (0.30, 1.31)	0.73 (0.32, 1.67)	0.85 (0.48, 1.51)	0.90 (0.60, 1.33)
BRMA (Wishart)	0.90 (0.45, 1.80)	0.84 (0.32, 2.17)	0.82 (0.31, 2.16)	0.72 (0.30, 1.74)	0.84 (0.39, 1.82)
BRMA (PNF)	0.87 (0.61, 1.25)	0.87 (0.48, 1.57)	0.87 (0.43, 1.72)	0.88 (0.56, 1.38)	0.86 (0.60, 1.21)
	EORTC-40813	Tsavaris	ICCG-1/81	ITMO	GITSG-8174

Observed	0.85 (0.64, 1.14)	0.55 (0.33, 0.89)	0.85 (0.64, 1.13)	0.98 (0.70, 1.37)	0.74 (0.53, 1.04)
Meta-regression	0.78 (0.57, 1.06)	0.58 (0.32, 1.03)	0.91 (0.67, 1.24)	0.93 (0.65, 1.33)	0.76 (0.53, 1.09)
Meta-regression 2	0.78 (0.56, 1.10)	0.57 (0.31, 1.05)	0.91 (0.65, 1.28)	0.93 (0.64, 1.36)	0.76 (0.52, 1.12)
Daniels & Hughes	0.79 (0.52, 1.19)	0.67 (0.35, 1.32)	0.91 (0.59, 1.40)	0.92 (0.59, 1.44)	0.78 (0.49, 1.25)
BRMA (Wishart)	0.81 (0.41, 1.63)	0.81 (0.33, 1.97)	0.88 (0.39, 1.97)	0.87 (0.35, 2.16)	0.83 (0.38, 1.80)
BRMA (PNF)	0.86 (0.62, 1.21)	0.87 (0.51, 1.46)	0.87 (0.62, 1.22)	0.87 (0.61, 1.24)	0.87 (0.60, 1.27)
	NCTTG-794151	ECCOG-EST3275	EORTC-40905	ICCG

Observed	1.02 (0.69, 1.51)	0.94 (0.68, 1.30)	0.93 (0.64, 1.37)	1.05 (0.74, 1.49)
Meta-regression	0.99 (0.65, 1.49)	0.92 (0.66, 1.30)	0.91 (0.62, 1.36)	1.11 (0.74, 1.66)
Meta-regression 2	0.99 (0.64, 1.53)	0.93 (0.64, 1.34)	0.91 (0.60, 1.39)	1.11 (0.72, 1.71)
Daniels & Hughes	0.95 (0.55, 1.64)	0.91 (0.57, 1.44)	0.89 (0.53, 1.50)	0.99 (0.62, 1.59)
BRMA (Wishart)	0.87 (0.38, 2.02)	0.80 (0.38, 1.70)	0.92 (0.42, 2.01)	0.89 (0.42, 1.92)
BRMA (PNF)	0.86 (0.56, 1.32)	0.86 (0.59, 1.24)	0.87 (0.56, 1.33)	0.86 (0.59, 1.25)

Validation trials
	A-cirera	B-CLASSIC	E-GOIM-9602	F-GOIRC

Observed	0.60 (0.39, 0.93)	0.72 (0.52, 1.00)	0.91 (0.69, 1.21)	0.90 (0.64, 1.26)
Meta-regression	0.57 (0.34, 0.94)	0.58 (0.38, 0.88)	0.92 (0.68, 1.23)	0.96 (0.67, 1.37)
Meta-regression 2	0.57 (0.34, 0.96)	0.58 (0.37, 0.89)	0.92 (0.66, 1.27)	0.96 (0.65, 1.40)
Daniels & Hughes	0.62 (0.33, 1.16)	0.62 (0.38, 1.02)	0.90 (0.61, 1.32)	0.93 (0.59, 1.48)
BRMA (Wishart)	0.79 (0.32, 1.94)	0.70 (0.32, 1.55)	0.84 (0.41 1.73)	0.80 (0.34, 1.84)
BRMA (PNF)	0.86 (0.54, 1.36)	0.80 (0.53, 1.20)	0.87 (0.63, 1.20)	0.87 (0.60, 1.26)

Discrepancies between observed and predicted estimates of the treatment effect on OS, summarised by the absolute difference and the ratio of the width of the predicted interval

w_{{\hat{Y}}_{2 j}}

to the width of the interval corresponding to the observed estimate

w_{Y_{2 j}}

, are presented in Table 8 (column three and second to last, respectively). The absolute discrepancies were highest when using bivariate meta-analysis (both PNF and Wishart), which may suggest that the exchangeability assumption about the true treatment effects was too strong for these data. As expected, the predicted intervals of

{\hat{Y}}_{2 j}

are inflated (compared to the intervals of Y_2j) due to the between-study variability in addition to the sampling variance. Intervals from the model by Daniels and Hughes were wider compared to those obtained from the REMR, likely due to the measurement error around the treatment effect on the surrogate endpoint (DFS in this case) taken into account in this model. This is also seen in the ratios of the widths of the predicted intervals of the true effects obtained from each model

w_{{\hat{μ}}_{2 j}^{CM}}

to the width of the predicted interval

w_{{\hat{μ}}_{2 j}^{FEMR}}

obtained from the FEMR (last column in Table 8) which suggests that predictive intervals obtained from the FEMR may be underestimated due to the ignored uncertainty. This is further investigated by a simulation study in Section 5. The results are in agreement with those obtained for the RRMS example in Section 4.1. However, unlike in the example in RRMS, the predicted intervals obtained from BRMA PNF are narrower compared to those obtained from the model by Daniels and Hughes. The inclusion of measurement error around the treatment effect on the surrogate endpoint is balanced by the ‘borrowing of strength’ across studies by the exchangeability assumption which in this case is likely to cause ‘overshrinkage’, as discussed in Section 3.6. This is consistent with the absolute discrepancies being larger when using the BRMA models compared to, for example, the model by Daniels and Hughes which does not make the assumption of the exchangeability. As already noted in Section 4.1, this issue is explored by the simulation in Section 5. The BRMA with inverse Wishart prior distribution gave much inflated intervals for Wishart A, but not for Wishart B prior distribution which confirms the sensitivity of the results to the parameters of the Wishart distribution as already observed in the RRMS example. Sensitivity analyses in relation to the choice of the prior distribution placed on the standard deviations (ψ in the meta-regression and model by Daniels and Huhges, and ψ₁ and ψ₂ in the BRMA PNF) were carried out as described in Section 3.5. The sensitivity analyses using prior distributions I–IV gave very similar results as can be seen in Table 8.

Table 8.

Results of the comparison of the models for predicting the treatment effect on OS from the treatment effect on DFS.

		Absolute discrepancy	$w_{{\hat{Y}}_{2 j}} / w_{Y_{2 j}}$	$w_{{\hat{μ}}_{2 j}^{CM}} / w_{{\hat{μ}}_{2 j}^{FEMR}}$
Model	Prior	Median (range)	Median (range)	Median (range)
FEMR		0.03 (0.00, 0.09)	1.06 (1.03, 1.23)
REMR	I	0.03 (0.00, 0.08)	1.15 (1.07, 1.27)	1.59 (1.11, 1.76)
REMR	II	0.03 (0.00, 0.09)	1.15 (1.07, 1.27)	1.60 (1.10, 1.78)
REMR	III	0.03 (0.00, 0.09)	1.15 (1.07, 1.27)	1.61 (1.15, 1.77)
REMR	IV	0.03 (0.00, 0.09)	1.15 (1.07, 1.26)	1.59 (1.08, 1.73)
Daniels & Hughes	I	0.06 (0.02, 0.20)	1.38 (1.23, 1.52)	2.70 (1.15, 3.89)
Daniels & Hughes	II	0.05 (0.03, 0.18)	1.39 (1.24, 1.48)	2.58 (1.38, 3.79)
Daniels & Hughes	III	0.05 (0.01, 0.17)	1.36 (1.28, 1.43)	2.68 (1.15, 3.96)
Daniels & Hughes	IV	0.06 (0.01, 0.21)	1.37 (1.19, 1.46)	2.64 (1.25, 3.13)
BRMA PNF	I	0.11 (0.01, 0.53)	1.10 (1.03, 1.22)	1.46 (0.47, 1.95)
BRMA PNF	II	0.11 (0.01, 0.53)	1.11 (1.03, 1.18)	1.57 (0.43, 1.83)
BRMA PNF	III	0.11 (0.01, 0.52)	1.14 (1.05, 1.24)	1.75 (0.51, 2.07)
BRMA PNF	IV	0.10 (0.01, 0.53)	1.10 (1.03, 1.18)	1.48 (0.47, 1.81)
BRMA	Wishart A	0.12 (0.01, 0.49)	2.24 (1.44, 2.83)	5.97 (2.17, 8.44)
BRMA	Wishart B	0.11 (0.01, 0.49)	1.37 (1.11, 1.55)	2.85 (0.89, 3.60)

4.2.1 Discussion of the results for gastric cancer

The cross-validation of the predictions of the treatment effect on the OS from the effect on the DFS confirmed the results of Oba et al. recommending that DFS is a good surrogate endpoint for OS in patients with curable gastric cancer. One of the limitations of this case study was the absence of any delay between the measurement of the effect on the surrogate endpoint and the final outcome. Ideally, one would be interested in establishing whether DFS measured early could be used to predict long-term OS in the new trials. Sensitivity analysis conducted by Oba et al. was inconclusive whether or not the treatment effect on DFS measured as early as at two years of follow-up can be a good predictor of the treatment effect on OS estimated with five years of follow-up.¹⁰

4.3 Results of sensitivity analysis with t-distribution

As discussed in Section 3.6, sensitivity analysis was carried out to investigate the effect of the distributional assumptions by using the t-distribution on the random effect. Tables 9 and 10 show results of applying the PTDF model to the ‘Sormani data’ for the example in RRMS. Sensitivity analyses were carried out by varying the degrees of freedom parameter using values 4, 15 and 30. The results are presented alongside those obtained from BRMA PNF with comparable prior distributions (the same prior distributions as for PTDF in Section 3.6). The models with the t-distribution gave very similar results across all values for the degrees of freedom parameter and also when compared to the results obtained from BRMA PNF. The only noticeable, but still very small, difference was for the model with df = 4 where the uncertainty around the pooled effect on relapse rate was slightly higher and the estimate of the heterogeneity parameter for the effect on this endpoint was also higher and with higher uncertainty (results in Table 9). All models gave very similar discrepancies in terms of the absolute difference and the ratios of the widths of the intervals comparing predicted and observed effects,

w_{{\hat{Y}}_{2 j}} / w_{Y_{2 j}}

, and the widths of the intervals of the predicted true effects from PTDF models compared to the predicted intervals from BRMA PNF,

w_{{\hat{μ}}_{2 j}^{PTDF}} / w_{{\hat{μ}}_{2 j}^{PNF}}

as shown in Table 10. Consistently with the results in Table 9, the intervals obtained from PTDF model with df = 4 were slightly wider compared to those obtained from BRMA PNF and PTDF with df = 15 or 30.

Table 9.

Summary results for placebo-controlled studies for the treatment effects on the risk of disability progression and the relapse rate ratio in RRMS, using models with t-distributions and BRMA PNF for comparison.

	Relapse incidence rate ratio			Disability relative risk
Model	Mean (SD)	95% CrI	ψ ₁	Mean (SD)	95% CrI	ψ ₂
BRMA PNF	0.57 (0.06)	[0.46; 0.70]	0.37 (0.09)	0.75 (0.05)	[0.67; 0.86]	0.07 (0.06)
BRMA PTDF (4 df)	0.58 (0.07)	[0.46; 0.72]	0.47 (0.14)	0.75 (0.05)	[0.66; 0.85]	0.08 (0.07)
BRMA PTDF (15 df)	0.57 (0.06)	[0.45; 0.71]	0.39 (0.10)	0.75 (0.05)	[0.66; 0.85]	0.08 (0.06)
BRMA PTDF (30 df)	0.57 (0.06)	[0.45; 0.71]	0.38 (0.10)	0.75 (0.05)	[0.67; 0.85]	0.07 (0.06)

Table 10.

Results of the comparison of the models for predicting the treatment effect on the risk of disability progression from the treatment effect on relapse rate in RRMS, using models with t-distributions and BRMA PNF for comparison.

	Absolute discrepancy	$w_{{\hat{Y}}_{2 j}} / w_{Y_{2 j}}$	$w_{{\hat{μ}}_{2 j}^{PTDF}} / w_{{\hat{μ}}_{2 j}^{PNF}}$
Model	Median (range)	Median (range)	Median (range)
BRMA PNF	0.16 (0.01, 1.22)	1.10 (1.02, 1.58)
BRMA PTDF (4 df)	0.16 (0.01, 1.22)	1.12 (1.02, 1.64)	1.04 (0.97, 1.15)
BRMA PTDF (15 df)	0.16 (0.01, 1.21)	1.10 (1.02, 1.57)	1.01 (0.96, 1.06)
BRMA PTDF (30 df)	0.16 (0.00, 1.22)	1.11 (1.02, 1.55)	1.01 (0.97, 1.08)

As it can be seen in Tables 11 and 12, the results from the models applied to the ‘Oba data’ for the example in gastric cancer were also very similar across the range of values of the degrees of freedom. Median interval ratio comparing the predicted to the observed effects was highest for df = 4, but still comparable with the results corresponding to other parameters and those from BRMA PNF. Predicted intervals of the true effects from PTDF model with df = 4 were wider than those obtained from BRMA PNF, with the median ratio of the widths

w_{{\hat{μ}}_{2 j}^{PTDF}} / w_{{\hat{μ}}_{2 j}^{PNF}} = 1.06

, but less so when df = 15 or 30 as expected. All predictions for both data sets are included in Tables A 2.1 and A 2.2 in Appendix 2. The results were similar to those obtained from the BRMA PNF model leading to the same conclusions.

Table 11.

Summary results for treatment effects on overall survival and disease-free survival RRMS, using models with t-distributions and BRMA PNF for comparison.

	Disease-free survival			Overall survival
Model	Mean (SD)	95% CrI	ψ ₁	Mean (SD)	95% CrI	ψ ₂
BRMA PNF	0.83 (0.04)	[0.76; 0.92]	0.03 (0.04)	0.87 (0.04)	[0.79; 0.95]	0.05 (0.04)
BRMA PTDF (4 df)	0.83 (0.04)	[0.76; 0.91]	0.03 (0.05)	0.87 (0.04)	[0.79; 0.94]	0.05 (0.05)
BRMA PTDF (15 df)	0.83 (0.04)	[0.76; 0.90]	0.03 (0.05)	0.86 (0.04)	[0.79; 0.94]	0.05 (0.04)
BRMA PTDF (30 df)	0.83 (0.04)	[0.76; 0.90]	0.03 (0.04)	0.86 (0.04)	[0.79; 0.94]	0.05 (0.04)

Table 12.

Results of the comparison of the models for predicting treatment effect on OS from treatment effect on DFS, using models with t-distributions and BRMA PNF for comparison.

	Absolute discrepancy	$w_{{\hat{Y}}_{2 j}} / w_{Y_{2 j}}$	$w_{{\hat{μ}}_{2 j}^{PTDF}} / w_{{\hat{μ}}_{2 j}^{PNF}}$
Model	Median (range)	Median (range)	Median (range)
BRMA PNF	0.11 (0.02, 0.52)	1.18 (1.05, 1.27)
BRMA PTDF (4 df)	0.11 (0.02, 0.52)	1.21 (1.06, 1.34)	1.06 (0.98, 1.19)
BRMA PTDF (15 df)	0.11 (0.01, 0.52)	1.17 (1.04, 1.27)	1.00 (0.93, 1.10)
BRMA PTDF (30 df)	0.11 (0.01, 0.52)	1.17 (1.05, 1.29)	1.01 (0.92, 1.08)

4.4 Results from the frequentist models

Table 13 shows the discrepancies between the predicted and observed values of the effect on the final outcome (in terms of the median absolute difference between the estimates and the median ratio of the width of the 95% predicted interval to the width of the 95% confidence interval corresponding to the observed effect) for the ‘Sormani data’ and the ‘Oba data’. The absolute discrepancies are comparable with those obtained from the Bayesian models. The effect of the model choice on the uncertainty of predictions is represented by the ratios

w_{{\hat{μ}}_{2 j}^{BRMA}} / w_{{\hat{μ}}_{2 j}^{FEMR}}

of the width of the predicted intervals for the true effects obtained from the BRMA model to the interval obtained from the FEMR. The differences in the width of the predicted intervals between the models are consistent with the conclusions from the Bayesian analysis; the predictive interval is inflated when using BRMA (with the median ratio

w_{{\hat{μ}}_{2 j}^{BRMA}} / w_{{\hat{μ}}_{2 j}^{FEMR}} = 1.6

9 in the RRMS example and

w_{{\hat{μ}}_{2 j}^{BRMA}} / w_{{\hat{μ}}_{2 j}^{FEMR}} = 1.41

for gastric cancer data) which allows the inclusion of the uncertainty on the effects on both outcomes alongside all other parameters.

Table 13.

Results of the comparison of the frequentist models for predicting the treatment effect on disability progression from treatment effect on relapse in RRMS and the treatment effect on OS from the treatment effect on DFS in gastric cancer.

	Absolute discrepancy	$w_{{\hat{Y}}_{2 j}} / w_{Y_{2 j}}$	$w_{{\hat{μ}}_{2 j}^{BRMA}} / w_{{\hat{μ}}_{2 j}^{FEMR}}$
Model	Median (range)	Median (range)	Median (range)
RRMS
FEMR	0.16 (0.01, 1.16)	1.02 (1.00, 1.21)
BRMA	0.16 (0.00, 1.24)	1.06 (1.06, 1.12)	1.69 (0.52, 4.90)
Gastric cancer
FEMR	0.04 (0.00, 0.09)	1.08 (1.03, 1.25)
BRMA	0.10 (0.02, 0.52)	1.10 (1.01, 1.15)	1.41 (0.20, 1.71)

Tables A 3.1 and A 3.2 in Appendix 3 list predicted estimates on the final outcome (disability progression in RRMS and OS in gastric cancer). When using meta-regression, the predictions were obtained with reduced intervals (compared to the intervals corresponding to those obtained from BRMA). As in the Bayesian analysis, predicted interval for one study (B-CLASSIC) in the example in gastric cancer indicated significant effect (numbers in bold) when using FEMR (but not BRMA) while the observed effect was only borderline significant. Note that in the frequentist analysis, the within-study correlation is fixed (instead of the prior distributions in the Bayesian analysis). The results in Tables 13, A 3.1 and A 3.2 were obtained from models with ρ_wi = 0.5. Sensitivity analysis using correlations $ρ_{wi} = 0, 0.25, 0.75$ gave very similar results.

5 Simulation

The models considered in this paper allow for different level of uncertainty on the parameters and use different degree of distributional assumptions, both of which can impact on the accuracy of predictions. The models by Daniels and Hughes and the BRMA PNF seemed to predict the treatment effect on the target outcome equally well, giving conservative predictions (in comparison with meta-regression) because uncertainty around all the model parameters is taken into account, but not with overly inflated intervals. The two models, however, use a different degree of distributional assumptions. Considering, for example, a scenario where a new study may measure a treatment effect much larger compared to the effect observed in the historical studies (training set), the assumption in the BRMA PNF (about the true effects measured by both outcomes coming from a common distribution) may be too strong. Sensitivity to this assumption along with the performance of all the models is tested here by a simulation.

5.1 Methods

To carry out the simulation, data were simulated for both the validation studies as well as the ‘training set’ to ensure the control over the distributional assumptions of the data (the ‘Sormani data’ did not satisfy the assumption of normality well). Simulation of the validation data and the training set data was conducted using the BRMA PNF model (8) and (9) in a number of scenarios where the mean of the effect in the validation set is shifted by δ relative to the mean of the training set

(Y_{1 i} Y_{2 i}) \sim MVN ((μ_{1 i} μ_{2 i}), Σ_{i}), Σ_{i} = (σ_{1 i}^{2} σ_{1 i} σ_{2 i} ρ_{wi} σ_{1 i} σ_{2 i} ρ_{wi} σ_{2 i}^{2})

(16)

{μ_{1 i} \sim N (η_{1} + δ, ψ_{1}^{2}) μ_{2 i} | μ_{1 i} \sim N (η_{2 i}, ψ_{2}^{2}) η_{2 i} = λ_{0} + λ_{1} μ_{1 i} .

(17) using a range of values of δs: 0, ψ₁,

2 ψ_{1}, 3 ψ_{1}

and 5ψ₁. The higher the δ the more different the ‘new study’ is with respect to the training set. Parameters for the simulation were obtained by fitting the model to the ‘Sormani data’ which gave

ψ_{1} = 0.36, ψ_{2} = 0.15

η_{1} = - 0.5253, λ_{0} = 0.01

and λ₁ = 0.4793. The within-study correlations ρ_wi were sampled from a uniform distribution with limits obtained from the confidence interval of the mean of estimated within-study correlations,

ρ_{wi} \sim U (- 0.11, 0.186)

. The within-study variances were generated by sampling the corresponding precisions (inverse variances) from the gamma distribution;

σ_{1 i} = 1 / P_{1 i}

and

σ_{2 i} = 1 / P_{2 i}, P_{1 i} \sim Γ (α_{1}, θ_{1})

P_{2 i} \sim Γ (α_{2}, θ_{2})

, where α₁ and α₂ are the shape parameters and θ₁ and θ₂ the scale parameters, which were obtained using the method of moments:

E (P_{1, 2}) = α_{1, 2} / ξ_{1, 2}

V (P_{1, 2}) = α_{1, 2} / ξ_{1, 2}^{2}

, where

ξ_{1, 2} = 1 / θ_{1, 2}

is a rate parameter. By summarising the inverse variances from the ‘Sormani data’, the following parameters were obtained:

E (P_{1}) = 112.6

E (P_{2}) = 32.2, V (P_{1}) = 11172.49, V (P_{2}) = 1062.76

, giving the following shape and rate parameters: α₁ = 1.13,

ξ_{1} = 0.01, α_{2} = 0.97

and

ξ_{2} = 0.03

. Because of the structure of the gamma distribution, some of the simulated precisions were very close to zero, resulting in very large variances. This led to some problems with the estimation. To overcome this issue, a constraint was placed on the simulated value of the precision by discarding the precisions resulting in variances larger than 3 (this number was taken as an arbitrary cut off, large enough to be much larger than the variances in the ‘Sormani data’ and hence including all plausible variances in the population but small enough not to produce problems with the estimation). The number of participants in each study was drawn from a uniform distribution with limits 25 and 100 (giving sample sizes of the studies comparable to those in the ‘Sormani data’).

Each model was fitted by adding a validation study to the training set (one at a time) assuming the effect on the target outcome (disability progression) unknown (coded as NA), which was then predicted by each model from the effect on the relapse rate given for this study. The predicted true effect ${\hat{μ}}_{2}$ was compared with the simulated ‘observed’ true effect μ₂ by checking if the credible interval of the predicted effect on the target outcome contained the observed mean effect. The whole process was repeated 1000 times and the percentage of predicted outcomes whose credible intervals covered the observed value was reported as the average performance of the credible interval of the model. The R code used to simulate the data is included in Appendix 1.7.

5.2 Results

Table 14 lists the average performances of predicted credible interval for each model and for the range of values of δ. Moving the ‘new study’ (validation study) away from the ‘training set’ (by increasing the δ) resulted in reduced performance of the BRMA PNF, while the model by Daniels and Hughes preformed better (due to the lack of the strong distributional assumption of exchangeability of the true effects made in the BRMA PNF). Performance of BRMA PTDF remained unchanged due to the t-distribution being better at modelling extreme effects, as noted in Section 3.6.

Table 14.

Comparison of the performance of the models in terms of the coverage of the predictive interval.

	Average performance of credible interval
Model	δ = 0	$δ = 1 ψ_{1}$	$δ = 2 ψ_{1}$	$δ = 3 ψ_{1}$	$δ = 5 ψ_{1}$
FEMR	39%	41%	49%	56%	60%
REMR	95%	93%	93%	92%	90%
Daniels & Hughes	95%	94%	95%	94%	93%
BRMA (Wishart)	97%	96%	96%	94%	90%
BRMA (PNF)	96%	95%	93%	91%	85%
BRMA PTDF (4 df)	96%	95%	96%	95%	95%

BRMA model with the Wishart prior distribution showed slightly too large performance for δ = 0 which was related to the overly inflated predictive intervals. FEMR performed least well due to the artificially reduced uncertainty by ignoring the estimation error of the treatment effect on the surrogate endpoint. In this case, the performance seems to increase with the validation set moving away from the training set which is due to the predicted interval expanding as we move further away from the data, as in linear regression.

6 Discussion

When investigating endpoints as candidate surrogate outcomes, a careful choice of the meta-analytical approach has to be made. The level of uncertainty taken into account by the model can impact on the precision of the predictions of the true effect on the final outcome ${\hat{μ}}_{2 j}$ from the effect on the surrogate endpoint. Models underestimating uncertainty, such as FEMR can lead to overly precise predictions of the treatment effect on the final outcome in a new study. Reduced uncertainty around predicted treatment effect on a target endpoint may give the illusion that this is a desirable effect of a larger number of events measured on the shorter term surrogate endpoint, whilst in fact this may be due to ignoring uncertainty and in the case of some models between-study variability. Models underestimating the uncertainty of available evidence may lead to over-optimistic predictions which can then have an effect on decisions made based on such predictions, i.e. underpowered clinical trials or unrealistic cost-effectiveness outcomes.

In the models by Daniels and Hughes and BRMAs, the treatment effect on the surrogate endpoint is treated as a response variable and its uncertainty is taken into account in the model in contrast to the meta-regression model where the effect on the surrogate was a fixed covariate. BRMA with the inverse Wishart prior distribution on the between-study covariance matrix seems an unreliable approach because it does not allow the analyst to easily control the prior distributions on the specific elements of the covariance matrix. Results obtained from the model are sensitive to the parameters of the Wishart distribution. For example, setting parameters of the Wishart distribution that lead to a desirable non-informative uniform distribution induced on the between-study correlation can give undesirably informative prior distributions for the between-study standard deviations, which depending on the parameters can lead to inflated intervals for pooled or predicted estimates. For the illustrative examples considered here, this led to the inflation of the uncertainty around the predicted target outcome when using the Wishart distribution with the identity matrix and degrees of freedom equal to three. The BRMA PNF and Daniels–Hughes models predict the target outcome better, but make different distributional assumptions that need to be considered when making a choice between these methods. While the Daniels–Hughes model makes less strong distributional assumptions and may perform better when the new study differs from the historical data in the meta-analysis data set, the BRMA PNF has an advantage over it by allowing the estimation of pooled effects for both outcomes when combining data reported on one or both of them, which can be desirable when the pooled effectiveness estimates are of interest as is often the case in HTA. In circumstances when the distributional assumptions are plausible in BRMA PNF, this model has an additional advantage of allowing the analyst to incorporate external information (based on external evidence or expert opinions) in the form of informative prior distributions with the potential to reduce uncertainty around the estimate of interest.^14,31

When using meta-analytic methods to predict the treatment effect on a target outcome of interest from the treatment effect measured by a surrogate endpoint, modelling assumptions need to be considered alongside the uncertainty, particularly around the surrogate endpoint. While Bayesian methods allow for a great flexibility in modelling uncertainty, the frequentist methods have also been used to account for the uncertainty around the surrogate endpoint by using an error-in-variables linear regression model,^9,10 which is an alternative for analysts with a preference for a frequentist approach. We have illustrated the importance of uncertainty by using frequentist methods of meta-regression and bivariate meta-analysis.

In this paper, to investigate the impact of uncertainty on predictions, we focused on a number of different parameterisations of normally distributed effects. The assumption of normality is not always reasonable and when it is not, alternative approaches need to be investigated. In our further work (to be published elsewhere) we investigate, for example, modelling of relapse rate using a Poisson distribution and the relative risk of disability progression by assuming that outcomes come from Binomial distribution. Meta-analytic methods using these type of outcomes have already been proposed, for example by Stijnen et al. who propose binomial-normal and Poisson-normal bivariate model (with binomial or Poisson distributions for the within-study variability).³² We have investigated the normality assumption on the random effect by sensitivity analysis where we replaced the normal distribution with the t-distribution. This approach has the limitation of only improving the modelling when there are more data in the tails (such as outlying observations) that a normal distribution would not capture properly. If the distribution of the data is, for example, bimodal or skewed, other approaches can be investigated such as a convolution of normal distributions³³ or skewed t-distribution as proposed by Lee and Thompson.²⁶ The issue of non-normality of the random effect has been discussed by Higgins et al.,²⁵ who also review non-parametric alternatives of the meta-analytic methods that can be applied to the non-normally distributed effects (such as non-parametric maximum likelihood procedures^34–37 and Bayesian semiparametric random-effects distributions based on Dirichlet process priors^38–40). However, as Higgins et al. discuss, although the methods have the ability to incorporate outliers, they are not suitable for making predictions due to the unusual shape of the discrete distributions. As such, they are unlikely to be suitable for the purpose of evaluating surrogate endpoints where predictions are of crucial importance.

The methods discussed in this paper do not fully cover all aspects of the surrogate evaluation process. As already mentioned in Section 1, the individual level association between outcomes needs to be explored and to do so, individual patient data is required on a number (preferably all) of the studies included in the meta-analysis. Although this was beyond the scope of this paper, the availability of individual level data could help to model uncertainty. For example, individual data can be used to obtain the within-study correlation between the treatment effects. Daniels and Hughes have used individual level data from a subset of studies in their meta-analysis to obtain the correlation between the treatment effects by bootstrapping⁶ while Bujkiewicz et al. performed a double bootstrap analysis on individual level data from a single study to obtain the correlation between the treatment effects in the form of an empirical distribution.¹⁴ A range of methods for obtaining the within-study correlation from individual level data was explored by Riley et al. who used a joint linear regression for multiple continuous outcomes and bootstrapping methods for a range of other outcomes.⁴¹ The availability of individual level data can also be desirable when taking into account the information on covariates which in the aggregate form is subject to ecological bias. When investigating surrogacy, the inclusion of covariates could help explain some heterogeneity or explore the effect of baseline risk. Further research is required to explore the advantages of individual level data in modelling uncertainty and exploring the impact of covariates.

Footnotes

Acknowledgements

The authors thank Ian White for his comments on the earlier version of the manuscript and for sharing his expertise on Stata coding for extending the use of the mvmeta command in Stata. We also thank the two anonymous reviewers for their comments which helped to improve the quality of the manuscript.

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was funded by the Medical Research Council (grant no. MR/L009854/1 awarded to SB). KRA is supported by the UK National Institute for Health Research (grant no. NF-SI-0512-10159).

Appendix 1 Appendix 2 Predictions from sensitivity analysis using t -distribution

Appendix 3 Predictions from the frequentist models

References

Burzykowski

Molenberghs

Buyse

. The evaluation of surrogate endpoints, New York, NY: Springer, 2005.

Prentice

. Surrogate endpoints in clinical trials: definition and operational criteria. Stat Med 1989; 8: 431–440.

Freedman

Graubard

Schatzkin

. Statistical validation of intermediate endpoints for chronic diseases. Stat Med 1992; 11: 167–178.

Buyse

Molenberghs

. Criteria for the validation of surrogate endpoints in randomized experiments. Biometrics 1998; 54: 1014–1029.

Buyse

Molenberghs

Burzykowski

et al.

The validation of surrogate endpoints in meta-analyses of randomized experiments. Biostatistics 2000; 1: 49–67.

Daniels

Hughes

. Meta-analysis for the evaluation of potential surrogate markers. Stat Med 1997; 16: 1965–1982.

Buyse

Burzykowski

Carroll

et al.

Progression-free survival is a surrogate for survival in advanced colorectal cancer. J Clin Oncol 2007; 25: 5218–5224.

Sormani

Bonzano

Roccatagliata

et al.

Surrogate endpoints for EDSS worsening in multiple sclerosis: a meta-analytic approach. Neurology 2010; 75: 302–309.

Burzykowski

Buyse

Piccart-Gebhart

et al.

Evaluation of tumor response, disease control, progression-free survival, and time to progression as potential surrogate end points in metastatic breast cancer. J Clin Oncol 2008; 26: 1987–1992.

10.

Oba

Paoletti

Alberts

et al.

Disease-free survival as a surrogate for overall survival in adjuvant trials of gastric cancer: a meta-analysis. J Natl Cancer Inst 2013; 105: 1600–1607.

11.

Gabler

French

Strom

et al.

Validation of 6-minute walk distance as a surrogate end point in pulmonary arterial hypertension trials. Circulation 2012; 126: 349–356.

12.

van Houwelingen

Arends

Stijnen

. Advanced methods in meta-analysis: multivariate approach and meta-regression. Stat Med 2002; 21: 589–624.

13.

Riley

Abrams

Lambert

et al.

An evaluation of bivariate random-effects meta-analysis for the joint synthesis of two correlated outcomes. Stat Med 2007; 26: 78–97.

14.

Bujkiewicz

Thompson

Sutton

et al.

Multivariate meta-analysis of mixed outcomes: a Bayesian approach. Stat Med 2013; 32: 3926–3943.

15.

Lunn

Thomas

Best

et al.

WinBUGS – a Bayesian modelling framework: concepts, structure, and extensibility. Stat Comput 2000; 10: 325–337.

16.

StataCorp. Stata Statistical Software: Release 12. College Station, TX: StataCorp LP, 2011.

17.

Berkey

Hoaglin

Antczak-Bouckoms

et al.

Meta-analysis of multiple outcomes by regression with random effects. Stat Med 1998; 17: 2537–2550.

18.

Sutton

Abrams

. Bayesian methods in meta-analysis and evidence synthesis. Stat Methods Med Res 2001; 10: 277–303.

19.

Gail

Pfeiffer

van Houwelingen

et al.

On meta-analytic assessment of surrogate outcomes. Biostatistics 2000; 1: 231–246.

20.

Nam

I-S

Mengersen

Garthwaite

. Multivariate meta-analysis. Stat Med 2003; 22: 2309–2333.

21.

Arends

Vokó

Stijnen

. Combining multiple outcome measures in a meta-analysis: an application. Stat Med 2003; 22: 1335–1353.

22.

Gelman

Hill

. Data analysis using regression and multilevel/hierarchical models (analytical methods for social research), New York: Cambridge University Press, 2007.

23.

Spiegelhalter

. Bayesian graphical modelling: a case-study in monitoring health outcomes. Appl Stat J R Stat Soc Ser C 1998; 47: 115–133.

24.

Lambert

Sutton

Burton

et al.

How vague is vague? A simulation study of the impact of the use of vague prior distributions in MCMC using WinBUGS. Stat Med 2005; 24: 2401–2428.

25.

Higgins

JPT

Thompson

Spiedelhalter

. A re-evaluation of random-effects meta-analysis. Appl Stat J R Stat Soc Ser A 2009; 172: 137–159.

26.

Lee

Thompson

. Flexible parametric models for random-effects distributions. Stat Med 2007; 27: 418–434.

27.

Marshall

Spiegelhalter

Comparing institutional performance using Markov chain Monte Carlo methods. In: Everitt

Dunn

(eds). Statistical analysis of medical data: new developments, London: Arnold, 1998, pp. 229–249.

28.

Smith

Spiegelhalter

Thomas

. Bayesian approaches to random effects meta-analysis: a comparative study. Stat Med 1995; 14: 2685–2699.

29.

Sharp

. Meta-analysis regression. Stata Tech Bull 1998; 42: 16–22.

30.

White

. Multivariate random-effects meta-analysis. Stata J 2009; 9: 40–56.

31.

Bujkiewicz

Thompson

Sutton

et al.

Use of Bayesian multivariate meta-analysis to estimate HAQ for mapping onto EQ-5D in rheumatoid arthritis. Value Health 2014; 17: 109–115.

32.

Stijnen T, Hamza TH and Özdemir P. Random effects meta-analysis of event outcome in the framework of the generalized linear mixed model with applications in sparse data. Stat Med 2008; 29: 3046--3067.

33.

Carrol

Roeder

Wasserman

. Flexible parametric measurement error models. Biometrics 1999; 55: 44–54.

34.

Laird

Louis

. Empirical Bayes confidence intervals for a series of related experiments. Biometrics 1989; 45: 481–495.

35.

Bhning

. Meta-analysis: a unifying meta-likelihood approach framing unobserved heterogeneity, study covariates, publication bias, and study quality. Meth Inform Med 2005; 44: 127–135.

36.

Van Houwelingen

Zwinderman

Stijnen

. A bivariate approach to meta-analysis. Stat Med 1993; 12: 2273–2284.

37.

Stijnen

Van Houwelingen

. Empirical Bayes methods in clinical trials meta-analysis. Biometr J 1990; 32: 335–346.

38.

Burr

Doss

Cooke

et al.

A meta-analysis of studies on the association of the platelet PlA polymorphism of glycoprotein IIIa and risk of coronary heart disease. Stat Med 2003; 22: 1741–1760.

39.

Burr

Doss

. A Bayesian semiparametric model for random-effects meta-analysis. J Am Statist Ass 2005; 100: 242–251.

40.

Ohlssen

Sharples

Spiegelhalter

. Flexible random-effects models using Bayesian semi-parametric models: applications to institutional comparisons. Stat Med 2007; 26: 2088–2112.

41.

Riley

Price

Jackson

et al.

Multivariate meta-analysis using individual participant data. Res Synth Methods 2015; 6: 157–174.

Uncertainty in the Bayesian meta-analysis of normally distributed surrogate endpoints

Abstract

Keywords

1 Introduction

2 Illustrative examples

2.1 Multiple sclerosis

2.2 Gastric cancer

3 Methods for evaluating surrogate endpoints

3.1 Meta-regression

3.1.1 Fixed-effects meta-regression

3.1.2 Random effects meta-regression

3.2 Meta-analysis by Daniels and Hughes

3.3 Bivariate random effects meta-analysis (BRMA)

3.4 BRMA in product normal formulation (BRMA PNF)

3.5 Sensitivity analysis: Prior distributions

3.6 Sensitivity analysis: Relaxing the normality assumption

3.7 Frequentist approaches

3.7.1 Meta-regression

3.7.2 Bivariate meta-analysis

3.8 Cross-validation procedure and model comparison

4 Results

4.1 Results from Bayesian models: multiple sclerosis

4.1.1 Discussion of the results for RRMS

4.2 Results from Bayesian models: Gastric cancer

4.2.1 Discussion of the results for gastric cancer

4.3 Results of sensitivity analysis with t-distribution

4.4 Results from the frequentist models

5 Simulation

5.1 Methods

5.2 Results

6 Discussion

Footnotes

Acknowledgements

Declaration of conflicting interests

Funding

Appendix 1

Appendix 2 Predictions from sensitivity analysis using t -distribution

Appendix 3 Predictions from the frequentist models

References