Sage Journals: Discover world-class research

Abstract

Effect sizes typically vary among studies of the same intervention. In a random effects meta-analysis, this source of variation is taken into account, at least to some extent. However, when we have only one study, the heterogeneity remains hidden and unaccounted for. Treating the study-level effect as if it is the population-level effect leads to underestimation of the uncertainty. We propose an empirical Bayesian approach to address this problem. We start by estimating the distribution of the population-level effects and heterogeneity among 1635 meta-analyses from the Cochrane Database of Systematic Reviews. Using both synthetic data and cross-validation, we assess the consequences of using these estimated distributions as prior information for the analysis of single trials. We find that our Bayesian “meta-analyses of single studies” perform much better than naively assuming non-varying effects. The prior on the heterogeneity results in better quantification of the uncertainty. The prior on the treatment effect substantially reduces the mean squared error both for estimating the study-level and population-level effects. For the latter, this reduction is equivalent to doubling the sample size.

Keywords

Meta-analysis heterogeneity Bayes empirical Bayes

1. Introduction

It is common practice to interpret the observed treatment effect of a clinical trial as an estimate of an underlying true treatment effect. This is reasonable. We claim that it is also common to tacitly, perhaps even unconsciously, assume that this true effect is an immutable property of the treatment. Thus, one might speak of “the” effect of a particular treatment or intervention, and expect the same effect in another study about the same treatment. However, there are many good reasons to expect variation among the underlying effects from differences between study populations, application of the treatment and measurement protocols among other factors. There is also much empirical evidence of effect heterogeneity from random effects meta-analyses.^1–3 Failure to take this into account when interpreting the results of a trial will lead to underestimation of the uncertainty about the treatment effect.

Recall that meta-analysis is a quantitative approach for combining inferences from multiple studies. Besides providing an estimate of the average or “population-level” effect of the treatment, together with some measure of uncertainty such as a standard error or confidence interval, a meta-analysis also provides an estimate of the variation of underlying effects of the individual studies. Understanding and quantifying heterogeneity is an important aspect of meta-analysis, as it can influence the interpretation of the overall results.

The Cochrane Database of Systematic Reviews (CDSR) is a globally respected collection of evidence-based healthcare information, comprising rigorous and comprehensive systematic reviews on diverse medical topics, with many of those reviews including meta-analyses.⁴ The database is continuously updated, offering the latest evidence to inform clinical practice, policy decisions, and research priorities. It adheres to strict quality standards, undergoes peer review, and includes open-access summaries for wider accessibility.

We will assume the standard random effects meta-analysis model. That is, we assume the following two-part hierarchical (or multilevel) model for the $j$ -th study in a collection of studies of the same treatment

\begin{aligned} β_{j} & = μ + u_{j} \end{aligned}

(1)

\begin{aligned} b_{j} & = β_{j} + ε_{j} \end{aligned}

(2)

The error terms

u_{j} \sim normal (0, τ)

and

ε_{j} \sim normal (0, s_{j})

are assumed to be independent. The first part (1) models the variation among studies of the same treatment. Here,

μ

is the population-level average treatment effect in the (hypothetical) superpopulation of similar studies and

β_{j}

is the study-level effect in the

j

-th individual study. The variance

τ^{2}

is referred to as the heterogeneity. The second part (2) models the uncertainty of the estimates from each individual study. We assume that the estimate

b_{j}

from the

j

-th study is unbiased and normally distributed with standard error

s_{j}

When we have only one study, we cannot separate the two error terms. Without additional assumptions, there is nothing more to do than estimate $β_{1}$ by $b_{1}$ with confidence interval $b_{1} \pm 1.96 s_{1}$ . If we make the extra assumption that there is no heterogeneity, i.e. $τ = 0$ , then we can also estimate the population average effect $μ$ by $b_{1}$ with confidence interval $b_{1} \pm 1.96 s_{1}$ . As we discussed, we believe that this extra assumption is commonly—if implicitly—made when interpreting the result of a single trial. However, $τ$ is typically not zero, so this is not correct.

Instead of assuming that $τ$ is zero, we propose a Bayesian approach with the prior information based on a hierarchical model fit to the CDSR. First, we estimate the distributions of both $μ$ and $τ$ across the CDSR and use those as empirical priors. Estimating one or both of these distributions from collections of meta-analysis is not new, see for instance.^2,3,5 We then proceed to use the R package baggr⁶ to use these distributions as prior information for Bayesian inference. We call this meta-analysis with a single trial. The baggr package is based on Stan, a state-of-the-art platform for statistical modeling and high-performance statistical computation.⁷

We compare the performance across the trials of the CDSR of our Bayesian approach to the naive approach of assuming that $τ$ is zero. To do this, we construct a “synthetic copy” of the CDSR and also use a kind of leave-one-out cross-validation.

2. Meta-analysis with a single trial

2.1. Estimating the distributions of $μ$ and $τ$

Many of the systematic reviews in the CDSR are accompanied by meta-analyses. These data have been processed and made available by Schwab (2020).⁸ Since the trials in the same meta-analysis are evaluating the same intervention, we can use the data from the CDSR to study how the effects of the same treatment vary across multiple studies.

We start by selecting trials with either a binary or numerical primary efficacy outcome; these comprise 97% of the trials in the CDSR. To make the effect sizes of the binary and numerical outcomes comparable, we quantify the treatment effect on the probit scale for all binary outcomes, and as the standardized mean difference (SMD) for all continuous outcomes. Next, we select all meta-analyses with at least 5 individual studies. Meta-analyses with fewer studies have little information about heterogeneity and discarding them reduces computation time. This leaves 18,368 unique trials from 1635 meta-analyses. We consider the following hierarchical model. For the $j$ -th individual study in the $i$ -th meta-analysis, we assume

\begin{aligned} β_{i j} & = μ_{i} + u_{i j} \end{aligned}

(3)

\begin{aligned} b_{i j} & = β_{i j} + ε_{i j} \end{aligned}

(4)

where

u_{i j} \sim normal (0, τ_{i})

and

ε_{i j} \sim normal (0, s_{i j})

. All the

u_{i j}

and

ε_{i j}

are assumed to be independent. From the CDSR, we obtain the pairs

(b_{i j}, s_{i j})

. We define the

z

-values for each pair as

z_{i j} = b_{i j} / s_{i j}

We want to estimate the distribution of the $μ_{i}$ and $τ_{i}$ across the CDSR. We assume that the $μ_{i}$ follow a generalized $t$ distribution and that the $τ_{i}$ are lognormally distributed. Moreover, we assume that the $μ_{i}$ and $τ_{i}$ are independent.

To estimate the five parameters of our model (the mean, scale and degrees of freedom of the generalized $t$ -distribution and the mean and standard deviation of the normal distribution) we use a Bayesian approach with uniform priors on the two means, the degrees of freedom, and the logarithm of the two scale parameters. The posterior distributions of the parameters are approximately normal, so that the posterior means are approximately equal to the maximum likelihood estimates. We show these estimates in the top row of Table 1.

Table 1.

Estimated parameters of the $t$ distribution of $μ$ and the normal distribution of $\log τ$ . In the bottom row, we restrict the mean of $μ$ to be zero. These estimates are based on 18,368 unique trials from 1635 meta-analyses.

	$t$ distribution of $μ$			Normal distribution of $\log τ$
	Center	Scale	df	Mean	Std Dev
Unrestricted	$- 0.11$	0.37	5.30	$- 1.82$	0.89
Zero mean	0.00	0.37	5.18	$- 1.82$	0.90

We can provide some context for these estimated distributions by recalling the tentative classification of effect sizes by Cohen according to which SMD values of 0.2–0.5 are considered small, 0.5–0.8 are considered medium, and >0.8 are considered large.^9,10 The estimated $t$ -distribution of $μ$ implies that the median of the absolute value of $μ$ is 0.28 (IQR: 0.13–0.50). In other words, 75% of the population-level average effects may be considered to be small. The median of $τ$ is 0.16 (IQR: 0.09–0.30). So, we find that the heterogeneity is roughly on the order of half the effect size.

We also fit the model restricting the center of the distribution of effect sizes $μ_{i}$ to be 0, and show the resulting estimates in the bottom row of Table 1. We will use this distribution as our prior, because it ensures that we treat positive and negative effect estimates equally. This seems fair because, to some extent, the sign of the effect estimate is arbitrary. For example, one could either consider the proportion of patients alive or dead after one year, but this choice should not have any material effect on our inferences about the effectiveness of the treatment.

2.2. An example

We demonstrate our approach with a small example. Suppose that we have a single trial with a numerical outcome, with estimated SMD of $b = 0.7$ . Suppose that the standard error also 0.7, so that the $z$ -value is 1 and the $p$ -value is 0.32. The 95% confidence interval is $(- 0.7, 2.1)$ .

We can use the R package baggr to incorporate the prior information that is represented in Table 1. We first use a flat prior for the population-level effect $μ$ and an informative prior for the heterogeneity.

We find that the posterior distribution of the study-level effect (the true effect in the trial) $β$ remains numerically the same. That is, the posterior mean is $\hat{β} = 0.7$ with 95% posterior interval from $-$ 0.7 to 2.1. Now we also obtain an estimate of the population-level effect (the average effect in similar trials) $μ$ . The posterior mean is $\hat{μ} = 0.7$ with 95% posterior interval $(- 0.9, 2.2)$ . The prior information about the heterogeneity is reflected in a wider interval.

Next, we also incorporate information about the population-level effect $μ$ . We use the following call to baggr.

The zero-mean prior for $μ$ induces considerable shrinkage. We find that the posterior mean of $β$ is now $\hat{β} = 0.24$ with 95% posterior interval from $- 0.5$ to 1.1. The posterior mean of $μ$ becomes $\hat{μ} = 0.17$ with 95% posterior interval $(- 0.5, 1.0)$ . Moreover, using the code from the Supplemental Appendix, we find that the posterior probability that $β$ is positive is 0.72, while the posterior probability that $μ$ is positive is 0.67.

2.3. The probability of the correct sign

Next, we use baggr to perform all 18,368 meta-analyses with one trial and compute the posterior probabilities that the observed $b_{i j}$ has the same sign as $β_{i j}$ and $μ_{i}$ . We plot the observed absolute $z$ -values $| z_{i j} | = | b_{i j} / s_{i j} |$ versus these probabilities in Figure 1. We fitted smooth regression curves which can be interpreted as conditional probabilities given the observed absolute $z$ -value. We see that a single trial never provides certainty about the sign of $μ$ .

Figure 1.

Conditional probabilities of $b$ having the same sign as $β$ or $μ$ given the observed absolute $z$ -value.

3. Performance of the method

3.1. Building a synthetic CDSR

We construct a “synthetic” CDSR to evaluate the performance of our Bayesian approach and compare it to the naive approach of assuming that $τ$ is 0. To generate the synthetic database we perform the following steps:

Sample $μ_{i}^{*}$ and $τ_{i}^{*}$ ( $i = 1, 2, \dots, 1635$ ) from the estimated distribution in the top row of Table 1. We are not restricting the $μ^{*}$ to be 0 on average.

To induce dependence between the pairs $μ_{i}^{*}$ and $τ_{i}^{*}$ , sort them in the same order as the maximum likelihood estimates of $μ_{i}$ and $τ_{i}$ , which we obtain by performing the meta-analyses of the CDSR data. Break any ties at random.

Sample independent $β_{i j}^{*}$ ( $i = 1, 2, \dots, 1635$ and $j = 1, 2, \dots, n_{i}$ ) from the normal distribution with mean $μ_{i}^{*}$ and standard deviation $τ_{i}^{*}$ .

Sample $b_{i j}^{*}$ from the normal distribution with mean $β_{i j}^{*}$ and the observed standard deviation $s_{i j}$ . Set $z_{i j}^{*} = b_{i j}^{*} / s_{i j}$ .

For $i = 1, 2, \dots, 1635$ and $j = 1, 2, \dots, n_{i}$ , we now have simulated sets $(μ_{i}^{*}, τ_{i}^{*})$ and $(β_{i j}^{*}, b_{i j}^{*}, s_{i j})$ that should be similar to the original CDSR. We can confirm this to some extent by comparing the distribution of the observed $b_{i j}$ and $z_{i j}$ to the simulated $b_{i j}^{*}$ and $z_{i j}^{*}$ in Figure 2.

Figure 2.

Observed and simulated distributions of 18,368 estimates and $z$ -values.

3.2. Estimating the effect in the trial

Once again we use baggr to perform all meta-analyses with one trial in the simulated dataset to estimate $β_{i j}^{*}$ and $μ_{i j}^{*}$ by their respective posterior means ${\hat{β}}_{i j}^{*}$ and ${\hat{μ}}_{i j}^{*}$ . We also construct uncertainty (“credible”) limits. We then compare to the naive approach where we estimate both quantities by $b_{i j}^{*}$ with interval $(b_{i j}^{*} \pm 1.96 \times s_{i j})$ .

In Table 2, we show the root mean squared error (RMSE), bias of the magnitude, and coverage of these two estimators. To be precise, for the naive approach we compute

\begin{aligned} mean squared error & = \frac{1}{18, 368} \sum_{i, j} (b_{i j}^{*} - β_{i j}^{*})^{2} \end{aligned}

(5)

\begin{aligned} bias of the magnitude & = \frac{1}{18, 368} \sum_{i, j} | b_{i j}^{*} | - | β_{i j}^{*} | \end{aligned}

(6)

\begin{aligned} coverage & = \frac{1}{18, 368} \sum_{i, j} 1_{| b_{i j}^{*} - β_{i j}^{*} | < 1.96 s_{i j}} \end{aligned}

(7)

Table 2.

Performance of the unbiased estimators $b_{i j}^{*}$ and Bayesian estimators ${\hat{β}}_{i j}^{*}$ of the effects in the trial $β_{i j}$ . The right side of the table shows the average over the statistically significant trials, that is, $| z_{i j}^{*} | > 1.96$ .

	Unconditional			Statistically significant
Method	RMSE	Bias	Coverage	RMSE	Bias	Coverage
Unbiased	0.38	0.10	0.95	0.42	0.21	0.90
Bayes	0.29	$- 0.06$	0.95	0.29	$- 0.01$	0.95

RMSE: root mean squared error.

We compute the analogous quantities for the Bayesian approach. The left side Table 2 displays the three performance measures for all trials and the right side by averaging only over the statistically significant trials with $| z_{i j}^{*} | > 1.96$ .

Over all trials, the bias of the magnitude for the naive estimator is 0.1, which is due to Jensen’s inequality; recall that the absolute value is a convex function. As expected, the coverage of the usual confidence interval equals its nominal level. The right side of the table shows that selection on significance increases the upward bias of the magnitude to 0.21. This is sometimes called the “winner’s curse.” This bias also causes the mean squared error to increase. Moreover, the usual confidence interval no longer reaches nominal coverage.

When we turn to our Bayesian approach, we find that the RMSE is substantially reduced compared to the unbiased estimator: the MSE is reduced from ${0.38}^{2} = 0.14$ to ${0.29}^{2} = 0.08$ , which would be equivalent to an increase of the sample size of more than 71%! Conditionally on statistical significance, the reduction of the RMSE is even greater.

The reductions in the RMSE are due to the “shrinkage” that is induced by the zero-mean prior for $μ$ which implies a zero-mean prior for $β$ . This pulls the unbiased naive estimate towards zero, that is $| {\hat{β}}_{i j}^{*} | < | b_{i j}^{*} |$ . We also see the effect of shrinkage in the reduction of the bias of the magnitude. On average across all the trials, the bias of the magnitude of the Bayesian estimator is $- 0.06$ . When we condition on statistical significance, however, this bias is almost exactly offset by the winner’s curse. The coverage of the Bayesian uncertainty interval is nominal, both unconditionally and conditionally. The superior performance of a similar Bayesian estimator was seen by van Zwet et al.¹¹

3.3. Estimating the population average effect

We also have two estimators for the population average effect $μ_{i}^{*}$ , namely the naive estimator $b_{i j}^{*}$ and the Bayesian estimator ${\hat{μ}}_{i j}^{*}$ . Again, we compute the RMSE, bias of the magnitude, and coverage both conditionally and unconditionally on statistical significance. We show the results in Table 3.

Table 3.
Performance of the unbiased estimators $b_{i j}^{}$ and Bayesian estimators ${\hat{μ}}_{i j}^{}$ of the population average effect $μ_{i}^{}$ . The right side of the table shows the average over the statistically significant trials, that is, $| z_{i j}^{} | > 1.96$ .

Unconditional Statistically significant

Method RMSE Bias Coverage RMSE Bias Coverage

Unbiased 0.53 0.15 0.82 0.72 0.41 0.63

Bayes 0.35 $- 0.11$ 0.93 0.38 $- 0.03$ 0.94

	Unconditional	Statistically significant
Unbiased	0.53	0.15	0.82	0.72	0.41	0.63
Bayes	0.35	$- 0.11$	0.93	0.38	$- 0.03$	0.94

RMSE: root mean squared error.

Essentially, the same observations apply to the results Table 3 as to those in Table 2. The reduction in the RMSE of the Bayesian estimator compared to the unbiased estimator is even more extreme. The MSE is reduced by more than a factor of 2 from ${0.53}^{2} = 0.28$ to ${0.35}^{2} = 0.12$ which is equivalent to more than doubling the sample size. Also, the bias of $| b_{i j}^{*} |$ as an estimator of $| μ_{i j}^{*} |$ is large, especially for statistically significant trials. This bias is completely absent for the Bayesian estimator.

The coverage of the usual confidence interval is far below 95% both with and without conditioning on statistical significance. This is a direct result of not taking the heterogeneity into account. The coverage of the Bayesian uncertainty interval is close to nominal in both cases.

3.4. Graphical comparison

Tables 2 and 3 provide a broad overview of the performance of the naive and Bayesian estimators. We will now study the performance in some more detail both from the frequentist and Bayesian points of view. The frequentist point of view means that we condition on the true effects $β_{i j}^{*}$ and $μ_{i}^{*}$ . The Bayesian point of view, on the other hand, means that we condition on the observed effect $b_{i j}^{*}$ .

We first consider the bias. The naive estimator $b_{i j}^{*}$ is unbiased both for $β_{i j}^{*}$ and $μ_{i}^{*}$ so we do not need to study this further. We focus on the bias of the Bayesian estimators ${\hat{β}}_{i j}^{*}$ and ${\hat{μ}}_{i j}^{*}$ which are both shrinkage estimators, in the sense that $| {\hat{μ}}_{i j}^{*} | < | {\hat{β}}_{i j}^{*} | < | b_{i j}^{*} |$ .

In the top left panel of Figure 3, we plot the estimation errors ${\hat{β}}_{i j}^{*} - β_{i j}^{*}$ versus the true effects in the trials $β_{i j}^{*}$ . In the bottom left panel, we plot the errors ${\hat{μ}}_{i j}^{*} - μ_{i}^{*}$ versus the true pooled effects $μ_{i j}^{*}$ versus. In the two right panels, we plot the same estimation errors, but in both cases we put the observed effects $b_{i j}^{*}$ on the $x$ -axis. We added the loess regression curves to each of the four plots, which estimate the following conditional expectations:

\begin{aligned} E ({\hat{β}}_{i j}^{*} - β_{i j}^{*} ∣ β_{i j}^{*}) (top left panel of Figure 3) \\ E ({\hat{μ}}_{i j}^{*} - μ_{i}^{*} ∣ μ_{i}^{*}) (bottom left panel of Figure 3) \\ E ({\hat{β}}_{i j}^{*} - β_{i j}^{*} ∣ b_{i j}^{*}) (top right panel of Figure 3) \\ E ({\hat{μ}}_{i j}^{*} - μ_{i}^{*} ∣ b_{i j}^{*}) (bottom right panel of Figure 3) \end{aligned}

The Bayesian estimators

{\hat{β}}_{i j}^{*}

and

{\hat{μ}}_{i j}^{*}

are both biased in the frequentist sense, that is, conditional on the true parameter value. This bias is apparent in the two left panels of Figure 3.

Figure 3.

Bias of the proposed Bayesian estimators. The two top panels show ${\hat{β}}_{i j}^{*} - β_{i j}^{*}$ . The two bottom panels show ${\hat{μ}}_{i j}^{*} - μ_{i}^{*}$ . The two panels on the left side show the frequentist bias, while the two panels on the right side show the Bayesian bias.

The two right panels paint a different picture. They show the bias in the Bayesian sense, that is, conditional on the observed effect. In this sense, the bias is negligible! The small bias that remains is due to our choice of a zero-mean prior while the average of the $b_{i j}^{*}$ is slightly negative.

Figure 4 shows the difference of the squared errors between the naive and Bayesian estimators. To be specific, in the top row we show these differences for estimating the $β_{i j}^{*}$

(b_{i j}^{*} - β_{i j}^{*})^{2} - ({\hat{β}}_{i j}^{*} - β_{i j}^{*})^{2}

(8)

and in the bottom row we show them for estimating the

μ_{i}^{*}

(b_{i j}^{*} - μ_{i}^{*})^{2} - ({\hat{μ}}_{i j}^{*} - μ_{i}^{*})^{2}

(9)

In both cases, positive values favor the Bayesian estimators. Again, we added the loess regression curves, which now estimate the following conditional expectations:

\begin{aligned} E ((b_{i j}^{*} - β_{i j}^{*})^{2} - ({\hat{β}}_{i j}^{*} - β_{i j}^{*})^{2} ∣ β_{i j}^{*}) (top left panel of Figure 4) \\ E ((b_{i j}^{*} - μ_{i}^{*})^{2} - ({\hat{μ}}_{i j}^{*} - μ_{i}^{*})^{2} ∣ μ_{i}^{*}) (bottom left panel of Figure 4) \\ E ((b_{i j}^{*} - β_{i j}^{*})^{2} - ({\hat{β}}_{i j}^{*} - β_{i j}^{*})^{2} ∣ b_{i j}^{*}) (top right panel of Figure 4) \\ E ((b_{i j}^{*} - μ_{i}^{*})^{2} - ({\hat{μ}}_{i j}^{*} - μ_{i}^{*})^{2} ∣ b_{i j}^{*}) (bottom right panel of Figure 4) \end{aligned}

Figure 4.

Difference in squared error of the naive and Bayesian estimators. Top row: difference in squared errors for $β_{i j}^{*}$ from (8). Bottom row: difference in squared errors for $μ_{i}^{*}$ , from (9). The two panels on the left side show the frequentist perspective, while the two panels on the right side show the Bayesian perspective.

The two panels on the left side of Figure 4 show the frequentist perspective, where we condition on the true parameter values. We see that the Bayesian estimator performs better for small values of the parameter, while the naive estimator performs better for large values. The two panels on the right side of Figure 4 show the Bayesian perspective, where we condition on the observed $b_{i j}^{*}$ . Unsurprisingly, from the Bayesian perspective, the Bayesian estimators always perform better.

3.5. Cross-validation

We have done our best to make sure that the synthetic CDSR closely resembles the true CDSR, so that the results of the previous sections also apply to the true CDSR. However, we cannot exclude the possibility that there are some systematic differences between the synthetic and true datasets which affect the relative performance of the Bayesian and naive estimators. However, we can conduct some additional checks that do not require synthetic CDSR.

First, since the estimates $b_{i j}$ are unbiased for the $β_{i j}$ the RMSE is approximately $\sqrt{\frac{1}{18, 368} \sum_{i j} s_{i j}^{2}} = 0.38$ , which agrees with the value of 0.38 from Table 2. The $b_{i j}$ are also unbiased for the $μ_{i}$ , so the RMSE can be estimated by $\sqrt{\frac{1}{18, 368} \sum_{i j} s_{i j}^{2} + τ_{i, mle}^{2}} = 0.50$ , which agrees reasonably well with the value of 0.53 in Table 3.

Second, the first column in Table 3 provides an approximation of the difference in mean squared errors across the CDSR,

\frac{1}{N} \sum_{i j} (b_{i j} - μ_{i})^{2} - ({\hat{β}}_{i j} - μ_{i})^{2}

(10)

by the difference across the synthetic CDSR,

\frac{1}{N} \sum_{i j} (b_{i j}^{*} - {\hat{μ}}_{i})^{2} - ({\hat{β}}_{i j}^{*} - {\hat{μ}}_{i})^{2} = 0.28 - 0.12 = 0.16

(11)

Despite $μ_{i}$ not being observed, there is an alternative, direct way to estimate (10). We start by constructing a third estimator of $μ_{i}$ which is unbiased and independent of both $b_{i j}$ and ${\hat{μ}}_{i j}$ . We leave out study $j$ from meta-analysis $i$ and run a simple fixed effects meta-analysis on the remaining $n_{i} - 1$ studies to obtain an estimate ${\hat{μ}}_{i}^{- j}$ of the population average effect $μ_{i}$ . This estimator is unbiased for $μ_{i}$ because the individual study estimates are. Moreover, it is independent of both $b_{i j}$ and ${\hat{μ}}_{i j}$ because it is based on different studies. According to van Zwet et al.,¹¹ we proved the following proposition.

Proposition 1

Consider three estimators $T_{0}$ , $T_{1}$ , and $T_{2}$ of a parameter $θ$ . Suppose that, conditionally on $θ$ , $T_{0}$ is unbiased and independent of $T_{1}$ and $T_{2}$ . Then

E (T_{1} - T_{0})^{2} - E (T_{2} - T_{0})^{2} = E (T_{1} - θ)^{2} - E (T_{2} - θ)^{2}

(12)

where the expectations are with respect to arbitrary distributions of

T_{0}

T_{1}

T_{2}

, and

θ

(as long as the expectations are well-defined and finite).

If $T_{0} = {\hat{μ}}_{i}^{- j}$ , $T_{1} = b_{i j}$ , $T_{2} = {\hat{β}}_{i j}$ , and $θ = μ_{i}$ , then it follows that (as $N$ approaches infinity)

\frac{1}{N} \sum_{i j} (b_{i j} - μ_{i})^{2} - ({\hat{β}}_{i j} - μ_{i})^{2} \approx \frac{1}{N} \sum_{i j} (b_{i j} - {\hat{μ}}_{i}^{- j})^{2} - ({\hat{β}}_{i j} - {\hat{μ}}_{i}^{- j})^{2}

(13)

We can compute the right side directly from the original CDSR. We find that it equals 0.19. This is reasonably close to the difference of 0.16 from Table 3. This strengthens our confidence in the results from the synthetic CDSR.

4. Discussion

4.1. Using these results in applied research

Between-study variation of the treatment effect is often present in systematic reviews. Such heterogeneity may be due to differences in study populations, methodologies, or measurement techniques. In a random effects, meta-analysis the uncertainty due to between-study variation can be accounted for, but it remains hidden when we have only a single trial. In that case, without external information, there is no choice but to identify the within-study effect with the population-level effect. This leads to underestimation of the uncertainty about the population-level effect.

We propose a Bayesian approach, which we refer to as a “meta-analysis of a single trial,” where we estimate the distribution of treatment effects and heterogeneity across 1635 meta-analyses from the CDSR. Taking these estimated distributions as prior information provides a substantial improvement in performance both for estimating the effect in the trial (Table 2) and for estimating the population average effect among similar trials (Table 3). The Bayesian meta-analysis of a single trial can easily be done in R by using package baggr.⁶

The Bayesian approach results in a large reduction of the RMSE across the trials of the CDSR compared to the usual unbiased estimator. This is to be expected for shrinkage estimation, and we have previously obtained similar results.¹¹ Here, we want to draw special attention to the substantial lack of coverage of the usual confidence interval for the population average effect; see Table 3. This is due to failure to account for the heterogeneity. In contrast, the coverage of the Bayesian uncertainty interval is equal to its nominal level.

Figure 1 shows that a single trial essentially never provides certainty about the sign of population average effect. This is a strong argument for the need for replication studies.

Since our prior distributions refer to the population of trials in the CDSR, our posterior statements can be interpreted in terms of random sampling from the CDSR. So, for example, we can say that if we randomly select a trial from the CDSR (or from the population of all the trials that could be in the CDSR) and we observe that the estimated treatment effect is $b = 0.7$ with standard error 0.7, then the probability that the true effect in that trial is also positive is 72% (cf. Section 2.2).

Now suppose we are interested in a particular trial where the estimated treatment effect is $b = 0.7$ with standard error 0.7. To which extent does that probability of 72% apply? The trial has not been randomly selected from the CDSR. By considering it as such, we are essentially ignoring all of its distinguishing features. An assessment will have to be made if those features should substantially change the 72% probability.

We stress that we do not propose that our method should replace the usual estimate and its standard error. It is important that those are reported so that they may be combined with other information such as a meta-analysis at some later time.

About 75% of the meta-analyses in the CDSR have five or fewer studies.⁴ When a meta-analysis consists of so few studies, it is clear that the heterogeneity cannot be estimated reliably without additional information.¹ Similarly as in the case of a single study that we outlined here, we should expect that doing a Bayesian meta-analysis with informative priors will improve inference. We intend to further evaluate the performance of this approach in a separate study.

The method we propose is especially well suited to decision making. Consider a decision maker choosing between interventions on the basis of a meta-analysis of a single or just a few studies. A model example would be the health technology assessment (HTA) process, which uses meta-analytic estimates to calculate cost-effectiveness. Both priors we propose play a role here. First, the prior on the mean effect induces shrinkage which helps avoid exaggeration of effect sizes. This leads to better rank ordering of treatments. Next, the prior on the heterogeneity improves the predictive distributions of effects, which can then be used by the decision maker to calculate fully Bayesian estimates of cost effectiveness. This matches the approach implemented in, for example, the UK by NICE (Dias et al.¹²).

4.2. Bayesian meta-analysis

Statistical practice—Bayesian and otherwise—has incoherence with respect to the number of studies $K$ in a meta-analysis or, more generally, the number of groups in a multilevel model. We can see this by starting with a large $K$ and then seeing what happens as it decreases.

When $K$ is large, say larger than 10, the study-level variance and thus the optimal shrinkage factor can be well estimated from the data, or if the meta-analysis includes individual and study-level predictors, the unexplained group-level variance can be well estimated.

As $K$ becomes smaller, there is more uncertainty in the study-level variance parameter, and prior information on that parameter becomes more relevant to determining the amount of shrinkage. When $K$ is between 5 and 10, the prior on $τ$ can make more of a difference: first by providing area-specific prior information and second through the regularization properties of weakly informative priors that (probabilistically) constrain the low and high ends of the distribution. A regularizing prior on the high end can be necessary to reduce the upper tail of the posterior for $τ$ ; in full Bayesian inference with a flat prior, the resulting long tail manifests itself by giving some probability of essentially no shrinkage, leading to wide uncertainties for the effects in individual studies. If $τ$ is estimated using a marginal posterior mode, it can also be helpful to use zero-avoiding priors¹³ as otherwise the point estimate for $τ$ can be noisy (zero in some cases and high in others), leading to meta-analyses that uncontrollably swing between complete pooling and little shrinkage in otherwise similar cases. Another advantage of an informative prior is that it reduces the influence of one or two outlying studies.

When $K$ is small, between 3 and 5, it is still possible to perform Bayesian inference on $τ$ with a flat prior, but the resulting posterior mode is noisy and the full Bayesian posterior for $τ$ will have a long right tail,¹⁴ and so in practice an informative prior for the group-level variance is necessary to avoid the meta-analysis procedure yielding unreasonable results. Setting $τ$ equal to zero to perform a so-called fixed or common effects meta-analysis amounts to using a extremely strong and unrealistic prior.

With $K = 2$ , the mathematical situation changes: a flat prior on $τ$ yields an improper posterior distribution with an infinite right tail. This is related to the result from James-Stein theory that the no-shrinkage estimate is admissible when $K < 3$ . In practice, though, there is no sharp boundary between two and three studies, as in either case we want to be using a strong prior for $τ$ . At $K = 2$ , we should expect the prior for $τ$ to dominate to the extent that the amount of shrinkage is determined much more by the prior—that is, by the population of studies being considered as the reference class—than by the observed spread in the data.

When $K = 1$ , the problem doesn’t look like meta-analysis at all: it is just inference from a single study. This leads to the paradox that removing information can be expected to decrease reported uncertainty. The paradox is resolved by recognizing that the estimand has changed. The solution is to consider questions of meta-analysis and between-study variation even in a single study. In other words, this means placing the problem in a hierarchical context: even when multiple data sources are not available, we can still include a prior on between-study variation. This was already going to be necessary with $K = 2$ or 3, so why not do this with $K = 1$ also?

Finally, $K = 0$ corresponds to the setting where no studies are available on a problem of interest, so that the posterior is determined entirely by the prior. This can be viewed as a sort of thought experiment, representing the information being assumed from nothing but the general class of problems under study.

5. Reproducibility

The results in this article are fully reproducible with the R code provided in GitHub repository at github.com/wwiecek/singletrial. The data are publicly available at https://osf.io/xjv9g/. Below we provide the code snippet for calculating probability of positive effects in a trial or population; see the example in Section 2.2.

Supplemental Material

sj-pdf-1-smm-10.1177_09622802251380628 - Supplemental material for Meta-analysis with a single study

Supplemental material, sj-pdf-1-smm-10.1177_09622802251380628 for Meta-analysis with a single study by Erik van Zwet, Witold Wiȩcek and Andrew Gelman in Statistical Methods in Medical Research

Footnotes

Acknowledgements

We thank Simon Schwab and Rebecca Turner for their comments on an early draft of the article. Witold Wiȩcek received funding from FTX Future Fund regranting program to pursue research into meta-analytic methods. Andrew Gelman’s work was supported by ONR grant N000142212648.

ORCID iD

Erik van Zwet

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: Witold Wiȩcek received funding from FTX Future Fund regranting program to pursue research into meta-analytic methods. Andrew Gelman’s work was supported by ONR grant N000142212648.

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Supplemental material

Supplemental material for this article is available online.

References

Kontopantelis

Springate

Reeves

. A re-analysis of the Cochrane Library data: The dangers of unobserved heterogeneity in meta-analyses. PLoS ONE 2013; 8: e69930.

Pullenayegum

. An informed reference prior for between-study heterogeneity in meta-analyses of binary outcomes. Stat Med 2011; 30: 3082–3094.

Turner

Davey

Clarke

, et al. Predicting the extent of heterogeneity in meta-analysis, using empirical data from the Cochrane database of systematic reviews. Int J Epidemiol 2012; 41: 818–827.

Davey

Turner

Clarke

, et al. Characteristics of meta-analyses and their component studies in the Cochrane database of systematic reviews: a cross-sectional, descriptive analysis. BMC Med Res Methodol 2011; 11: 1–11.

Higgins

JPT

Whitehead

. Borrowing strength from external trials in a meta-analysis. Stat Med 1996; 15: 2733–2749.

Wiecek

Meager

. baggr: Bayesian aggregate treatment effects. R package version 0.6.9, 2021. https://CRAN.R-project.org/package=baggr.

Stan Development Team. Stan modeling language user’s guide. 2023. https://doi.org/10.17605/OSF.IO/XJV9G.

Schwab

. Re-estimating 400,000 treatment effects from intervention studies in the Cochrane database of systematic reviews [data set]. Open Science Framework 2020. https://doi.org/10.17605/OSF.IO/XJV9G .

Andrade

. Mean difference, standardized mean difference (SMD), and their use in meta-analysis: As simple as it gets. J Clin Psych 2020; 81: 11349.

10.

Cohen

. Statistical Power Analysis for the Behavioral Sciences, Revised Edition. New York: Routledge, 2013.

11.

van Zwet

Tian

Tibshirani

. Evaluating a shrinkage estimator for the treatment effect in clinical trials. Stat Med 2024; 43: 855–868.

12.

Dias

Welton

Sutton

, et al. A generalised linear modelling framework for pairwise and network meta-analysis of randomised controlled trials. NICE DSU Tech Supp Document 2011; 2: 607–617.

13.

Chung

Rabe-Hesketh

Gelman

, et al. A nondegenerate estimator for hierarchical variance parameters via penalized likelihood estimation. Psychometrika 2013; 78: 685–709.

14.

Gelman

. Prior distributions for variance parameters in hierarchical models. Bayes Anal 2006; 1: 515–533.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

7.00 MB