Bayesian clustering prior with overlapping indices for effective use of multisource external data

Abstract

The use of external data in clinical trials offers numerous advantages, such as reducing enrollment, increasing study power, and shortening trial duration. In Bayesian inference, information in external data can be transferred into an informative prior for future borrowing (i.e. prior synthesis). However, multisource external data often exhibits heterogeneity, which can cause information distortion during the prior synthesizing. Clustering helps identifying the heterogeneity, enhancing the congruence between synthesized prior and external data. Obtaining optimal clustering is challenging due to the trade-off between congruence with external data and robustness to future data. We introduce two overlapping indices: the overlapping clustering index and the overlapping evidence index . Using these indices alongside a K-means algorithm, the optimal clustering result can be identified by balancing this trade-off and applied to construct a prior synthesis framework to effectively borrow information from multisource external data. By incorporating the (robust) meta-analytic predictive (MAP) prior within this framework, we develop (robust) Bayesian clustering MAP priors. Simulation studies and real-data analysis demonstrate their advantages over commonly used priors in the presence of heterogeneity. Since the Bayesian clustering priors are constructed without needing the data from prospective study, they can be applied to both study design and data analysis in clinical trials.

Keywords

Information borrowing real-world data evidence synthesis heterogeneity clustering prior data congruency prior robustness

1. Introduction

Incorporating multisource external data into the design and analysis of clinical trials has become a field of significant interest. U.S. Food and Drug Administration (FDA) has issued the guidance, “Use of Real-World Evidence to Support Regulatory Decision-Making for Medical Devices,”¹ to encourage the integration of external data in new studies. This practice aims to enhance the efficiency of clinical trials by leveraging real-world data (RWD) to inform study parameters, potentially reducing sample size, increasing the statistical power and precision of testing or estimating study outcomes, and accelerating trial timelines.² In the Bayesian framework, a critical component of incorporating RWD is the construction of informative priors from external data, a process commonly referred to as evidence synthesis. Methods of synthesizing informative priors have been extensively studied. For instance, the meta-analytic predictive (MAP) prior^3,4 uses meta-analysis to summarize information from external data into an informative prior. The power prior (PP)^5,6 adjusts the influence of external data on the analysis of new data by applying a likelihood discounting approach based on the relevance and reliability of the external data. Commensurate priors^7,8 and multisource exchangeability models (MEMs)⁹ modulate the degree of information borrowing from external data according to their relevance and congruence to the new data. The elastic prior¹⁰ dynamically borrows information from external data using a monotonic function of a congruence measure between external and new data.

The use of prior information is critical in both trial design and data analysis. During the trial design stage, where new trial data is not yet available, prior information is typically derived from domain knowledge or external data. In the data analysis stage, once new trial data is available, priors can be refined by evaluating the similarity between the external and new trial data. Constructing an informative prior that is suitable for both trial design and data analysis is challenging. For example, with the exception of the MAP prior, all of the aforementioned priors require the new trial data, which limits their applicability during the trial design stage.

Another challenge arises from the diversity and heterogeneity of multiple data sources, including differences in study design, population characteristics, eligibility criteria, outcome measures, etc.⁴ Such variability complicates the accurate transferring or borrowing of information from external data into a prior, potentially leading to information distortion and adversely affecting the analysis of the new trial. Therefore, it is crucial to accurately identify the heterogeneity across external datasets. MAP (or rMAP) prior accommodates this heterogeneity by using the linear mixed model with a random effect parameter.¹¹ However, a single parameter may not adequately capture complex heterogeneity structures, such as scenarios where external datasets include multiple clusters with varying degrees of homogeneity. The MEM approach attempts to address this challenge by measuring pairwise exchangeability among the interested random parameters (associated with external datasets). However, it does not explicitly construct a prior that adapts to the heterogeneity structure. in this paper, we focus on clustering methods, where heterogeneity is identified by partitioning the interested random parameters (associated with external datasets) into distinct clusters. Bayesian nonparametric clustering with Dirichlet process¹² is a widely used approach. For example, based on it, Chen and Lee¹³ proposed a Bayesian clustering hierarchical model to dynamically partition sub-trials into clusters for efficient information borrowing in basket trials. Nevertheless, in their model, the number of clusters is determined by a hyperparameter, making it challenging to establish an interpretable and unified criterion for selecting the optimal hyperparameter value across different applications.¹⁴ Additionally, based on our knowledge, this approach has not been applied to prior synthesis with multisource external data.

To address these challenges, we propose a novel approach for synthesizing informative priors from multisource external data, leveraging the concept of overlapping coefficients.^15,16 Specifically, we introduce the overlapping clustering index (OCI) and employ a K-means algorithm to identify heterogeneity across external datasets. To measure the congruence between the synthesized prior and external data, we define an overlapping evidence index (OEI). A higher OEI indicates more accurate information transferred from external data to the prior, reflecting stronger congruence between them. There is, however, an inherent trade-off between maximizing OEI and ensuring the robustness of the prior to new data. To address this, we propose an OEI-based criterion that balances this trade-off, enabling accurate heterogeneity identification through optimal clustering. Compared to existing nonparametric clustering methods,^17,18 our criterion is more interpretable and maintains a consistent standard across different applications. Using the optimal clustering results, we introduce the Bayesian clustering prior, a flexible framework for prior synthesis. By integrating the MAP and robust MAP priors within this framework, we develop the Bayesian clustering MAP (BCMAP) and robust Bayesian clustering MAP (rBCMAP) priors. They exhibit desirable properties and are applicable in both trial design and data analysis stages.

Section 2 introduces the notations, assumptions, and key challenges of Bayesian evidence synthesis with multisource external data. In Section 3, we propose two overlapping indices, explore the trade-off between evidence congruence and robustness, and present a K-means algorithm to achieve optimal clustering for accurate heterogeneity identification. Section 4 details the construction of Bayesian clustering priors, provides a sensitivity analysis and OEI-based threshold selection guidelines, and includes an illustrative example. Simulation studies comparing our method with existing approaches are presented in Section 5. Section 6 demonstrates the application of the proposed method to two real-world external datasets, featuring binary and continuous endpoints. Section 7 concludes with a brief discussion. To facilitate reading and referencing, Table 5 in the Appendix lists the symbols and abbreviations used in this paper.

2. Bayesian evidence synthesis from multisource external data

Let $Y_{1}, \dots, Y_{H}$ denote external data from multiple sources. Assume $θ$ to be the common parameter of interest. Bayesian evidence synthesis aims to create an informative prior of $θ$ from $Y_{1}, \dots, Y_{H}$ , denoted as $π (θ | Y_{1}, \dots, Y_{H})$ . Then, for any new data $Y^{*}$ , the inference of $θ$ can borrow information from $π (θ | Y_{1}, \dots, Y_{H})$ through $p (θ | Y^{*}) \propto L (Y^{*} | θ) π (θ | Y_{1}, \dots, Y_{H})$ , where $L (Y^{*} | θ)$ is the likelihood function. In this paper, we assume $Y^{*}$ is unknown; in other words, $Y^{*}$ has no effect on the construction of $π (θ | Y_{1}, \dots, Y_{H})$ . In Bayesian inference, the prior distribution represents the beliefs or information about a parameter before observing the new data. Avoiding “use the data twice”¹⁹ is fundamental for maintaining the integrity of the Bayesian updating process. Compared to the most existing methods, such as the MEM and power prior, which use $Y^{*}$ first in the prior and then in the likelihood, our approach strictly follows the rule that do not use the data twice. It makes $π (θ | Y_{1}, \dots, Y_{H})$ applicable in both trial design (without $Y^{*}$ ) and data analysis (with $Y^{*}$ ). Another assumption in this paper is the exclusion of covariate information. This assumption stems from the practical challenges of obtaining such information. For instance, patient-level data may be restricted from the public access, even for research purposes. Even when this information is accessible, the available covariates often vary across data sources. For example, data source 1 includes covariates $X_{1}, X_{2}, X_{3}$ , data source 2 include $X_{2}, X_{3}, X_{5}$ , and data source 3 include $X_{3}, X_{4}, X_{5}$ , leaving only $X_{3}$ as a common covariate. In such cases, inference relying solely on $X_{3}$ may lead to questionable conclusions.

In practice, heterogeneity often exists among data sources, such as multi-regional data or data from multiple health centers. One approach to address this issue is to filter out certain datasets to achieve homogeneity. However, without knowledge of covariate information or the new data $Y^{*}$ , this method risks losing valuable information, furthermore it is hard to determine which data sources should be excluded. An alternative approach is to include all external datasets. In this scenario, the evidence synthesis model used to construct the prior must be capable of effectively identifying the heterogeneity across the various data sources and be able to handle it well. Otherwise, it may distort the information transferred from $Y_{1}, \dots, Y_{H}$ to $π (θ | Y_{1}, \dots, Y_{H})$ . The information of $θ$ in $Y_{1}, \dots, Y_{H}$ is contained in $p (θ | Y_{1}), \dots, p (θ | Y_{H})$ where $p (θ | Y_{h})$ , $h = 1, \dots, H$ , is obtained through $p (θ | Y_{h}) \propto L (Y_{h} | θ) π (θ)$ , and $π (θ)$ can be either a weakly informative or an informative prior of $θ$ . (Note: Instead of using $θ_{1}, \dots, θ_{H}$ , we utilize the posteriors $p (θ | Y_{1}), \dots, p (θ | Y_{H})$ to reflect the heterogeneity of the external data. This notation emphasizes the congruence between the external data and the synthesized prior $π (θ | Y_{1}, \dots, Y_{H})$ .) The information distortion can be illustrated by the example shown in Panels (a-1) and (a-2) of Figure 1. The example includes six external datasets with corresponding posteriors $p (θ | Y_{1}), \dots, p (θ | Y_{6})$ as shown in Panel (a-1). It is obvious that these datasets are heterogeneous and can be partitioned into two clusters. In Panel (a-2), we examine information transfer at three points: $θ = a_{1}$ , $θ = a_{2}$ , and $θ = a_{3}$ . The transferred information is quantified using the likelihood under priors $π_{1}$ and $π_{2}$ . The corresponding likelihoods are $l_{11}, l_{12}, l_{13}$ for $π_{1}$ and $l_{21}, l_{22}, l_{23}$ for $π_{2}$ . According to the posteriors $p (θ | Y_{1}), \dots, p (θ | Y_{6})$ , it is clear that the likelihoods at $θ = a_{1}$ and $θ = a_{3}$ should be greater than the likelihood at $θ = a_{2}$ . However, $π_{1}$ performs oppositely, with $l_{11}, l_{13} < l_{12}$ . This occurs because the synthesizing method of $π_{1}$ fail to identify the heterogeneity, thus distorting the information in the external data. Conversely, $π_{2}$ correctly captures the heterogeneity and accurately reflects the information in the external data, with $l_{21}, l_{23} > l_{22}$ .

Figure 1.

Information distortion and the trade-off between evidence congruence and robustness. Panel (a-1) shows an example with the posterior distributions of six external datasets. Panel (a-2) illustrates the distortion of information due to heterogeneity, demonstrating that prior $π_{2}$ , which correctly accounts for heterogeneity, is more appropriate than prior $π_{1}$ . Panels (b-1) and (b-2) demonstrate the trade-off between evidence congruence and robustness. In Panel (b-1), the prior $π_{2}^{'}$ has stronger evidence congruence but weaker robustness. In Panel (b-2), the prior $π_{2}^{″}$ has weaker evidence congruence but stronger robustness.

As discussed above, the quality of a synthesized prior can be evaluated by the congruence of information about $θ$ between $Y_{1}, \dots, Y_{H}$ and $π (θ | Y_{1}, \dots, Y_{H})$ . We refer to this congruence as the evidence congruence of $π (θ | Y_{1}, \dots, Y_{H})$ with respect to $Y_{1}, \dots, Y_{H}$ . In Panel (a-2) of Figure 1, it is clear that $π_{2}$ has stronger evidence congruence than $π_{1}$ . However, higher evidence congruence is not always better. Robustness is another crucial criterion for evaluating the quality of a synthesized prior.⁴ As shown in Panels (b-1) and (b-2), $π_{2}^{'}$ is constructed by identifying four clusters in the external datasets, whereas $π_{2}^{″}$ assumes two clusters. It is easy to check that the evidence congruence of $π_{2}^{'}$ is greater than that of $π_{2}^{″}$ . But we prefer $π_{2}^{″}$ because it is more robust. In sum, both criteria of evidence congruence and robustness are closely related to the heterogeneity identification. Since $p (θ | Y_{1}), \dots, p (θ | Y_{H})$ contain all the information about $θ$ , an accurate clustering of $p (θ | Y_{1}), \dots, p (θ | Y_{H})$ can strike a good balance between evidence congruence and robustness, thereby helping to create a high-quality informative prior.

3. Overlapping indices

Overlapping coefficient (OVL) is a measure of the intersection area between two probability density or mass functions. Let $X$ and $Y$ be two random variables with probability density or mass functions $f$ and $g$ , respectively. $Ω$ is the common support of $f$ and $g$ . OVL can be defined by equation (1). (Note: the integral expression is used in this paper without loss of generality)

OVL (X, Y) = {\begin{cases} \int_{Ω} m i n {f (t), g (t)} d t, & continuous \\ \sum_{Ω} m i n {f (t_{i}), g (t_{i})} \cdot I (t_{i} \in Ω), & discrete \end{cases}

(1)

Based on the concept of OVL, we propose two overlapping indices for the clustering of

p (θ | Y_{1}), \dots, p (θ | Y_{H})

to address the challenges discussed in Section 2.

3.1. Overlapping clustering index and K-means clustering

Let $p (θ | Y_{h})$ , $h = 1, \dots, H$ , be the probability density function (pdf) or probability mass function (pmf) of posterior distributions obtained from the external datasets $Y_{1}, \dots, Y_{H}$ . The corresponding random variables are denoted as $θ | Y_{h}$ , $h = 1, \dots, H$ . A partition $S$ groups them into $K$ clusters ${G_{1}, \dots, G_{K}}$ , $1 \leq K \leq H$ . For each cluster $G_{m}$ , $m = 1, \dots, K$ , let $Θ_{m}$ be a Gaussian random variable with density function $g_{m}$ , which is the maximum likelihood estimation (MLEs of mean and variance) obtained from the samples of the random variables $θ | Y_{h}$ in cluster $G_{m}$ . Then, the overlapping clustering index (OCI) of this partition $K$ is defined as follows:

{OCI}_{K} = \sum_{m = 1}^{K} \sum_{h = 1}^{H} OVL (θ | Y_{h}, Θ_{m}) \cdot I (θ | Y_{h} \in G_{m}),

(2)

where

I (\cdot)

is a indicator function. The random variables

θ | Y_{h}

in the same cluster

G_{m}

are assumed to be exchangeable.

Θ_{m}

denotes the centroid of cluster

G_{m}

, and it is modeled as a Gaussian random variable. This assumption is justified by the central limit theorem, which implies that as the size of

G_{m}

increases, the mean of the all posteriors within the cluster will approximate a Gaussian distribution, regardless of the differences among

θ | Y_{h}

h = 1, \dots, H

${OCI}_{K}$ measures the overall within cluster homogeneity of the K-partition. Based on it, we can define the optimal $S_{K}^{(oci) *}$ for K partition as follows:

\begin{aligned} S_{K}^{(oci) *} & = \underset{S_{K}}{\arg \max} {OCI}_{K} \\ = \underset{S_{K}}{\arg \max} \sum_{m = 1}^{K} \sum_{h = 1}^{H} OVL (θ | Y_{h}, Θ_{m}) \cdot I (θ | Y_{h} \in G_{m}^{*}) . \end{aligned}

(3)

With $S_{K}^{(oci) *} : {G_{1}^{*}, \dots, G_{K}^{*}}$ , we can calculate the corresponding ${OCI}_{K}^{*}$ :

{OCI}_{K}^{*} = \sum_{m = 1}^{K} \sum_{h = 1}^{H} OVL (θ | Y_{h}, Θ_{m}^{*}) \cdot I (θ | Y_{h} \in G_{m}^{*}) .

(4)

An optimal clustering

S_{K}^{(km) *}

can be found through the following K-means algorithm:

S_{K}^{(km) *} = \underset{S_{K}}{\arg \min} \sum_{m = 1}^{K} \sum_{h = 1}^{H} d (θ | Y_{h}, Θ_{m}) \cdot I (θ | Y_{h} \in G_{m}^{*})

(5)

where

d (θ | Y_{h}, Θ_{m}) = 1 - OVL (θ | Y_{h}, Θ_{m})

is a distance measure between two random variables

θ | Y_{h}

and

Θ_{m}

. The equivalence between

S_{K}^{(oci) *}

and

S_{K}^{(km) *}

is shown as follows:

\begin{aligned} S_{K}^{(oci) *} & = \underset{S_{K}}{\arg \min} \sum_{m = 1}^{K} \sum_{h = 1}^{H} d (θ | Y_{h}, Θ_{m}) \cdot I (θ | Y_{h} \in G_{m}^{*}) \\ = \underset{S_{K}}{\arg \min} \sum_{m = 1}^{K} \sum_{h = 1}^{H} [1 - OVL (θ | Y_{h}, Θ_{m})] \cdot I (θ | Y_{h} \in G_{m}^{*}) \\ = \underset{S_{K}}{\arg \max} \sum_{m = 1}^{K} \sum_{h = 1}^{H} OVL (θ | Y_{h}, Θ_{m}) \cdot I (θ | Y_{h} \in G_{m}^{*}) = S_{K}^{(km) *} . \end{aligned}

The above mentioned K-means algorithm closely resembles the standard K-means algorithm. The differences lie in the definitions of the centroid $Θ_{m}$ and the distance $d (θ | Y_{h}, Θ_{m}) = 1 - OVL (θ | Y_{h}, Θ_{m})$ . A detailed description of the proposed K-means algorithm in the form of pseudo-code can be found in Algorithm 1 in the Appendix.

3.2. Overlapping evidence index and trade-off between evidence congruence and robustness

For any fixed $K$ , the heterogeneity among $p (θ | Y_{1})$ ,…, $p (θ | Y_{H})$ can be effectively identified through $S_{K}^{(oci) *}$ . As demonstrated by Figure 1 in Section 2, the evidence congruence increases as $K$ increases, and the information distortion reduces. However, the robustness of the synthesized prior weakens as $K$ increases because the clusters become more and more specific and the amount of data in each cluster tends to decrease. It is desirable to find an optimal $K$ to strike a good balance for this trade-off. To help identifying the optimal $K$ , we introduce the concept of overlapping evidence index (OEI).

Let $θ | Y_{1}, \dots, Y_{H}$ denote the random variable corresponding to the synthesized prior $π (θ | Y_{1}, \dots, Y_{H})$ . The OEI of $π (θ | Y_{1}, \dots, Y_{H})$ is defined as the weighted sum of overlapping coefficients between the random variable $θ | Y_{1}, \dots, Y_{H}$ and each $θ | Y_{h}$ for $h = 1, \dots, H$ .

OEI (π) = \sum_{h = 1}^{H} \frac{N_{h}}{N} \cdot O V L (θ | Y_{1}, \dots, Y_{H}, θ | Y_{h}),

(6)

where

N_{h}

is the sample size of dataset

Y_{h}

and

N = \sum_{h = 1}^{H} N_{h}

$OEI (π)$ lies within the interval $[0, 1]$ . It measures the congruence of synthesized prior $π (θ | Y_{1}, \dots, Y_{H})$ with the information of external data. The higher the $OEI (π)$ , the more congruent information transfers to $π (θ | Y_{1}, \dots, Y_{H})$ from external data, and the less information distortion occurs. An OEI-based criterion for selecting the optimal number of clusters $K$ , aimed at balancing evidence congruence and robustness, is introduced in Section 4.

4. Bayesian clustering prior

4.1. The formulation of bayesian clustering prior

For a fixed $K$ , let us assume the optimal partition $S_{K}^{(oci) *}$ is denoted as follows:

\begin{aligned} S_{K}^{(oci) *} : {θ | Y_{1}, \dots, θ | Y_{H}} \to {G_{1}^{*}, \dots, G_{K}^{*}} = {{θ | Y_{11}, \dots, θ | Y_{1 n_{1}}}, \dots, {θ | Y_{K 1}, \dots, θ | Y_{K n_{K}}}} \end{aligned}

where

n_{m}

is the number of random variables in cluster

G_{m}^{*}

m = 1, \dots, K

Based on $S_{K}^{(oci) *}$ , the Bayesian clustering prior with $K$ clusters can be constructed as a weighted sum of informative priors synthesized from clusters, $G_{m}^{*} = {Y_{m 1}, \dots, Y_{m n_{m}}}$ , $m = 1, \dots, K$ .

π_{K} (θ | Y_{1}, \dots, Y_{H}) = \sum_{m = 1}^{K} \frac{N_{m}}{N} \cdot π (θ | Y_{m 1}, \dots, Y_{m n_{m}})

(7)

where

N_{m}

is the number of observations in cluster

G_{m}^{*}

, and

N = \sum_{m = 1}^{K} N_{m}

is the size of all external data. Moreover, the prior in equation (7) can be made more robust by adding a weighted weakly informative prior. We refer to this as the robust Bayesian clustering prior:

\begin{aligned} π_{K}^{R} (θ | Y_{1}, \dots, Y_{H}) = (1 - w) \cdot \sum_{m = 1}^{K} \frac{N_{m}}{N} \cdot π (θ | Y_{m 1}, \dots, Y_{m n_{m}}) + w \cdot π_{0} (θ) \end{aligned}

(8)

where

π_{0} (θ)

is a weakly informative prior,

w \in [0, 1]

is the weight of

π_{0} (θ)

In equations (7) and (8), $π (θ | Y_{m 1}, \dots, Y_{m n_{m}})$ can be estimated through various methods, such as traditional Bayesian hierarchical models (BHM), MAP, or power priors. In this article, we choose MAP and robust MAP (rMAP) priors because they can be used in both trial design and data analysis stages. We refer to the resulting priors as Bayesian clustering MAP (BCMAP) and robust Bayesian clustering MAP (rBCMAP), respectively. Weber et al.¹¹ developed the R package “RBesT” to implement the sampling of MAP and rMAP through Markov chain Monte Carlo (MCMC). To represent the MAP prior in parametric form, an expectation maximization (EM) algorithm is conducted to approximate the MCMC samples with a parametric mixture distribution. Since BCMAP (rBCMAP) is a weighted sum of MAPs, it can naturally be represented by a parametric mixture distribution as well. Thus, when conjugate MAP priors exist, a mixture of conjugate BCMAP priors can be applied (see the real data example in Section 6.1.1).

The BCMAP prior $π_{K} (θ | Y_{1}, \dots, Y_{H})$ can be constructed based on $S_{K}^{(oci) *}$ for each $K$ . Then, we can obtain a sequence of $OEI (π_{K})$ , $K = 1, \dots, H$ , which monotonically increase as $K$ increases, that is $0 \leq OEI (π_{1}) \leq \dots \leq OEI (π_{H}) \leq 1$ . The proof of monotonicity can be found in Theorem 1 in Appendix. We can scale the sequence $OEI (π_{1}) \leq \dots \leq OEI (π_{H})$ by the maximum $OEI (π_{H})$ and refer to $OEI (π_{K}) / OEI (π_{H})$ , $K = 1, \dots, H$ , as $SOEI (π_{K})$ . The sequence of $SOEI (π_{K})$ , ${OEI (π_{1}) / OEI (π_{H}), \dots, OEI (π_{H - 1}) / OEI (π_{H}), 1}$ , denotes the percentage of evidence congruence under each $K$ relative to the extreme case where each distribution is a cluster. A threshold balancing the trade-off of maximizing evidence congruence and minimizing the number of clusters $K$ for robustness can be used to determine the optimal $K^{*}$ . Then, $π_{K^{*}} (θ | Y_{1}, \dots, Y_{H})$ is the finalized optimal Bayesian clustering prior (see an example in Section 4.3 below). The main advantages of this OEI-based threshold optimizing approach are two-fold: (i) it is straightforward and easy to interpret, as the threshold reflects a balance between the congruence of the synthesized prior with the external data and the robustness to new data, and (ii) a fixed threshold offers a consistent and unified interpretation across various applications. These contrasts with Bayesian nonparametric clustering approaches, where the value of hyperparameters used to determine the number of clusters are often challenging to select and interpret, and may lack consistency across different applications.

4.2. Sensitivity analysis and practical guidelines

To provide guidelines for selecting an appropriate threshold, it is essential to understand how the threshold affects clustering performance. We answer this question through a simulation study. The simulation setup consists of two clusters representing low and high parameter values, each containing five subgroups. Specifically, the data are generated as follows: $Y_{l i} \sim Binomial (n_{l i}, θ_{l})$ and $Y_{h i} \sim Binomial (n_{h i}, θ_{h})$ for $i = 1, \dots, 5$ . We fixed $θ_{l} = 0.2$ and varied $θ_{h} = 0.3, 0.4, 0.5, 0.6$ to create increasing levels of separation between the two cluster, $θ_{h} - θ_{l} = 0.1, 0.2, 0.3, 0.4$ . To examine the effects of the threshold under different sample sizes, three settings for $n_{l i}$ and $n_{h i}$ are considered, where values are drawn from ${5, \dots, 20}$ , ${20, \dots, 50}$ , and ${50, \dots, 100}$ , respectively. We evaluate five threshold values: $0.5, 0.6, 0.7, 0.8, 0.9$ . For each configuration (i.e., a combination of $θ_{h} - θ_{l}$ , sample size, and threshold), we simulate 1000 datasets and compare the average clustering accuracy rate.

The clustering accuracy rate is defined as follows. Suppose there are $H$ objects partitioned into $K$ clusters. For each pair of objects, we assign a label of 1 if they belong to the same cluster and 0 otherwise, resulting in a total of $(\binom{H}{2}) = H (H - 1) / 2$ labels for any given clustering. We then compare the labels from the estimated clustering (based on a given threshold) to those from the true clustering. Let $M$ denote the number of matching labels between the two. The accuracy rate is then calculated as $M / (\binom{H}{2}) = 2 M / (H (H - 1))$ .

Several insights emerge from the simulation results in Figure 2. (a) accuracy improves as the sample size increases. (b) Low thresholds (0.5, 0.6) tend to reduce accuracy when clusters are close together and/or the sample size is small. (c) High thresholds (0.9) decrease accuracy when clusters are well separated. Base on these insights, the general guidelines for selecting the threshold are:

Choose a threshold in the range $[0.6, 0.7]$ ;

Use values closer to 0.7 when the sample size of each study or the average sample size are small ( $\leq 20$ ) or when the gap of clusters is small ( $< 0.2$ );

Use values closer to 0.6 when the sample size of each study or the average sample size are large ( $> 20$ ) or when the gap of clusters is large ( $\geq 0.2$ ).

Figure 2.

Sensitivity analysis of the scaled OEI (SOEI) threshold.

4.3. An illustrative example

The following example provides an intuitive understanding of the questions: how is BCMAP constructed and how and why does it work? In this example, 12 external datasets were generated by using the model (9). The summary of these datasets is provided in Table 6 in Appendix. It is easy to find the heterogeneity among these datasets that contains two clusters centered at 0.2 and 0.6.

\begin{aligned} Y_{1} | θ, σ^{2}, \dots, Y_{12} | θ, σ^{2} \sim N (θ, σ^{2}), \\ σ \sim Halfnormal (1), \\ θ = ϕ \cdot (μ_{c 1} + ϵ_{c 1}) + (1 - ϕ) \cdot (μ_{c 2} + ϵ_{c 2}), \\ ϕ \sim Bernoulli (p = 0.5), μ_{c 1} = 0.2, μ_{c 2} = 0.6, \\ ϵ_{c 1} \sim N (0, {0.05}^{2}), ϵ_{c 2} \sim N (0, {0.08}^{2}) . \end{aligned}

(9)

The posteriors $p (θ | Y_{h}) \propto L (Y_{h} | θ) π_{0} (θ)$ , $h = 1, \dots, 12$ , are shown in Panel (a) of Figure 3. The weakly informative prior $π_{0} (θ)$ is constructed as: $π_{0} (θ) = \int π (θ | τ) π (τ) d τ$ , where $π (θ | τ) \sim N (0.4, 1 / τ)$ and $π (τ) \sim g a m m a (0.01, 10)$ . Given that the average size of the external data is 55, we set the threshold at 0.6 in accordance with the proposed guidelines. Thus, the optimal number of clusters is identified as $K^{*} = 2$ as shown in Panel (b). The BCMAP prior is presented in Panel (c), where the corresponding MAP prior is also provided for comparison. The BCMAP prior is congruent with external data, that is RWD. Therefore, if the new data $Y^{*}$ is congruent with the RWD too, which is reasonable, the estimation of $p (θ | Y^{*})$ will benefit from the information borrowing with the BCMAP prior. To illustrate this, two examples are shown in Panels (d-1) and (d-2) with the new datasets $Y^{*}$ (black dots) generated from $N (θ^{*}, {0.3}^{2})$ . $θ^{*}$ (black vertical line) is generated following the $θ$ defined in (9), maintaining congruence between the new data and the external data. We compare the performance of BCMAP and MAP in estimating $θ^{*}$ . In Panel (d-1), $θ^{*}$ is near the center at 0.2. The posterior with BCMAP prior outperforms MAP prior in the estimations of both location (bias) and scale (variance). In Panel (d-2), $θ^{*}$ is near the center at 0.6. BCMAP prior again outperforms MAP priors in both location and scale estimation. However, when the new data is incongruent with the external data in a certain way, BCMAP may perform worse than MAP, as illustrated in Panel (d-3). Notably, this does not imply that BCMAP consistently underperforms in all incongruent scenarios. For example, in Panel (d-4), BCMAP still outperforms MAP despite the presence of incongruence. In Panel (d-5), although MAP provides a more accurate estimate of the location, BCMAP exhibits a smaller variance.

Figure 3.

The example of BCMAP and comparisons with MAP. Panel (a) shows the posteriors $p (θ | Y_{h}) \propto L (Y_{h} | θ) π_{0} (θ)$ , $h = 1, \dots, 12$ , with a weakly informative prior $π_{0} (θ)$ . Panel (b) presents the identification of optimal number of clusters $K^{*} = 2$ . Panel (c) exhibits the clustering result and corresponding BCMAP prior, where MAP prior is also provided for comparison. Panels (d-1), (d-2), (d-3), (d-4), and (d-5) show the posteriors, which are obtained by using the new data $Y^{*}$ with various congruence and incongruence to the external data. Panels (d-1) and (d-2) demonstrate the cases that the new data is congruence with the external data. Panels (d-3), (d-4), and (d-5) demonstrate the cases that the new data is incongruence with the external data.

The example highlights the strengths and limitations of the proposed Bayesian clustering prior. Specifically, the superiority of this approach depends on two key conditions, which are commonly met in practice.

Heterogeneity among external datasets: The Bayesian clustering MAP (BCMAP) prior is particularly advantageous when there is heterogeneity among the external datasets. In the absence of heterogeneity, BCMAP typically identifies a single cluster, causing the BCMAP prior to degenerate to the MAP prior (see Figure 8 in Appendix).

Congruence between new data and external data: For the BCMAP to perform well, the new data $Y^{*}$ must be congruent with the external data. Specifically, the data generating process of $Y^{*}$ should align with that of the external datasets, even if it is not identical. For example, $Y^{*}$ may follow a distribution that corresponds to one of the components in the mixture distribution of the external data. If this condition is not met, the BCMAP may not work well. In fact, none of the priors will work well if there are severe prior-data conflicts. In this case, the rBCMAP can provide a partial remedy by incorporating robustness into the prior construction (see Section 5.1).

5. Simulation studies

In this section, we conduct comprehensive simulation studies to demonstrate the advantages of BCMAP (rBCMAP) in both parameter estimation and hypothesis testing compared to commonly used priors, such as MAP, rMAP, NPP (Normalized Power Prior), and MEM. Note: for both rBCMAP and rMAP, the weight for robustness (weight of weakly informative component) is set to 0.5.

5.1. Parameter estimation

The simulation study for parameter estimation follows the prior comparison framework described in Section 4.3. We examine external data under three different heterogeneity scenarios: one, two, and three clusters. For each scenario, we conduct 1000 simulation runs. In each run, we generate new data samples and compute the posterior distributions using various priors.

In the two-cluster scenario, we use the 12 external datasets described in Section 4.3, whose posteriors are displayed in Panel (a) of Figure 3. The new data consists of 10 observations generated from a normal distribution $N (θ^{*}, {0.3}^{2})$ , where $θ^{*}$ is drawn from the model (9). Then, the posteriors are then calculated based on $Y^{*}$ using different priors. Intuitive comparison can be conducted by considering the posterior estimation of the location (bias) and scale (standard deviation) of estimate ${\hat{θ}}^{*}$ . Let us examine three examples shown in Panels (a), (b), and (c) of Figure 4. In Panels (a) and (b), $θ^{*}$ is close to the centers 0.2 or 0.6, indicating the congruence of the new data $Y^{*}$ with the external data. BCMAP and rBCMAP outperform other methods with less bias and lower variance of the estimate ${\hat{θ}}^{*}$ . However, Panel (c) exhibits the opposite scenario where $θ^{*}$ is located in the middle of the two centers, indicating the incongruence between the new data $Y^{*}$ and the external data. This results in a worse performance of BCMAP compared to the other methods. But, such scenarios as in Panel (c) are rare under the assumption that the new data $Y^{*}$ is congruent with the external data, i.e., RWD. A valuable observation in Panel (c) is that rBCMAP can enhance the robustness of BCMAP when the assumption of congruence is violated. The results from 1000 simulation runs under both congruent and incongruent scenarios are presented in Panels (d-1) through (e-2). Panels (d-1) and (d-2) correspond to the congruent scenario. Panel (d-1) shows the empirical distribution of the absolute bias, $| {\hat{θ}}^{*} - θ^{*} |$ , where ${\hat{θ}}^{*}$ is the mode of the posterior. In this setting, BCMAP and rBCMAP exhibit lower bias compared to the other methods. Panel (d-2) displays the empirical distribution of the standard deviation, $s d ({\hat{θ}}^{*})$ . Excluding NPP, BCMAP and rBCMAP perform better than all other methods. Although NPP has the lowest standard deviation, it has the highest bias in Panel (d-1). In the incongruent scenarios shown in Panels (e-1) and (e-2), BCMAP exhibits the highest bias among all methods. However, with a robustness weight of 0.5, rBCMAP effectively reduces this bias, consistent with the observation in Panel (c).

Figure 4.

Estimation of new $θ^{*}$ from posteriors computed with different priors. In Panels (a) and (b), $θ^{*}$ is congruent with the external data, located near the cluster centers at 0.2 and 0.6, respectively. In contrast, Panel (c) illustrates an incongruent scenario, where $θ^{*}$ lies between the two cluster centers. Panels (d-1) and (d-2) present the bias and standard deviation of the estimator ${\hat{θ}}^{*}$ under the congruent cases. Panels (e-1) and (e-2) show the corresponding results for the incongruent case, where $θ^{*}$ is positioned between the two centers. All results in (d-1) through (e-2) are based on 1000 simulations, each with 10 observations of $Y^{*}$ .

The summary of the estimation results for both the congruence and incongruence scenarios is presented in Table 1. Regarding the root mean square error (RMSE) criterion, both BCMAP and rBCMAP shows a smaller RMSE (0.077 and 0.076) compared to the MAP and rMAP (0.088 and 0.092) in the congruence scenario, aligning with the results shown in Panels (d-1) and (d-2) of Figure 4. Sometimes, the influence of outliers, as highlighted in Panel (d-1) of Figure 4 (indicated by the blue dashed rectangle), may cause inflation of the RMSE ( see Tables 7 and 8 in Appendix). In these case, we can remove the top 5% or 10% of data (outliers) to reduce the inflation. Regarding the coverage rate of the 95% highest (posterior) density interval (HDI), in the congruence scenario, both BCMAP and rBCMAP outperform all other methods. In the incongruence scenario, however, the HDI coverage of BCMAP experiences a significant drop, falling below that of other methods. Nevertheless, the rBCMAP provides a substantial remedy for this issue.

Table 1.

Comparison of RMSE and the HDI coverage rate under the scenarios of congruence and incongruence shown in panel (d-1), (d-2) and (e-1), (e-2) in figure 4. “remove top 5%” means that calculates RMSE after removing the top 5% of data (outliers). “95% HDI CR” is the 95% highest (posterior) density interval (HDI) coverage rate about the true parameter.

Scenarios	Criteria	Data	BCMAP	MAP	rBCMAP	rMAP	NPP	MEM
Congruence	RMSE	All	0.077	0.088	0.076	0.092	0.156	0.105
		Remove top 5%	0.047	0.073	0.052	0.073	0.145	0.082
	95% HDI CR	All	0.972	0.955	0.972	0.963	0.807	0.953
Incongruence	RMSE	All	0.124	0.071	0.107	0.089	0.063	0.136
		Remove top 5%	0.115	0.059	0.098	0.067	0.053	0.104
	95% HDI CR	All	0.906	0.953	0.938	0.946	0.995	0.928

The simulation presented in Figure 4 focuses solely on the two-cluster heterogeneity scenario of the external datasets and keeps the new data with 10 observations. To enable a more comprehensive comparison, we extend the study in two directions: (1) in addition to the two-cluster scenario, we examine cases with one and three clusters; (2) we evaluate the effect of varying the size of $Y^{*}$ , considering sizes of 5, 10, 15, 20, 25, and 30. In the one-cluster scenario, 10 external datasets are generated from the model (10), representing homogeneity among the datasets.

\begin{aligned} Y_{1} | θ, σ^{2}, \dots, Y_{10} | θ, σ^{2} \sim N (θ, σ^{2}), \\ σ \sim H a l f n o r m a l (1), \\ θ = μ + ϵ, μ = 0.6, ϵ \sim N (0, {0.06}^{2}) . \end{aligned}

(10)

While the three-clusters scenario includes 25 external datasets generated from the model (11), reflecting different heterogeneity from the two-clusters scenario.

\begin{aligned} Y_{1} | θ, σ^{2}, \dots, Y_{25} | θ, σ^{2} \sim N (θ, σ^{2}), \\ σ \sim H a l f n o r m a l (1), \\ θ = I (c = c_{1}) \cdot (μ_{c 1} + ϵ_{c 1}) + I (c = c_{2}) \cdot (μ_{c 2} + ϵ_{c 2}) + I (c = c_{3}) \cdot (μ_{c 3} + ϵ_{c 3}), \\ c \sim m u l t i n o m i a l (p_{c 1} = 0.3, p_{c 2} = 0.4, p_{c 3} = 0.3), \\ μ_{c 1} = 0.2, μ_{c 2} = 0.6, μ_{c 3} = 1, \\ ϵ_{c 1} \sim N (0, {0.05}^{2}), ϵ_{c 2} \sim N (0, {0.06}^{2}), \\ ϵ_{c 3} \sim N (0, {0.08}^{2}) . \end{aligned}

(11)

A summary of the three scenarios, including the number of clusters, number of datasets, and sample sizes within each dataset, is provided in Table 9 in the Appendix. The posteriors (with a weakly informative prior), clustering results, and synthesized BCMAP and rBCMAP priors for the one-cluster and three-cluster scenarios are presented in Figures 8 and 9, respectively, in the Appendix.

The simulation results are presented in Figure 5. The Panels in the left and right columns illustrate the comparison of bias $| {\hat{θ}}^{*} - θ^{*} |$ and variance $s d ({\hat{θ}}^{*})$ , respectively. In the one-cluster scenario, the MAP and rMAP priors are identical to the BCMAP and rBCMAP priors, respectively, as shown in Panels (a-1) and (a-2). In the two- and three-clusters scenarios, when the new data $Y^{*}$ is congruent with the external data, the BCMAP and rBCMAP priors outperform other methods. Notably, the NPP exhibits poor performance, characterized by the smallest $s d ({\hat{θ}}^{*})$ but the largest bias. An interesting observation is that as the size of $Y^{*}$ increases, all methods tend to converge to similar results because the new data begins to dominate the prior. The summary (RMSE and 95% HDI coverage rate) for the two clusters and three clusters scenarios (excluding the one cluster scenario, as BCMAP and rBCMAP are equivalent to MAP and rMAP in this case) is presented in Table 7 and Table 8, respectively.

Figure 5.

Comparison of parameter estimation under different heterogeneous scenarios with various size of new observations $Y^{*}$ . Panels in left and right columns present the bias and standard deviation of estimate ${\hat{θ}}^{*}$ , respectively. Panels in top row show the result under homogeneous scenario. Panels in middle row present the result under two clusters scenario, where the true new $θ^{*}$ is near the cluster centers 0.2 or 0.6. Panels in bottom row exhibit the result under three clusters scenario, where the true new $θ^{*}$ is near the cluster centers 0.2, 0.6, or 1.

In summary, if the new data $Y^{*}$ is congruent with external data that exhibits heterogeneity, BCMAP and rBCMAP priors provide more accurate parameter estimation, with lower bias and variance, than commonly used methods. When the new data $Y^{*}$ is incongruent with the external data, both BCMAP and rBCMAP may perform worse than alternative methods. But, rBCMAP can provide a partial remedy, as demonstrated in Panel (e-1) of Figure 4 and Tables 1, 7, and 8.

5.2. Hypothesis testing

In this section, we compare the performance of different priors in hypothesis testing. All priors are constructed from the external datasets used in the illustrative example (see Panel (a) of Figure 3) in Section 4.3. We design a prospective trial with two groups: control and treatment. Observations are generated from $N (θ_{c}, σ^{2})$ and $N (θ_{t}, σ^{2})$ , where $σ = 0.3$ and $θ_{t} = 0.4$ . Since the possible values of $θ_{c}$ are centered at 0.2 and 0.6 by the model (9), we consider two scenarios: $θ_{c 1} = 0.2$ and $θ_{c 2} = 0.6$ . Correspondingly, we are interested in two hypothesis tests:

$H_{0} : θ_{t} \leq θ_{c 1}, H_{1} : θ_{t} > θ_{c 1}$

$H_{0} : θ_{t} \geq θ_{c 2}, H_{1} : θ_{t} < θ_{c 2}$ .

In the simulation, we investigate both frequentist and Bayesian methods. The frequentist method is a one-sided two-sample t-test. We considered two control vs. treatment recruitment ratios, $10 : 30$ and $30 : 30$ . For Bayesian methods, to study the effect of information borrowing from synthesized priors, we use the same data with control vs. treatment ratio $10 : 30$ and evaluate the gain in power by incorporating external control data. Corresponding to the two hypothesis tests, the decision rule (reject $H_{0}$ ) and operational characteristics in Bayesian methods are defined as follows:

Decision rule:

$P r (θ_{t} > θ_{c 1} | Y_{t}^{*}, Y_{c 1}^{*}, Y_{1}, \dots, Y_{12}) > η$ ;

Type 1 error rate:

$P r (P r (θ_{t} > θ_{c 1} | Y_{t}^{*}, Y_{c 1}^{*}, Y_{1}, \dots, Y_{12}) > η | H_{0})$ ;

Power:

$P r (P r (θ_{t} > θ_{c 1} | Y_{t}^{*}, Y_{c 1}^{*}, Y_{1}, \dots, Y_{12}) > η | H_{1})$ .

Decision rule:

$P r (θ_{t} < θ_{c 2} | Y_{t}^{*}, Y_{c 2}^{*}, Y_{1}, \dots, Y_{12}) > η$ ;

Type 1 error rate:

$P r (P r (θ_{t} < θ_{c 2} | Y_{t}^{*}, Y_{c 2}^{*}, Y_{1}, \dots, Y_{12}) > η | H_{0})$ ;

Power:

$P r (P r (θ_{t} < θ_{c 2} | Y_{t}^{*}, Y_{c 2}^{*}, Y_{1}, \dots, Y_{12}) > η | H_{1})$ ,

where

η \in [0.5, 1)

is an adjustable threshold used to control the type 1 error rate under 5%. In the calculation of

P r (θ_{t} > θ_{c 1} | Y_{t}^{*}, Y_{c 1}^{*}, Y_{1}, \dots, Y_{12})

and

P r (θ_{t} < θ_{c 2} | Y_{t}^{*}, Y_{c 2}^{*}, Y_{1}, \dots, Y_{12})

, we need to find the joint posterior

p (θ_{t}, θ_{c} | Y_{t}^{*}, Y_{c}^{*}, Y_{1}, \dots, Y_{12})

, where

θ_{c} = {θ_{c 1}, θ_{c 2}}

and

Y_{c}^{*} = {Y_{c 1}^{*}, Y_{c 2}^{*}}

. Since

θ_{t}

and

θ_{c}

are independent, the joint posterior can be expressed as follows:

\begin{aligned} p (θ_{t}, θ_{c} | Y_{t}^{*}, Y_{c}^{*}, Y_{1}, \dots, Y_{12}) \propto L (Y_{t}^{*} | θ_{t}) π_{0} (θ_{t}) L (Y_{c}^{*} | θ_{c}) π (θ_{c} | Y_{1}, \dots, Y_{12}) \end{aligned}

(12)

where

π_{0} (θ_{t})

is a weakly informative prior indicating trivial prior knowledge about the treatment group, and

π (θ_{c} | Y_{1}, \dots, Y_{12})

denotes the synthesized priors from the 12 external datasets.

The simulation results are shown in Table 2. For the frequentist method, the results of both hypothesis tests (1) and (2) show a dramatic reduction in power as the size of the control group decreases from 30 to 10. (Note: the frequentist method does not involve clustering or information borrowing.) Regarding Bayesian methods, except for the case of NPP in hypothesis test (1), information borrowing from synthesized priors can improve the test power compared to the frequentist 10:30 trial. However, the simulation results also illustrate that the quality of the prior plays a critical role. NPP performs best in test (2) but worst in test (1). MEM and rMAP perform similarly, offering only modest gains in power. MAP provides a greater improvement but remains less effective than rBCMAP and BCMAP. Overall, BCMAP achieves the highest performance, increasing power by more than 35%, and is comparable to the frequentist 30:30 trial. In sum, Bayesian cluster priors enable the incorporation of accurate information from heterogeneous external data. When the new data $Y_{c}^{*}$ is generally congruent with the external data, borrowing information from Bayesian cluster priors can effectively improve the operational characteristics of hypothesis testing.

Table 2.

Comparison of operation characteristic in hypothesis testing. Two hypothesis tests: (1) $H_{0} : θ_{t} \leq θ_{c 1}, H_{1} : θ_{t} > θ_{c 1}$ and (2) $H_{0} : θ_{t} \geq θ_{c 2}, H_{1} : θ_{t} < θ_{c 2}$ . For the Bayesian methods, the control vs. treatment ratio is $10 : 30$ , but incorporating external control data in the construction of the prior distribution.

	$θ_{c 1} = 0.2$		$θ_{c 1} = 0.6$
Methods	Type 1 Error	Power	Type 1 Error	Power
Control : Treatment $=$ 30 : 30
Frequentist	0.044	0.810	0.055	0.817
Control : Treatment $=$ 10 : 30
Frequentist	0.055	0.567	0.053	0.550
Bayesian
MEM	0.055	0.610	0.054	0.580
NPP	0.004	0.439	0.008	0.835
MAP	0.049	0.663	0.051	0.693
rMAP	0.053	0.568	0.051	0.608
BCMAP	0.050	0.783	0.051	0.773
rBCMAP	0.048	0.668	0.051	0.727

Table 3.

Studies on acupoint P6 stimulation for preventing nausea. The variables are: id: trial id number. study (author): first author of the study. year: study year. ai: number of patients experiencing nausea in treatment (wrist acupuncture point P6 treatment) group. n1i: total number of patients in treatment group. in cluster: partition $p (θ | Y_{h})$ into the corresponding cluster.

id	study (author)	year	ai ( $r_{h}$ )	n1i ( $N_{h}$ )	in cluster
External/historical data
1	Dundee	1986	3	25	2
2	Gieron	1993	11	30	3
3	Allen	1994	9	23	3
4	Andrzejowski	1996	11	18	3
5	Ferrera-Love	1996	1	30	1
6	Ho	1996	1	30	1
7	Duggal	1998	69	122	3
8	Alkaissi	1999	9	20	3
9	Harmon	1999	7	44	2
10	Agarwal	2000	18	100	2
11	Harmon	2000	4	47	1
12	Zarate	2001	28	110	2
Current data
13	Agarwal	2002	5	50	–
14	Alkaissi	2002	32	135	–
15	Rusy	2002	24	40	–
16	Wang	2002	16	50	–

6. Real data examples

6.1. Acupuncture trials

Postoperative nausea and vomiting are common complications following surgery and anesthesia. As an alternative to drug therapy, acupuncture has been studied as a potential treatment in several trials.²⁰ The dataset “dat.lee2004” in the R package “metadat”²¹ contains the results from 16 clinical trials examining the effectiveness of wrist acupuncture point P6 treatment for preventing postoperative nausea. Patient level (covariate) information is not available. A detailed description of the dataset is provided in Table 3. The columns “ai” and “n1i” correspond to $r_{h}$ and $N_{h}$ , respectively, where $N_{h}$ denotes the number of patients in study $h$ , and $r_{h}$ indicates the number of patients who experienced postoperative nausea. The data are modeled using a Binomial distribution: $B i n o m i a l (N_{h}, θ)$ , where $θ$ represents the probability of a patient experiencing nausea. We denote the observed data for study $h$ as $Y_{h} = {r_{h}, N_{h}}$ . There are four studies (id $=$ 13, 14, 15, and 16) were conducted in 2002. To illustrate the application of the proposed prior in data analysis, we treat each of these four datasets as the current data, and the other 12 studies conducted before 2002 as external (historical) data.

Figure 6.

Studies on the effectiveness of wrist acupuncture point P6 treatment for preventing postoperative nausea. Panel (a): the optimal number of cluster $K = 3$ . Panel (b) the corresponding BCMAP, rBCMAP, MAP, and rMAP. Panels (c-1) to (c-4): the posteriors with different priors and current data.

6.1.1. Prior for study design

In trial design study, our goal is to construct an informative prior $π (θ | Y_{1}, \dots, Y_{12})$ from the external data to provide useful prior information for the new trial. To achieve this goal, we begin by examining the posteriors of the external data using a vague conjugate prior, $π (θ) = B e t a (0.5, 0.5)$ . The posterior distributions are straightforward to derive: $θ | Y_{h} \sim B e t a (0.5 + y_{h}, 0.5 + N_{h} - y_{h}), h = 1, \dots, 12$ , as shown in Panel (b) of Figure 6. The results reveal substantial heterogeneity across the studies, making this dataset well-suited for applying the Bayesian clustering prior. Next, we determine the optimal number of clusters. Given that the average sample size is approximately 50, we set the threshold at 0.6 in accordance with the guidelines outlined in Section 4.2. The optimal number of clusters is identified as 3 as shown in Panel (a). The corresponding BCMAP and rBCMAP priors are presented in Panel (b), alongside the MAP and rMAP priors for comparison. Their parameterized forms are listed below:

\begin{aligned} π_{M A P} (θ | Y_{1}, \dots, Y_{12}) = B e t a (2.2, 5.5) \end{aligned}

(13)

\begin{aligned} π_{r M A P} (θ | Y_{1}, \dots, Y_{12}) = 0.5 \cdot B e t a (2.2, 5.5) + 0.5 \cdot B e t a (1, 1) \end{aligned}

(14)

\begin{aligned} π_{B C M A P} (θ | Y_{1}, \dots, Y_{12}) = 0.18 \cdot B e t a (4.7, 38.1) + 0.47 \cdot B e t a (10.6, 40.6) + 0.35 \cdot B e t a (9.7, 11.5) . \end{aligned}

(15)

\begin{aligned} π_{r B C M A P} (θ | Y_{1}, \dots, Y_{12}) = 0.09 \cdot B e t a (4.7, 38.1) + 0.23 \cdot B e t a (10.6, 40.6) + 0.18 \cdot B e t a (9.7, 11.5) + 0.5 \cdot B e t a (1, 1) . \end{aligned}

(16)

Based on equations (13) and (15), compared to MAP prior, the BCMAP prior has the advantage that its effective sample size (ESS) can be estimated in a finer way. From equation (13), we know that the ESS of $π_{M A P} (θ | Y_{1}, \dots, Y_{12})$ over the entire region [0,1] is $2.2 + 5.5 \approx 8$ . Let us check the ESS of $π_{B C M A P} (θ | Y_{1}, \dots, Y_{12})$ from equation (15) and Panel (b): in region 1, $E S S_{1} \approx 0.18 \cdot (4.7 + 38.1) \approx 8$ ; in region 2, $E S S_{2} \approx 0.47 \cdot (10.6 + 40.6) \approx 24$ ; in region 3, $E S S_{3} \approx 0.35 \cdot (9.7 + 11.5) \approx 7$ . This offers valuable prior information for the design of future trials. For example, acupuncture treatment appears more likely to reduce the probability of postoperative nausea to below 30%, with a concentration in the range of 15% to 30%. For the rMAP and rBCMAP priors, the effect of robustification results in an ESS of $0.5 \cdot (1 + 1) = 1$ .

6.1.2. Prior for data analysis

Applying the constructed priors to the current datasets (id = 13, 14, 15, and 16), we can obtain the posteriors of $θ$ as shown in Panels (c-1) to (c-4). Overall, the resulting posteriors are not substantially affected by the choice of prior, likely because the relatively large sample sizes in the current datasets dominate the posterior inference. However, in Panel (c-4), the BCMAP prior noticeably shifts the posterior toward the middle cluster (around 0.2), while the rBCMAP prior slightly counteracts this effect, drawing the posterior closer to those obtained under the MAP and rMAP priors.

6.2. Potassium supplementation to reduce diastolic blood pressure

In this section, we consider continuous endpoint data with the normal distribution. We use the dataset “dat.curtin2002” from the R package “metadat”, which includes 21 cross-over studies evaluating the effect of potassium supplementation on reducing diastolic blood pressure. This dataset does not contain patient-level information. We perform a preliminary data cleaning step by removing two outlier studies with mean values of $- 13.1$ and $- 8.0$ . The resulting cleaned dataset is provided in Table 4.

Table 4.
Studies on potassium supplementation to reduce diastolic blood pressure. The variables are: id: trial id number. study (author): first author of the study. year: study year. N: number of patients in each study. mean: the mean of each study. SE: the standard error of each study. in cluster: partition $p (θ | Y_{h})$ into the corresponding cluster.

id study (author) year N mean SE in cluster

External/historical data

1 Skrabal 1981 20 –4.5 2.1 2

2 Skrabal 1981 20 –0.5 1.7 4

3 MacGregor 1982 23 –4.0 1.9 2

4 Khaw 1982 20 –2.4 1.1 3

5 Richards 1984 12 –1.0 3.4 4

6 Smith 1985 20 0.0 1.9 4

7 Kaplan 1985 16 –5.8 1.6 1

8 Zoccali 1985 23 –3.0 3.0 3

9 Matlou 1986 36 –3.0 1.5 3

10 Barden 1986 44 –1.5 1.4 4

11 Poulter a 1986 19 2.0 2.2 5

12 Grobbee 1987 40 –0.3 1.5 4

13 Mullen 1990 24 3.0 2.0 6

14 Mullen 1990 24 1.4 2.0 5

15 Valdes 1991 24 –3.0 2.0 3

16 Barden 1991 39 –0.6 0.6 4

17 Overlack 1991 12 3.0 2.0 6

Current data

18 Smith 1992 22 –1.7 2.5 –

19 Fotherby 1992 18 –6.0 2.5 –

id	study (author)	year	N	mean	SE	in cluster
External/historical data
1	Skrabal	1981	20	–4.5	2.1	2
2	Skrabal	1981	20	–0.5	1.7	4
3	MacGregor	1982	23	–4.0	1.9	2
4	Khaw	1982	20	–2.4	1.1	3
5	Richards	1984	12	–1.0	3.4	4
6	Smith	1985	20	0.0	1.9	4
7	Kaplan	1985	16	–5.8	1.6	1
8	Zoccali	1985	23	–3.0	3.0	3
9	Matlou	1986	36	–3.0	1.5	3
10	Barden	1986	44	–1.5	1.4	4
11	Poulter a	1986	19	2.0	2.2	5
12	Grobbee	1987	40	–0.3	1.5	4
13	Mullen	1990	24	3.0	2.0	6
14	Mullen	1990	24	1.4	2.0	5
15	Valdes	1991	24	–3.0	2.0	3
16	Barden	1991	39	–0.6	0.6	4
17	Overlack	1991	12	3.0	2.0	6
Current data
18	Smith	1992	22	–1.7	2.5	–
19	Fotherby	1992	18	–6.0	2.5	–

Similar to the acupuncture example, in order to simulate data analysis, we pick out the two studies from 1992 (id = 18, 19) as the current data and those conducted before 1992 as external data (17 external datasets). Given a prior $N (θ_{0}, σ_{0}^{2})$ , the corresponding posteriors can be computed as follows:

θ | {\bar{y}}_{h}, s_{h}^{2} \sim N (\frac{σ_{0}^{2}}{\frac{s_{h}^{2}}{n_{h}} + σ_{0}^{2}} {\bar{y}}_{h} + \frac{s_{h}^{2} / n_{h}}{\frac{s_{h}^{2}}{n_{h}} + σ_{0}^{2}} θ_{0}, {(\frac{1}{σ_{0}^{2}} + \frac{n_{h}}{s_{h}^{2}})}^{- 1})

where,

n_{h}

{\bar{y}}_{h}

, and

s_{h}^{2}

denote the sample size, sample mean, and sample variance for dataset

h

, respectively. We set the prior mean and variance as

θ_{0} = \frac{1}{H} \sum_{h = 1}^{H} {\bar{y}}_{h}

and

σ_{0} = 10 \cdot {(\sum_{h = 1}^{H} s_{h}^{2})}^{- 2}

, in order to specify a weakly informative prior

N (θ_{0}, σ_{0}^{2})

. The resulting posterior distributions are shown in Panel (b) of Figure 7.

Figure 7.

Studies on potassium supplementation to reduce diastolic blood pressure. Panel (a): the optimal number of cluster $K = 6$ . Panel (b) the corresponding BCMAP, rBCMAP, MAP, and rMAP. Panels (c-1) and (c-2): the posteriors with different priors and current data.

Since the posterior distributions are well spread, indicating substantial separation between potential clusters, we select a threshold of 0.6 in accordance with the guidelines in Section 4.2. The optimal number of clusters is identified as $K = 6$ as shown in Panel (a) of Figure 7. The corresponding BCMAP and rBCMAP priors are shown in Panel (b), alongside MAP and rMAP for comparison. Compared to MAP and rMAP, the BCMAP and rBCMAP priors exhibit more pronounced concentration around the cluster centers, particularly near -0.7 and -2.8. This prior information suggests that future studies are more likely to observe the values of pressure reduction centered around these two values.

The posterior distributions of $θ$ , based on the constructed priors and current data, are displayed in Panels (c-1) and (c-2). Unlike the acupuncture example, the current data in this case are relatively small, which leads the prior to exert a stronger influence on the posterior. The rBCMAP serves to moderate the impact of the BCMAP, yielding a smoother posterior distribution closer to MAP and rMAP.

7. Discussion

Clustering plays a crucial role in synthesizing informative priors from heterogeneous multisource external data. Leveraging the concept of the overlapping coefficient, we introduce the OCI, OEI, K-Means algorithm, and an OEI-based criterion to identify the optimal clustering. Based on it, a Bayesian clustering prior can be constructed and applied during both the trial design and data analysis stages. Simulation studies validate its advantages.

Effectively borrowing information from external (historical) data remains an active area of research. The proposed Bayesian clustering prior represents an effort to address the challenges of information borrowing from heterogeneous multisource external data. Several potential research directions of this approach merit further investigation:

First, this study focuses on the case where $θ$ is one-dimensional. Since the definition of the overlapping coefficient (OVL) extends naturally to higher dimensions, We believe the proposed method can also be extended to multi-dimensional $θ$ . However, the computational aspects of OVL in higher dimensions, which impact the calculation of OCI and OEI, require more careful consideration.

Second, we do not incorporate covariate information in the current study. However, in the era of precision medicine, covariates are becoming increasingly important. A straightforward way to integrate covariates into the Bayesian clustering prior is through their inclusion in posterior estimation, for instance via the MAP approach. Further research is needed to explore more advanced strategies for incorporating covariates, such as regression-based adjustments or propensity score matching.

Third, this paper adopts a fixed weight of $w = 0.5$ , representing an equal balance between the informative prior derived from external data and a weakly informative component. While convenient, this choice may not be optimal across all scenarios. Data-driven methods for estimating the weight dynamically could offer greater flexibility and potentially enhance performance.

Finally, beyond the K-Means-based approach used here, alternative clustering techniques, such as Gaussian mixture models, could be employed to simultaneously perform clustering and prior construction. These alternatives deserve further exploration and comparative evaluation.

Footnotes

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

ORCID iDs

Xuetao Lu

J Jack Lee

Appendix

Table 5.

Description of symbols and abbreviations.

Symbol / Abbreviation	Description
$Y_{h}$ , $h = 1, \dots, H$	External datasets.
$p (θ \| Y_{h})$	Posterior distribution of $θ$ given external data $Y_{h}$
$θ \| Y_{h}$	The random variable corresponding to the posterior $p (θ \| Y_{h})$ .
$p (θ \| Y_{1}, \dots, Y_{H})$	Constructed prior distribution of $θ$ given external data $Y_{1}, \dots, Y_{H}$ .
$θ \| Y_{1}, \dots, Y_{H}$	The random variable corresponding to the posterior $p (θ \| Y_{1}, \dots, Y_{H})$ .
$O V L (X, Y)$	Overlapping coefficient between random variables $X$ and $Y$ .
${OCI}_{K}$	Overlapping clustering index of a $K$ partition.
$S_{K}$	$K$ -partition of $θ \| Y_{h}$ , $h = 1, \dots, H$ .
$G_{m}$	The cluster $m$ in the partition $S_{K}$ .
$Θ_{m}$	The centroid random variable of cluster $G_{m}$ .
$S_{K}^{(oci) *}$	The optimal partition identified by maximizing $O C K_{K}$ .
$S_{K}^{(km) *}$	The optimal partition identified by K-Means clustering algorithm.
$OEI (π)$	Overlapping evidence index of the prior $π (θ)$ .
$π_{K} (θ \| Y_{1}, \dots, Y_{H})$	The Bayesian clustering prior with $K$ clusters given the external data.
$π_{K}^{R} (θ \| Y_{1}, \dots, Y_{H})$	The robust Bayesian clustering prior with $K$ clusters given the external data.
$SOEI (π_{K})$	The sequence of scaled overlapping evidence index.
MAP	Meta-analytic predictive prior.
rMAP	Robust meta-analytic predictive prior.
BCMAP	Bayesian clustering meta-analytic predictive prior.
rBCMAP	Robust Bayesian clustering meta-analytic predictive prior.
NPP	Normalized Power Prior.
MEM	Multisource exchangeability model prior.
$θ^{}$ , $Y^{}$	The value of parameter and corresponding random sample in the future (new) study.

Theorem 1 Monotonicity of <inline-formula> <math xmlns:mml="http://www.w3.org/1998/Math/MathML" display="inline" id="mml-inline371"> <mi>O</mi> <mi>E</mi> <mi>I</mi> <mi>s</mi> </math> </inline-formula>

For the BCMAP priors $π_{K}$ , $K = 1, \dots, H$ , the sequence ${O E I (π_{K})}_{K = 1}^{H}$ increases monotonically with the increase of $K$ , i.e., $0 \leq O E I (π_{1}) \leq \dots \leq O E I (π_{H}) \leq 1$ .

Proof.

Without losing generality, let us compare $O E I (π_{K})$ and $O E I (π_{K + 1})$ , $K < H$ . Correspondingly, we need to consider the difference between $S_{K}^{(oci) *}$ and $S_{K + 1}^{(oci) *}$ . For an arbitrary cluster in $S_{K}^{(oci) *}$ , say $G_{m}^{(K) *}$ , we split it to two clusters say $G_{m}^{(K + 1)}$ and $G_{K + 1}^{(K + 1)}$ , where $G_{m}^{(K) *} = G_{m}^{(K + 1)} \cup G_{K + 1}^{(K + 1)}$ . The new partition is denoted as $S_{K + 1}^{(oci)}$ . In each cluster, we assume that the random variables $θ | Y_{h}$ are homogeneous and use them to construct a MAP prior. For $G_{m}^{(K) *}$ , the corresponding MAP random variable is denoted as $θ | G_{m}^{(K) *}$ . For $G_{m}^{(K + 1)}$ and $G_{K + 1}^{(K + 1)}$ , they are presented as $θ | G_{m}^{(K + 1)}$ and $θ | G_{K + 1}^{(K + 1)}$ . Since the homogeneity of cluster $G_{m}^{(K + 1)}$ and cluster $G_{K + 1}^{(K + 1)}$ are greater than that of cluster $G_{m}^{(K) *}$ , it follows that:

\begin{aligned} \sum_{h = 1}^{n_{m}} \frac{N_{m h}}{N} \cdot O V L (θ | G_{m}^{(K) *}, θ | Y_{m h}) & \leq \sum_{h = 1}^{n_{m}} \frac{N_{m h}}{N} \cdot O V L (θ | G_{m}^{(K + 1)}, θ | Y_{m h}) \cdot I (Y_{m h} \in G_{m}^{(K + 1)}) \\ + \sum_{h = 1}^{n_{m}} \frac{N_{m h}}{N} \cdot O V L (θ | G_{K + 1}^{(K + 1)}, θ | Y_{m h}) \cdot I (Y_{m h} \in G_{K + 1}^{(K + 1)}) . \end{aligned}

By equation (6) in the definition of OEI, we know that

OEI (π_{K}) \leq OEI (π_{S_{K + 1}^{(oci)}}) .

By definition of the Overlapping Clustering Index (OCI), $S_{K + 1}^{(oci) *}$ denotes the partition that maximizes the overall within-cluster homogeneity among all K+1 partitions (including $S_{K + 1}^{(oci)}$ ). This implies:

OEI (π_{S_{K + 1}^{(oci)}}) \leq OEI (π_{S_{K + 1}^{(oci) *}}) = OEI (π_{K + 1}) .

Therefore,

OEI (π_{K}) \leq OEI (π_{K + 1})

holds. Note: This proof is applied to the rBCMAP as well.

References

US Food and Drug Administration. Use of real-world evidence to support regulatory decision-making for medical devices: guidance for industry and food and drug administration staff. Silver Spring, 2017.

Spiegelhalter

DJDJ

Abrams

KRKR

Myles

. Bayesian approaches to clinical trials and health-care evaluation. Statistics in practice. Chichester: John Wiley & Sons, 2004.

Neuenschwander

Capkun-Niggli

Branson

, et al. Summarizing historical information on controls in clinical trials. Clinical Trials 2010; 7: 5–18.

Schmidli

Gsteiger

Roychoudhury

, et al. Robust meta-analytic-predictive priors in clinical trials with historical control information. Biometrics 2014; 70: 1023–1032.

Chen

Ibrahim

Shao

. Power prior distributions for generalized linear models. J Stat Plan Inference 2000; 84: 121–137.

Ibrahim

Chen

Sinha

. On optimality properties of the power prior. J Am Stat Assoc 2003; 98: 204–213.

Hobbs

Sargent

Carlin

. Commensurate priors for incorporating historical information in clinical trials using general and generalized linear models. Bayesian Anal 2012; 7: 639–674.

Hobbs

Carlin

Mandrekar

, et al. Hierarchical commensurate and power prior models for adaptive incorporation of historical information in clinical trials. Biometrics 2011; 67: 1047–1056.

Kaizer

Koopmeiners

Hobbs

. Bayesian hierarchical modeling based on multisource exchangeability. Biostatistics 2017; 19: 169–184.

10.

Jiang

Nie

Yuan

. Elastic priors to dynamically borrow information from historical data in clinical trials. Biometrics 2023; 79: 49–60.

11.

Weber

Seaman JW

III

, et al. Applying meta-analytic-predictive priors with the R Bayesian evidence synthesis tools. J Stat Softw 2021; 100: 1–32.

12.

Gershman

Blei

. A tutorial on Bayesian nonparametric models. J Math Psychol 2012; 56: 1–12.

13.

Chen

Lee

. Bayesian cluster hierarchical model for subgroup borrowing in the design and analysis of basket trials with binary endpoints. Stat Methods Med Res 2020; 29: 2717–2732.

14.

Lee

. Overlapping indices for dynamic information borrowing in Bayesian hierarchical modeling. 2023, https://arxiv.org/abs/2305.17515v1.

15.

Schmid

Schmidt

. Nonparametric estimation of the coefficient of overlapping-theory and empirical application. Comput Stat Data Anal 2006; 50: 1583–1596.

16.

Weitzman

. Measures of overlap of income distributions of white and negro families in the united states. Technical Report 22, US Department of Commerce, 1970.

17.

Fraley

Raftery

. Model-based clustering, discriminant analysis, and density estimation. J Am Stat Assoc 2002; 97: 611–631.

18.

Teh

Jordan

Beal

, et al. Hierarchical dirichlet processes. J Am Stat Assoc 2006; 101: 1566–1581.

19.

Carlin

Louis

. Empirical bayes: past, present and future. J Am Stat Assoc 2000; 95: 1286–1289.

20.

Lee

Done

. Stimulation of the wrist acupuncture point p6 for preventing postoperative nausea and vomiting. Cochrane Database Syst Rev 2004. DOI: 10.1002/14651858.CD003281.pub2.

21.

White

Noble

Senior

, et al. metadat: Meta-Analysis Datasets, 2022, https://CRAN.R-project.org/package=metadat. R package version 1.2-0.