Sage Journals: Discover world-class research

Abstract

Determining sample size is crucial in research study design. The hierarchical structure of the data in cluster-randomized trials (CRTs) complicates this process, thereby necessitating the determination of the sample size at each level. Most methods for these trials are based on null hypothesis significance testing, which has numerous pitfalls. Using the Bayes factor may avoid these drawbacks, but existing methods are limited to trials without a multilevel structure. This study presents a method to determine the sample size for a one-period two-treatment parallel CRT using the Bayes factor. We introduce the implementation of this method in an R package. Simulation results show that the required sample size increases with decreasing effect sizes and with increasing intraclass correlation and Bayes factors.

Keywords

Bayes factor sample size determination cluster-randomized trials sample size multilevel model

In the initial stages of the design of a research study, a key step is the determination of the sample size. Neglecting this key step may result in underpowered studies due to insufficient sample size, thereby potentially diminishing the ability to detect clinically relevant effects and leading to unethical use of participants. Furthermore, the publication of underpowered studies aggravates the crisis of replicability of research findings, as the replicability of a study is related to the statistical power of its design (Oakes, 1987). Determining sample size also prevents the use of more subjects than necessary, thereby reducing waste of resources and unethical participant recruitment.

Numerous elements come into play in determining the required sample size, with variations depending on the selected statistical model and the framework employed for hypothesis testing. The complexity of the interaction between the elements is especially intensified when dealing with multilevel models, given the hierarchical structure of the data. An example of multilevel data is found in cluster-randomized trials (CRTs), where complete groups, such as schools or families, are randomly assigned to treatment conditions. This design is widely used in social, behavioral, and biomedical sciences for the evaluation of treatments, programs, or interventions (Campbell & Walters, 2014; Donner & Klar, 2010; Eldridge & Kerry, 2012; Hayes & Moulton, 2009; Murray, 1998). Considering the hierarchical structure of the data, with subjects nested within clusters, the researcher must determine the required sample size for both levels, cluster sizes, and number of clusters.

The conventional framework for hypothesis testing is based on null hypothesis significance testing (NHST), which is a combination of the significance testing approach of Fisher and the hypothesis testing approach of Neyman and Pearson (Balluerka et al., 2005). NHST involves the null hypothesis, that is, the absence of an effect, and the alternative hypothesis, that is, the presence of an effect. This approach to hypothesis testing assumes that the null hypothesis is true in the population, and subsequently, the researchers decide to either reject or retain the hypothesis using p-values. In the context of a CRT, previous studies have identified several factors that influence the determination of sample size within this framework, including intraclass correlation, effect size, error rates, cluster size, and number of clusters (Moerbeek et al., 2000; Raudenbush, 1997). Researchers determine the sample size using these elements in equations that illustrate the relation between sample size and power, or available software such as Software for Power Analysis of trials with Multilevel data (SPA-ML) (Moerbeek & Teerenstra, 2016) and the Shiny CRT Calculator (Hemming & Kasza, n.d.).

An alternative approach to hypothesis testing is based on the Bayes factor, which is a quantification of the relative support of the data for one hypothesis over another (Heck et al., 2022; Hoijtink, 2012; Hoijtink, Mulder, et al., 2019; Kass & Raftery, 1995). In recent years, there has been an increase in the use of the Bayes factor in cluster-randomized trials; some examples are Beeres et al. (2022), Conigrave et al. (2024), Hitchcock and Westwell (2017), Rosário et al. (2022), Sanchez et al. (2021), So et al. (2025), Dienes et al. (2018), Troncoso and Humphrey (2021), and Umbach et al. (2018). As a result of the increased use of the Bayes factor, tools implementing hypothesis testing with the Bayes factor have been developed (Bürkner, 2018; Gu et al., 2019; JASP Team, 2025; Makowski et al., 2019; Mulder et al., 2019; Veenman et al., 2024).

In general, Bayesian sample size determination involves an expected behavior of the posterior distribution and a user-specified minimum value; however, various criteria exist in the literature such as the average coverage criterion, average length criterion, average posterior variance criterion, and so on (a comprehensive overview can be found in Gelfand & Wang, 2002; M’Lan et al., 2006; Pezeshk, 2003; Wang & Gelfand, 2002; Weiss, 1997). Methods using the Bayes factor in sample size determination use a threshold for the Bayes factor to control the error rates or the probability of misleading or weak evidence (De Santis, 2004; Weiss, 1997). However, in the context of CRT, the methodology for determining the Bayesian sample size is scarce. Wilson (2022) proposed a method to calculate the total number of participants using Monte Carlo simulations, but this method was proposed for Bayesian parameter estimation instead of hypothesis testing and assumes that the number of clusters is fixed beforehand, whereas in some CRTs, the cluster size is fixed beforehand, so that the number of clusters needs to be determined.

This study aims to present a method to determine the sample size in CRTs using the Bayesian approach for hypothesis testing. The method for sample size determination relies on simulation studies, for which we created functions in R that can either determine the required number of clusters given a fixed cluster size or, vice versa, the cluster size that is required for a fixed number of clusters. The next section introduces the data generation model. Subsequently, we discuss the shortcomings of NHST and the advantages of using the Bayes factor. We then explore the details of the determination of the sample size for CRT, explaining each essential factor and the underlying algorithm. Subsequently, we present the results of a simulation and the sample size required for realistic scenarios. To conclude, we present the limitations of the methodology and offer advice to researchers.

Cluster-Randomized Trials

The data from a CRT have a so-called multilevel structure, with variables measured on individuals at the first level and variables measured on clusters at the second level (Goldstein, 2011; Hox et al., 2017; Lazega & Snijders, 2016; Raudenbush & Bryk, 2010). An example of this design is the study by Ausems et al. (2002), in which the objective was to test the additional effect of out-of-school smoking prevention intervention. In this study, elementary schools were randomly assigned to four treatment conditions: an in-school smoking prevention program, a computer-based out-of-school smoking prevention program, a combined approach (in-school and out-of-school conditions), and a control condition. The students filled out a questionnaire twice, once before the intervention and once afterward. The researchers expected that students within the same school would mutually influence each other’s smoking behavior; hence, the multilevel model was used to account for dependencies in the outcome variables.

Randomization at the cluster level rather than the individual level results in a decrease in statistical power, given the dependency on outcome measures within the same cluster. In other words, the CRT does not provide the same amount of information as an individually randomized trial. Despite this drawback, the CRT is widely used for ethical and logistical reasons. One of the main advantages that this design provides is that it helps to avoid or reduce contamination of the control condition that may occur if multiple treatment conditions are available within each cluster, and information leaks from the intervention condition to the control. This leakage may occur when the intervention relies on providing new information, procedures, or guidelines to the participants (Moerbeek, 2005).

In this article, the continuous outcome $Y_{ij}$ for an individual $i = 1, . . ., n_{1}$ in cluster $j = 1, . . ., n_{2}$ , is a function of the treatment condition:

Y_{ij} = μ_{C} I_{Cj} + μ_{T} I_{Tj} + u_{j} + e_{ij},

(1)

where $μ_{C}$ and $μ_{T}$ represent the mean of the control and treatment conditions, respectively. The indicator variable $I_{Cj}$ takes the value of 1 if cluster $j$ is in the control condition and 0 otherwise, similarly $I_{Tj}$ indicates if cluster $j$ is in the treatment condition. In addition, two random terms are included, $u_{j} ~ N (0, σ_{u}^{2})$ at the cluster level and $e_{ij} ~ N (0, σ_{e}^{2})$ at the individual level, which are assumed to be independent of each other. The sum of the between-cluster variance $σ_{u}^{2}$ and the within-cluster variance $σ_{e}^{2}$ results in the total variance, denoted as $σ^{2}$ . The two variances also define the intraclass correlation coefficient (ICC), $ρ = σ_{u}^{2} / (σ_{e}^{2} + σ_{u}^{2})$ , which is the proportion of total variance attributable to the cluster level.

The standardized treatment effect, denoted as $δ$ and also known as the effect size, is defined as

δ = \frac{{\bar{μ}}_{T} - {\bar{μ}}_{C}}{σ},

(2)

where $σ$ is the standard deviation of the outcome variable. The variance of the treatment effect is expressed as

\frac{4 σ^{2} [1 + (n_{1} - 1) ρ]}{n_{1} n_{2}},

(3)

where $n_{1}$ represents the number of individuals per cluster, and $n_{2}$ represents the total number of clusters.

All of these elements play a crucial role, along with statistical power, in the estimation of the sample size. The statistical power is denoted as $1 - β$ , where $β$ is the probability of committing a Type II error. According to Moerbeek and Teerenstra (2016), in CRTs, the definition of statistical power when the null hypothesis ( $H_{0} : δ = 0$ ) is tested is given by the combination of Equations 2 and 3:

1 - β = \frac{δ}{\sqrt{\frac{4 σ^{2} [1 + (n_{1} - 1) ρ]}{n_{1} n_{2}}}} .

(4)

This equation shows that the power decreases as $ρ$ increases, especially when the common cluster size $n_{1}$ is high. Therefore, the researcher must balance the cluster sizes with the number of clusters to obtain the minimum sample required to detect a specific treatment effect.

One approach to determine the sample size is using the design effect:

DE = 1 + (n_{1} - 1) ρ,

(5)

which considers the effect of clustering. The total number of subjects is calculated based on the sample size obtained for an individually randomized design and then is inflated by the design effect with the fixed cluster size $n_{1}$ (e.g., Campbell & Walters, 2014; Moerbeek & Teerenstra, 2016).

An alternative approach to determine the sample size uses the factors that influence power. Moerbeek and Teerenstra (2016) presented formulas describing the relation between statistical power, effect size, Type I error rate, ICC, and sample size. Equation 4 can be rewritten so that the number of clusters becomes a function of cluster size, Type I error rate, ICC, power, and effect size:

n_{2} = 4 \frac{1 + (n_{1} - 1) ρ}{n_{1}} {(\frac{z_{1 - α} + z_{1 - β}}{δ})}^{2},

(6)

where $z$ denotes the percentile from the standard normal distribution and $α$ the significance level. The formula makes it evident that increasing the common cluster size $n_{1}$ leads to a smaller number of clusters, while increasing the intraclass correlation $ρ$ leads to a larger number of clusters. Alternatively, we can also formulate a function for the cluster size:

n_{1} = 4 \frac{1 - ρ}{{(\frac{δ}{z_{1 - α} + z_{1 - β}})}^{2} n_{2} - 4 ρ} .

(7)

Here, it can be seen that, for a small number of clusters, the desired power level may not always be achieved, even when the cluster size increases to infinity (Hemming et al., 2011).

However, important to note that the methods for estimating the sample size discussed until now have been established for the NHST framework, which comes with notable limitations. These limitations will be explored in greater detail in the next section.

Hypothesis Testing

Limitations and Criticisms of NHST

Despite the widespread use of NHST, criticism of this approach has grown over the past few decades. Hoijtink, Mulder, et al. (2019) provide an extensive account of numerous issues associated with NHST. One criticism in their article is related to the use of the NHST approach in research. The excessive emphasis on p-values has contributed to publication bias, as studies yielding statistically significant results are more likely to be published, leading to the failed drawer phenomenon. Furthermore, this emphasis on p-values has led some researchers aiming to advance their careers to engage in questionable research practices, such as p-hacking, hypothesizing after the results are known, and cherry-picking. A second criticism is that the use of a dichotomous decision rule based on $α = 0.05$ , or another appropriate value, narrows the focus of the investigation to reporting whether the null hypothesis is rejected.

A third criticism is the question of whether one is even interested in testing the null hypothesis. The null hypothesis indicates that there is zero effect, or in other words, that two groups have exactly the same means on a continuous outcome variable. This hypothesis is operationalized as $H_{0} : μ_{1} = μ_{2}$ , meaning that the mean outcome in condition 1 is equal to the mean outcome in condition 2. However, the likelihood of this scenario in reality is very low, rendering the test practically unnecessary (Gu et al., 2014). A fourth criticism is that the NHST approach is focused on the null hypothesis, and when this hypothesis is rejected, the conclusion remains limited to asserting that the effect is not zero. When comparing more than two treatment means, post hoc tests are required to understand which condition means differ significantly from each other.

Beyond Null Hypothesis Testing

Diverse types of hypotheses can be of interest to researchers; to illustrate some of these types, we consider the study presented in the “CRTs” section, where Ausems et al. (2002) collected data on four treatment conditions in a smoking prevention intervention. For the sake of simplicity, in the formulation of the hypotheses, we will only use two of the treatment conditions, namely the in-school and out-of-school interventions. However, it should be noted that the hypotheses that will be presented can include more than two conditions. An informative hypothesis shows that there is an order or an expected relationship between the treatment condition means, using equality and order constraints Hoijtink (2012). This type of hypothesis is created based on the findings in the literature or the actual expectations of the researchers. In the study by Ausems et al. (2002), a possible informative hypothesis is

\begin{matrix} H_{i} : μ_{in} > μ_{out} \end{matrix},

where the mean of the in-school smoking prevention program is expected to be larger than the mean of the out-of-school smoking prevention program. On the other hand, a hypothesis without any constraint is called an unconstrained hypothesis

\begin{matrix} H_{u} : μ_{in}, μ_{out} . \end{matrix}

Such a hypothesis implies that the researcher does not have any a priori expectations concerning the group means. A final hypothesis that is used in this article is the complement hypothesis, which, in the case of $H_{i}$ is

\begin{matrix} H_{c} : μ_{in} < μ_{out} . \end{matrix}

This hypothesis covers all possible parameter values that are not covered in $H_{i}$ .

Recognizing the widespread use of the null hypothesis in social sciences, and considering that the project aims to provide researchers with the necessary tools, we included the null hypothesis as a possible hypothesis of study. In this article, we provide a methodology for sample size determination for two-group comparisons. The first pair of competing hypotheses includes the equality-constrained hypothesis $H_{0}$ and an informative hypothesis $H_{1}$ , and is referred to as Hypothesis set 1:

\begin{matrix} H_{0} : μ_{in} = μ_{out} \\ H_{1} : μ_{in} < μ_{out}, \end{matrix}

while Hypothesis set 2 contains a pair of informative hypotheses:

\begin{matrix} H_{1} : μ_{in} < μ_{out} \\ H_{2} : μ_{in} > μ_{out} . \end{matrix}

Hypothesis set 2 can be used when one has good reason to believe one of the treatments is performing better than the other ( $H_{1}$ ) and one wants to test this versus its complement ( $H_{2}$ ). These two hypothesis sets can also be formulated in more general terms, as in Equation 1, if one replaces $μ_{in}$ and $μ_{out}$ with $μ_{C}$ and $μ_{T}$ , where the latter two are the means in a control and treatment condition, respectively.

An example where Hypothesis set 1 clearly represents the researchers’ expectations is a study of the effect of cognitive behavioral therapy (CBT) on reducing alcohol consumption. The null hypothesis $H_{0}$ would state that there is no difference in the reduction in alcohol consumption between individuals receiving CBT and those receiving standard counseling, whereas $H_{1}$ would state that CBT leads to a greater reduction in alcohol consumption compared to standard counseling. Hypothesis set 2, in comparison, is better suited for comparing two treatments or interventions. For instance, when comparing the effect of a medication-assisted treatment with CBT on opioid use. Hypothesis $H_{1}$ would state that individuals following the medicine-assisted treatment exhibit lower levels of opioid use than those following CBT. In contrast, $H_{2}$ would state the opposite, that individuals receiving CBT show lower levels of opioid use compared to those receiving medication-assisted treatment. However, it must be emphasized that the selection between the two sets of hypotheses should be driven by the theory and expectations of researchers.

Bayes Factor

The Bayes factor is a quantification of the relative support of the data for one hypothesis over another (Heck et al., 2022). It is also known as the ratio of two marginal likelihoods, or the marginal probability of the data $X$ under $H_{i}$ or $H_{i^{'}}$ (Heck et al., 2022; Kass & Raftery, 1995). This quantification is represented as

B F_{i i^{'}} = \frac{P (X | H_{i})}{P (X | H_{i^{'}})} .

However, when comparing an (in)equality-constrained hypothesis with the unconstrained hypothesis $H_{u}$ , the formulation can be simplified by using the so-called encompassing prior approach, where the prior for $H_{i}$ is constructed as a truncation of the unconstrained prior (Gu et al., 2018; Klugkist et al., 2005). Given that the likelihood of $H_{i}$ can be expressed as the truncation of the likelihood of the unconstrained hypothesis at a specific value; the Bayes factor can be computed as the Savage-Dickey density ratio (Mulder et al., 2022; Wagenmakers et al., 2010). The Bayes factor for Hypothesis set 1 is thus computed by

B F_{01} = \frac{B F_{0 u}}{B F_{1 u}} = \frac{\frac{f_{0}}{c_{0}}}{\frac{f_{1}}{c_{1}}} .

(8)

where $f$ is the relative fit and $c$ is the relative complexity of the hypothesis under consideration compared to the unconstrained hypothesis. In the case of comparing the hypotheses in Hypothesis set 2, the Bayes factor is computed by

B F_{12} = \frac{B F_{1 u}}{B F_{2 u}} = \frac{\frac{f_{1}}{c_{1}}}{\frac{1 - f_{1}}{1 - c_{1}}} .

(9)

where the Bayes factor is the ratio of the Bayes factor for the informative hypothesis ( $H_{1}$ ) compared to the unconstrained hypothesis ( $H_{u}$ ) and the Bayes factor of the complement hypothesis ( $H_{2}$ ).

Following the Savage-Dickey density ratio, the computation of the fit and complexity of the null hypothesis ( $H_{0}$ ) is carried out by using densities (i.e., heights) of the treatment effect $δ$ of zero in the posterior and prior distributions, respectively. However, in the case of evaluating informative hypotheses such as $H_{1}$ , the fit is the integration over the posterior distribution that aligns with the hypothesis, while the complexity is the integration over the prior distribution supported by the hypothesis (Hoijtink, Mulder, et al., 2019; Mulder et al., 2022; Wagenmakers et al., 2010). The left panel of Figure 1 depicts the prior distribution of the treatment effect ( $δ$ ), with a vertical dotted line marking the treatment effect size under $H_{0}$ . The complexity of $H_{0}$ corresponding to the density at zero is marked with a horizontal dotted line ( $c_{0} = . 398$ ), whereas the complexity of $H_{1}$ is equal to the gray-shaded area ( $c_{1} = . 5$ ). The right panel displays the posterior distribution of the treatment effect, which is the distribution reflecting the update to the prior after incorporating the information in the data. The fit of $H_{0}$ is also indicated by the horizontal dotted line ( $f_{0} = . 217$ ), while the fit of $H_{1}$ is represented by the gray-shaded area ( $f_{1} = . 969$ ).

FIGURE 1.

Prior and posterior distribution of the treatment effect $δ$ . The gray area represents the complexity and fit of $H_{1}$ , while the horizontal dotted lines mark the complexity and fit of $H_{0}$ . The vertical dotted line marks the treatment effect size of zero. The prior was constructed following $δ ~ N (0, 1)$ and the posterior distribution follows $δ ~ N (0.91, 0.62)$ .

The present article uses the approximated adjusted fractional Bayes factor (AAFBF), which uses a fraction of the information in the data, denoted as the fraction $b$ , to specify the prior distribution; further details can be found in Appendix A. Given that the fraction $b$ determines the variance of the distribution and hence its height; the AAFBF is sensitive to the parameter $b$ when evaluating a hypothesis with equality constraints such as $H_{0}$ . Therefore, it is crucial to perform a sensitivity analysis by computing the Bayes factor using different values of parameter $b$ in order to assess its robustness (Gu et al., 2018). In comparison, in the evaluation of a hypothesis that only includes inequality constraints, such as $H_{1}$ or $H_{2}$ , AAFBF is stable regardless of the fraction of information $b$ (Mulder, 2014). Further explanation of $b$ can be found in Appendix A.

One of the advantages of the Bayes factor is that it is easy to interpret. Given that the Bayes factor is a quantification of the support for one hypothesis over the other, the interpretation is how much relative support the data have for one hypothesis over the other. For instance, if $B F_{10} = 10,$ then the relative support for hypothesis 1 ( $H_{1}$ ) is ten times larger than for the null hypothesis ( $H_{0}$ ). The same conclusion can be drawn if $B F_{01} = 1 / 10 = 0.1$ . In the case where the Bayes factor is 1, there is not enough evidence of a preference between the hypotheses under consideration. Using the values obtained in Figure 1, $B F_{10} = (. 969 / 0.5) / (. 217 / . 398) = 3.554$ , indicating that the support for hypothesis 1 in the observed data is 3.17 times larger than the support for the null hypothesis. Initially, Jeffreys (1983) proposed a threshold of 3.2 to declare that there is “positive” evidence for one hypothesis. Later, Kass and Raftery (1995) proposed more thresholds to distinguish between positive, strong, and very strong evidence in favor of one hypothesis. However, another advantage of the Bayes factors is that there are no strict thresholds because their interpretation is relative to the hypotheses of the study, and the Bayes factors can take on positive values to infinity. For these reasons, avoiding the use of cut-off values for interpretation is strongly advised.

Methodology for Sample Size Determination

Fu et al. (2021) proposed a method to determine the sample size using the Bayes factor for hypothesis testing. The sample size is determined by the probability ( $η$ ) that the Bayes factor exceeds a threshold ( $B F_{thresh}$ ), given that the hypothesis is true. Thus, Bayesian power is defined as

P (B F_{i i^{'}} > B F_{thresh, i} | H_{i}) \geq η_{i},

(10)

where $i$ and $i^{'}$ represent competing hypotheses of Hypothesis set 1 or Hypothesis set 2. Equation 10 is evaluated for each of the hypotheses under consideration (i.e., $B F_{i i^{'}}$ and $B F_{i^{'} i}$ ). This formulation allows for unequal levels of evidence ( $B F_{thres, i}$ ) and power $(η_{i})$ for hypotheses $H_{i}$ and $H_{i^{'}}$ . In practice the thresholds are often chosen to be reciprocal.

The probabilities and the thresholds are specified by the researcher, taking into account the objective of the study, and they may take different values for each hypothesis in the set. In cases where the study is high-stakes and the aim is to obtain compelling evidence with high probabilities, the thresholds and the probabilities are relatively large. To illustrate, consider a study of a new medication for treating post-traumatic stress disorder. This new medication may possess severe and harmful side effects, including severe drowsiness or risk of dependency, and the costs associated with its development can be substantial. In such cases, researchers must be confident that the new medication offers clear benefits over the standard treatment by choosing high thresholds and probabilities for sample size determination, such as 10 and 0.9. This will lead to a high probability of finding a large Bayes factor. In comparison, in cases where the side effects of the treatment are less severe, researchers may opt for lower probabilities and thresholds. For instance, consider a pilot study designed to compare the effect of CBT alone with the effect of using an app with AI, developed to interact with patients with anxiety disorders, as a complement to CBT. Considering the exploratory nature of the study and the noncritical implications of using the app in conjunction with CBT, researchers may use low thresholds of, say, 3 and probabilities of .8. Furthermore, if researchers do not aim to collect more evidence for one of the hypotheses under study, we suggest maintaining the same probability and threshold for both.

As mentioned earlier, in a CRT, two sample sizes need to be determined: cluster sizes and number of clusters. The strategy proposed in this paper fixes one of the samples, with the other to be determined. The algorithm can be seen in Figure 2.

FIGURE 2.

Algorithm of function for sample size determination in cluster-randomized trials when using the Bayes factor to test informative hypotheses, including the null hypothesis.

The first step in this method is to generate data sets corresponding to a two-group-parallel conditions CRT, with a given number of clusters and cluster size. The second step is to fit the multilevel model in Equation 1 to the data with the function lmer from the R package lme4 (Bates et al., 2015). The third step is to use the estimates of means and variance of the means for both treatment conditions to calculate the Bayes factors for both hypotheses in Hypotheses set 1 or 2. The fourth step is to calculate the proportion of generated data sets for which the Bayes factors exceed the threshold and to evaluate the Bayesian power criterion. The fifth step, which occurs when the power criterion in Equation 10 is not met, is to change the sample size. Rather than increasing the sample size by only one, the algorithm incorporates a binary search to efficiently find the required sample size and reduce the computation time (see Appendix B). In the case of Hypothesis set 1, a sensitivity analysis is carried out, which means that the aforementioned first five steps are repeated for different choices of fraction of information $b$ as specified by the researchers. When the power criterion is met and the sensitivity analysis has finished, the results are displayed in a table where the researcher can find the hypotheses under consideration, the number of clusters, the cluster sizes, and the probabilities of Bayes factors exceeding the threshold.

In the repository and in the R package, two functions to determine the sample size for a trial with two parallel treatment conditions can be found. The function SSD_crt_null determines the sample size when one of the hypotheses has an equality constraint, which is the Hypothesis set 1. Meanwhile, the function SSD_crt_inform determines the sample size for Hypothesis set 2.

The arguments necessary to determine the sample size are the following:

eff_size is a numeric value corresponding to the standardized mean difference between the treatment and control conditions.

n1 is a numeric value that specifies the sizes of the clusters. All clusters are assumed to have the same size. The default value is 15 individuals in each cluster.

n2 is a numeric value that specifies the total number of clusters. The default is 30 clusters: 15 in the experimental condition and 15 in the control condition.

BF. thresh1 is a numeric value that specifies the desired minimum of the Bayes factor under the informative hypothesis $H_{1}$ . This value indicates how much relative support the data should show for the informative hypothesis $H_{1}$ when it is true. The default value is 3.

eta1 is a numeric value that indicates the probability of finding a Bayes factor equal or larger than the threshold, given that the informative hypothesis $H_{1}$ is true. The default value is .8.

ndatasets is a numeric value that indicates how many data sets are generated to evaluate the power criterion. The default is 5,000 data sets.

rho is a numeric value that specifies the intraclass correlation.

fixed is a string that specifies which sample size is fixed. When the number of clusters is fixed (fixed = “n2”), the function determines the cluster sizes. If the cluster sizes are fixed (fixed = “n1”), then the function determines the number of clusters. The default setting is “n2”.

max is a numeric value that indicates the maximum sample size that is used by the algorithm: if the algorithm reaches this sample size, it stops. By default, the maximum sample size is 1,000.

batch_size is a numeric value that indicates the batch size in the multilevel model fitting, which is a strategy to improve memory usage efficiency and computational performance, given that the data sets might become very large and require a considerable amount of computational effort for model fitting. The default is 100 models at the same time.

The function SSD_crt_null has the following additional arguments:

b_fract is a numeric value that specifies the maximum value that the fraction of information b is multiplied by in the sensitivity analysis. A sensitivity analysis is carried out for all integer values ranging from 1 to b_fract. This means, the fraction of information taken from the data increases from $1 / N_{eff}$ until and including b_ fract $\times 1 / N_{eff}$ . For further information, the reader can refer to Appendix A. By default, b_fract is equal to 3.

BF.thresh0 is a numeric value that specifies the desired minimum of the Bayes factor under the null hypothesis $H_{0}$ . This value indicates how much relative support the data should show for the null hypothesis in Hypothesis set 1. The default value is 3.

eta0 is a numeric value that indicates the probability of finding a Bayes factor equal or larger than the threshold, given that the null hypothesis is true. The default value is 0.8.

The outputs are different for SSD_crt_null and SSD_crt_inform. However, for both functions, the output includes the hypotheses under consideration, the sample size required, whether the number of clusters or the cluster size was fixed, the probabilities that the Bayes factor is higher than the threshold, and data sets containing the Bayes factors calculated during the simulation. For SSD_crt_null, the output also incorporates the results for different choices of b.

Simulation Study

Design

Four simulations were carried out to provide sample sizes for various realistic scenarios. Two of the simulations, one for each hypothesis set, aimed to determine the number of clusters given a fixed cluster size, while the other two aimed to determine the cluster size for a fixed number of clusters. The common factors in the design were the following:

Intraclass correlation: .025, .05, .1.

Effect size: 0, .2, .5, .8.

Bayes factor threshold: 1, 3, 5.

Probability: .8.

Maximum sample size: 1,000.

Note that we restrict to equal Bayes factor threshold and probabilities for simplicity and space limitations.

Determining the Number of Clusters

To determine the number of clusters, the cluster size was fixed to the following values:

Cluster sizes: 5, 10, 40.

Together with the factors that were common for both simulations, 81 combinations were formed. For each of these combinations, 5,000 data sets were generated.

Determining the Cluster Size

To determine the cluster sizes, the number of clusters was fixed to the following values:

Number of clusters: 30, 60, 90.

The total of formed combinations of the factors was 81, and for each of these combinations, 5,000 data sets were generated.

The minimum sample size was set inside the functions SSD_crt_null and SSD_crt_inform; for the cluster size, it was 5, and for the number of clusters, it was 6, while the maximum was set to 1,000 in the functions’ argument. The reason for using a maximum sample size was to provide a stopping point, given that with a small number of clusters, sufficient power is not always achieved, even in cases where the cluster size increases to infinity (Hemming et al., 2011). In addition, to test Hypothesis set 1, a sensitivity analysis was carried out for each combination with fractions of information $b$ , $2 b$ , and $3 b$ .

Results

Taking into account the limited space, this subsection presents a selection of the required sample sizes in tables. Readers interested in exploring additional results not displayed here can access them in the following Shiny app (https://utrecht-university.shinyapps.io/BayesSamplSizeDet-CRT/).

Determining the Number of Clusters for Hypothesis Set 1

Table 1 presents the required number of clusters and the probability of exceeding the threshold $B F_{thresh}$ for Hypothesis set 1. For instance, for $B F_{thres} = 1$ , $n_{1} = 5$ , $ICC = . 025$ and effect size $δ = . 2$ the required total number of clusters is 126, 63 in control and 63 in treatment conditions. This number results in probabilities of 80.7% for $H_{1}$ and 98.2% for $H_{0}$ of obtaining a Bayes factor larger than 1. Thus, in this specific case, the desired threshold is exceeded more often for $H_{0}$ than it is for $H_{1}$ . However, that is not necessarily the case for all other combinations of design factors in Table 1.

TABLE 1.

Required Total Number of Clusters With a Fraction of Information Equal to 1 and Probability ( $η$ ) of .8 for Hypothesis Set 1

$B F_{thresh}$	$n_{1}$	Hypothesis	ICC = .025				ICC = .050				ICC = .100
			Eff. size = .2		Eff. size = .5		Eff. size = .2		Eff. size = .5		Eff. size = .2		Eff. size = .5
			$n_{2}$	$P (BF > B F_{thresh})$	$n_{2}$	$P (BF > B F_{thresh})$	$n_{2}$	$P (BF > B F_{thresh})$	$n_{2}$	$P (BF > B F_{thresh})$	$n_{2}$	$P (BF > B F_{thresh})$	$n_{2}$	$P (BF > B F_{thresh})$
1	5	H0	126	.982	16	.942	138	.982	18	.941	158	.981	20	.931
		H1		.807		.808		.810		.820		.806		.815
1	10	H0	72	.984	12	.943	80	.980	14	.942	110	.982	14	.927
		H1		.816		.891		.800		.896		.801		.834
1	40	H0	30	.981	8	.943	42	.977	8	.927	70	.983	10	.920
		H1		.821		.976		.802		.913		.804		.854
3	5	H0	160	.944	26	.813	174	.944	28	.802	204	.946	32	.805
		H1		.804		.872		.808		.873		.801		.870
3	10	H0	92	.940	14	.806	110	.943	18	.809	140	.945	24	.809
		H1		.810		.857		.813		.887		.802		.896
3	40	H0	38	.937	8	.822	54	.940	10	.809	90	.943	16	.818
		H1		.828		.941		.803		.903		.800		.900
5	5	H0	176	.904	68	.802	188	.898	76	.804	228	.905	88	.804
		H1		.808		.998		.800		.998		.805		.998
5	10	H0	98	.892	42	.811	118	.897	48	.802	154	.900	58	.802
		H1		.801		.999		.801		.999		.811		.998
5	40	H0	40	.889	16	.804	60	.898	24	.807	100	.896	42	.804
		H1		.805		.998		.810		.999		.802		1.000

Note. n ₁ represents the cluster size while n₂ represents the number of clusters. ICC = intraclass correlation; BF = Bayes factor.

The results in Table 1 show that the required number of clusters to meet the power criterion increases as the ICC increases. This increase is expected, given that the higher the correlation between the individuals within a cluster, the larger the dependency among the individuals and, hence, the lower the effective sample size (Hox et al., 2017). This expectation is also possible to infer from Equation 6. Table 1 further shows a trade-off between the two sample sizes: the required number of clusters decreases if the cluster size increases. This decrease is obvious, given that the larger the cluster size, the more information is available within each cluster and, hence, fewer clusters are needed. The results also show an inverse relationship between the effect size and the number of clusters, which is not surprising, as larger effect sizes are easier to detect than smaller effect sizes. With respect to the Bayes factor threshold, the relationship with the number of clusters is proportional, meaning that increasing the threshold requires a larger number of clusters to achieve the support for the correct hypothesis.

As the reader can easily verify from our Shiny app (https://utrecht-university.shinyapps.io/BayesSamplSizeDet-CRT/), overall, the number of clusters required increases with the fraction of information $b$ , in the cases that the effect size is .5 or larger. One explanation of this finding can be found in Fu et al. (2021), who indicated that when $b$ gets larger, the prior variances decrease, while the complexity $c_{0}$ increases in Equation 8, resulting in smaller values of the Bayes factor when $H_{0}$ is true. However, if the effect size is small, depending on the Bayes factor, the required number or cluster decreases or decreases and then increases.

Determining the Cluster Sizes for Hypothesis Set 1

Table 2 presents the required cluster sizes as a function of the number of clusters, effect size, ICC, and Bayes factor thresholds. This table is similar in format to Table 1, but now the number of clusters rather than the cluster size appears in the second column, while the other columns show the required cluster sizes and corresponding Bayesian power.

TABLE 2.

Required Number of Individuals per Cluster With a Fraction of Information Equal to 1 and Probability ( $η$ ) of .8 for Hypothesis Set 1

$B F_{thresh}$	$n_{2}$	Hypothesis	ICC = .025				ICC = .050				ICC = .100
			Eff. size = .2		Eff. size = .5		Eff. size = .2		Eff. size = .5		Eff. size = .2		Eff. size = .5
			$n_{1}$	$P (BF > B F_{thresh})$	$n_{1}$	$P (BF > B F_{thresh})$	$n_{1}$	$P (BF > B F_{thresh})$	$n_{1}$	$P (BF > B F_{thresh})$	$n_{1}$	$P (BF > B F_{thresh})$	$n_{1}$	$P (BF > B F_{thresh})$
1	30	H0	37	.981	6	.961	304	.981	6	.956	1,000	.971	6	.950
		H1		.808		.983		.802		.974		.546		.947
1	60	H0	12	.985	6	.973	18	.980	6	.969	137	.981	6	.967
		H1		.803		1.000		.816		1.000		.800		.998
1	90	H0	8	.984	6	.979	9	.981	6	.978	16	.981	6	.975
		H1		.836		1.000		.811		1.000		.801		1.000
3	30	H0	66	.938	6	.847	1,000	.933	6	.834	1,000	.897	6	.810
		H1		.806		.953		.692		.929		.390		.880
3	60	H0	17	.942	6	.907	29	.938	6	.894	1,000	.938	6	.883
		H1		.804		.999		.803		.998		.703		.994
3	90	H0	10	.939	6	.927	14	.943	6	.924	40	.943	6	.914
		H1		.803		1.000		.818		1.000		.800		.999
5	30	H0	85	.898	14	.801	1,000	.871	22	.806	1,000	.789	1,000	.789
		H1		.805		.999		.637		.999		.332		.998
5	60	H0	20	.897	6	.802	39	.893	7	.811	1,000	.874	10	.806
		H1		.809		.999		.806		.998		.649		.998
5	90	H0	12	.899	6	.865	16	.897	6	.855	81	.893	6	.830
		H1		.823		1.000		.805		1.000		.800		.999

Note. n ₁ represents the cluster size while n₂ represents the number of clusters. ICC = intraclass correlation; BF = Bayes factor.

As shown in Table 2, the desired probability $η = . 8$ for both hypotheses was not always achieved. The maximum specified cluster size of 1,000 was reached for one or both hypotheses without meeting the power criterion, especially in cases of small effect sizes and medium-to-high ICC ( $ρ = . 05$ and $ρ = . 1$ , respectively). This outcome was expected, given that an increase in the number of individuals only has a limited increase in power (Hemming et al., 2011). This finding also proves that increasing the cluster size has a weaker effect on Bayesian power than increasing the number of clusters, which is a characteristic also observed in the frequentist approach, as illustrated in Equation 7.

Comparatively, when the effect size is .5, an increase in the ICC has little impact on the cluster size, and it can be seen a slight increase in the sample size under certain conditions, specifically a threshold of 5 and cluster sizes of 30 or 60. It is noteworthy that we found similar patterns of the effect of the factors we have mentioned for an effect size of .8. The reader can use the Shiny app to verify that, for an effect size of .8, cluster sizes tend to be 6 and increase slightly with the ICC under specific conditions. However, in some cases, the maximum cluster size is reached, particularly when ICC is high ( $ρ = . 1$ ), the number of clusters is low ( $n_{2} = 30$ ), and the threshold is large ( $B F_{thresh} = 5$ ).

The relationship between the variables in Table 2 is similar to those described in Table 1. Larger cluster sizes are required when the effect size and the number of clusters decrease. There is a notable difference in the Bayesian power according to the effect size: with a small effect size, the increase in ICC leads to a considerable increase in cluster size.

In general, larger cluster sizes were required with increasing the Bayes factor threshold and the fraction $b$ . However, for conditions with a small effect size, the cluster size decreased or initially decreased but then increased as the fraction $b$ increased. These findings are similar to those for determining the number of clusters. In the cases of medium and large effect sizes ( $η = . 5$ and $η = . 8$ , respectively), the cluster sizes increased or remained the same as the fraction $b$ increased. In particular, the cluster sizes stayed constant for a large number of clusters ( $n_{2} = 90$ ).

Determining the Number of Clusters for Hypothesis Set 2

Table 3 is different from the tables presented for Hypothesis set 1. The consideration of hypotheses with only inequality constraints requires the simulation and test for one hypothesis, given that, in this case, we are testing one hypothesis (i.e., $H_{1}$ ) against its complement. Moreover, as there is no equality constraint, the fraction of information $b$ is not necessary for a sensitivity analysis. Regarding all the other factors, the interpretation of the table is similar to the interpretation of Table 1.

TABLE 3.

Required Total Number of Clusters Condition With Probability ( $η$ ) of .8 for Hypothesis Set 2

		ICC = .025				ICC = .050				ICC = .100
		Eff. size = .2		Eff. size = .5		Eff. size = .2		Eff. size = .5		Eff. size = .2		Eff. size = .5
$B F_{thresh}$	$n_{1}$	$n_{2}$	$P (BF > B F_{thresh})$	$n_{2}$	$P (BF > B F_{thresh})$	$n_{2}$	$P (BF > B F_{thresh})$	$n_{2}$	$P (BF > B F_{thresh})$	$n_{2}$	$P (BF > B F_{thresh})$	$n_{2}$	$P (BF > B F_{thresh})$
1	30	6	.969	6	1.000	6	.959	6	1.000	6	.942	6	1.000
1	60	6	.994	6	1.000	6	.992	6	1.000	6	.987	6	1.000
1	90	6	.999	6	1.000	6	.998	6	1.000	6	.996	6	1.000
3	30	6	.867	6	1.000	6	.854	6	1.000	6	.811	6	.999
3	60	6	.967	6	1.000	6	.957	6	1.000	6	.936	6	1.000
3	90	6	.992	6	1.000	6	.987	6	1.000	6	.978	6	1.000
5	30	7	.830	6	1.000	8	.818	6	.999	12	.806	6	.998
5	60	6	.937	6	1.000	6	.922	6	1.000	6	.887	6	1.000
5	90	6	.983	6	1.000	6	.975	6	1.000	6	.959	6	1.000

Note. n ₁ represents the cluster size while n₂ represents the number of clusters. ICC = intraclass correlation; BF = Bayes factor.

The results clearly indicate that, regardless of the Bayes factor threshold, ICC, effect size, and cluster size, the required total number of clusters is 6, 3 in control and 3 in treatment condition. The cases that deviate from this tendency have the largest thresholds, the smallest cluster sizes, and the smallest effect sizes. This tendency means that, while the number of clusters may be larger in specific conditions, overall, the power criterion is easily met, with a number of clusters close to the minimum specified in the design of the simulation study.

Determining the Cluster Size for Hypothesis Set 2

Table 4 presents the cluster size as a function of the number of clusters, ICC, effect sizes, and Bayes factor thresholds. From the table can be inferred that, regardless of the ICC and threshold, the required cluster size is 8 when the effect size is .5. However, in the cases where the effect size is .2, larger cluster sizes are necessary when the number of clusters is low, the ICC increases, or the threshold increases.

TABLE 4.

Required Number of Individuals per Cluster With Probability ( $η$ ) of .8 for Hypothesis Set 2

		ICC = .025				ICC = .050				ICC = .100
		Eff. size = .2		Eff. size = .5		Eff. size = .2		Eff. size = .5		Eff. size = .2		Eff. size = .5
$B F_{thresh}$	$n_{2}$	$n_{1}$	$P (BF > B F_{thresh})$	$n_{1}$	$P (BF > B F_{thresh})$	$n_{1}$	$P (BF > B F_{thresh})$	$n_{1}$	$P (BF > B F_{thresh})$	$n_{1}$	$P (BF > B F_{thresh})$	$n_{1}$	$P (BF > B F_{thresh})$
1	5	8	.810	8	.984	10	.816	8	.980	10	.807	8	.972
1	10	8	.867	8	.998	8	.842	8	.995	8	.809	8	.989
1	40	8	.963	8	1.000	8	.928	8	1.000	8	.873	8	.998
3	5	26	.810	8	.924	28	.809	8	.908	32	.807	8	.886
3	10	14	.800	8	.983	18	.820	8	.974	22	.809	8	.944
3	40	8	.863	8	1.000	10	.837	8	.999	16	.832	8	.983
5	5	36	.803	8	.867	42	.817	8	.851	46	.801	8	.817
5	10	22	.828	8	.965	24	.806	8	.943	32	.805	8	.898
5	40	10	.845	8	1.000	14	.835	8	.997	20	.802	8	.967

Note. n ₁ represents the cluster size while n₂ represents the number of clusters. ICC = intraclass correlation; BF = Bayes factor.

The comparison of the results for Hypotheses set 1 and 2 indicates that including hypotheses with equality constraints, such as $H_{0}$ , and maintaining all the other factors equal, leads to larger sample sizes. This result was not surprising, since the evaluation of Hypothesis set 1 requires meeting the Bayesian power criterion for two scenarios ( $H_{0}$ and $H_{1}$ ), whereas for Hypothesis set 2, the evidence collected is for only one of the hypotheses ( $H_{1}$ ) given that the other one is its complement ( $H_{2}$ ).

Another difference in the results of the two hypothesis sets lies in that the required sample size for Hypothesis set 2 hardly depends on the ICC, effect size, and Bayes factor threshold. The results for Hypothesis set 1 are more consistent with the effects of the factors in the frequentist framework, which are expressed in Equations 6 and 7 and proved in Moerbeek and Teerenstra (2016). Only under specific conditions, the same relationship between the factors and the sample size is observed for Hypothesis set 2, such as the smallest effect size, the largest thresholds, and the smallest fixed sample sizes.

Practical Example

This section uses the example of the CRT carried out by Ausems et al. (2002), presented above. In this study, schools were assigned randomly to four treatment conditions to evaluate two interventions and their interaction. One of the variables that was measured is the attitude toward the disadvantages of smoking, which is a result of the sum of an 11-item scale with a 5-point Likert scale, ranging from 1 = very negative to 5 = very positive.

Suppose that a researcher wants to replicate the study of Ausems et al. (2002) but is only interested in the effect of the out-of-school condition versus the control. Attitude toward the disadvantages of smoking is the outcome variable for which a power analysis is to be performed. The pair of hypotheses to consider is

\begin{matrix} H_{0} : μ_{out} = μ_{control} \\ H_{1} : μ_{out} > μ_{control} . \end{matrix}

The researcher performs sample size calculations. Following Moerbeek (2006), the between-cluster variance is equal to $σ_{u}^{2} = 3.5$ and within-cluster variance is equal to $σ_{e}^{2} = 45$ , which means that the total variance is $σ^{2} = 48.5$ and the ICC is $ρ = \frac{3.5}{48.5} = . 0721$ . The unstandardized effect size that the researcher is expecting to detect is 1.39, corresponding to the standardized effect size $δ = 1.39 / \sqrt{48.5} = . 19$ . The desired significance level is $α = . 05$ and statistical power $1 - β = . 8$ . If the cluster size is $n_{1} = 30$ students per cluster, using the formula 6, the required total number of schools is

n_{2} = 4 \frac{1 + (30 - 1) 0.0721}{30} {(\frac{1.96 + 0.84}{0.19})}^{2} = 81.24714,

(11)

which is rounded up to 82. If the statistical power is increased to .9, the required total number of schools is

n_{2} = 4 \frac{1 + (30 - 1) 0.0721}{30} {(\frac{1.96 + 1.28}{0.19})}^{2} = 108.7669,

(12)

which is rounded up to 109.

However, the researcher is also open to using the Bayes factor as a method to test the hypotheses. For this reason, the researcher performs a sample size calculation to test the hypotheses with different thresholds and probability values. Considering that the researcher wants to confirm the effects that have been studied before, the Bayes factor thresholds are 3 and 5, and the probabilities are .8 and .9.

The results presented in Table 5 demonstrate that the required number of clusters can vary from 84 up to 252, increasing with the threshold ( $B F_{thresh}$ ) and probability ( $η$ ). It can be seen in the table that in cases where the threshold is equal to 3 ( $B F_{thresh}$ = 3) and the fraction $b$ is 2 or 3, the sample size is lower than the one required for the frequentist approach. However, when the threshold is increased to 5, this no longer holds, and the number of clusters is larger in the Bayesian approach. This outcome is obvious and aligns with the simulation results, given that increasing the threshold implies a larger support for a hypothesis increases and hence the sample size. The effect of increasing the fraction of information $b$ on the required number of clusters may be increasing, decreasing, or constant, depending on the values of effect sizes and $B F_{thresh}$ such as in this study ( $δ = . 19$ ).

The original study by Ausems et al. (2002) included 143 schools, which exceeds most of the required number of schools as listed in Table 5. This implies that this study, maintaining the same number of schools, could have ensured an 80% probability that the Bayes factor would be larger than 3 or 5, as well as ensured a 90% probability of obtaining a Bayes factor larger than 3.

TABLE 5.

Required Number of Clusters for Evaluation of Smoking Prevention Programs With the Bayes Factor When the Cluster Size Equals 30 Students

$η$	$B F_{thresh}$	$b$	$n_{2}$	$P (B F_{01} > B F_{thresh} ∣ H_{0})$	$P (B F_{10} > B F_{thresh} ∣ H_{1})$
.8	3	1	84	.948	.804
.8	3	2	78	.913	.801
.8	3	3	76	.879	.806
.9	3	1	110	.951	.902
.9	3	2	104	.925	.905
.9	3	3	100	.906	.903
.8	5	1	94	.912	.811
.8	5	2	88	.845	.808
.8	5	3	96	.801	.854
.9	5	1	120	.920	.908
.9	5	2	170	.900	.988
.9	5	3	252	.902	1.000

Note. $η$ represents the probability that must be reached to meet the power criterion. $B F_{thresh}$ represents the Bayes factor threshold. In addition, $n_{2}$ stands for the required number of clusters to reach the Bayesian power criterion.

Discussion

This article presents an innovative method for determining the sample size for CRTs based on Bayesian power. The method is designed to determine the number of clusters or the cluster sizes when one of them is fixed. The Bayesian power criterion specifies that the sample size is determined so that it ensures a probability ( $η$ ) of obtaining a Bayes factor larger than a Bayes factor threshold for either hypothesis under consideration.

To facilitate this approach, we have developed two functions, SSD_crt_null and SSD_crt_inform, that are implemented and freely available in an R package called SSD_Bayes_ML. The first function determines the required sample size for the evaluation of Hypothesis set 1, where $H_{0} : μ_{C} = μ_{T}$ and $H_{1} : μ_{C} < μ_{T}$ are evaluated. The second function determines the required sample size to evaluate Hypothesis set 2, which is $H_{1} : μ_{C} < μ_{T}$ against its complement $H_{2} : μ_{C} > μ_{T}$ . In addition, the results from the simulations can be consulted and explored in the Shiny app Sample Size Determination for Cluster Randomized Trials with Bayes Factor (https://utrecht-university.shinyapps.io/BayesSamplSizeDet-CRT/).

The results showed the effect of ICC, effect size, and fixed sample size on the determination of the required sample size. Larger sample sizes are required when the ICC increases, the fixed sample size is small, or the effect size is small. These results align with the effect of these same factors on the sample size determination in the frequentist approach (Moerbeek & Teerenstra, 2016; Raudenbush, 1997). The simulation also showed the trade-off between the number of clusters and cluster sizes. In addition, as anticipated (Hemming et al., 2011), the desired probability $η$ could not always be achieved for a limited number of clusters, even when the cluster size was as large as 1,000.

The effects of the Bayes factor threshold and fraction of information $b$ on the sample size indicate that, in general, larger sample sizes were required for larger values of thresholds and fraction of information. However, important to note is that the tendency of the effect of the fraction of information on sample size determination also depends on the effect size, the Bayes factor threshold, and the values of the sample size fixed. In scenarios where the effect size is small, the sample size may decrease while the fraction $b$ increases. In addition, if the Bayes factor threshold is large, the sample size decreases, followed by an increase. These results confirm the importance of carrying out a sensitivity analysis in the sample size determination. It is important to note that the effect of fraction $b$ only happens when a hypothesis with equality constraint is evaluated since the fraction of information is not used in the specification of the prior distribution for inequality-constrained hypotheses, such as in Hypothesis set 2.

In comparison, our findings for Hypothesis set 2 suggest that the required sample sizes tended to be small and varied little, regardless of the factor levels used in the simulation study. This tendency lends additional support to questioning the testing of the null hypothesis (for further arguments against the use of NHST, see: Anderson et al., 2000; de Schoot et al., 2011; Gu et al., 2014; Hoijtink, Mulder, et al., 2019; Klugkist et al., 2011).

The simulations exhibited the differences in reaching the Bayesian power criterion for different hypotheses. The power criterion was easily met when evaluating Hypothesis set 2 for all conditions. Moreover, the required sample size had a small range of values and varied little, especially for cases with medium and large effect sizes. In comparison, when evaluating Hypothesis set 1, the required sample size varied considerably, and in cases with small effect sizes and a small number of clusters, the power criterion was not met, even after reaching 1,000 individuals per cluster.

Throughout the article, comparisons of this sample size determination method with that of the frequentist approach were drawn to show the differences between the two approaches, as well as highlighting the advantages of hypothesis testing with the Bayes factor. In NHST, statistical power is defined as the probability of rejecting the null hypothesis when there is a treatment effect, thus avoiding a Type II error. Researchers perform a priori power analysis or sample size calculation to ensure a certain statistical power in their studies. The methodology for sample size determination proposed here is similar in that it aims to ensure a probability; however, it goes beyond avoiding committing a Type II error. In the proposed method, we are ensuring the probability of obtaining a certain amount of relative support for a hypothesis when said hypothesis is true. In this way, although the error rate is not the primary focus of the methodology, researchers may consider $1 - \Pr (B F_{i} > B F_{thresh} | H_{i})$ as the rate of preferring the false hypothesis or providing insufficient support to the correct hypothesis.

An additional advantage of the method presented in this paper is that it allows the testing of different hypotheses beyond the null hypothesis. The hypotheses of the study presented in this article are limited to the null and informative hypotheses; however, the Bayes factor can also be used to test interval hypotheses. Future research may focus on expanding the present method to test interval hypotheses, which would open the possibility to different designs such as the superiority design, non-inferiority design, and equivalence design (Heck et al., 2022). Likewise, the number of hypotheses under study may be larger than two, which can be another topic for future research and expansion of this method.

The method and the software presented in this paper implement the AAFBF, which is only one type of Bayes factor. While one of the advantages of the Bayes factor is the incorporation of prior information, for this type of Bayes factor, the user only has to indicate the fraction of information used to specify the prior distribution. However, the AAFBF is sensitive to different fractions of the sample in the case of equality-constrained hypotheses such as $H_{0}$ . In addition, the software presented in this paper is specifically tailored for determining the sample sizes in a parallel CRT with only two-treatment conditions and evaluating two hypotheses. The method is also restricted to equal allocation ratios, which can be shown to be optimal when costs and variances do not vary across treatment conditions (Schouten, 1999).

Additional elements in CRT design that also influence the determination of sample size but were not considered in this study include unequal cluster sizes, uncertainty surrounding intraclass correlation, and non-inferiority and equivalence designs (Rutterford et al., 2015). Researchers considering unequal cluster sizes must be aware that in NHST, the loss of efficiency due to variation in cluster sizes has been shown to rarely exceed $10 %$ ; therefore, $11 %$ more clusters should be added to compensate (van Breukelen et al., 2007).

In general, the methods for sample size determination require an educated guess of the ICC. Researchers may use values obtained from the literature or expert knowledge. For instance, Table 11.1 in Moerbeek and Teerenstra (2016) shows a summary of papers that report estimates of ICCs in CRTs across various scientific fields. Researchers may consider a sensitivity analysis and determine the sample size for a range of plausible values of the ICC to study the degree to which sample size depends on the ICC. Future research could further explore sample size determination when variances vary across treatment conditions.

Another consideration when utilizing the provided functions is computational cost, since our method to determine the required sample relies on simulations. In general, the minimum running time is approximately 5 minutes, while the largest running time is 35 hours. These results were obtained with a 16-core GPU with 250 GB of RAM. To improve efficiency and reduce computation time, the functions in the package employ a binary search algorithm to find the required sample size. However, it is important to highlight that the combinations with the largest running time always corresponded to the evaluation of Hypothesis set 1. Most likely, in the near future, this limitation of computational cost will be solved with the advances of technology.

Considering the growing popularity of Bayes factors for hypothesis testing in psychology (Heck et al., 2022), our method for sample size determination in CRTs is an important advance in research. We have previously discussed the disadvantages of NHST and how the Bayes factors provide an alternative approach to hypothesis testing that can avoid these disadvantages. One of the drawbacks of NHST is that, in practice, the decision to reject or maintain the null hypothesis relies on an arbitrary level of significance. To avert the same misinterpretation of the thresholds in our method, important to note is that the Bayes factor is the quantification of the evidence of the hypotheses under consideration, and it may take values from 0 to infinity. Although the thresholds used in this article are often seen in practice, we encourage the researchers to choose the thresholds based on the aimed-for degree of support.

This article introduced a method for sample size determination in CTRs tailored for hypothesis testing using Bayes factors. Moreover, we provided the practical implementation of the method through functions in the R package SSD_Bayes_ML. To our knowledge, this was the first contribution to Bayesian sample size determination for CRTs. This article is part of a larger 4-year project; in the years to come, we aim to focus on trials with more than two-treatment conditions (thereby extending the article by Fu [2022] to the multilevel setting) and cluster-randomized crossover. We also consider alternative measures for the evaluation of hypotheses, such as the Generalized Order-Restricted Information Criterion Approximation (GORICA) (Altinisik et al., 2021; Vanbrabant et al., 2020).

Footnotes

Appendix A

Appendix B

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This project was funded by the Netherlands Organisation of Scientific Research (NWO), grant number 406.21.GO.006.

ORCID iD

Camila Natalia Barragán Ibáñez

Authors

CAMILA NATALIA BARRAGÁN IBÁÑEZ is a PhD candidate at Utrecht University, Padualaan 14, 3584 CH Utrecht, The Netherlands. E-mail: c.n.barraganibanez@uu.nl. Her research interests are statistical power on multilevel models, hypotheses evaluation, and Bayesian statistics.

MIRJAM MOERBEEK is an associate professor at Utrecht University, Padualaan 14, 3584 CH Utrecht, The Netherlands. E-mail: m.moerbeek@uu.nl. Her research interests are statistical power analysis and optimal experimental design, especially for hierarchical and survival data.

References

Anderson

D. R.

Burnham

K. P.

Thompson

W. L.

(2000). Null hypothesis testing: Problems, prevalence, and an alternative. Journal of Wildlife Management, 64(4), 912–923. https://doi.org/10.2307/3803199

Altinisik

Van Lissa

C. J.

Hoijtink

Oldehinkel

A. J.

Kuiper

R. M.

(2021). Evaluation of inequality constrained hypotheses using a generalization of the AIC. Psychological Methods, 26(5), 599–621. https://doi.org/10.1037/met0000406

Ausems

Mesters

van Breukelen

De Vries

(2002). Short-term effects of a randomized computer-based out-of-school smoking prevention trial aimed at elementary schoolchildren. Preventive Medicine, 34(6), 581–589. https://doi.org/10.1006/pmed.2002.1021

Balluerka

Gómez

Hidalgo

(2005). The controversy over null hypothesis significance testing revisited. Methodology, 1(2), 55–70. https://doi.org/10.1027/1614-1881.1.2.55

Bates

Mächler

Bolker

Walker

(2015). Fitting linear mixed-effects models using lme4. Journal of Statistical Software, 67(1), 1–48. https://doi.org/10.18637/jss.v067.i01

Beeres

Arnö

Pulkki-Brännström

A.-M.

Nilsson

Galanti

M. R.

(2022). Evaluation of the Swedish school-based program “tobacco-free DUO” in a cluster randomized controlled trial (TOPAS study). Results at 2-year follow-up. Preventive Medicine, 155, Article 106944. https://doi.org/10.1016/j.ypmed.2021.106944

Binary Search Algorithm. (2024). WikipediaPage Version ID: 1218941965.

Bürkner

P.-C.

(2018). Advanced Bayesian multilevel modeling with the R Package brms. The R Journal, 10(1), 395. https://doi.org/10.32614/RJ-2018-017

Campbell

M. J.

Walters

S. J.

(2014, April). How to design, analyse and report cluster randomised trials in medicine and health related research (1st ed.). Wiley.

10.

Conigrave

J. H.

Lee

K. S. K.

Dobbins

Wilson

Padarian

Ivers

Morley

Haber

P. S.

Vnuk

Marshall

Conigrave

(2024). No improvement in AUDIT-C screening and brief intervention rates among wait-list controls following support of Aboriginal Community Controlled Health Services: Evidence from a cluster randomised trial. BMC Health Services Research, 24(1), 813. https://doi.org/10.1186/s12913-024-11214-6

11.

De Santis

(2004). Statistical evidence and sample size determination for Bayesian hypothesis testing. Journal of Statistical Planning and Inference, 124(1), 121–144. https://doi.org/10.1016/S0378-3758(03)00198-8

12.

de Schoot

R. V.

Hoijtink

Jan-Willem

(2011). Moving beyond traditional null hypothesis testing: Evaluating expectations directly. Frontiers in Psychology, 2, Article 24. https://doi.org/10.3389/fpsyg.2011.00024

13.

Dienes

Coulton

Heather

(2018). Using Bayes factors to evaluate evidence for no effect: Examples from the SIPS project. Addiction, 113(2), 240–246. https://doi.org/10.1111/add.14002

14.

Donner

Klar

(2010). Design and analysis of cluster randomization trials in health research. Wiley.

15.

Eldridge

Kerry

(2012, January). A practical guide to cluster randomised trials in health services research (1st ed.). Wiley.

16.

(2022, March). Sample size determination for Bayesian informative hypothesis testing [Doctoral dissertation, Utrecht University]. https://doi.org/10.33540/1221

17.

Hoijtink

Moerbeek

(2021). Sample-size determination for the Bayesian t test and Welch’s test using the approximate adjusted fractional Bayes factor. Behavior Research Methods, 53(1), 139–152. https://doi.org/10.3758/s13428-020-01408-1

18.

Gelfand

A. E.

Wang

(2002). A simulation-based approach to Bayesian sample size determination for performance under a given model and for separating models. Statistical Science, 17(2), 193–208. https://doi.org/10.1214/ss/1030550861

19.

Goldstein

(2011). Multilevel statistical models (4th ed.). Wiley.

20.

Hoijtink

Mulder

Van Lissa

C. J.

(2019, February). Bain: Bayes factors for informative hypotheses [Institution: Comprehensive R Archive Network Pages: 0.2.11]. Retrieved January 7, 2025, from https://CRAN.R-project.org/package=bain

21.

Mulder

Deković

Hoijtink

(2014). Bayesian evaluation of inequality constrained hypotheses. Psychological Methods, 19(4), 511–527. https://doi.org/10.1037/met0000017

22.

Mulder

Hoijtink

(2018). Approximated adjusted fractional Bayes factors: A general method for testing informative hypotheses. British Journal of Mathematical and Statistical Psychology, 71(2), 229–261. https://doi.org/10.1111/bmsp.12110

23.

Hayes

R. J.

Moulton

L. H.

(2009). Cluster randomised trials. CRC Press.

24.

Heck

D. W.

Boehm

Böing-Messing

Bürkner

P.-C.

Derks

Dienes

Karimova

Kiers

H. A. L.

Klugkist

Kuiper

R. M.

Lee

M. D.

Leenders

Leplaa

H. J.

Linde

Meijerink-Bosman

Moerbeek

. . . Hoijtink

(2022). A review of applications of the Bayes factor in psychological research. Psychological Methods, 28(3), 558–579. https://doi.org/10.1037/met0000454

25.

Hemming

Girling

A. J.

Sitch

A. J.

Marsh

Lilford

R. J.

(2011). Sample size calculations for cluster randomised controlled trials with a fixed number of clusters. BMC Medical Research Methodology, 11(1), Article 102. https://doi.org/10.1186/1471-2288-11-102

26.

Hemming

Kasza

(n.d.). The Shiny CRT Calculator: Power and Sample size for Cluster Randomised Trials. https://clusterrcts.shinyapps.io/rshinyapp/

27.

Hitchcock

Westwell

M. S.

(2017). A cluster-randomised, controlled trial of the impact of Cogmed Working Memory Training on both academic performance and regulation of social, emotional and behavioural challenges. Journal of Child Psychology and Psychiatry, 58(2), 140–150. https://doi.org/10.1111/jcpp.12638

28.

Hoijtink

(2012). Informative hypotheses: Theory and practice for behavioral and social scientists. CRC.

29.

Hoijtink

Mulder

(2019). Bayesian evaluation of informative hypotheses for multiple populations. British Journal of Mathematical and Statistical Psychology, 72(2), 219–243. https://doi.org/10.1111/bmsp.12145

30.

Hoijtink

Mulder

van Lissa

(2019). A tutorial on testing hypotheses using the Bayes factor. Psychological Methods, 24(5), 539–556. https://doi.org/10.1037/met0000201

31.

Hox

J. J.

Moerbeek

van de Schoot

(2017, September). Multilevel analysis: Techniques and applications (3rd ed.). Routledge.

32.

JASP Team. (2025). JASP (Version 0.19.3) [Computer software]. https://jasp-stats.org/

33.

Jeffreys

(1983). Theory of probability (3rd ed). Oxford University Press.

34.

Kass

R. E.

Raftery

A. E.

(1995). Bayes factors. Journal of the American Statistical Association, 90(430), 773–795. https://doi.org/10.1080/01621459.1995.10476572

35.

Klugkist

Laudy

Hoijtink

(2005). Inequality constrained analysis of variance: A Bayesian approach. Psychological Methods, 10(4), 477–493. https://doi.org/10.1037/1082-989X.10.4.477

36.

Klugkist

van Wesel

Bullens

(2011). Do we know what we test and do we test what we want to know? International Journal of Behavioral Development, 35(6), 550–560. https://doi.org/10.1177/0165025411425873

37.

Lazega

Snijders

T. A. B.

(Eds.). (2016). Multilevel network analysis for the social sciences: Theory, methods and applications. Springer.

38.

Makowski

Lüdecke

Ben-Shachar

M. S.

Patil

Wilson

M. K.

Wiernik

B. M.

(2019, April). bayestestR: Understand and Describe Bayesian Models and Posterior Distributions [Institution: Comprehensive R Archive Network Pages: 0.15.0]. Retrieved January 7, 2025, from https://CRAN.R-project.org/package=bayestestR

39.

M’Lan

C. E.

Joseph

Wolfson

D. B.

(2006). Bayesian sample size determination for case-control studies. Journal of the American Statistical Association, 101(474), 760–772.

40.

Moerbeek

(2005). Randomization of clusters versus randomization of persons within clusters: Which is preferable? The American Statistician, 59(1), 72–78. https://doi.org/10.1198/000313005X20727

41.

Moerbeek

(2006). Power and money in cluster randomized trials: When is it worth measuring a covariate? Statistics in Medicine, 25(15), 2607–2617. https://doi.org/10.1002/sim.2297

42.

Moerbeek

Teerenstra

(2016). Power analysis of trials with multilevel data. CRC Press, Taylor & Francis Group.

43.

Moerbeek

van Breukelen

G. J. P.

Berger

M. P. F.

(2000). Design issues for experiments in multilevel populations. Journal of Educational and Behavioral Statistics, 25(3), 271. https://doi.org/10.2307/1165206

44.

Mulder

(2014). Prior adjusted default Bayes factors for testing (in)equality constrained hypotheses. Computational Statistics & Data Analysis, 71, 448–463. https://doi.org/10.1016/j.csda.2013.07.017

45.

Mulder

Van Lissa

Williams

D. R.

Olsson-Collentine

Boeing-Messing

Fox

J.-P.

(2019, October). BFpack: Flexible Bayes factor testing of scientific expectations [Institution: Comprehensive R Archive Network Pages: 1.4.0]. Retrieved January 7, 2025, from https://CRAN.R-project.org/package=BFpack

46.

Mulder

Wagenmakers

E.-J.

Marsman

(2022). A generalization of the savage–dickey density ratio for testing equality and order constrained hypotheses. The American Statistician, 76(2), 102–109.

47.

Murray

D. M.

(1998). Design and analysis of group-randomized trials. Oxford University Press.

48.

Oakes

M. W.

(1987). Statistical inference: A commentary for the social and behavioural sciences (Reprint). Wiley.

49.

Pezeshk

(2003). Bayesian techniques for sample size determination in clinical trials: A short review. Statistical Methods in Medical Research, 12(6), 489–504.

50.

Raudenbush

S. W.

(1997). Statistical analysis and optimal design for cluster randomized trials. Psychological Methods, 2(2), 173–185. https://doi.org/10.1037/1082-989X.2.2.173

51.

Raudenbush

S. W.

Bryk

A. S.

(2010). Hierarchical linear models: Applications and data analysis methods (2nd ed.). Sage Publication.

52.

Rosário

Vasiljevic

Pas

Angus

Ribeiro

Fitzgerald

(2022). Efficacy of a theory-driven program to implement alcohol screening and brief interventions in primary health-care: A cluster randomized controlled trial. Addiction, 117(6), 1609–1621. https://doi.org/10.1111/add.15782

53.

Rutterford

Copas

Eldridge

(2015). Methods for sample size determination in cluster randomized trials. International Journal of Epidemiology, 44(3), 1051–1067. https://doi.org/10.1093/ije/dyv113

54.

Sanchez

Z. M.

Valente

J. Y.

Galvão

P. P.

Gubert

F. A.

Melo

M. H. S.

Caetano

S. C.

Mari

J. J.

Cogo-Moreira

(2021). A cluster randomized controlled trial evaluating the effectiveness of the school-based drug prevention program #Tamojunto2.0. Addiction, 116(6), 1580–1592. https://doi.org/10.1111/add.15358

55.

Schouten

H. J. A.

(1999). Sample size formula with a continuous outcome for unequal group sizes and unequal variances. Statistics in Medicine, 18(1), 87–91. https://doi.org/10.1002/(SICI)1097-0258(19990115)18:1<87::AID-SIM958>3.0.CO;2-K

56.

Kariyama

Oyamada

Matsushita

Nishimura

Tezuka

Sunami

Furukawa

T. A.

Sahker

Kawaguchi

Kobashi

Nishina

Otsuka

Tsujimoto

Horie

Yoshiji

Yuzuriha

Nouso

(2025). Effectiveness of screening and ultra-brief intervention for hazardous drinking in primary care: Pragmatic cluster randomised controlled trial. BMJ, 390, e083985. https://doi.org/10.1136/bmj-2024-083985

57.

Troncoso

Humphrey

(2021). Playing the long game: A multivariate multilevel non-linear growth curve model of long-term effects in a randomized trial of the Good Behavior Game. Journal of School Psychology, 88, 68–84. https://doi.org/10.1016/j.jsp.2021.08.002

58.

Umbach

Raine

Leonard

N. R.

(2018). Cognitive decline as a result of incarceration and the effects of a CBT/MT intervention: A cluster-randomized controlled trial. Criminal Justice and Behavior, 45(1), 31–55. https://doi.org/10.1177/0093854817736345

59.

van Breukelen

G. J. P.

Candel

M. J. J. M.

Berger

M. P. F.

(2007). Relative efficiency of unequal versus equal cluster sizes in cluster randomized and multicentre trials. Statistics in Medicine, 26(13), 2589–2603. https://doi.org/10.1002/sim.2740

60.

Vanbrabant

Van Loey

Kuiper

R. M.

(2020). Evaluating a theory-based hypothesis against its complement using an AIC-type information criterion with an application to facial burn injury. Psychological Methods, 25(2), 129–142. https://doi.org/10.1037/met0000238

61.

Veenman

Stefan

A. M.

Haaf

J. M.

(2024). Bayesian hierarchical modeling: An introduction and reassessment. Behavior Research Methods, 56(5), 4600–4631.

62.

Wagenmakers

E.-J.

Lodewyckx

Kuriyal

Grasman

(2010). Bayesian hypothesis testing for psychologists: A tutorial on the savage–dickey method. Cognitive Psychology, 60(3), 158–189.

63.

Wang

Gelfand

A. E.

(2002). A simulation-based approach to Bayesian sample size determination for performance under a given model and for separating models. Statistical Science, 17(2), 193–208.

64.

Weiss

(1997). Bayesian sample size calculations for hypothesis testing. Journal of the Royal Statistical Society Series D: The Statistician, 46(2), 185–191.

65.

Wikipedia contributors. (2025, August 9). Binary search - Wikipedia. https://en.wikipediaorg/wiki/Binary_search

66.

Wilson

K. J.

(2022, August). Bayesian design and analysis of two-arm cluster randomised trials using assurance. arXiv. arXiv:2208.12509. https://doi.org/10.48550/arXiv.2208.12509

Method for Sample Size Determination for Cluster-Randomized Trials Using the Bayes Factor

Abstract

Keywords

Cluster-Randomized Trials

Hypothesis Testing

Limitations and Criticisms of NHST

Beyond Null Hypothesis Testing

Bayes Factor

Methodology for Sample Size Determination

Simulation Study

Design

Determining the Number of Clusters

Determining the Cluster Size

Results

Determining the Number of Clusters for Hypothesis Set 1

Determining the Cluster Sizes for Hypothesis Set 1

Determining the Number of Clusters for Hypothesis Set 2

Determining the Cluster Size for Hypothesis Set 2

Practical Example

Discussion

Footnotes

Appendix A

Appendix B

Declaration of Conflicting Interests

Funding

ORCID iD

Authors

References