Abstract
Sample-size justification is an essential aspect of rigorous research in the behavioral and social sciences and helps to ensure studies are adequately powered, minimize resource waste, and reduce participant burden. However, researchers often face challenges in navigating the array of sample-size-planning methods available, particularly when balancing inferential goals and statistical frameworks. The SampleSizePlanner (SSP), originally developed to assist researchers in selecting appropriate sample-size determination methods for two-group designs, has been expanded to address 2 × 2 analysis-of-variance (ANOVA) designs. In this article, we introduce novel 2 × 2 design extensions to the SSP, including tools for Bayesian methods, such as the Bayes factor equivalence interval and the region of practical equivalence, and a frequentist approach. The SSP offers an accessible ShinyApp interface and R package, enabling researchers to streamline decision-making and apply various sample-size-planning methods with minimal computational overhead. Ready-to-use reporting templates foster transparency in sample-size justification. In the article, we address the practical application of these tools through comprehensive examples, demonstrating their relevance to scenarios such as interaction testing and equivalence estimation. By providing a standardized and accessible approach to sample-size planning, this work supports researchers in conducting reproducible and well-powered studies while addressing gaps in sample-size planning for 2 × 2 ANOVA designs.
Social and behavioral scientists are increasingly expected to plan and justify study sample sizes (Maxwell, 2004). Sample-size planning and justification help avoid underpowered studies, which provide low informational value and risk biasing empirical findings (Kovacs et al., 2022). They optimize resources, such as time and data-collection costs (Lakens, 2022), and minimize unnecessary participant burden. For example, careful sample-size planning can help balance the number and frequency of instances of data collection against the anticipated informational gain from additional data collection, thereby promoting both study efficiency and ethical rigor (Bacchetti et al., 2005).
Despite these advantages, researchers still struggle with planning sample sizes and providing justifications for their choice. To plan their sample size, researchers are expected to conduct formal power calculations before conducting a study (Carter et al., 2017; Papalouka et al., 2023). Various stand-alone programs, packages, and online applets are available to facilitate this process (e.g., MorePower 6.0, Campbell & Thompson, 2012; G*Power 3.1, Faul et al., 2009; BayesianPower, Klaassen, 2020; BFDA ShinyApp, Stefan et al., 2019). However, many find the calculations associated with sample-size planning complicated and continue to resort to heuristics or common field practices (Bakker et al., 2016; Washburn et al., 2018). This reliance on informal practices reflects improper sample-size planning, contributing to a lack of transparency and methodologically sound sample-size justifications (Tripathi et al., 2020) that ultimately undermine the informational value of the collected data.
One contributing factor is the overwhelming array of tools, which, combined with a lack of practical guidance, makes navigating among the many sample-size methods challenging (Kovacs et al., 2022). Furthermore, although these tools perform calculations related to sample-size planning reliably, they offer limited support for applying results in context or reporting sample-size justifications clearly. In practice, this leads to a disconnect between the mechanical execution of sample-size planning and its integration into coherent research planning and reporting.
Against this backdrop, the SampleSizePlanner (SSP; Kovacs et al., 2022) was developed to reduce the complexity of navigating across sample-size methods and guide users toward appropriate sample-size methods. SSP provides a bird’s-eye view across different types of statistical inference, facilitating researchers to determine which inferential goal is best suited for their own research design. This bird’s-eye view also offers pedagogical value by clarifying the choice of inference and highlighting the specific parameters required for sample-size planning under each of those types of inference. Coupled with customizable reporting templates, SSP promotes more transparent and rigorous reporting practices (Lakens, 2022). Furthermore, with growing interest in Bayesian methods, SSP addresses the need for accessible tools that support Bayesian sample-size planning, which remains comparatively underdeveloped (Jevremov & Pajić, 2024).
In this article, we describe an extended version of the SSP designed to provide practical guidance for researchers by addressing key decision points in selecting appropriate sample-size methods for 2 × 2 factorial designs in addition to the existing modules for two-group designs. Building on the original SSP (Kovacs et al., 2022), this extension incorporates additional decision points that guide researchers through selecting sample-size methods. Specifically, it emphasizes research design—2 × 2 group design or two-group design—as a first step, followed by decisions on inferential goals—whether estimating a parameter or testing a hypothesis—and the choice of statistical framework—frequentist or Bayesian (see Fig. S1 in the Supplemental Material available online).
By mapping these key decision points onto a diagram, researchers can use the SSP to quickly identify and access the appropriate sample-size-determination methods for their given study design. For example, if researchers want to conduct a hypothesis test that two different populations have different means within the Bayesian framework, then they will be directed to choose between the “Bayes Factor design analysis” or “predetermined sample size with Bayes Factor” options (see Stefan et al., 2019). If, on the other hand, they opt for the frequentist framework, they can consider using the traditional power-analysis approach or a power-curve approach, depending on whether they want to test a single effect size or a range of effect sizes. The traditional power analysis estimates the minimum sample size needed to detect a specified effect size with a given Type I error rate and statistical power (Cohen, 1992).
We introduce additional ready-to-use analysis code and a ShinyApp designed to help researchers apply and report key sample-size-determination techniques specifically for 2 × 2 factorial designs. We develop new modules, encompassing a frequentist module based on the traditional power approach and innovative Bayesian modules, which feature (a) Bayes factor equivalence interval for testing interval null hypotheses, (b) region of practical equivalence (ROPE) for Bayesian estimation inference, and (c) Bayes factor for testing point null hypotheses, all in the context of the 2 × 2 analysis-of-variance (ANOVA) design.
A novel contribution of this article is the development of sample-size-planning modules in the context of Bayes factor equivalence interval and ROPE in 2 × 2 factorial designs. The Bayes factor equivalence interval quantifies evidence for the hypothesis that an effect falls within a specified equivalence interval (versus outside that same interval) using the Bayes factor (Rouder et al., 2009; van Ravenzwaaij et al., 2019). ROPE tests whether the highest density region of the posterior distribution of an effect size falls fully within a region of practical equivalence (Kruschke, 2018). Although these methods have been extensively discussed in the literature (Kruschke, 2018; Kruschke & Liddell, 2018; Linde et al., 2023; Morey & Rouder, 2011; Rouder et al., 2009; van Ravenzwaaij et al., 2019), their practical application in sample-size planning has remained underdeveloped. This SSP extension further contributes to this gap by introducing specialized modules tailored to these approaches.
How to Use This Guide
We borrow the structure from Kovacs et al. (2022) to make it easier for the reader to navigate through both articles. In the upcoming sections, we demonstrate the application of our ShinyApp extension and our R package for several techniques of sample-size determination for 2 × 2 ANOVA designs. In the ShinyApp, one can adjust primary parameters using a point-and-click interface, and the R package allows for more detailed customization through additional preset parameters. Both allow for saving or copying a text template of sample-size results. Suggested justification template texts are provided at decision points in the ShinyApp, but users must supplement them with additional details specific to their study design. These justifications should be detailed and theoretically supported, serving as a basis for describing sample-size choices in articles, preregistrations, or grant proposals.
We use the following example study to guide the reader through the methods. Anna, a psychologist, wants to test whether dog ownership is associated with anxiety in adults. Furthermore, she wants to examine if there are sex differences regarding anxiety. To evaluate these, Anna sets up a study with two factors. The first factor is dog ownership, in which participants are classified as either current dog owners or nonowners. The second factor is sex, and participants are categorized as either male or female. 1 This results in a 2 × 2 ANOVA design, with dog ownership (owner vs. nonowner) and sex (male vs. female) as the factors A and B, respectively. Anna’s outcome variable is participants’ anxiety levels measured by the seven-item Generalized Anxiety Disorder (GAD-7) scale (Spitzer et al., 2006) from 0 (minimal anxiety) to 21 (severe anxiety). Justification for each sample-size-planning method will be based on Anna’s decisions in this research scenario.
In some methods, Anna hypothesizes the presence of a main or interaction effect (target population: employed adults), whereas in others, she expects a null effect (target population: retired adults), allowing us to demonstrate different sample-size planning methods. Note that in reality, researchers should base their hypotheses a priori on a body of evidence or plausible theoretical considerations.
The browser-based ShinyApp (https://doi.org/10.5281/zenodo.17362197) or the R command devtools::install_github(“marton-balazs-kovacs/SampleSizePlanner”) can be used to access the various methods for determining sample size in a 2 × 2 ANOVA design.
Method
We use a simulation-based approach in R (R Core Team, 2022; RStudio Team, 2020) to determine the required sample size needed for each of the newly introduced methods. For each candidate sample size, we simulate data under a 2 × 2 factorial design, sampling observations for four groups from normal distributions defined by the unstandardized means and a common standard deviation. Subsequently, we apply the target inference procedure to each data set. We record whether the effect of interest is detected under that method’s criterion and compute the true positive rate (TPR) as the proportion of detections over iterations. An optimization routine then adjusts the sample size between a lower (N = 4) and upper bound (N = MaxN; set by the user) until the specified TPR is reached. To make Bayesian calculations accessible, we precomputed results on a high-performance cluster after standardizing and deduplicating parameter combinations. The ShinyApp locates the nearest precomputed scenario via least squares matching and returns its unstandardized means, sample-size recommendation, and expected TPR (for details about the procedure, see Appendix; for R code, see the GitHub Repository, https://doi.org/10.5281/zenodo.17362197). Certain parameters are common to most of the provided sample-size-planning methods and warrant detailed clarification.
The mean vector (Mu) contains the four unstandardized group means “m11,” “m12,” “m21,” and “m22” of the dependent variable, which correspond to the subgroups A1|B1, A1|B2, A2|B1, and A2|B2 of factors A|B, respectively (see Table 1). For instance, if factor A represents dog ownership (with levels dog owner and nonowner) and factor B represents sex (with levels male and female), then m11 corresponds to the mean anxiety score for males in the dog owner condition. Based on the unstandardized group means, the marginal means are calculated as the average of the group means for each level of a factor, averaged across the levels of the other factor (see Table 1). For example, the marginal mean for factor level A1 is calculated as the average of the group means across the levels of factor B (m11 and m12).
Group Means and Marginal Means for 2 × 2 ANOVA Designs
Note: In this table, we present the group means m11, m12, m21, and m22 and their respective marginal means, as used in this article. ANOVA = analysis of variance.
All groups are assumed to have equal size N. The standard deviation is assumed to be the same in all groups. The TPR is the probability of detecting an effect when it truly exists (commonly known as power in frequentist hypothesis testing). We term it “TPR”—rather than power—to also capture the long-run proportion of simulated data sets (under the assumed model) meeting the chosen Bayesian evidence threshold (e.g., Bayes factor > 10 or highest density interval [HDI] ⊂ ROPE), thereby highlighting that any hit-rate metric depends on the selected decision boundary. The equivalence band (EqBand) is the margin of the effect-size range around zero, quantified as a standardized mean difference. Thus, a standardized interval around zero is specified as [–EqBand; EqBand]. Effect sizes within this range are not considered to be practically relevant.
In the following, we discuss four sample-size-planning methods for a 2 × 2 ANOVA: (a) the traditional frequentist ANOVA, (b) a Bayesian ANOVA testing a point null hypothesis against a composite alternative, (c) a Bayesian ANOVA testing an interval null hypothesis against its complementary alternative, and (d) a Bayesian ANOVA using ROPE. Relative evidence in the Bayesian approaches is quantified using Bayes factors with default priors (Rouder et al., 2012) for Methods b and c above (Morey & Rouder, 2011; Rouder et al., 2012; van Ravenzwaaij et al., 2019), whereas Method d above relies on evidence quantified using a ROPE (Kruschke, 2018; Kruschke & Liddell, 2018). Note that for each separate method, the SSP requests as input a desired long-run probability of obtaining a statistical result above or below some cutoff score (called “TPR”). The TPR for the calculated sample size will often match the desired TPR exactly, but sometimes, small deviations may occur.
Testing
Frequentist 2 × 2 ANOVA (effect ≠ 0)
Description
The technique returns a minimum sample size for given parameters based on a Type I error fixed to 0.05 using a frequentist 2 × 2 ANOVA design. The ANOVA assesses both main and interaction effects, compares the variance between groups for the targeted effect to the variance within groups, and makes inference based on F statistics and their associated p values. For more details, refer to Cardinal and Aitken (2013).
Study context
Anna would like to know what sample size she needs to have a 0.8 probability of finding a significant difference of anxiety (a) in the dog-owner versus nonowner group, (b) in males versus females, and (c) regarding an interaction between dog ownership and sex. Therefore, she hypothesizes that the alternative hypotheses are true. She would like to test these hypotheses in a frequentist framework and would like to know what sample size would be needed given the expected unstandardized group means m11 = 5 (male dog owners), m12 = 7 (female dog owners), m21 = 9 (male dog nonowners), m22 = 9 (female dog nonowners), and SD = 4 of the GAD-7 anxiety scores (see Table 2). These values are informed by a preliminary pilot study conducted with the target population of employed adults. She constrains the maximum number of participants in a group to MaxN = 500 (more would not be feasible given Anna’s study budget).
Expected Means and Standard Deviations of the Scores on the GAD-7 Anxiety Scale Across Groups Among Employed Adults
Note: The table contains expected unstandardized group means and standard deviations for the example study context, which are informed by a preliminary pilot study. GAD-7 = seven-item Generalized Anxiety Disorder scale.
Parameters
Effect: the effect of interest (main effect A, main effect B, interaction effect);
Mu: the unstandardized mean of the dependent variable for each group;
Sigma: the standard deviation of the dependent variable for the groups;
TPR: the desired long-run probability of obtaining a significant result, given the means;
MaxN: the maximum group size;
Iter: the number of iterations to calculate the TPR;
Alpha: the level of significance.
How to use the package
One can use the ShinyApp and set the given parameters or run the following code in R: ssp_power_traditional_anova(effect = "Interaction Effect", iter = 5000, max_n = 500, mu = c(5, 7, 9, 9), sigma = 4, tpr = 0.8, alpha = 0.05, seed = 1234).
How to report your sample-size estimation
We conducted a power analysis with an alpha of .05 to estimate the required sample size. We set the target power at .80 because it is (a) the common standard in the field or (b) the journal publishing requirement. The expected group means were 5, 7, 9, and 9 for subgroups 1|1, 1|2, 2|1, and 2|2 of factors A|B, respectively, and we assumed a common standard deviation of 4. Based on these parameters, a minimum per-group sample size of (a) 15, (b) 126, and (c) 127 was required to achieve the target power .80. The effective power was (a) .81, (b) .80, and (c) .80 for the (a) main effect A, (b) main effect B, and (c) interaction effect, respectively.
Bayesian 2 × 2 ANOVA (effect ≠ 0)
Description
The technique returns a minimum sample size for given parameters using a Bayesian 2 × 2 ANOVA design. In the ShinyApp, the method uses a default Bayes factor for ANOVA designs (Rouder et al., 2012), which is calculated by using a Cauchy prior centered at 0 and with a scale fixed to 1 / sqrt(2) for the standardized effect size Cohen’s d (it is possible to customize this value in the R package). The Bayes factor is a likelihood ratio for observing the data given a point null hypothesis (H0; Cohen’s d = 0) and a composite alternative hypothesis (H1; Cohen’s d ≠ 0): Therefore, the Bayes factor provides relative evidence for the alternative hypothesis over the point null hypothesis. A high Bayes factor (larger than the expected thresh, which is set in the SSP) supports the alternative hypothesis, indicating the effect is likely to be different from zero and, thus, the marginal means are different. For more details, refer to Rouder et al. (2012).
Study context
Anna would like to know what sample size she needs to have a 0.8 probability of obtaining a Bayes factor larger than 10 using a default Cauchy prior with a scale of 1 / sqrt(2). She would like to determine the sample size for hypothesized differences (a) in the dog-owner versus nonowner group, (b) in males versus females, and (c) regarding an interaction between dog ownership and sex using a Bayesian framework. Therefore, she hypothesizes that the alternative hypotheses are true. She would like to know what sample size would be needed given the expected unstandardized group means m11 = 5 (male dog owners), m12 = 7 (female dog owners), m21 = 9 (male dog nonowners), m22 = 9 (female dog nonowners), and SD = 4 of the GAD-7 anxiety scores (see Table 2). These values are informed by a preliminary pilot study conducted with the target population of employed adults. The maximum number of participants per group is constrained to MaxN = 500 (more would not be feasible given Anna’s study budget).
Parameters
Effect: the effect of interest (main effect A, main effect B, interaction effect);
Mu: the unstandardized mean of the dependent variable for each group;
Sigma: the standard deviation of the dependent variable for the groups;
TPR: the desired long-run probability of obtaining a Bayes factor higher than Thresh given the means;
Thresh: the threshold of the Bayes factor, which is fixed to 10 in the ShinyApp;
PriorScale: the scale of the Cauchy prior, which is fixed to 1 / sqrt(2) in the ShinyApp;
MaxN: the maximum group size, which is fixed to 500 in the ShinyApp;
Iter: the number of iterations to calculate the TPR, which is fixed to 5,000 in the ShinyApp.
How to use the package
You can use the ShinyApp and set the given parameters. The Bayesian ANOVA method is highly computationally intensive; as a result, we performed precalculations for frequently occurring parameter combinations, which are retrieved by the ShinyApp. To match any set of user-entered means (e.g., reaction times in the hundreds or thousands) to our precomputed lookup table, we first standardize the input (shift the lowest mean to 0 and scale the standard deviation to 1) and then find the nearest precalculated set of group means. We report the sample size, TPR, and the means from that closest match, rescaled back to the user’s original units (for details, see Appendix). To obtain exact calculations for values that have not been precalculated, one can use the following R command: ssp_anova_bf(effect = "Main Effect 1", mu = c(5, 7, 9, 9), sigma = 4, iter = 5000, tpr = 0.8, thresh = 10, prior_scale = 1 / sqrt(2), max_n = 500, seed = NULL).
How to report your sample-size estimation
We conducted a Bayesian ANOVA (Rouder et al., 2012) to estimate the required sample size using a Bayes factor threshold of 10 and a Cauchy prior distribution centered at zero with scale parameter 1 / sqrt(2). We set the target TPR at 0.8 because it is (a) the common standard in the field or (b) the journal publishing requirement. Expected group means were 5, 7, 9, and 9 for subgroups 1|1, 1|2, 2|1, and 2|2 of factors A|B, respectively, and we assumed a common standard deviation of 4. Based on these parameters, a minimum per-group sample size of (a) 28, (b) 297, and (c) 246 was required to achieve the target TPR .80. The effective TPR was (a) 0.83, (b) 0.81, and (c) 0.81 for the (a) main effect A, (b) main effect B, and (c) interaction effect, respectively.
Bayes factor equivalence interval (effect = 0)
Description
The Bayes-factor-equivalence-interval method compares two models, under the null and alternative hypotheses, to assess whether an effect falls within a specified equivalence interval (Morey & Rouder, 2011; Rouder et al., 2012; van Ravenzwaaij et al., 2019) defined by the EqBand. The Bayes factor is the ratio of how likely the data are under the interval null hypothesis (Cohen’s d lies within the interval) versus the complementary alternative hypothesis (Cohen’s d lies outside the interval; Rouder et al., 2009). For equivalence testing, the method calculates the ratio of posterior odds inside versus outside the interval and divides it by the prior odds using Bayes rule (Linde et al., 2023). A high Bayes factor supports the interval null hypothesis, indicating the effect is likely within the equivalence range, and thus, the marginal means are considered practically equivalent. The method uses a Cauchy prior centered at 0 and with a scale fixed to 1 / sqrt(2) for the standardized effect size Cohen’s d (it is possible to customize this value in the R package).
Study context
Anna would like to know what sample size she needs to have a 0.8 probability of obtaining a Bayes factor larger than 10 for the interval null hypotheses relative to the alternative hypotheses, using a default Cauchy prior with a scale of 1 / sqrt(2). She uses a Bayesian framework to determine sample size for testing that the marginal means between (a) the dog-owner versus nonowner group and (b) males versus females are practically equivalent. Therefore, she hypothesizes that the interval null hypotheses are true. She would like to know what sample size would be needed given the expected unstandardized group means m11 = 4 (male dog owners), m12 = 5 (female dog owners), m21 = 4.5 (male dog nonowners), m22 = 4 (female dog nonowners), and SD = 2 of the GAD-7 anxiety scores (see Table 3). These values are informed by a preliminary pilot study conducted with the target population of retired adults. Anna considers standardized differences in marginal means from −0.3 to +0.3 (EqBand = 0.3) as practically equivalent. The maximum number of participants per group is constrained to MaxN = 500 (more would not be feasible given Anna’s study budget).
Expected Means and Standard Deviations of Scores on the GAD-7 Anxiety Scale Across Groups Among Retired Adults
Note: The table contains expected unstandardized group means and standard deviations for the example study context, which are informed by a preliminary pilot study. GAD-7 = seven-item Generalized Anxiety Disorder scale.
Parameters
Effect: the effect of interest (main effect A, main effect B);
Mu: the unstandardized mean of the dependent variable for each group;
Sigma: the standard deviation of the dependent variable for the groups;
TPR: the desired long-run probability of obtaining a Bayes factor higher than Thresh given the means;
EqBand: the margin of the standardised equivalence region;
Thresh: the threshold of the Bayes factor, which is fixed to 10 in the ShinyApp;
PriorScale: the scale of the Cauchy prior, which is fixed to 1 / sqrt(2) in the ShinyApp;
MaxN: the maximum group size, which is fixed to 500 in the ShinyApp;
Iter: the number of iterations to calculate the TPR, which is fixed to 5,000 in the ShinyApp;
PostIter: the number of iterations to estimate the posterior distribution, which is fixed to 5,000 in the ShinyApp.
How to use the package
You can use the ShinyApp and set the given parameters. The Bayes-factor-equivalence-interval method is highly computationally intensive, and as a result, we ran precalculations for frequently occurring parameter combinations, which are retrieved by the ShinyApp. To match any set of user-entered means (e.g., reaction times in the hundreds or thousands) to our precomputed lookup table, we first standardize the input (shift the lowest mean to 0 and scale the standard deviation to 1) and then find the nearest precalculated set of group means. We report the sample size, TPR, and the means from that closest match, rescaled back to the user’s original units (for details, see Appendix). To obtain exact calculations for values that have not been precalculated, one can use the following R command: ssp_anova_eq(effect = "Main Effect 1", eq_band = 0.3, mu = c(4, 5, 4.5, 4), sigma = 2, iter = 5000, tpr = 0.8, thresh = 10, prior_scale = 1 / sqrt(2), max_n = 500, seed = NULL).
How to report your sample-size estimation
We conducted an interval equivalent Bayes factor analysis (Morey & Rouder, 2011; Rouder et al., 2012; van Ravenzwaaij et al., 2019) to estimate the required sample size, using a Bayes factor threshold of 10. We considered all effect sizes within −0.3 and +0.3 equivalent to zero because (a) previous studies reported a similar region of practical equivalence, or (b) of the following substantive reasons: . . . , or (c) other. . . . A Cauchy prior distribution centred at zero with scale parameter 1 / sqrt(2) was specified, and we set the target TPR at 0.80 because it is (a) the common standard in the field or (b) the journal publishing requirement. Expected group means were 4, 5, 4.5, 4, for subgroups 1|1, 1|2, 2|1, and 2|2 of factors A|B, respectively, with a common standard deviation of 2. Based on these parameters, a minimum per-group sample size of (a) 72 and (b) 72 was required to achieve the target TPR .80. The effective TPR was (a) 0.81 and (b) 0.80 for the (a) main effect A and (b) main effect B, respectively.
Bayesian estimation
ROPE (effect = 0)
Description
The ROPE technique is used within Bayesian analysis to determine whether an effect is practically equivalent to the null value (Kruschke, 2018; Kruschke & Liddell, 2018). A ROPE represents the range of effect-size values that are not considered practically or clinically relevant. The highest density interval (HDI) is an interval containing the [%] highest density of the distribution mass (Linde et al., 2023). First, we define the ROPE, and second, we calculate the 95% HDI of the posterior distribution. Finally, we assess if the HDI is fully covered in the ROPE, indicating the effect is likely within the equivalence range, and thus, the marginal means are considered practically equivalent. The method uses a Cauchy prior centered at 0 and with a scale fixed to 1 / sqrt(2) for the standardized effect size Cohen’s d. For more details, we refer to Kruschke (2018).
Study context
Anna wants to use parameter estimation and would like to know what sample size she needs to have a long-run probability of 0.8, where the ROPE interval contains the 95% HDI, using a default Cauchy prior with a scale of 1 / sqrt(2). She expects that the marginal means between (a) the dog owner versus nonowner group and (b) males versus females are practically equivalent. She would like to know what sample size would be needed given the expected unstandardized group means m11 = 4 (male dog owners), m12 = 5 (female dog owners), m21 = 4.5 (male dog nonowners), m22 = 4 (female dog nonowners), and SD = 2 of the GAD-7 anxiety scores (see Table 3). These values are informed by a preliminary pilot study conducted with the target population of retired adults. Anna considers standardized differences in marginal means from −0.3 to +0.3 (EqBand = 0.3) as practically equivalent. The maximum number of participants per group is constrained to MaxN = 500 (more would not be feasible given Anna’s study budget).
Parameters
Effect: the effect of interest (main effect A, main effect B).
Mu: the unstandardized mean of the dependent variable for each group.
Sigma: the standard deviation of the dependent variable for the groups.
TPR: the desired long-run probability of the HDI fully falling inside the ROPE, given the means.
EqBand: the margin of the standardised ROPE interval.
HDI: the percentage of the HDI, which is fixed to 0.95 in the ShinyApp.
PriorScale: the scale of the Cauchy prior, which is fixed to 1 / sqrt(2) in the ShinyApp.
MaxN: the maximum group size, which is fixed to 500 in the ShinyApp.
Iter: the number of iterations to calculate the TPR, which is fixed to 5,000 in the ShinyApp.
PostIter: the number of iterations to estimate the posterior distribution, which is fixed to 5,000 in the ShinyApp.
How to use the package
You can use the ShinyApp and set the given parameters. The ROPE method is highly computationally intensive, and as a result, we ran precalculations for frequently occurring parameter combinations, which are retrieved by the ShinyApp. To match any set of user-entered means (e.g., reaction times in the hundreds or thousands) to our precomputed lookup table, we first standardize the input (shift the lowest mean to 0 and scale the standard deviation to 1) and then find the nearest precalculated set of group means. We report the sample size, TPR, and the means from that closest match, rescaled back to the user’s original units (for details, see Appendix). To obtain exact calculations for values that have not been precalculated, one can use the following R command: ssp_anova_rope(effect = "Main Effect 1", eq_band = 0.3, mu = c(4, 5, 4.5, 4), sigma = 2, iter = 5000, tpr = 0.8, ci = 0.95, prior_scale = 1 / sqrt(2), max_n = 500, seed = NULL).
How to report your sample-size estimation
We conducted a ROPE (Kruschke, 2018; Kruschke & Liddell, 2018) analysis to estimate the required sample size. We considered all effect sizes within −0.3 and +0.3 equivalent to zero because (a) previous studies reported a similar region of practical equivalence, or (b) of the following substantive reasons: . . . , or (c) other. . . . We used an HDI of 0.95 and specified a Cauchy prior distribution centered at zero with scale parameter 1 / sqrt(2). The target TPR was set at 0.80 because it is (a) the common standard in the field or (b) the journal publishing requirement. Expected group means were 4, 5, 4.5, and 4 for subgroups 1|1, 1|2, 2|1, and 2|2 of factors A|B, respectively, with a common standard deviation of 2. Based on these parameters, a minimum per-group sample size of (a) 242 and (b) 242 was required to achieve the target TPR 0.80. The effective TPR was (a) 0.80 and (b) 0.81 for the (a) main effect A and (b) main effect B, respectively.
Discussion
In this article, we expand the functionality of the SSP, originally developed by Kovacs et al. (2022), by introducing new methods for sample-size determination for 2 × 2 ANOVA designs. These methods help researchers plan studies within multiple statistical frameworks and nuanced inferential goals, such as testing for equivalence or practical insignificance, alongside traditional hypothesis testing. To resolve a methodological gap in the application of sample-size planning to 2 × 2 ANOVA designs, we introduce two Bayesian approaches—Bayes factor equivalence interval and ROPE. These additions provide researchers with greater flexibility in addressing the desired inferential goals in 2 × 2 factorial designs.
As with any planning tool, our methodology involves certain caveats that should be considered in application. One important consideration of our ANOVA modules is that they require unstandardized means and standard deviations as input. Applied researchers may not always have an intuition for what appropriate values should be. Fortunately, exploratory projects (or on a smaller scale, pilot studies) can be conducted to discover the approximate values of descriptive statistics, such as means and standard deviations, for the population(s) under study. We believe such an approach is much more informative than power calculations that are based on default effect sizes (e.g., f 2). Thus, we advocate for a more thoughtful process of sample-size planning than using heuristics.
Another caveat of our ANOVA modules is that the ones with frequentist methods do not automatically account for Type I error inflation in factorial designs involving multiple comparisons. We caution researchers to adjust alpha levels (e.g., Bonferroni or Tukey correction) to control the family-wise error rate or apply other corrections as necessary (for discussion, see Rubin, 2021). In contrast, Bayesian ANOVA, including Bayes factor equivalence interval and ROPE, quantifies evidence for competing hypotheses without explicitly controlling for Type I error. Although this reflects a conceptual difference rather than a flaw, it may require additional clarification for researchers transitioning from frequentist frameworks.
Furthermore, we note that Bayes factor estimation can vary across repeated runs and between different estimation algorithms (Pfister, 2021, 2024). SSP relies on the Bayes factor for ANOVA designs implemented in the BayesFactor R package (Rouder et al., 2012). The recommended sample size will likely differ for alternative estimation methods. Although precalculated parameter combinations enhance computational efficiency, they limit the utility of the outputs when user-specified parameters substantially deviate from precalculated values. For highly customized analyses, manual computations with the provided R code may be required.
Although the ShinyApp interface simplifies the process, applying these methods effectively still requires a degree of statistical expertise, particularly in understanding differences between frequentist and Bayesian frameworks or the implications of equivalence testing. This reliance on user expertise highlights a broader opportunity for future development: enhancing modularity and guidance to reduce barriers for less-experienced users. Future versions of SSP could incorporate more adaptive guidance, workflow customization, or support for more advanced models, such as complex factorial designs, mixed-effect models, or adaptive Bayesian procedures. This would further broaden the SSP’s applicability to address increasingly complex research designs (Cramer et al., 2016).
Allowing for these caveats, we believe that the SSP extension is a practical and versatile tool for applied researchers, offering an accessible ShinyApp interface and R-based package to assist with decision-making and application of various methods with minimal computational or programming effort. The inclusion of ready-to-use templates for reporting sample-size justifications fosters transparent communication of methodological decisions, aligning with the broader goals of open science (Fecher & Friesike, 2014). This standardization enables researchers to allocate resources effectively, mitigate the risks of underpowered or overpowered studies, and facilitate replication and evaluation of methodological choices. The SSP extension and accompanying R code reduce barriers to adopting rigorous, reproducible sample-size planning, marking a significant step toward accessible and standardized practices in applied science.
Supplemental Material
sj-jpg-1-amp-10.1177_25152459251398805 – Supplemental material for Sample-Size Planning for Frequentist and Bayesian 2 × 2 Analysis-of-Variance Designs
Supplemental material, sj-jpg-1-amp-10.1177_25152459251398805 for Sample-Size Planning for Frequentist and Bayesian 2 × 2 Analysis-of-Variance Designs by Sebastian A. J. Lortz, Andrew Setiono, Marton Kovacs and Don van Ravenzwaaij in Advances in Methods and Practices in Psychological Science
Footnotes
Appendix
Acknowledgements
We thank Marton Aron Varga for his help in running the precalculations. Ethical approval was not required for this study because it involved only simulated data and did not include human participants or real-world data collection. The data and R scripts are publicly available on the GitHub repository (
).
Transparency
Action Editor: Yasemin Kisbu-Sakarya
Editor: David A. Sbarra
Author Contributions
S.A. J. Lortz and A. Setiono are shared first authors.
Notes
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
