Abstract
Planning sample size often requires researchers to identify a statistical technique and to make several choices during their calculations. Currently, there is a lack of clear guidelines for researchers to find and use the applicable procedure. In the present tutorial, we introduce a web app and R package that offer nine different procedures to determine and justify the sample size for independent two-group study designs. The application highlights the most important decision points for each procedure and suggests example justifications for them. The resulting sample-size report can serve as a template for preregistrations and manuscripts.
Social and behavioral sciences are known to be plagued by undersampling (Ioannidis, 2005). In the traditional statistical framework, even when the effect exists, undersampled studies yield either nonsignificant results or significant results because of overestimating the size of the effect. Because nonsignificant results are less likely to reach publications than significant ones, results of undersampled studies either remain unpublished or impose a substantial bias on the body of published empirical findings. In addition, the low informational value of undersampled studies may not justify the cost or potential risk they induce (Halpern et al., 2002). To mitigate these issues, authors are increasingly expected to plan and justify the sample size of their study (Maxwell, 2004). However, such sample-size justifications are meaningful only if they provide sufficient information to the readers to judge the adequacy of the author’s decisions.
In the statistical literature, a few methods have been proposed to determine and justify sample size. In practice, however, authors are short of practical guides on how to navigate among the different sample-size methods. The aim of our tutorial is to point out for each method the essential decision points that a researcher has to face during this process. We provide a short description of each method and the corresponding parameters, but we avoid listing their advantages and disadvantages. Because there are disagreements between the experts of the field regarding the correct use of some of the methods, we intentionally try to remain impartial and do not favor any of the presented methods. Researchers who want to know more about each method can find a number of useful references in the description of the methods. We also provide a collection of analysis code ready to use and a ShinyApp that helps researchers use and report the main sample-size-estimation techniques for different scenarios. The tutorial is focused exclusively on the scenario of the comparison of two independent groups (i.e., the independent t-test design) with a one-sided test. The most important terms related to sample size planning are defined in the glossary at the end of this article. 1
Sample-Size Determination and Justification
A lot of factors go into the determination of the sample size for an independent two-group study design. In this section, we first provide a bird’s-eye view of the most important decisions. Next, we go into more detail on the specific inference tool that results from the combination of the larger choices.
It is crucial to not just state how one determined a planned sample size but to also give the reader insight into the reasons behind one’s choices. In a recent overview, Lakens (2021) listed six types of general approaches to justify sample size in quantitative empirical studies: (a) measure entire population, (b) resource constraints, (c) a priori power analysis, (d) accuracy, (e) heuristics, and (f) no justification. For the first approach, no quantitative justification is necessary, and for the second approach, the researcher has no freedom to increase the sample size. Power analysis, or more generally, the estimation of true positive rate, is used when one plans to conduct hypothesis testing; accuracy justifications are used when one plans to conduct parameter estimation. Our tutorial mainly focuses on the resource constraints, a priori power analysis, and accuracy approaches and is aimed at providing a hands-on approach for the mechanical part of the sample-size determination (i.e., the calculation). For a deeper discussion of justification of these approaches or for other approaches (i.e., using heuristics or not providing justification), see Lakens (2021).
Choosing a method in case of sample-size justification
In an ideal world, the choice for the number of participants would be solely determined by scientific considerations, and depending on the chosen technique, the collection of data would continue until either the desired sample size or a desired outcome has been reached. In practice, researchers are limited by time (collecting data is quite demanding), money (participants or people collecting the data may be paid, and the same may hold for renting space or equipment), or availability of participants (the population may be relatively small and/or the participation rate quite low).
When constrained by limited resources, it is important to be transparent about those limitations. It is also important to be open about scientific considerations. Depending on the nature of the study (Perhaps it is an initial exploration?), small sample sizes need not be a deal breaker. So although more data are always preferred from an informational point of view, by owning the limitations of a study, researchers improve future readers’ understanding of the process leading up to the eventual article and also answer in advance to readers who think the chosen sample size was insufficient.
Whether or not authors have limited resources, two important choices need to be made: (a) whether they are interested in statistical testing or in parameter estimation and (b) whether they want to conduct their statistical inference within the frequentist framework or within the Bayesian framework. Starting with the first decision, statistical testing is the primary framework when one is interested in establishing whether an underlying population effect is equal to, different from, larger than, or smaller than a certain value. In essence, statistical testing lends itself to binary decision-making. Typically, testing is concerned with a fixed-point null hypothesis (e.g., there is no difference between two groups), although using intervals for testing is also possible. Alternatively, one might be interested in parameter estimation that is less interested in establishing the existence of a difference and instead is concerned with establishing the magnitude of the difference.
The second important decision concerns the statistical framework. Choosing to conduct statistical tests within a frequentist framework, one is usually interested in balancing the Type I (false positive) and Type II (false negative) error rates. Practitioners choosing to conduct statistical tests within a Bayesian framework are typically interested in being able to quantify the relative probability of hypotheses or models being true given the data and in including prior information.
Within the realm of statistical testing, there are some other factors that affect the preferred inference tool: Do you prefer to test for equivalence (no difference in mean) or for superiority (mean of one group larger than mean of other group), are you interested in calculating a required sample size for a specific hypothetical effect size or for a range of possible values, and do you wish to employ sequential testing (applicable to Bayesian testing)? In case of testing, some of the methods are designed to find support for the null hypothesis (e.g., two one-sided tests [TOST], region of practical equivalence [ROPE]), whereas others are designed to find support for the alternative hypothesis (e.g., traditional null hypothesis testing), and some methods are designed to find support for either (e.g., Bayes’s factor design analysis [BFDA]). For frequentist estimation, the preferred inference tool might differ depending on whether one evaluated uncertainty for each group separately or jointly. We describe these specific factors when we go into detail about each of the preferred methods. A flow chart representing all of these choices is given in Figure 1.

The decisions that one faces when choosing among sample-size-estimation methods. The nine sample-size-estimation methods discussed in this article are listed in the bottom row. Some decisions are determined by the investigated question and the design of the study, whereas others are based on the preferred statistical framework.
How to use this guide
In the next section, we illustrate the specific inference tools and resulting sample-size calculations in more detail using a ShinyApp and an R package we have developed. Throughout this section, we recurrently use two terms that have different meanings for different techniques. These are the true positive rate (TPR) and the equivalence band (EqBand). The TPR reflects the long-run probability of concluding there is an effect, given that it does exist. For traditional null hypothesis testing, this is typically referred to as power, but related concepts exist for different inference tools. The EqBand refers to an effect size region, typically around zero, that is deemed clinically insignificant or irrelevant. Different names are given to this region depending on the technique that employs them, such as statistical effect size of interest (SESOI) or ROPE. For both TPR and EqBand, we explain the specific meaning in context of the relevant inference tool below.
For each method, only the main parameters can be adjusted with a certain range of values in the ShinyApp by using a slider. These parameters are presented in the text in bold. Other parameters are set to preset values in the application but can be adjusted in the accompanying R package to any sensible value. These parameters are highlighted in italics in the tutorial. Both the app and the package allow the users to save or copy a text template with the results of the sample-size determination. We offer a list of possible justifications at the decision points for each method (indicated between brackets), but users are able to provide their own justification as free text. Note that the listed justifications are meant to provide guidance for the user, and they are not sufficient without further details provided by the researcher in the context of the given study. For example, previously reported values should always be accompanied by a theoretical justification of why these values make sense. The provided justification text could serve as a stub for the description of the chosen sample size in an article, a preregistration or registered report, or a grant proposal.
Throughout, we use the example story of Mary, the educational psychologist. Mary has come up with a new set of games that challenge spatial insight. She would like to test whether distributed and targeted engagement with these games for a period of 6 months for children in the age range of 8 to 12 will lead to lasting improvements on their IQ score as measured through Raven’s progressive matrices test (population mean = 100, SD = 15). Mary collects data for a control sample that gets regular education and for an experimental sample and plans to compare those samples. Mary has good reason to be skeptical about the effectiveness of training on increasing performance because there are several studies questioning the existence of such effects (Owen et al., 2010; Simons et al., 2016). For illustrative purposes, in some of the upcoming examples, Mary expects a null effect, and in others, Mary expects a positive effect to highlight the different research scenarios for each sample-size-planning method. We also present a justification text for each sample-size-planning method based on Mary’s choices described in the example research scenario for the given method.
The ShinyApp is available on https://martonbalazskovacs.shinyapps.io/SampleSizePlanner, and the R package can be installed by running the following command in R devtools::install_github(“marton-balazs-kovacs/SampleSizePlanner”). There is more information about the R package and the ShinyApp at https://github.com/marton-balazs-kovacs/SampleSizePlanner or https://marton-balazs-kovacs.github.io/SampleSizePlanner/.
Testing
Effect size = 0
Two one-sided tests (TOST)
Study context
Mary would like to know what sample size she needs for a power of .80 to study whether the mean IQ score of the experimental group’s population is practically equivalent to the mean IQ score of the control group. She tests this assumption in a frequentist framework and considers a population effect size between −0.2 and 0.2 to be “practically equivalent” to no difference. This would correspond to IQ scores between 97 (100 + 15 × −.2) and 103 (100 + 15 × .2).
Description
TOST is a frequentist equivalence testing approach that adopts two one-sided hypotheses to designate an interval hypothesis (Schuirmann, 1987). The lower and upper boundaries of the interval are determined by the EqBand (i.e., SESOI) around the expected population effect size (e.g., 0). Lakens et al. (2018) listed several methods that can be used to determine the SESOI. In case of TOST, the two null hypotheses state that the effect size is equal to the lower and upper EqBand values, whereas the alternative hypotheses state that the effect size is significantly smaller than the upper EqBand value and significantly larger than the lower EqBandvalue. In case both one-sided tests reject the null hypothesis at a given significance level, the group means are considered to be practically equivalent. See Lakens et al. (2018) for further reading.
Parameters
Alpha: The level of significance. The α level in the application is preset to .05.
How to use the package
To use this method in R, run the following code:
How to report your sample-size estimation
To calculate an appropriate sample size for testing whether the two groups are practically equivalent, we used the TOST (Schuirmann, 1987) method. We used an α of .05. We set the aimed TPR to be 0.8 because [1) it is the common standard in the field; 2) it is the journal publishing requirement]. We consider all effect sizes below 0.2 equivalent to zero because [1) previous studies reported the choice of a similar equivalence band; 2) of the following substantive reasons: . . . ]. The expected delta was 0 because [1) we expected no difference between the groups]. Given these parameters, a sample size of 429 per group was estimated to reach a TPR of 0.8 with our design.
Equivalence interval Bayes’s factor
Study context
Mary would like to know what sample size she needs to have a long-run probability of .80 of obtaining a Bayes’s factor (BF) larger than 10. Mary would like to test whether the mean IQ score of the experimental group’s population is practically equivalent to the mean IQ score of the control group. Mary hypothesizes that there is no difference (i.e., H0 is true). Mary tests this assumption in a Bayesian framework. Mary considers a population effect size between −0.2 and under 0.2 to be practically equivalent. This would correspond to IQ scores between 97 (100 + 15 × −.2) and 103 (100 + 15 × .2).
Description
Equivalence interval BFs contrast an equivalence hypothesis to a nonequivalence hypothesis and quantify the evidence with BFs. Typically, H0 constitutes the equivalence interval (comparable with SESOI in the TOST framework), and Ha constitutes the complementary nonequivalence regions. Formally, the BF is calculated by dividing the fraction posterior area inside the interval/posterior area outside the interval (i.e., the posterior odds) by the fraction prior area inside the interval/prior area outside the interval (i.e., the prior odds). The resulting value quantifies how much more likely it is that the data occurred under a population effect size deemed equivalent relative to the data having occurred under a population effect size deemed nonequivalent. The current implementation uses a default Cauchy prior on effect size with the possible scale parameters of medium (r = 1/
Parameters
Threshold: Critical threshold for the BF. The threshold level in the application can be set to 10, 6, or 3.
How to use the package
To use this method in R, run the following code:
How to report your sample-size estimation
To estimate the sample size, we used the interval equivalent BF (Morey & Rouder, 2011; Ravenzwaaij et al., 2019) method. We used a Cauchy prior distribution centered on 0 with a scale parameter of 1/
Effect size >0 (frequentist)
Classical power analysis
Study context
Mary would like to know what sample size she needs for a power of .80 to study whether the mean IQ score of the experimental group’s population is significantly higher than the mean IQ score of the control group. She tests this assumption in a frequentist framework for a hypothetical population effect size of 0.5. This corresponds to a mean IQ score of 107.5 in the experimental group (100 + 15 × .5), assuming a mean IQ score of 100 in the control group.
The classical power analysis approach allows one to calculate the required sample size to obtain a significant result for the null hypothesis test a certain proportion of times in the long run given an assumed population effect size.
Parameters
Alpha: The level of significance. Alpha is preset to .05 in the application.
How to use the package
To use this method in R, run the following code:
How to report your sample-size estimation
We used a power analysis to estimate the sample size. We used an α of .05. We set the aimed TPR at 0.8 because [1) it is the common standard in the field; 2) it is the journal publishing requirement]. The expected delta was 0.5 because [1) previous results published in . . . ; 2) of the following substantive reasons: . . . ]. Given these parameters, a minimal sample size of 51 per group was estimated to reach 0.8 TPR for our design.
Power curve
Study context
Mary would like to know what sample size she needs for a power of .80 to study whether the mean IQ score of the experimental group’s population is significantly higher than the mean IQ score of the control group. She tests this assumption in a frequentist framework. However, she is reluctant to commit to a single hypothetical population effect size a priori, preferring to calculate required sample size for a range of hypothetical deltas between 0.1 and 0.9.
The power curve method is similar to a classical power analysis, but instead of calculating the appropriate sample size for one hypothesized population effect size, the method calculates the required sample size for a range of plausible population effect sizes.
Parameters
Alpha: The level of significance. Alpha is preset to .05 in the application.
How to use the package
To determine the sample sizes for each delta, see
To plot the power curve,
How to report your sample-size estimation
We used a power analysis to estimate the sample size. We used an α of .05. We set the aimed TPR at 0.8 because [1) it is the common standard in the field; 2) it is the journal publishing requirement]. Because [1) we have no clear expectation of the magnitude of delta 2) we expected the delta to be around . . . ], we include power calculations for delta ranging from 0.1 to 0.9. Given these parameters, minimal sample sizes per group for different hypothetical effect sizes to reach 0.8 TPR can be found in Figure 2.

The resulting power curve created by the application. The x-axis shows the range of deltas from the example, and the y-axis shows the corresponding sample sizes determined by the power curve method.
Effect size >0 (Bayesian)
Predetermined sample size with Bayes’s factor
Study context
Mary would like to test whether the mean IQ score of the experimental group’s population is higher than the mean IQ score of the control group. She would like to know what sample size she needs to have for a long-run probability of .80 of obtaining a BF larger than 10. Mary plans to collect all her data in one batch without testing sequentially. Mary expects the population effect size to be 0.5. This corresponds to a mean IQ score of 107.5 (100 + 15 × .5) in the experimental group, assuming a mean IQ score of 100 in the control group.
The present method calculates the corresponding default BF for a t-test statistic with Cauchy prior distribution centered on 0 with scale parameter of either 1/
Parameters
How to use the package
To use this method in R, run the following code:
How to report your sample-size estimation
The following explains how to report your sample-size estimation: We used the Jeffrey-Zellner-Siow BF method to estimate the sample size. We used a Cauchy prior distribution centered on 0 with a scale parameter of 1/
Bayes’s factor design analysis (BFDA)
Mary would like to know what sample size she needs to have a long-run probability of .80 of obtaining a BF larger than 10. Mary would like to test whether the mean IQ score of the experimental group’s population is higher than the mean IQ score of the control group in a Bayesian framework. Mary plans to collect all her data incrementally and thus is interested in using the advantage of not testing more than strictly necessary offered by sequential testing in her Bayesian analysis. Mary expects the population effect size to be 0.5. This corresponds to a mean IQ score of 107.5 in the experimental group (100 + 15 × .5), assuming a mean IQ score of 100 in the control group.
The description of the BFDA method is functionally identical to the one provided in the Predetermined Sample Size With BF section but gains in TPR because of the addition of sequential testing. In the app, H0 and Ha indicate the proportion of times sequential testing leads to BFs providing evidence with the given threshold for the null hypothesis and for the alternative hypothesis, respectively. Users of the Shiny app and R package should set delta to 0 if they wish to determine the sufficient sample size for rejecting an effect and use delta > 0 if they wish to find support for the existence of an effect. For further reading, see Schönbrodt and Wagenmakers (2018) and Schönbrodt et al. (2017).
The parameters include the following:
To use this method in R, run the following code:
The following explains how to report your sample-size estimation: We used the BFDA method to estimate the sample size. We used a Cauchy prior distribution centered on 0 with a scale parameter of 1/
Estimation
Frequentist
Accuracy in parameter estimation (AIPE)
Study context
Mary would like to know what sample size she needs so that the 95% confidence interval for the population effect size has an expected width of 0.4. She estimates the population effect size to be 0.2.
Description
Accuracy in parameter estimation (AIPE) aims to determine the sufficient sample size to obtain a confidence interval with a desired width (precision) around the expected effect size (Kelley & Rausch, 2006). Note that the width of the calculated confidence interval will depend on the sample variance. As a result, it is possible that the variance is relatively large for a given sample, which leads to a resulting confidence interval that is larger than the width of the desired interval for a given sample. Thus, the AIPE method aims to establish the expected value of the calculated confidence interval, which can be thought of as the 50% long-run probability of obtaining a confidence interval no wider than the provided width.
Parameters
How to use the package
To use this method in R, run the following code:
How to report your sample-size estimation
To estimate the sample size, we used the accuracy in parameter estimation [AIPE; Kelley and Rausch (2006)] method. We aimed for a 95% confidence level because [1) it is the common standard in the field; 2) it is the journal publishing requirement]. The desired width was 0.4 because [1) previous studies reported the choice of a similar region of practical equivalence; 2) of the following substantive reasons: . . . ]. We expected an underlying population effect size of 0.3 because [1) previous results published in . . . ; 2) of the following substantive reasons: . . . ]. Given these parameters, a minimal sample size of 195 per group was estimated for our design.
A priori precision (APP)
Study context
Mary would like to know the sample size for which she will have a 95% long-run probability that the sample means in both the experimental and the control group lie within 0.2 SD (3 IQ points) of the true population mean.
Description
A priori precision (APP) aims to determine the sample size needed to have a certain long-run probability of both sample means being within a certain range of their respective population means, expressed in terms of standard deviations (Trafimow & MacDonald, 2017). As a result, APP is not reliant on the expected effect size.
Parameters
How to use the package
To use this method in R, run the following code:
How to report your sample-size estimation
To estimate the sample size, we used the a priori precision [APP; Trafimow and MacDonald (2017)] method. Before data collection, we wanted to be 95% confident that both sample means lie within 0.2 SD of the true population means. Given these parameters, the resulting minimum sample size was 126 per group for our design.
Bayesian testimation
Region of practical equivalence (ROPE)
Study context
Mary would like to conduct parameter estimation to see whether the mean IQ score of her experimental group’s population is practically equivalent to 100. She would like to know what sample size she needs to have a long-run probability of .80 of obtaining a 95% highest density interval (HDI) that is contained within her predefined ROPE. Mary hypothesizes that there is no difference (i.e., H0 is true). She considers a population effect size between −0.2 and under 0.2 to be practically equivalent. This would correspond to IQ scores between 97 (100 + 15 × −.2) and 103 (100 + 15 × .2).
Description
The HDI-ROPE (often referred to as just ROPE) shares some features with the equivalence interval BF procedure. Both define an equivalence interval, construct a prior for the population effect size, and update to a posterior after the data come in. The equivalence interval BF procedure then focuses on the posterior and prior odds under complementary hypotheses. The ROPE procedure, on the other hand, identifies the 95% HDI (other percentages are permissible as well) and determines whether the HDI is fully contained within the equivalence interval. For further reading, see Kruschke (2018) and Kruschke (2011).
Parameters
How to use the package
To use this method in R, run the following code:
How to report your sample-size estimation
To estimate the sample size, we used the region of practical equivalence (Kruschke, 2018) method. We used a Cauchy prior distribution centered on 0 with a scale parameter of 1/
Summary
Justifying the decisions made during the sample-size planning process presents valuable information when one evaluates the inferences drawn from a study. The Shiny app and R package presented in this article aim to help researchers to choose and employ their sample-size estimation method. In addition, the tool provides assistance in reporting the process and justification behind sample-size choices. We encourage users and experts of the field to provide feedback and recommendations toward further developments.
Footnotes
Transparency
Action Editor: Alexa Tullett
Editor: Daniel J. Simons
Author Contributions
M. Kovacs and D. van Ravenzwaaij are shared first authors. Conceptualization: M. Kovacs, D. van Ravenzwaaij, R. Hoekstra, and B. Aczel; methodology: D, van Ravenzwaaij; project administration: M. Kovacs; software: M. Kovacs and D. van Ravenzwaaij; supervision: B. Aczel; writing, original draft preparation: M. Kovacs, D. van Ravenzwaaij, R. Hoekstra, and B. Aczel; writing, review and editing: M. Kovacs, D. van Ravenzwaaij, R. Hoekstra, and B. Aczel. All of the authors approved the final manuscript for submission.
