Sage Journals: Discover world-class research

Abstract

Critical-effect-size values represent the smallest detectable effect that can reach statistical significance given a specific sample size, alpha level, and test statistic. It can be useful to calculate the critical effect size when designing a study and evaluate whether such effects are plausible. Reporting critical-effect-size values may be useful when the sample size has not been planned a priori, there is uncertainty about the expected sample size that can be collected, or researchers plan to analyze the data with a statistical hypothesis test. To assist researchers in calculating critical-effect-size values, we developed an R package that allows researchers to report critical-effect-size values for group comparisons, correlations, linear regressions, and meta-analyses. Reflecting on critical-effect-size values could benefit researchers during the planning phase of the study by helping them to understand the limitations of their research design. Critical-effect-size values are also useful when evaluating studies performed by other researchers when a priori power analyses were not performed, especially when nonsignificant results are observed.

Keywords

critical-effect-size values statistical significance hypothesis testing effect sizes open materials

The “critical effect size” refers to the smallest effect size that can be statistically significant given the test performed, the sample size, and the alpha level (Lakens, 2022). Consider a study that tests a bivariate correlation with $n = 20$ participants and uses $α < . 05$ as the threshold for statistical significance in a two-sided test. The critical-effect-size values for this test are Pearson’s $r = - . 44$ and $r = . 44$ , which means that the statistical test will yield a significant result for effects more extreme than $r = - . 44$ or $r = . 44$ , whereas observed effect sizes between these two values will lead to a statistically nonsignificant result. In the present article, we suggest that researchers should systematically report critical-effect-size values when they make statistical inferences using significance levels in hypothesis tests to make claims about the presence or absence of effects. Critical-effect-size values can serve as a useful complement for interpreting study results, communicate the informational value of a study design, and contextualize the interpretation of nonsignificant results, particularly when statistical significance is relied on for inference. Reporting critical-effect-size values is especially important when the sample size was not (or could not be) predetermined on the basis of a smallest effect size of interest.

Null hypothesis significance testing (NHST) remains the most prominent approach for statistical inference in science even though there are widespread concerns about the misuse of hypothesis tests (Chow, 1988; Cohen, 1994; Cortina & Dunlap, 1997; Gigerenzer et al., 2004; Hagen, 1997; Haig, 2017; Krueger, 2001; Lakens, 2021; Miller, 2017; Nickerson, 2000). Over the past decades, numerous proposals have emerged to improve the use of hypothesis testing. These include complementing hypothesis tests with effect sizes and their confidence intervals, preregistering hypotheses before the data are collected, and conducting power analyses to determine the sample size a priori based on a smallest effect size of interest (American Psychological Association [APA], 2020; Cumming et al., 2012). Researchers often publish studies with low power for medium to small effect sizes (Szucs & Ioannidis, 2017) and use resource limitations as the main justification for the sample size (Lakens, 2022). However, low power makes it challenging to distinguish signal from noise, and this combined with the selective reporting of statistically significant results in the published literature lead to an overestimation of effect sizes (Altoè et al., 2020). Certain types of research questions can be studied by relying on online data collection, which has made it cheaper and more feasible to collect very large samples. Studies with very large samples are well powered to detect even small effect sizes, but they also require researchers to carefully consider the possibility that an effect might be statistically significant but practically insignificant (Anvari et al., 2023; Kirk, 1996). The risk is that researchers will heuristically or opportunistically accept all effects as potentially important. For this reason, “researchers need to explicitly state the mechanisms that can amplify the importance of an observed effect size and the mechanisms that counteract it” (Anvari et al., 2023, p. 504).

We believe there are two clear benefits to reporting critical-effect-size values for a corresponding test. In the following sections, we illustrate examples of common scenarios and go through the main problems of certain research scenarios and possible solutions, highlighting the utility of critical-effect-size values in such contexts.

Small Sample Sizes and Uncertain Sample-Size Determination

First, when sample sizes are small, the critical-effect-size values inform readers about whether the effect sizes that could lead to rejecting the null hypothesis are in line with realistic expectations. If the sample size is small and only very large effects would yield a statistically significant result and the underlying mechanism that is examined is unlikely to lead to such large effect sizes, researchers will realize they are not able to collect sufficient data to perform an informative hypothesis test. An a priori power analysis would typically lead to a similar conclusion, but reporting critical-effect-size values will focus the attention more strongly on which effect sizes are reasonable to expect. In case of already conducted studies with small sample sizes, it could be argued that it would be more informative to use retrospective design analysis (Altoè et al., 2020), but this would require both knowledge of the plausible effect size and/or the smallest effect size of interest, which is the minimum effect that could be considered meaningful based on practical relevance, theoretical importance, and/or specific research interests (Mesquida & Lakens, 2024; Riesthuis, 2024), which are not always easy to determine. The use of critical-effect-size values can be used in a simple and efficient way to evaluate which findings could have not been found significant because of sample-size limitations and based on the complexity of the test. Power analysis provides information about how likely it is to detect a specific effect size if it truly exists in the population. However, it does not indicate the minimum effect size required to reach statistical significance, which could provide additional insights into the strengths and limitations of a study design.

Consider the following two scenarios as examples. First, imagine researchers who conduct a study involving a between-groups comparison. Because of severe resource constraints, they are able to collect only a limited sample size of $N = 30$ ( $n = 15$ per group). The authors did not perform an a priori power analysis, and their sample-size justification is purely based on how many participants they can collect. The statistical test yields a nonsignificant result, $p > . 05$ . Given their prior expectation that the effect of interest may not be large, they acknowledge that their study was likely underpowered; although without a theoretically expected effect size, it remains uncertain how low the statistical power is. Subsequently, they compute the critical-effect-size values, revealing Cohen’s $d = - 0.75$ and $d = 0.75$ . This indicates that any observed Cohen’s $d$ between −0.75 and 0.75 will certainly fail to reach statistical significance. By reporting these critical-effect-size values, the researchers transparently convey that estimated effects much larger than most effects in the psychological literature will always fall short of significance. This means that the hypothesis test will not be informative.

To provide a more tangible illustration of the practical applications of the critical effect sizes, we examine a real-world instance drawn from published research. The study of interest was conducted on musicians and nonmusicians to detect differences in working memory; a modest sample size of 57 participants (42 musicians and 15 control subjects) was collected (Weiss et al., 2014). The two groups were compared on a variety of tasks related to verbal working memory and auditory skills. Regarding the verbal working memory, the syllable-span task was administered, and scores were computed for the maximal span and the total number of sequences they correctly repeated. A two-tailed $t$ test was conducted to compare the groups. The resulting effect was statistically significant for the maximal span, $t (55) = - 2.5$ , $p = . 017$ ; the Cohen’s $d$ calculated on summary statistics is −0.74; critical $d = - 0.62$ , and $d = 0.62$ . But for the total number of sequences, the test did not reach statistical significance, $t (55) = - 1.8$ , $p = . 076$ ; actual $d = - 0.31$ , and critical $d$ again −0.62 and 0.62. Thus, given the limited sample size, a result must be of rather large magnitude to be associated with $p < . 05$ , and medium to small effect sizes will simply fail to reach significance. This case makes it obvious that a failure to find a significant result does not imply the absence of an effect and that a statistically significant effect may represent an overestimation. Nonsignificant effects in such scenarios are likely even if there is a true effect of interest. If a significant effect is observed but most effects in a field are smaller than a study can detect, it is probable the effect size is overestimated (Hedges, 1984), and this should be taken into account when interpreting the effect size.

Beyond critical-effect-size values, researchers should reason before conducting a study what conclusions can realistically be drawn from their experimental design. For example, because interaction effects are examined in addition to main effects, power may decrease because interactions are generally associated with smaller effect sizes, may require corrections for multiple tests, and interaction coefficients present larger standard errors than main-effects coefficients. Such a reduction in statistical power may require a larger sample size to detect the effects of interest reliably. If this is not feasible and sample-size constraints make it impossible to detect significance for a plausible effect because it is smaller than the critical-effect-size value, alternative data-collection strategies, such as multisite studies, should be explored (Byers-Heinlein et al., 2020; Jarke et al., 2022; Moshontz et al., 2018; Sirois et al., 2023). Finally, if participating in or leading a multisite study is not an option, researchers should evaluate whether collecting the sample for exploratory analysis without making inferences might still be valuable. Such data could later be included in a meta-analysis.

Large Sample Sizes and Meta-Analysis

When sample sizes are very large, the critical-effect-size values will make it clear that trivially small effect sizes will be statistically significant. Reporting critical-effect-size values will focus the attention of researchers on the difference between statistical significance and practical significance and raises awareness of the importance of interpreting the size of the effect. This is especially important in correlational studies with large sample sizes in psychology, in which systematic but uncontrolled sources of variability may lead to significant but very small and not meaningfully interpretable nonzero effects, a phenomenon referred to as the “crud factor” (Orben & Lakens, 2020).

Imagine researchers who gain access to a very large archival data set ( $N = 5, 000$ ) and decide to explore bivariate correlations between variables. With such a large sample size, very small correlations will reach statistical significance, and the critical-effect-size values for a two-sided test are $r = - . 03$ and $r = . 03$ . Note that such small effects may potentially reflect just minor artifacts (Wilson et al., 2020), such as subtle experimenter effects or slight nonindependence among observations, even in meticulously designed studies.

Consider now another real-world instance involving the study by Kramer et al. (2014), which explored the impact of emotional content on Facebook users’ experiences. With a notably large sample size of $N = 689, 003$ , the researchers observed a statistically significant increase in the number of negative words typed by participants when positive posts were reduced in their timeline, $t (310.044) = - 5.63$ , $p < . 001$ (nondirectional test). Based on this and other results from the study, the authors concluded that “the emotions expressed by friends, via online social networks, influence our own moods, constituting, to our knowledge, the first experimental evidence for massive-scale emotional contagion via social networks” (Kramer et al., 2014, p. 8789). The critical values for Cohen’s $d$ in this experimental design are −0.006 and 0.006. It should have been clear from the outset that the researchers should have considered the effect size beyond merely testing for the statistical significance. The critical effect size should prompt reflection on which effect sizes are truly meaningful in this context and whether the observed effect size ( $d = 0.02$ ) is theoretically or practically interesting.

A similar scenario may arise in a meta-analysis. Despite potential loss of precision because of substantial heterogeneity across effect sizes in different studies, meta-analyses typically synthesize a large amount of evidence, and as a consequence, even very small average effect sizes can reach statistical significance. Although the focus of meta-analysis is generally on estimating effect sizes with uncertainty, statistical significance is routinely reported and interpreted. Signaling the critical-effect-size values beforehand can serve as a clear warning that statistically significant results should not automatically be interpreted as practically significant, thus urging caution when interpreting the results (Anvari et al., 2023). For this reason, the ongoing call to specify which effects constitute an “important difference” (Boring, 1919; Hodges & Lehmann, 1954; Kirk, 1996) has become especially urgent.

Finally, routinely reasoning about critical effect sizes alongside the use of commonly known practices to enhance the quality of research (i.e., power analysis, design analysis, data simulation) will bring benefits to researchers by making it clear that significance is strictly related to the sample size, the alpha level, and the statistic test used to analyze the data. Nevertheless, in educational settings, critical-effect-size values will help students to better grasp such concepts and therefore give a more critical approach to published research and for their future studies.

How to Compute Critical-Effect-Size Values

In this section, we provide guidance and formulas for computing standardized critical effect sizes with examples for frequently encountered effect sizes, including standardized mean differences (Cohen’s $d$ ), correlations (Pearson’s $r$ ), and raw and standardized coefficients in linear models. For a summary of these equations, see Table 1. The same equations can be used to calculate unstandardized effect sizes that can be preferred in some situations (e.g., linear models). These formulas have been incorporated into R functions of the package criticalESvalue , accessible at https://github.com/psicostat/criticalESvalue, and are elaborated on in the subsequent section.

Table 1.

Summary of the Critical-Effect-Size Equations

Model	Equation
One-sample t test	$d_{c} = \frac{t_{c} \times S E_{b}}{s} = t_{c} \sqrt{\frac{1}{n}}$
Paired t test	$d_{c} = \frac{t_{c} \times S E_{b}}{s_{p}}$
Paired t test	$d_{z_{c}} = \frac{t_{c} \times S E_{b}}{s_{D}} = t_{c} \sqrt{\frac{1}{n}}$
Between-samples t test	$d_{c} = \frac{t_{c} \times S E_{b}}{s_{p}} = t_{c} \sqrt{\frac{1}{n_{1}} + \frac{1}{n_{2}}}$
Correlation test	$r_{c} = \frac{t_{c}}{\sqrt{n - 2 + t_{c}^{2}}}$
Correlation	$r_{z_{c}} = \frac{z_{c}}{\sqrt{n - 3}}$
Meta-analysis/regression	$β_{c} = t_{c} \times S E_{β}$

Note: See text for details When possible, we also included the equation to calculate the critical effect size without computing the standard error. For the paired t test, this is feasible when using the standard deviation of differences. For the two-sample t test, it is possible only under the assumption of homogeneity of variances.

The t test

For the $t$ test, we considered the two-sample, paired, and one-sample tests. As a general approach, the $t$ statistic is computed as reported in Equation 1:

t = \frac{b}{S E_{b}},

(1)

where $b$ is the unstandardized effect size that depends on the type of test. For example, in the two-sample $t$ test, the numerator is the difference between the two sample means, and in the one-sample case, it is the difference between the sample mean and the population value. The denominator is the standard error of the numerator that depends on the sample size and the samples variances.

Equation 2 describes a general formulation of a standardized effect-size measure, where $b$ is the unstandardized effect size (e.g., the difference between two means) and $s$ is the standardization term. For example, in the two-samples case, $s$ is the pooled standard deviation between the two samples; in the paired-samples case, $s$ is the standard deviation of the differences for paired samples:

d = \frac{b}{s} .

(2)

The critical $t_{c}$ is the test statistic associated with a $p$ value equal or lower to $α$ . Substituting $t_{c}$ into Equation 1, we show that there is a $b_{c}$ that with a certain $S E_{b}$ produces the critical test statistics. The general form of the critical effect size $d_{c}$ is defined in Equation 3. The $b_{c}$ is derived solving Equation 1 for $b$ :

\begin{array}{l} d_{c} = \frac{b_{c}}{s} \\ b_{c} = t_{c} \times S E_{b} \end{array} .

(3)

Another way of conceptualizing the critical effect size is by removing the sample size from the $t$ statistics (see Equation 1). In fact, $S E_{b}$ can be generally defined as $\frac{s}{\sqrt{N}}$ , where $s$ is the denominator in Equation 2 and $n$ is the sample size. Thus, $d_{c} = t_{c} \times \frac{1}{\sqrt{n}}$ $(\frac{1}{\sqrt{n}}$ because the standard deviation of a standardized effect size is 1 by definition) is equivalent to Equation 3. The only caveat is that calculating $d$ or $d_{c}$ in this way assumes that the standard deviation used to calculate the test statistics and the effect size is the same. Although this is true most of the time, as explained in the paired- $t$ -test section, there is more than one method to calculate the (critical) effect size using different standard deviations. Furthermore, this method assumes that the two groups (in case of a two-sample $t$ test) have the same size.

One-sample t test

For the one-sample $t$ test, $b$ is the sample mean, $s$ is the sample standard deviation, and $n$ is the sample size. $S E_{b} = \frac{s}{\sqrt{n}}$ , and the $t_{c}$ is based on $n - 1$ df. Thus, $b_{c} = t_{c} \times S E_{b}$ , and $d_{c} = \frac{b_{c}}{s}$ . Equivalently, $d_{c} = t_{c} \times \frac{1}{\sqrt{n}}$ .

Two-sample t test

We can apply the one-sample approach to the two-sample case. When assuming homogeneity of the variances between the two groups, we have $n_{1} + n_{2} - 2$ df to calculate $t_{c}$ , and $s$ is the pooled standard deviation ( $s_{p}$ ) calculated using Equation 4. The $s_{p}$ can be interpreted as the square root of the weighted average of the sample’s standard deviations. The effect size $d$ is calculated as $\frac{b}{s_{p}}$ , and the critical effect size can be calculated as $d_{c} = \frac{b_{c}}{s_{p}}$ , with $b_{c} = t_{c} \times S E_{b}$ . $S E_{b}$ is calculated as $s_{p} \sqrt{\frac{1}{n_{1}} + \frac{1}{n_{2}}}$ . As for the one-sample case, we can calculate the critical effect size from the t statistics as $d_{c} = t_{c} \times \sqrt{\frac{1}{n_{1}} + \frac{1}{n_{2}}}$ :

s_{p} = \sqrt{\frac{s_{1}^{2} (n_{1} - 1) + s_{2}^{2} (n_{2} - 1)}{n_{1} + n_{2} - 2}} .

(4)

When relaxing the assumption of equal variances (i.e., using the Welch’s $t$ test), $s$ is calculated as the square root of the average between the two samples’ variances; $S E_{b}$ is calculated as $\sqrt{\frac{s_{x_{1}}^{2}}{n_{1}} + \frac{s_{x_{2}}^{2}}{n_{2}}}$ . The degrees of freedom for $t_{c}$ are calculated in a more complex way using the Welch-Satterthwaite equation (Satterthwaite, 1946; Welch, 1947). For the Welch’s $t$ test, we can calculate the critical effect size removing the sample size from $t_{c}$ (as done above) only when $n_{1} = n_{2}$ .

Paired-sample t test

For the paired-sample case, the situation is less straightforward. The reason is that the $t$ statistic is computed with $b$ being the average of the paired differences between the two (paired) samples and $s$ being the standard deviation of the differences. Using these values, we can just compute the critical effects size as for the one-sample case because a paired $t$ test is a one-sample test on the vector of paired differences. Test statistics and effect size based on the standard deviation of differences (the so-called $d_{z}$ ) are used for hypothesis testing and power analysis (Lakens, 2013). The problem is that the $d_{z}$ cannot be directly compared with the effect size calculated using the pooled standard deviation (Morris & DeShon, 2002). If the correlation between the two paired samples is known, there is a direct relationship between the pooled standard deviation $s_{p}$ and the standard deviation of differences $s_{D}$ as reported in Equation 5:

\begin{array}{l} s_{p} = \frac{s_{D}}{\sqrt{2 (1 - ρ)}} \\ s_{D} = s_{p} \sqrt{2 (1 - ρ)} \end{array} .

(5)

Even if the hypothesis testing is computed using $s_{D}$ , we reported the critical-effect-size value using both the pooled standard deviation and the standard deviation of the differences.

Hedges’s correction

The effect size calculated as in the previous step is known to be inflated, especially for small samples. For this reason, there is a corrected version of the effect size, called “Hedges’s $g$ ,” by Viechtbauer (2007). The Hedges’s $g$ can be calculated for all the $t$ -test scenarios. The Hedges’s correction is implemented in Equation 6, where $Γ$ is the gamma function and $m$ is the degrees of freedom:

c (m) = \frac{Γ (\frac{m}{2})}{\sqrt{\frac{m}{2}} Γ (\frac{m - 1}{2})} \approx 1 - \frac{3}{4 m - 1} .

(6)

The correction is applied as $g = c (m) \times d$ . The hypothesis testing does not take the Hedges’s correction into account; thus, the critical $g$ will be different from the critical $d$ .

Correlation test

Hypothesis testing for the Pearson’s correlation coefficient is usually done using the t statistics or a z statistic. The general approach presented for the $t$ test is still valid. The only difference is that the correlation is already a standardized effect size; thus, there is no need of the standardization term $s$ . Equation 1 can be used substituting $b$ with $r$ (the sample correlation coefficient), as shown in Equation 7. The standard error is calculated as $S E_{r} = \sqrt{\frac{1 - r^{2}}{n - 2}}$ (Bowley, 1928):

\begin{array}{l} t = \frac{r}{S E_{r}} = r \sqrt{\frac{n - 2}{1 - r^{2}}} \\ r_{c} = \frac{t_{c}}{\sqrt{n - 2 + t_{c}^{2}}} \end{array} .

(7)

Another approach for hypothesis testing of the Pearson’s correlation coefficient is using the Fisher’s $z$ transformation (Fisher, 1915, 1921) reported in Equation 8 and a $z$ $z$ -test statistic:

\begin{matrix} r_{z} = \frac{1}{2} \ln (\frac{1 + r}{1 - r}) = a r t a n h (r) \\ r = \frac{\exp (2 r_{z} - 1)}{\exp (2 r_{z} + 1)} = t a n h (r_{z}) \end{matrix} .

(8)

One can still use Equation 1, substituting $b$ with $r_{z}$ . The standard error of the $r_{z}$ correlation is calculated as $S E_{r_{z}} = \frac{1}{\sqrt{n - 3}}$ . Equation 9 shows how to calculate the critical $r_{z_{c}}$ value, where $z_{c}$ is the critical value of the standard normal distribution with a certain $α$ level. Finally, $r_{z_{c}}$ can be transformed back from the Fisher’s $z$ transformation using Equation 8:

\begin{array}{l} z = \frac{r_{z}}{S E_{z}} = r_{z} \sqrt{n - 3} \\ r_{z_{c}} = \frac{z_{c}}{\sqrt{n - 3}} \end{array} .

(9)

Linear regression

Hypothesis testing on regression coefficients (e.g., using the lm function in R) is performed using the so-called Wald test. The test can be considered a one-sample $t$ test, where the $t$ value is calculated dividing the regression parameter $β_{0}, \dots, β_{j}$ by its standard error $S E_{β_{j}}$ . The critical $t$ value is calculated from a $t$ distribution using $n - p - 1$ df, where $n$ is the number of observations and $p$ is the number of coefficients beyond the intercept. One can substitute $b$ with $β_{j}$ and $S E_{b}$ with $S E_{β_{j}}$ in Equation 3 to find the critical unstandardized regression coefficient ( $β_{c_{j}} = t_{c} \times S E_{β_{j}}$ ). There is no unique way to calculate the standardized version of the critical regression coefficient. The main reason is that in regression modeling, one can include categorical and numerical independent variables with main effects and interactions. For example, Groß and Möller (2023) proposed a way to calculate a generalized Cohen’s $d$ from linear regression with or without the presence of other variables dividing the estimated parameter with the residual standard deviation. Gelman (2008) proposed to standardize using 2 SD because when both numeric and categorical variables are included, the coefficients are not on the same metric. Another complication is that in some cases, researchers standardize the dependent variable, the independent variables, or both. In general, the unstandardized critical regression coefficient can usually also be easily interpreted. For example, assuming one is regressing some scores on a cognitive test using the age of participant, the critical regression coefficient is the minimum increase in the cognitive test for a unit age increase that would be significant.

Meta-analysis

Meta-analysis allows one to pool information from multiple studies related to a specific research question. The main advantage of meta-analysis is pooling multiple studies to obtain a more precise and powerful estimation of the effect. From a statistical point of view, a meta-analysis can be considered as a weighted linear regression with heterogeneous variances. Similar to standard linear regression, hypothesis testing is performed using Wald $t$ or $z$ tests (Borenstein et al., 2021, p. 338; Viechtbauer et al., 2015). For this reason, one can apply the same equations used for the linear regression. The main difference and advantage is that the meta-analysis model usually works directly with standardized-effect-size measures and regression coefficients are already standardized.

Other models

Despite the fact that we discussed only linear models, the same approach could be applied to other types of models, such as generalized linear models. In fact, one simply needs to multiply the critical value of the chosen distribution (e.g., $t$ or $z$ ) by the standard error of the regression coefficient. For example, in a logistic regression with a binary predictor, the estimated regression coefficient is the (log) odds ratio. One can obtain the critical odds ratio by multiplying the critical $z$ statistics (the default in R with the glm function) with the standard error of the regression coefficient.

Examples in R

In this section, we introduce the aforementioned user-friendly implementation of the mathematical computations as functions of the package criticalESvalue in R (currently hosted on GitHub). Here, we demonstrate its application through two examples: one example of a $t$ test on real data and a computation of the critical-effect-size values for a correlation from sample size. Note that in general, depending on the researcher’s hypotheses, the package allows for the calculation of either two critical-effect-size values for two-tailed tests or a single critical-effect-size value for one-tailed tests, for which the direction must be specified. For additional examples for correlation, $t$ test, paired $t$ test, linear models, and regression coefficients, see the supplemental material available on GitHub.

First, the package should be downloaded and opened with the library function:

For our examples on real data, we used from the package psych the data set “holzinger.swineford,” which has a series of demographics and scores of different subtests measuring intelligence on 301 subjects. Once the package is retrieved with library, the data set can be opened using data (“name of the dataset”). For simplicity, we decided to rename it with a shorter name:

We want to know the critical value for a $t$ test comparing boys and girls on a cognitive variable of visual perception. In this case, it can be easily done using the t.test function:

The output gives a wide range of values: the Cohen’s $d$ calculated on the data ( $d$ ), the critical Cohen’s $d$ ( $d c$ ), the numerator of the formula for the critical Cohen’s $d$ ( $b c$ ), the Cohen’s $d$ adjusted for small samples ( $g$ ), and the critical Cohen’s $d$ adjusted for small samples ( $g c$ ). The variance is set to be equal, but if that is not the case of your data, you can also run a Welch two-sample $t$ test and obtain the critical-effect-size values for that.

In the next example, we show the use of the package’s function critical_cor to calculate the critical-effect-size values for a correlation in a prospective framework.

The direction of the hypothesis and the test to apply, either the $t$ test or $z$ test, should be specified. The output will return the critical correlation value(s), the degrees of freedom, and the type of test used.

Discussion

With the present article, we propose that researchers compute and report the critical-effect-size value(s) in their empirical articles. This is not intended to replace other strategies aimed at enhancing the NHST approach to inference. Such strategies, such as the emphasis on estimating effect sizes with confidence intervals (APA, 2020) or the a priori planning for statistical power, are valuable in their own right. Instead, our proposal serves as a complementary tool, especially beneficial for facilitating the interpretation of results when statistical power deviates from an optimal level (typically falling below but occasionally exceeding it). Critical-effect-size values can be retrospectively applied even to already published studies. This possible application facilitates potential reframing of the original interpretations. Serving as a tool for retrospective analysis, critical-effect-size values may enable a reconsideration of the relevance of previously reported findings.

An advantage of reporting critical-effect-size values is that they can be precisely computed in any scenario without requiring assumptions about the expected effect size, as is the case with power calculations. The critical-effect-size value represents a directly interpretable benchmark that is especially useful in situations in which statistical power is below the desired level and researchers are left otherwise uncertain about how to proceed with the interpretation of a study’s findings. For example, say that one reads a published article reporting some effects as statistically significant and others as not: The reader suspects that the study may be underpowered but is widely uncertain about the magnitude of possible true effects. To what extent can the reported results be interpreted, precisely? Knowing the critical-effect-size value provides a clear benchmark. Conversely, say that an effect achieves significance in a very large sample: Researchers tend to draw substantive conclusions based on this. But is it of real theoretical relevance? If in comparing two groups, such as controls versus treatment, any Cohen’s $d > 0.07$ would reach significance, is statistical significance enough to signal a “successful” treatment? Maybe yes, even if effects are tiny (e.g., Funder & Ozer, 2019), but knowing the critical-effect-size value certainly prompts some appropriate interpretive caution.

Reporting the critical-effect-size value(s) can also be an efficient way to allow researchers to evaluate which findings are statistically significant. For example, in a correlation table, researchers customarily add an asterisk to all statistically significant correlations. But as long as all correlations are based on the same sample size, researchers can simply remark “the critical effect size is $r = . 3$ ,” and readers will know that all correlations larger than this value are statistically significant.

Beyond enhancing study design and statistical inferences based on hypothesis tests, reporting critical-effect-size values can also serve an educational purpose. It underscores how the distinction between a significant and nonsignificant result is not determined solely by the presence or absence of a true effect but also by the sample size and the complexity of the analysis. By highlighting a critical-effect-size value, researchers can become more aware of the possibility of Type 2 errors when results are nonsignificant. Conversely, in studies with exceedingly large samples and in many meta-analyses, the critical-effect-size value(s) may serve as a reminder that any observed effect larger than a trivially small value will likely achieve significance. This emphasizes that the mere attainment of statistical significance in a test is not particularly surprising, especially in nonexperimental studies.

Real-case scenarios may not always be that simple. Hence, we chose to expand the application of computing critical-effect-size values beyond Cohen’s $d$ and correlation to include linear regression with both raw and standardized coefficients and meta-analysis. This serves as a first step in computing critical-effect-size values for a wider array of effects encountered in practical scenarios in which linear models and their extensions are commonly used for modeling purposes. A prerequisite is that researchers must be able to identify what parameters in their statistical models reflect the effect sizes of interest and that they can assess their relevance. Note, however, that this prerequisite aligns with the requirements of APA style guidelines concerning the reporting of effect sizes. For further illustration and application, see the additional examples in the supplemental material available on GitHub.

We suggest that reporting critical-effect-size values is particularly valuable when sample-size planning was not feasible or did not occur a priori. In cases in which optimal power can be attained with a sufficiently large sample size for an effect of a specific magnitude of interest and this is truly determined a priori, the interpretation of both significance and nonsignificance becomes straightforward. However, when power analysis did not inform the sample size or when power is likely but indeterminably low, reporting critical-effect-size values for the obtained sample can help provide context for interpretation. Critical-effect-size values can be computed and interpreted even retrospectively or for studies that have already been published.

In conclusion, reporting critical-effect-size values in empirical articles serves as a valuable addition to researchers’ tool kit, aimed at augmenting transparency and facilitating the interpretability of their findings. Although not designed to supplant existing practices, it provides a useful aid in interpreting newly presented and previously published results, thus advancing the understanding of research outcomes.

Footnotes

Transparency

Action Editor: Pamela Davis-Kean

Editor: David A. Sbarra

Author Contributions

Ambra Perugini: Conceptualization; Methodology; Software; Writing – original draft; Writing – review & editing.

Filippo Gambarota: Methodology; Software; Writing – original draft; Writing – review & editing.

Enrico Toffalini: Methodology; Software; Supervision; Writing – original draft; Writing – review & editing.

Daniël Lakens: Conceptualization; Supervision; Writing – review & editing.

Massimiliano Pastore: Conceptualization; Writing – review & editing.

Livio Finos: Conceptualization; Writing – review & editing.

Core Team Psicostat: Conceptualization; Supervision; Writing – review & editing.

Gianmarco Altoè: Conceptualization; Supervision; Writing – review & editing.

ORCID iDs

Ambra Perugini

Filippo Gambarota

Daniël Lakens

Gianmarco Altoè

References

Altoè

Bertoldo

Zandonella Callegher

Toffalini

Calcagnì

Finos

Pastore

(2020). Enhancing statistical inference in psychological research via prospective and retrospective design analysis. Frontiers in Psychology, 10, Article 2893. https://doi.org/10.3389/fpsyg.2019.02893

American Psychological Association. (2020). Publication manual of the American Psychological Association (7th ed.). https://doi.org/10.1037/0000165-000

Anvari

Kievit

Lakens

Pennington

C. R.

Przybylski

A. K.

Tiokhin

Wiernik

B. M.

Orben

(2023). Not all effects are indispensable: Psychological science requires verifiable lines of reasoning for whether an effect matters. Perspectives on Psychological Science, 18(2), 503–507.

Borenstein

Hedges

L. V.

Higgins

J. P.

Rothstein

H. R.

(2021). Introduction to meta-analysis. John Wiley & Sons.

Boring

E. G.

(1919). Mathematical vs. scientific significance. Psychological Bulletin, 16(10), 335–338. https://doi.org/10.1037/h0074554

Bowley

A. L.

(1928). The standard deviation of the correlation coefficient. Journal of the American Statistical Association, 23(161), 31–34. https://doi.org/10.1080/01621459.1928.10502991

Byers-Heinlein

Bergmann

Davies

Frank

M. C.

Hamlin

J. K.

Kline

Kominsky

J. F.

Kosie

J. E.

Lew-Williams

Liu

Mastroberardino

Singh

Waddell

C. P. G.

Zettersten

Soderstrom

(2020). Building a collaborative psychological science: Lessons learned from ManyBabies 1. Canadian Psychology/Psychologie Canadienne, 61(4), 349–363. https://doi.org/10.1037/cap0000216

Chow

S. L.

(1988). Significance test or effect size? Psychological Bulletin, 103(1), 105–110. https://doi.org/10.1037/0033-2909.103.1.105

Cohen

(1994). The earth is round (p < .05). American Psychologist, 49(12), 997–1003.

10.

Cortina

J. M.

Dunlap

W. P.

(1997). On the logic and purpose of significance testing. Psychological Methods, 2(2), 161–172. https://doi.org/10.1037/1082-989X.2.2.161

11.

Cumming

Fidler

Kalinowski

Lai

(2012). The statistical recommendations of the American Psychological Association publication manual: Effect sizes, confidence intervals, and meta-analysis. Australian Journal of Psychology, 64(3), 138–146.

12.

Fisher

R. A.

(1915). Frequency distribution of the values of the correlation coefficient in samples from an indefinitely large population. Biometrika, 10, 507–521. https://doi.org/10.2307/2331838

13.

Fisher

R. A.

(1921). On the “probable error” of a coefficient of correlation deduced from a small sample. Metron, 1, 1–31.

14.

Funder

D. C.

Ozer

D. J.

(2019). Evaluating effect size in psychological research: Sense and nonsense. Advances in Methods and Practices in Psychological Science, 2(2), 156–168.

15.

Gelman

(2008). Scaling regression inputs by dividing by two standard deviations. Statistics in Medicine, 27, 2865–2873. https://doi.org/10.1002/sim.3107

16.

Gigerenzer

Krauss

Vitouch

(2004). The null ritual. In Kaplin

(Ed.), The Sage handbook of quantitative methodology for the social sciences (pp. 391–408). Sage.

17.

Groß

Möller

(2023). A note on Cohen’s d from a partitioned linear regression model. Journal of Statistical Theory and Practice, 17, Article 22. https://doi.org/10.1007/s42519-023-00323-w

18.

Hagen

R. L.

(1997). In praise of the null hypothesis statistical test. American Psychologist, 52(1), 15–24. https://doi.org/10.1037/0003-066X.52.1.15

19.

Haig

B. D.

(2017). Tests of statistical significance made sound. Educational and Psychological Measurement, 77(3), 489–506.

20.

Hedges

L. V.

(1984). Estimation of effect size under nonrandom sampling: The effects of censoring studies yielding statistically insignificant mean differences. Journal of Educational Statistics, 9(1), 61–85.

21.

Hodges

Lehmann

(1954). Matching in paired comparisons. The Annals of Mathematical Statistics, 25(4), 787–791.

22.

Jarke

Anand-Vembar

Alzahawi

Andersen

T. L.

Bojanić

Carstensen

Feldman

Garcia-Garzon

Kapoor

Lewis

Todsen

A. L.

Većkalov

Zickfeld

J. H.

Geiger

S. J.

(2022). A roadmap to large-scale multi-country replications in psychology. Collabra: Psychology, 8(1), Article 57538. https://doi.org/10.1525/collabra.57538

23.

Kirk

R. E.

(1996). Practical significance: A concept whose time has come. Educational and Psychological Measurement, 56(5), 746–759.

24.

Kramer

A. D.

Guillory

J. E.

Hancock

J. T.

(2014). Experimental evidence of massive-scale emotional contagion through social networks. Proceedings of the National Academy of Sciences of the United States of America, 111(24), 8788–8790. https://doi.org/10.1073/pnas.1320040111

25.

Krueger

(2001). Null hypothesis significance testing: On the survival of a flawed method. American Psychologist, 56(1), 16–26. https://doi.org/10.1037/0003-066x.56.1.16

26.

Lakens

(2013). Calculating and reporting effect sizes to facilitate cumulative science: A practical primer for t-tests and ANOVAs. Frontiers in Psychology, 4, Article 863. https://doi.org/10.3389/fpsyg.2013.00863

27.

Lakens

(2021). The practical alternative to the p value is the correctly used p value. Perspectives on Psychological Science, 16(3), 639–648.

28.

Lakens

(2022). Sample size justification. Collabra: Psychology, 8(1), Article 33267. https://doi.org/10.1525/collabra.33267

29.

Mesquida

Lakens

(2024). Is the effect large enough to matter? Why exercise physiologists should interpret effect sizes meaningfully: A reply to Williams et al. (2023). Journal of Physiology, 602(1), 241–242. https://doi.org/10.1113/JP285901

30.

Miller

(2017). Hypothesis testing in the real world. Educational and Psychological Measurement, 77(4), 663–672.

31.

Morris

S. B.

DeShon

R. P.

(2002). Combining effect size estimates in meta-analysis with repeated measures and independent-groups designs. Psychological Methods, 7(1), 105–125. https://doi.org/10.1037/1082-989x.7.1.105

32.

Moshontz

Campbell

Ebersole

C. R.

IJzerman

Urry

H. L.

Forscher

P. S.

Grahe

J. E.

McCarthy

R. J.

Musser

E. D.

Antfolk

Castille

C. M.

Evans

T. R.

Fiedler

Flake

J. K.

Forero

D. A.

Janssen

S. M. J.

Keene

J. R.

Protzko

Aczel

. . . Chartier

C. R.

(2018). The psychological science accelerator: Advancing psychology through a distributed collaborative network. Advances in Methods and Practices in Psychological Science, 1(4), 501–515.

33.

Nickerson

R. S.

(2000). Null hypothesis significance testing: A review of an old and continuing controversy. Psychological Methods, 5(2), 241–301. https://doi.org/10.1037/1082-989x.5.2.241

34.

Orben

Lakens

(2020). Crud (Re)Defined. Advances in Methods and Practices in Psychological Science, 3(2), 238–247. https://doi.org/10.1177/2515245920917961

35.

Riesthuis

(2024). Simulation-based power analyses for the smallest effect size of interest: A confidence-interval approach for minimum-effect and equivalence testing. Advances in Methods and Practices in Psychological Science, 7(2). https://doi.org/10.1177/25152459241240722

36.

Satterthwaite

F. E.

(1946). An approximate distribution of estimates of variance components. Biometrics Bulletin, 2(6), 110–114.

37.

Sirois

Brisson

Blaser

Calignano

Donenfeld

Hepach

Hochmann

J.-R.

Kaldy

Liszkowski

Mayer

Ross-Sheehy

Russo

Valenza

(2023). The pupil collaboration: A multi-lab, multi-method analysis of goal attribution in infants. Infant Behavior and Development, 73, Article 101890. https://doi.org/10.1016/j.infbeh.2023.101890

38.

Szucs

Ioannidis

J. P.

(2017). Empirical assessment of published effect sizes and power in the recent cognitive neuroscience and psychology literature. PLOS Biology, 15(3), Article e2000797. https://doi.org/10.1371/journal.pbio.2000797

39.

Viechtbauer

(2007). Approximate confidence intervals for standardized effect sizes in the two-independent and two-dependent samples design. Journal of Educational and Behavioral Statistics, 32, 39–60. https://doi.org/10.3102/1076998606298034

40.

Viechtbauer

López-López

J. A.

Sánchez-Meca

Marín-Martínez

(2015). A comparison of procedures to test for moderators in mixed-effects meta-regression models. Psychological Methods, 20, 360–374. https://doi.org/10.1037/met0000023

41.

Weiss

A. H.

Biron

Lieder

Granot

R. Y.

Ahissar

(2014). Spatial vision is superior in musicians when memory plays a role. Journal of Vision, 14(9), Article 18. https://doi.org/10.1167/14.9.18

42.

Welch

B. L.

(1947). The generalization of “student’s” problem when several different population variances are involved. Biometrika, 34(1–2), 28–35. https://doi.org/10.1093/biomet/34.1-2.28

43.

Wilson

B. M.

Harris

C. R.

Wixted

J. T.

(2020). Science is not a signal detection problem. Proceedings of the National Academy of Sciences, 117(11), 5559–5567.

The Benefits of Reporting Critical-Effect-Size Values

Abstract

Keywords

Small Sample Sizes and Uncertain Sample-Size Determination

Large Sample Sizes and Meta-Analysis

How to Compute Critical-Effect-Size Values

The t test

One-sample t test

Two-sample t test

Paired-sample t test

Hedges’s correction

Correlation test

Linear regression

Meta-analysis

Other models

Examples in R

Discussion

Footnotes

Transparency

ORCID iDs

References