Sage Journals: Discover world-class research

Abstract

Conducting research with human subjects can be difficult because of limited sample sizes and small empirical effects. We demonstrate that this problem can yield patterns of results that are practically indistinguishable from flipping a coin to determine the direction of treatment effects. We use this idea of random conclusions to establish a baseline for interpreting effect-size estimates, in turn producing more stringent thresholds for hypothesis testing and for statistical-power calculations. An examination of recent meta-analyses in psychology, neuroscience, and medicine confirms that, even if all considered effects are real, results involving small effects are indeed indistinguishable from random conclusions.

Keywords

random conclusions estimation hypothesis testing benchmarks

Introduction

Human-subjects research often involves noisy measures and limited sample sizes. Accordingly, small effects and low statistical power are typical in many areas of behavioral and medical science (Marek et al., 2022; Szucs & Ioannidis, 2017). Some argue that this situation is tenable because the ongoing identification of small effects amounts to a steady accumulation of knowledge (Götz et al., 2022). We argue to the contrary. Specifically, we show that the study of small effects frequently produces results that are indistinguishable from flipping a coin to determine the direction of an experimental treatment’s effect. We use this idea to develop a benchmark based on minimum acceptable estimation accuracy. This benchmark yields an intuitive interpretation of effect-size estimates—one based in accurate estimation. We show that calibrating existing tests to our benchmark yields far stricter thresholds for hypothesis testing and for statistical-power calculations. Our work is intended to spark a larger discussion within the scientific community on acceptable estimation accuracy, the interpretation of effects, and statistical standards.

Although there are many exceptions, behavioral scientists almost universally test null hypotheses, which are often formulated as two or more means being exactly equal to one another. Much ink has been spilled noting the shortcomings of this approach (e.g., Krantz, 1999; Nickerson, 2000; van de Schoot et al., 2011). Cohen (1994) famously criticized the null hypothesis through his “nil” hypothesis critique, describing it as a conceptual tool that is ill-suited for answering substantive research questions. He noted that for continuous dependent variables, it is simply impossible for two population means to be truly equal to one another. This means that the null hypothesis acts as a straw man to be knocked down at a given sample size. By his critique, all effects exist in a trivial sense; it just may be that some are so small that they do not warrant attention. A more meaningful line of investigation is determining whether effects are accurately estimated and characterized.

What constitutes acceptable estimation accuracy? This question is challenging to answer and fraught with subjectivity. A confidence interval (CI) deemed acceptably narrow by one scientist may be unacceptably wide to another. We seek to answer this question by the use of a reference—a foil—with undeniable negative qualities. To better understand the accuracy of standard methods, we will compare them against a foil estimation process that is, by construction, incapable of accurately estimating effects. Such a foil is useful for handling questions of subjectivity. If a community of scientists agree that this foil is unacceptably inaccurate, then any estimation process that cannot be distinguished from it is also unacceptably inaccurate.

Our foil must be tailored to the types of questions that behavioral scientists ask and to how they make decisions about data. Behavioral scientists often formulate directional hypotheses about treatment effects. Is the population mean of Group A larger than that of Group B? A strong foil would offer zero information about the correct direction of effects. A foil could randomize the direction of any observed effect; for example, which group mean is larger than another would be decided via a coin flip. Such a foil creates a worst-case scenario for evaluating any directional hypothesis. In addition, behavioral scientists typically use the outcome of a statistical test to conclude whether a treatment effect is detected. In keeping with our estimation focus, an ideal foil would remove effect detection from the comparison. One way to handle this is for the foil to correctly detect whether an effect exists at similar, or identical, rates as standard methods. A scientist using this foil would correctly reject a relevant null hypothesis just as often as someone using standard estimation methods. This would make the foil especially useful for evaluating published findings in the literature.

Scientists using such a foil would arrive at random conclusions regarding their data. All else being equal, they would detect effects as often as scientists using standard methods, but would be incapable of accurately estimating and characterizing them. The logic is straightforward: If one accepts that arriving at random conclusions is unscientific and inaccurate, then it becomes incumbent on the scientific community to use statistical procedures that would be distinguishable from such a foil.¹ In the present work, we focus on the canonical case of using sample means to estimate population means for two independent groups. Our proposed foil consists of an estimation process that randomizes the direction of treatment effects while still correctly rejecting a null hypothesis as often as standard methods.

Our analyses reveal that distinguishing sample means from such a foil requires far larger sample sizes than typically employed in the behavioral sciences, especially when studying the kinds of small effects that are commonplace in the psychological literature. We also show that our foil comparison naturally relates to many existing tests and methods, including those based on traditional null hypotheses. We leverage these connections to provide new calibrations for existing techniques. For power analyses, we show that typical power thresholds of .80 are not sufficient to rule out unacceptable estimation accuracy. Linking our argument to hypothesis testing, we show that far stricter thresholds $(α = . 0005)$ are required if sufficient estimation accuracy is to be ensured. We also provide a simple methodology that allows researchers to convert a common measure of effect size, Cohen’s d, into an easily understood measure of estimation accuracy on the basis of our foil. This methodology can be applied to CIs over Cohen’s d, allowing researchers to determine whether their estimates are acceptably accurate. Finally, we examine a collection of meta-analyses from the behavioral sciences, finding that typical estimates in many fields of study are indistinguishable from our random conclusions foil.

Ultimately, all scientific decisions regarding data are made by human beings. A key aim of any statistical methodology is to provide characterizations of data that researchers can understand. What we provide in the current work is simply a perspective, one grounded in a common experimental design with linkages to many other familiar statistical quantities and methods. It is through this framing that we aim to push forward the conversation on estimation accuracy and replication efforts. To further understand our approach and provide precise definitions, consider the following scenario.

A Tale of Two Labs

Consider two hypothetical laboratories, Lab 1 and Lab 2, studying an effect—for instance, the efficacy of a drug. Both labs use a treatment condition (Group A) and a control condition (Group B) and compare the sample means from each group, ${\bar{x}}_{A}$ and ${\bar{x}}_{B}$ , on some outcome measure. These sample means underpin the statistical tests conducted by both labs and provide point estimates for the population means, $μ_{A}$ and $μ_{B}$ , that instantiate their scientific hypotheses regarding the drug’s effect. Assume that the drug has a true effect $δ > 0$ , where $δ = \frac{μ_{A} - μ_{B}}{σ}$ , with σ being the standard deviation of responses from the populations.²

Unfortunately, Lab 2 has a glitch in their data-analysis software—it randomly assigns, with equal likelihood, the labels of “treatment” and “control” to those means. That is, if Lab 2 conducted a study for which the actual sample means for the two conditions were ${\bar{x}}_{A} = 7$ and ${\bar{x}}_{B} = 3$ , the software would instead report ${\bar{x}}_{A} = 3$ and ${\bar{x}}_{B} = 7$ with probability equal to .5, and the truth cannot be recovered. We refer to this procedure as a random-conclusions estimator (RCE) because the direction of the effect—whether the drug helps or harms—is determined at random. Although mathematically related, the RCE is distinct from a classic Fisher randomization test in which labels are randomized at the individual response level to generate a null, no-effect reference distribution.

If Lab 2’s error came to light, retraction of any study that relied on this software would be demanded, and a drug approved on the basis of such results would (rightfully) be recalled. But Lab 2 provides an interesting comparison with Lab 1, especially when considering issues of replication and reliability. Lab 2 will correctly reject the null hypothesis, $H_{0} : μ_{A} = μ_{B}$ , exactly as often as Lab 1 using a two-tailed t test. Barring preregistration restrictions, both labs will publish results at similar rates. In this way, Lab 2 will pollute the scientific literature with random conclusions and, in the case of drug trials, potentially claim evidence for dangerous treatments.

Lab 1 and Lab 2 are identical with the exception that Lab 2 is using an RCE, which, by any measure, is not science because the direction of effects (including published effects) is determined via a coin flip. Intuitively, we would like to believe that results from the two labs would be readily distinguishable. Unfortunately, in many areas of behavioral science, even if all effects exist, Lab 2’s results will often be strikingly similar to Lab 1’s, and the gain from removing their results from the literature may be marginal at best. This situation is illustrated in Figure 1, which presents scenarios for effect sizes that are conventionally considered large, medium, and small (yet interpretable; Cohen, 1988; Sawilowsky, 2009). For simplicity, these scenarios assume that outcomes in both conditions are normally distributed with unit variance. The left and right columns of Figure 1 illustrate the sampling distributions of mean estimates in each of the labs. Each dot represents a pair of means from a single study. How well these means estimate the population means $μ_{A}$ and $μ_{B}$ is quantified in terms of a common metric for assessing estimation accuracy: mean-squared error (MSE; see the Appendix).

Fig. 1.

Distribution of sample mean estimates ${\bar{x}}_{A}$ and ${\bar{x}}_{B}$ for Lab 1 and Lab 2. Each row corresponds to a different combination of effect size d and sample size per group n. The ratio of mean-squared error (MSE) values for the two labs, $\frac{{MSE}_{Lab 2}}{{MSE}_{Lab 1}}$ , is represented by $ψ .$ To facilitate visualization, we report all relevant values for each comparison from both Lab 1 and Lab 2 (δ, n, MSE) in the Lab 1 panel.

In the top row of Figure 1, the effect size is large. Lab 2’s bimodal distribution of estimates clearly evidences the software error, and the resulting MSE is 19 times larger than Lab 1’s. We use ψ to denote the ratio $\frac{{MSE}_{Lab 2}}{{MSE}_{Lab 1}}$ ; ψ has a lower bound of 1, given that there is no scenario in which Lab 2’s estimates will be, on average, more accurate than Lab 1’s. The middle and bottom rows of Figure 1 illustrate how the estimates from the two labs converge as effect size becomes smaller, with Lab 2’s distribution of estimates eventually becoming unimodal. These changes are indexed by ψ: In the bottom row, $ψ = 1.5,$ and estimates from the two labs are visually nearly indistinguishable, an impression confirmed by a small Wasserstein metric (Rubner et al., 2000) and the large number of replicates needed (at least 54 per lab) to reliably distinguish the distributions of results from the two labs via a Kolmogorov-Smirnov test (see the Appendix).

Effect size and sample size combinations like those in the bottom row of Figure 1 raise an important question: If Lab 2’s results are subject to retraction, how should we interpret Lab 1’s results? Put differently, if one’s results look unscientific, perhaps they are unscientific. A computer glitch on the scale of Lab 2’s results is, one hopes, an unlikely occurrence, but the comparison is useful in illustrating what a worst-case estimator could look like and why it would be problematic if it were indistinguishable from current practice. Within the behavioral sciences, many of the hypotheses being tested, if not the vast majority, are directional in nature. The RCE completely randomizes the direction of effects, removing any information about direction from the data. Yet the RCE is special in that it still detects effects at the same rate as sample means via a nondirectional t test, which is, once again, ubiquitous practice in the behavioral sciences. In this way, our RCE comparison provides an interesting new perspective on published literature in the field, which often hinges upon the successful reporting of a significant test. We are not seriously suggesting that such a computer glitch exists, but we do think it highly problematic if a large corpus of work within the behavioral sciences is indistinguishable from such an error.³

General Formulation

If the goal is to be distinguishable from a veritable Lab 2, as instantiated by the RCE, we can use ψ as an index to set standards for hypothesis testing and sample-size planning. As shown in the Appendix, ψ simplifies to

ψ (δ, n) = \frac{n δ^{2} + 2}{2},

(1)

where n is the sample size per group. Equation 1 is straightforward to interpret: For given values of δ and n, sample means are ψ times as accurate (in terms of MSE) as the RCE. Although ψ is distribution-free and interpretable outside of any testing framework, it functionally relates to a two-sample t test and the resulting p values. See the Appendix for connections between ψ and other metrics, including out-of-sample $R^{2}$ . This relationship allows us to reexamine hypothesis-testing and statistical-power standards by calibrating to minimally acceptable estimation, as opposed to detection error rates against a null hypothesis. The mathematics are familiar, but the RCE comparison offers new interpretation to these techniques.

Determining a minimum acceptable ψ for a given scientific discipline is perhaps best decided on a case-by-case basis, taking into consideration specific research goals (S. F. Anderson & Maxwell, 2016; Navarro, 2019). Here, we demonstrate the consequences of a threshold of 3 for the interpretation of results and sample-size planning. Although somewhat arbitrary and perhaps modest, this threshold is motivated by the logic illustrated in Figure 1. When $ψ < 3$ , the sampling distribution of the RCE becomes unimodal for normal random variables (Figs. A3 –A7, Appendix), and the number of study replicates required to reliably distinguish it from sample means becomes impractical (Table A1). If we take our illustration with the two labs seriously, poor ψ values imply that members of Lab 1 and Lab 2 could spend their entire careers replicating scores of studies and be unable to reject the null hypothesis that they are using the same estimator (see the Appendix).

Table A1 in the Appendix characterizes ψ in terms of the information about the direction of effect that is gained by using sample means versus the RCE. For example, for $ψ = 1.5$ , the usage of sample means reduces the uncertainty about the correct direction of effects by only 29% compared with the total uncertainty given by the RCE (see also Fig. A1, Appendix). In this way, our RCE comparison links directly to the concept of Type S errors regarding the sign of the effect (Gelman & Carlin, 2014; Gelman & Tuerlinckx, 2000). See also recent work by Domingue et al. (2021), who applied the concept of weighted coins to develop a measure of predictive accuracy for binary outcomes.

Applications to CIs and Hypothesis Testing

Applying Equation 1 to the bounds of a 95% CI over δ provides researchers a simple, transparent method to gauge how accurately a range of plausible effects is being estimated. For example, consider a study with a sample size of 50 that yields an effect size point estimate d of 0.5 and a 95% CI equal to $[0.10, 0.89]$ (see, e.g., Cumming & Finch, 2001). This interval does not include 0, corresponds to a p value of .014, and by current standards would provide researchers assurance that an effect has been detected. But even if this interval contains the population value δ, researchers cannot be confident that their estimation is better than the bottom row of Figure 1. Applying Equation 1 to this CI yields a ψ interval of $[1.25, 20.80]$ , which includes conditions in which sample mean estimates are practically indistinguishable from random conclusions. Put another way, these researchers may claim that the population means are not equal, but, upon examining the bounds on ψ, may also conclude that there remains tremendous uncertainty regarding the size and direction of the effect. Indeed, sample means estimation yields a 16.075% reduction in uncertainty (relative to the RCE) at the lower bound ( $ψ = 1.25)$ and a 99.998% reduction in uncertainty at the upper bound ( $ψ = 20.80;$ Fig. A1). Although the effect-size estimate implies a difference between groups, the accuracy of this estimate could be anything from a blind guess to a statement of fact.

Figure 2 contextualizes ψ within familiar statistical quantities:

Panel (a) - ensuring that ψ is greater than 3 often requires a large n, especially when dealing with smaller effect sizes (e.g., $δ = 0.10$ ). Sample size requirements are more stringent if one also wants to achieve 95% confidence that the true ψ is larger than 3. For example, the estimation accuracy of a small effect (with $δ = 0.3)$ requires a sample of $2 \times 255 = 510$ to be confidently acceptable. See the Appendix for a discussion on how effect-size priors can be used to determine n.

Panel (b) - the requirements for acceptability can also be framed in terms of statistical power. Regardless of δ, under the standard alpha ( $α = . 05)$ , statistical power needs to be above .92 for CIs over δ to exclude ψ values less than 3. Minimally acceptable estimation of an effect requires its detection to be near certain: Common but arbitrary power standards, such as .80, do not yield estimates that rule out unacceptable estimation accuracy.

Panel (c) - it is well known that a larger n results in smaller observed effects becoming statistically significant. However, the ψ associated with said effects can still be unacceptable. For example, critical effects with $a p of . 05$ yield a ψ of approximately 5, with CIs that include values very close to 1. In comparison, critical effects with $a p of . 0005,$ which are approximately 78% larger than their $. 05$ counterparts, yield confidently acceptable ψ values. We note that using an $α of . 0005$ as a threshold for null hypothesis testing is a stricter standard than other recent proposals that focus on the detection of effects (Benjamin et al., 2018). Such a stringent criterion makes it more difficult for questionable researcher practices, such as p-hacking (Simmons et al., 2011), to affect the outcome. Finally, these results may also serve to dampen researcher urges to characterize nonsignificant effects ( $p > . 05$ ) as if they are acceptably accurate.

Panel (d) - some researchers consider an effect to be robust or reliable when the 95% CI of δ does not cross zero (Cumming, 2013). But when we transform a strictly positive or negative interval onto a range of plausible ψ values, we see that they will include unacceptable values (for a threshold of 3) unless the width is less than $1.16 δ$ (i.e., 58% of its maximum width of $2 δ$ ). In short, estimation accuracy can be unacceptable even for robust or reliable effects.

Fig. 2.

Relationship between ψ and different relevant quantities. The bands correspond to the 95% confidence intervals (CIs) of ψ. The power values reported in (b) are also reported in Table 1. For further details, see the main text and the Appendix.

Examining Prior Meta-Analyses

We examined several recent meta-analyses to get a snapshot of how common poor ψ values are in various subfields (Gaeta & Brydges, 2020; Nuijten et al., 2020; Siegel et al., 2021; Szucs & Ioannidis, 2017). Table 1 shows a remarkable consistency across subfields, with the estimated median power to detect a small effect ( $δ = 0.20$ ) ranging between 0.11 and 0.16. These power estimates translate to ψ values ranging from 1.54 to 1.94, which strongly resemble the unacceptable scenario illustrated in the bottom row of Figure 1. Said simply, the majority of studies examining small effects in these fields may be producing results that are virtually indistinguishable from random conclusions. These meta-analytic values are also plotted in Figure 2 (b), where we show that even representative studies examining medium and large effects are not sufficiently powered to rule out unacceptable estimation accuracy.

Table 1.

Median Power to Detect Small ( $δ = . 2$ ), Medium ( $δ = . 5$ ), and Large ( $δ = . 8$ ) Effects as Reported in Meta-Analyses and Their Corresponding ψ Values (in Brackets).

	Small effect	Medium effect	Large effect
Meta-analyses	Median power [ψ]	Median power [ψ]	Median power [ψ]
Szucs & Ioannidis (2017)
Cognitive neuroscience	0.11 [1.54]	0.40 [4.06]	0.70 [7.56]
Psychology	0.16 [1.94]	0.60 [6.13]	0.81 [9.48]
Medicine	0.15 [1.86]	0.59 [6.00]	0.80 [9.32]
Nuijten et al. (2020)
Intelligence	0.11 [1.54]	0.47 [4.75]	0.99 [19.88]
Gaeta & Brydges (2020)
Speech and language	0.13 [1.70]	0.49 [4.86]	0.91 [12.52]
Siegel et al. (2021)
Industrial and organizational psychology	0.47 [4.58]	0.79 [8.86]	0.99 [19.88]

Note: The ψ values are also illustrated in Figure 2b. We calculated power for Gaeta & Brydges (2020) and Siegel et al. (2021) on the basis of median sample sizes.

Extensions

Our Lab 1 and Lab 2 framing provides a concrete way for scientists to grapple with inherently difficult questions about acceptable estimation accuracy and replication within the behavioral sciences. This framing could be extended to other estimators, testing frameworks, and experimental designs. In the current application, we focused on sample means and the usage of the independent two-sample t test. We did so because of the ubiquity of this experimental design and testing framework within the behavioral sciences. Our RCE formulation could be used to calibrate power and hypothesis-testing thresholds for statistical tests other than the standard t test, such as Welch’s test, which allows for differences in group variance (Welch, 1947). Future work could explore how different configurations of group variances impact the RCE sample-mean comparison and what testing and power thresholds provide acceptable estimation accuracy.

The RCE is defined by the randomization of group labels on the estimates of interest, but these are not required to be population means. In keeping with our two-group design, an RCE could be defined as the randomization of group labels to estimates of population medians, which may be an interesting application for heavily skewed distributions. One could then examine alternative power and hypothesis-testing calibrations for tests such as the Wilcoxon-Mann-Whitney U test. It should be noted, however, that the Wilcoxon-Mann-Whitney U test is appropriate only for evaluating whether two population medians are different under relatively strict assumptions—that is, that both populations are identically distributed and differ only by a shift in location (Divine et al., 2018).

The RCE and two-labs perspective could be extended to other experimental designs. In defining a general RCE comparison, we want to preserve two distinct features of our current formulation. First, a generalized RCE should randomize the conclusions of scientific interest. Applications could include a one-way analysis of variance, in which group mean labels are randomized, thus randomizing which means are larger than others while preserving Type I and Type II error rates for the omnibus F test. Generalizations could also include multiple regression: Certain aspects of the estimation process could be randomized, such as whether one standardized regression coefficient is larger than, or has the same sign, as another.⁴ Second, a generalized RCE should also yield statistically significant results at rates similar to the standard estimation method being evaluated. This gives a generalized Lab 2 comparison additional bite, because the generalized RCE is not just randomizing the direction of results; it is also leading to random decisions regarding data. This second point is not intended to avoid important questions relating to preregistration practices (Nosek et al., 2019; Szollosi et al., 2020) but rather to place a finer point on an RCE comparison.

Given a suitable RCE and a standard method of estimation (e.g., ordinary least squares), we define a generalized ψ as the ratio of the respective mean-squared-error values. Although MSE has several nice properties, other accuracy metrics could also be substituted. Under this definition, ψ retains its simple interpretation: An estimator is ψ times as accurate as a generalized RCE. Future work could develop these comparisons and relate them to existing techniques, such as CIs, statistical power, and hypothesis testing.

Recommendations

Report ψ intervals

When reporting CIs over Cohen’s d values, we recommend also reporting the requisite ψ interval using that study’s sample size. A CI communicates a range of plausible effect sizes, whereas the CI over ψ communicates how well the effect is being estimated relative to an easily understood benchmark. If the ψ CI includes values less than 3, it is worth reporting that the data do not rule out unacceptable levels of estimation accuracy. Although we have illustrated some consequences of using 3 as a threshold for ψ, other values could be used depending upon the context.⁵ The key takeaway is that ψ intervals translate effect-size estimates into a comprehensible measure of estimation accuracy. Reporting ψ intervals also provides researchers a degree of nuance when reporting results, allowing them to claim (or not) the detection of an effect, up to the usual Type I error rate under a specified α level, while also being transparent about estimation accuracy. To be clear, no additional inference is taking place: Transforming a CI over δ values into one over ψ values is expressing the same information again from an estimation perspective. Making use of such a perspective can be done regardless of one’s statistical-inferential inclinations (e.g., Bayesian vs. frequentist). It is worth noting once again that ψ is distribution-free, in that its interpretation as the ratio of MSE values between sample means and the RCE does not depend upon any particular distributional form (see the Appendix for details).

Power statistical tests for estimation

When conducting a priori power analyses, we recommend that the sample size be selected according to effective estimation of the effect, rather than simple detection. We demonstrated that power of .92, when using an $α of . 05$ , results in CIs over δ that exclude ψ values less than 3. This perspective offers a grounded rationale for power values, rather than the highly arbitrary, but quite common, value of .80. Selecting sample sizes in this way is similar in spirit to the work of Gelman and Carlin (2014) and connects to the work of Kelley and Maxwell (2003) and Kelley and Lai (2011), who argue for determining sample size on the basis of CI width. See also the work of S. F. Anderson et al. (2017), who present a power-analysis framework that incorporates publication bias.

Bayesian estimation

One takeaway from our arguments is that there simply is not much information contained in small samples and small effects. Bringing more information to the analysis can take many forms, with Bayesian methodology being an obvious approach. Informative priors can be used to improve estimation accuracy of mean estimates (Gelman et al., 1995), and such priors can be incorporated into the t test itself (see, notably, Rouder et al., 2009; Gronau et al., 2019; and Ly & Wagenmakers, 2021). Bayesian formulations are well suited for integrating informative hypotheses with cognitive models (Lee & Vanpaemel, 2018; Vanpaemel & Lee, 2012), which can help avoid some of the estimation issues we raise here. This approach is especially important for researchers who face limited sample sizes by the very nature of their investigations. Of course, the accuracy of Bayesian approaches under limited sample sizes will be prior dependent (e.g., McNeish, 2016). The Appendix also provides two examples of how prior beliefs can be incorporated into the computation of ψ.

Computational modeling and formal theory

Throughout, we have treated the accurate estimation of an effect as a primary goal. There is much to say about whether conceptualizing and testing theories in this way is optimal from a meta-science perspective. Indeed, Scheel (2022) argued that many psychological hypotheses are imprecisely specified, leading to questionable attempts at replication and measurement. Improved theory and quantitative modeling can lead to more compelling tests (e.g., model selection; for a recent review, see Myung & Pitt, 2018), avoiding simple effect-based characterizations (van Rooij & Baggio, 2021); see also Guest and Martin (2021) and Proulx and Morey (2021). Lee et al. (2019) and Devezer et al. (2019) provide thoughtful analysis and argumentation for how formalism can be used to improve scientific practices.

A more stringent threshold (α = .0005) for two-group between-subjects hypothesis testing

Using α = .0005 sets a more stringent threshold than recent high-profile recommendations for methods reform (Benjamin et al., 2018). It’s hardly our goal to further contribute to file-drawer problems by arguing that some studies should not be published if ψ is less than 3. Indeed, we believe that all studies should be reported and that p values (likewise, ψ values) should not serve as gatekeepers to the literature. Yet for researchers who want to provide a characterization that goes beyond mere detection (e.g., “the two groups differ”) and ensure that their estimates are distinguishable from random conclusions, a more prohibitive α level is arguably required. Rather than a tool for censorship, ψ can be perceived as a useful way to adjust the strength of one’s claims to the expected accuracy of the estimation process.

The importance of experimental design

The fact that small effects are commonly observed does not mean that they are inevitable—one should always keep in mind the artificial and constructive nature of effects (e.g., Guala & Mittone, 2005; Woodward, 1989). In the behavioral sciences, effects are often small because of the use of minimal experimental manipulations that make the conditions being compared virtually identical, apart from a minor change (for a discussion, see Prentice & Miller, 1992). Researchers can rely on ψ to gauge the ability of a given experimental design to elicit a target phenomenon with sufficient accuracy, which in some cases can lead to the development of alternative experimental approaches. We do emphasize that notions of effect size are just one of many factors that impact experimental outcomes; see Buzbas et al. (2023) for a formal treatment of experimental design and its relation to replication rates.

Discussion

In reaction, one might argue that estimation accuracy should not be much of a concern if we care only about correctly detecting effects. We find this argument untenable for four reasons: First, knowledge about effect sizes plays a crucial role when using basic research findings to develop effective real-world interventions (Schober et al., 2018). Second, developing a theoretic account of the phenomena being studied typically requires more than just nominal or ordinal information (Meehl, 1978). Third, this reaction is at odds with the widespread use of statistical models that are predicated on quantitative comparisons of effects (Kellen et al., 2021), or the popularity of inferential frameworks that call for a quantitative reasoning of effects (Vanpaemel, 2010). Fourth, even in the context of coarse-grained theoretical accounts and ordinal predictions, knowledge about effect sizes is still relevant in the sense that it can inform us on matters of theoretical scope (i.e., how many people conform to a given theory’s predictions; Davis-Stober & Regenwetter, 2019; Heck, 2021). That being said, we are not claiming that a focus on detection is by itself problematic, or that there are no legitimate contexts in which it takes center stage; we are asserting only that a mature scientific characterization calls for more than that, namely accurate estimates.

Alternatively, one could try to downplay the importance of estimation accuracy by arguing that talk of effects is by itself problematic, in the sense that effects are of secondary importance relative to the explanation of psychological capacities (van Rooij & Baggio, 2021). We take issue with pursuing such a line of reasoning here, as it mistakenly implies that giving psychological theorizing the attention that it is owed somehow eliminates effects from researchers’ discourses. As a counterexample, consider the recent discussion on benchmark effects in short-term and working memory, a research domain that stands out for its highly sophisticated theoretical accounts (Oberauer et al., 2018). By contrast, the empirical exigencies of theory testing and development give estimation accuracy center stage (Meehl, 1978).

One could also argue that there is nothing new to see here, given that ψ is so closely related to already-established quantities. For instance, it is easy to see that ψ is a quadratic function of the t statistic (for details, see the Appendix). Rather than an all-new, all-different quantity to be reconciled with all the other ones in researchers’ toolboxes, what ψ offers is a reframing of an old problem. It is an attractive feature, not a shortcoming,⁶ that ψ is closely related to known quantities or tests, or that the pursuit of estimation accuracy ends up recovering similar methodological proposals with distinct motivations (e.g., Benjamin et al., 2018). It is also worth noting once again that although we assumed Gaussian distributions when deriving our ψ value recommendations, the definition of the RCE and the subsequent interpretation of ψ as a ratio of MSE values is distribution-free.

Regardless of one’s scientific view, random conclusions are indefensible. It follows that researchers’ empirical findings should, at a minimum, be distinguishable from a foil whose conclusions are determined by a coin flip. But as we have demonstrated, this is easier said than done: Many published research studies, despite honest efforts, have barely improved upon the estimation accuracy of the infamous Lab 2. As it turns out, one can easily fail to reliably outperform Lab 2, even if effects are real, studies are based in strong theory, and no questionable research practices are at play. The RCE approach and the ψ index that can be derived from it provide a new perspective on methodological reform (Devezer et al., 2019; Munafò et al., 2017; Shrout & Rodgers, 2018). Everything begins with a simple statement: The estimation accuracy of our methods should be distinguishable from a random-conclusions foil. In the pursuit of this modest goal, we find that the default p value threshold of .05 does not rule out unacceptable conditions (see the bottom row of Fig. 1), leading us to more stringent criteria that also address known concerns with measurement error, statistical power, and replicability (Gelman & Carlin, 2014; Loken & Gelman, 2017; Maxwell et al., 2015; but see also Bak-Coleman et al., 2022). Based on these results, we believe that ψ and the RCE approach more generally constitute an important tool in improving psychological science.

Footnotes

Appendix

Transparency

Action Editor: Tim Pleskac

Editor: Interim Editorial Panel

ORCID iDs

Clintin P. Davis-Stober

David Kellen

Wes Bonifay

Notes

References

Anderson

S. F.

Kelley

Maxwell

S. E.

(2017). Sample-size planning for more accurate statistical power: A method adjusting sample effect sizes for publication bias and uncertainty. Psychological Science, 28(11), 1547–1562.

Anderson

S. F.

Maxwell

S. E.

(2016). There’s more than one way to conduct a replication study: Beyond statistical significance. Psychological Methods, 21(1), 1–12.

Anderson

T. W.

(1962). On the distribution of the two-sample Cramér-von Mises criterion. The Annals of Mathematical Statistics, 33, 1148–1159.

Bak-Coleman

Mann

R. P.

West

Bergstrom

C. T.

(2022). Replication does not measure scientific productivity. SocArXiv rkyf7, Center for Open Science. https://doi.org/10.31235/osf.io/rkyf7

Benjamin

D. J.

Berger

J. O.

Johannesson

Nosek

B. A.

Wagenmakers

E.-J.

Berk

Bollen

K. A.

Brembs

Brown

Camerer

Cesarini

Chambers

C. D.

Clyde

Cook

T. D.

De Boeck

Dienes

Dreber

Easwaran

Efferson

. . . Johnson

V. E.

(2018). Redefine statistical significance. Nature Human Behaviour, 2(1), 6–10. https://doi.org/10.1038/s41562-017-0189-z

Buzbas

E. O.

Devezer

Baumgaertner

(2023). The logical structure of experiments lays the foundation for a theory of reproducibility. Royal Society Open Science, 10, Article 221042. https://doi.org/10.1098/rsos.221042

Campbell

J. Y.

Thompson

S. B.

(2008). Predicting excess stock returns out of sample: Can anything beat the historical average? The Review of Financial Studies, 21(4), 1509–1531.

Cardelino

(2021). Two sample Cramér-von Mises hypothesis test. https://www.mathworks.com/matlabcentral/fileexchange/13407-two-sample-cramer-von-mises-hypothesis-test

Cohen

(1988). Statistical power analysis for the behavioral sciences. Routledge.

10.

Cohen

(1994). The earth is round (p < .05). American Psychologist, 49(12), 997–1003.

11.

Cumming

(2013). Understanding the new statistics: Effect sizes, confidence intervals, and meta-analysis. Routledge.

12.

Cumming

Finch

(2001). A primer on the understanding, use, and calculation of confidence intervals that are based on central and noncentral distributions. Educational and Psychological Measurement, 61(4), 532–574.

13.

Davis-Stober

C. P.

Dana

(2014). Comparing the accuracy of experimental estimates to guessing: A new perspective on replication and the “crisis of confidence” in psychology. Behavior Research Methods, 46, 1–14.

14.

Davis-Stober

C. P.

Regenwetter

(2019). The ‘paradox’ of converging evidence. Psychological Review, 126(6), 865–879.

15.

Devezer

Nardin

L. G.

Baumgaertner

Buzbas

E. O.

(2019). Scientific discovery in a model-centric framework: Reproducibility, innovation, and epistemic diversity. PLOS ONE, 14(5), Article e0216125. https://doi.org/10.1371/journal.pone.0216125

16.

Divine

G. W.

Norton

H. J.

Barón

A. E.

Juarez-Colunga

(2018). The Wilcoxon–Mann–Whitney procedure fails as a test of medians. The American Statistician, 72(3), 278–286.

17.

Domingue

Rahal

Faul

Freese

Kanopka

Rigos

Stenhaug

Tripathi

(2021). Intermodel vigorish (IMV): A novel approach for quantifying predictive accuracy with binary outcomes. https://doi.org/10.31235/osf.io/gu3ap

18.

Fasano

Franceschini

(1987). A multidimensional version of the Kolmogorov–Smirnov test. Monthly Notices of the Royal Astronomical Society, 225(1), 155–170.

19.

Gaeta

Brydges

C. R.

(2020). An examination of effect sizes and statistical power in speech, language, and hearing research. Journal of Speech, Language, and Hearing Research, 63(5), 1572–1580.

20.

Gelman

Carlin

(2014). Beyond power calculations: Assessing type S (sign) and type M (magnitude) errors. Perspectives on Psychological Science, 9(6), 641–651.

21.

Gelman

Carlin

J. B.

Stern

H. S.

Rubin

D. B.

(1995). Bayesian data analysis. Chapman & Hall/CRC.

22.

Gelman

Tuerlinckx

(2000). Type S error rates for classical and Bayesian single and multiple comparison procedures. Computational Statistics, 15(3), 373–390.

23.

Götz

F. M.

Gosling

S. D.

Rentfrow

P. J.

(2022). Small effects: The indispensable foundation for a cumulative psychological science. Perspectives on Psychological Science, 17, 205–215. https://doi.org/10.1177/1745691620984483

24.

Gronau

Q. F.

Wagenmakers

E.-J.

(2019). Informed Bayesian t-tests. The American Statistician, 74(2), 137–143.

25.

Grünwald

Navarro

D. J.

(2009). NML, Bayes and true distributions: A comment on Karabatsos and Walker (2006). Journal of Mathematical Psychology, 53(1), 43–51.

26.

Guala

Mittone

(2005). Experiments in economics: External validity and the robustness of phenomena. Journal of Economic Methodology, 12(4), 495–515.

27.

Guest

Martin

A. E.

(2021). How computational modeling can force theory building in psychological science. Perspectives on Psychological Science, 16(4), 789–802.

28.

Hausser

Strimmer

(2009). Entropy inference and the James-Stein estimator, with application to nonlinear gene association networks. Journal of Machine Learning Research, 10(7), 1469–1484.

29.

Heck

D. W.

(2021). Assessing the “paradox” of converging evidence by modeling the joint distribution of individual differences: Comment on Davis-Stober and Regenwetter (2019). Psychological Review, 128(6), 1187–1196.

30.

Horn

J. L.

(1965). A rationale and test for the number of factors in factor analysis. Psychometrika, 30, 179–185.

31.

Jaynes

E. T.

(1957). Information theory and statistical mechanics. Physical Review, 106(4), 620.

32.

Karabatsos

Walker

S. G.

(2006). On the normalized maximum likelihood and Bayesian decision theory. Journal of Mathematical Psychology, 50(6), 517–520.

33.

Kellen

Davis-Stober

C. P.

Dunn

J. C.

Kalish

M. L.

(2021). The problem of coordination and the pursuit of structural constraints in psychology. Perspectives on Psychological Science, 16(4), 767–778.

34.

Kelley

Lai

(2011). Accuracy in parameter estimation for the root mean square error of approximation: Sample size planning for narrow confidence intervals. Multivariate Behavioral Research, 46(1), 1–32.

35.

Kelley

Maxwell

S. E.

(2003). Sample size for multiple regression: Obtaining regression coefficients that are accurate, not simply significant. Psychological Methods, 8(3), 305–321.

36.

Krantz

D. H.

(1999). The null hypothesis testing controversy in psychology. Journal of the American Statistical Association, 94(448), 1372–1381.

37.

Kullback

Leibler

R. A.

(1951). On information and sufficiency. The Annals of Mathematical Statistics, 22(1), 79–86.

38.

Lau

(2021). 2-d Kolmorogov-Smirnov test, n-d energy test, Hotelling T² test. https://github.com/brian-lau/multdist

39.

Lee

M. D.

Criss

A. H.

Devezer

Donkin

Etz

Leite

F. P.

Matzke

Rouder

J. N.

Trueblood

J. S.

White

C. N.

Vandekerckhove

(2019). Robust modeling in cognitive science. Computational Brain & Behavior, 2(3), 141–153.

40.

Lee

M. D.

Vanpaemel

(2018). Determining informative priors for cognitive models. Psychonomic Bulletin & Review, 25(1), 114–127.

41.

Loken

Gelman

(2017). Measurement error and the replication crisis. Science, 355(6325), 584–585.

42.

Wagenmakers

E.-J.

(2021). Bayes factors for peri-null hypotheses. https://doi.org/10.48550/ARXIV.2102.07162

43.

Marek

Tervo-Clemmens

Calabro

F. J.

Montez

D. F.

Kay

B. P.

Hatoum

A. S.

Donohue

M. R.

Foran

Miller

R. L.

Hendrickson

T. J.

Malone

S. M.

Kandala

Feczko

Miranda-Dominguez

Graham

A. M.

Earl

E. A.

Perrone

A. J.

Cordova

Doyle

Dosenbach

N. U. F.

(2022). Reproducible brain-wide association studies require thousands of individuals. Nature, 603, 654–660.

44.

Maxwell

S. E.

Lau

M. Y.

Howard

G. S.

(2015). Is psychology suffering from a replication crisis? What does “failure to replicate” really mean? American Psychologist, 70(6), 487–498.

45.

McNeish

(2016). On using Bayesian methods to address small sample problems. Structural Equation Modeling: A Multidisciplinary Journal, 23(5), 750–773.

46.

Meehl

P. E.

(1978). Theoretical risks and tabular asterisks: Sir Karl, Sir Ronald, and the slow progress of soft psychology. Journal of Consulting and Clinical Psychology, 46(4), 806–834.

47.

Munafò

M. R.

Nosek

B. A.

Bishop

D. V.

Button

K. S.

Chambers

C. D.

Du Sert

N. P.

Simonsohn

Wagenmakers

E.-J.

Ware

J. J.

Ioannidis

J. P.

(2017). A manifesto for reproducible science. Nature Human Behaviour, 1(1), 1–9.

48.

Myung

J. I.

Pitt

M. A.

(2018). Model comparison in psychology. In Wixted

J. T.

Phelps

E. A.

Davachi

(Eds.), Stevens’ handbook of experimental psychology and cognitive neuroscience (Vol. 5, pp. 85–118). John Wiley & Sons, Inc.

49.

Navarro

D. J.

(2019). Between the devil and the deep blue sea: Tensions between scientific judgement and statistical model selection. Computational Brain & Behavior, 2(1), 28–34.

50.

Nickerson

R. S.

(2000). Null hypothesis significance testing: A review of an old and continuing controversy. Psychological Methods, 5(2), 241–301.

51.

Nosek

B. A.

Beck

E. D.

Campbell

Flake

J. K.

Hardwicke

T. E.

Mellor

D. T.

van’t Veer

A. E.

Vazire

(2019). Preregistration is hard, and worthwhile. Trends in Cognitive Sciences, 23(10), 815–818.

52.

Nuijten

M. B.

van Assen

M. A.

Augusteijn

H. E.

Crompvoets

E. A.

Wicherts

J. M.

(2020). Effect sizes, power, and biases in intelligence research: A meta-meta-analysis. Journal of Intelligence, 8(4), Article 36.

53.

Oberauer

Lewandowsky

Awh

Brown

G. D.

Conway

Cowan

Donkin

Farrell

Hitch

G. J.

Hurlstone

M. J.

, et al (2018). Benchmarks for models of short-term and working memory. Psychological Bulletin, 144, 885–958.

54.

Prentice

D. A.

Miller

D. T.

(1992). When small effects are impressive. Psychological Bulletin, 112(1), 160–164.

55.

Proulx

Morey

R. D.

(2021). Beyond statistical ritual: Theory in psychological science. Perspectives on Psychological Science, 16(4), 671–681.

56.

Rouder

J. N.

Speckman

P. L.

Sun

Morey

R. D.

Iverson

(2009). Bayesian t tests for accepting and rejecting the null hypothesis. Psychonomic Bulletin & Review, 16(2), 225–237.

57.

Rubner

Tomasi

Guibas

L. J.

(2000). The earth mover’s distance as a metric for image retrieval. International Journal of Computer Vision, 40(2), 99–121.

58.

Sawilowsky

S. S.

(2009). New effect size rules of thumb. Journal of Modern Applied Statistical Methods, 8(2), Article 26.

59.

Scheel

A. M.

(2022). Why most psychological research findings are not even wrong. Infant and Child Development, 31(1), Article e2295.

60.

Schober

Bossers

S. M.

Schwarte

L. A.

(2018). Statistical significance versus clinical importance of observed effect sizes: What do p values and confidence intervals really represent? Anesthesia and Analgesia, 126(3), 1068–1072.

61.

Shrout

P. E.

Rodgers

J. L.

(2018). Psychology, science, and knowledge construction: Broadening perspectives from the replication crisis. Annual Review of Psychology, 69, 487–510.

62.

Siegel

Eder

J. S. N.

Wicherts

J. M.

Pietschnig

(2021). Times are changing, bias isn’t: A meta-meta-analysis on publication bias detection practices, prevalence rates, and predictors in industrial/organizational psychology. Journal of Applied Psychology, 107(11), 2013–2039.

63.

Simmons

J. P.

Nelson

L. D.

Simonsohn

(2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22(11), 1359–1366.

64.

Steinley

Brusco

M. J.

(2021). On fixed marginal distributions and psychometric network models. Multivariate Behavioral Research, 56(2), 329–335.

65.

Szollosi

Kellen

Navarro

D. J.

Shiffrin

van Rooij

Van Zandt

Donkin

(2020). Is preregistration worthwhile? Trends in Cognitive Sciences, 24(2), 94–95.

66.

Szucs

Ioannidis

J. P. A.

(2017). Empirical assessment of published effect sizes and power in the recent cognitive neuroscience and psychology literature. PLOS Biology, 15(3), Article e2000797. https://doi.org/10.1371/journal.pbio.2000797

67.

Urbanek

Rubner

(2015). Emdist: Earth mover’s distance. R package version 0.3-2.

68.

van de Schoot

Hoijtink

Jan-Willem

(2011). Moving beyond traditional null hypothesis testing: Evaluating expectations directly. Frontiers in Psychology, 2, Article 24. https://doi.org/10.3389/fpsyg.2011.00024

69.

Vanpaemel

(2010). Prior sensitivity in theory testing: An apologia for the Bayes factor. Journal of Mathematical Psychology, 54, 491–498.

70.

Vanpaemel

Lee

M. D.

(2012). Using priors to formalize theory: Optimal attention and the generalized context model. Psychonomic Bulletin & Review, 19(6), 1047–1056.

71.

van Rooij

Baggio

. (2021). Theory before the test: How to build high-verisimilitude explanatory theories in psychological science. Perspectives on Psychological Science, 16(4), 682–697.

72.

Welch

B. L.

(1947). The generalization of ‘student’s’ problem when several different population variances are involved. Biometrika, 34(1–2), 28–35.

73.

Woodward

(1989). Data and phenomena. Synthese, 79, 393–472.

Better Accuracy for Better Science . . . Through Random Conclusions

Abstract

Keywords

Introduction

A Tale of Two Labs

General Formulation

Applications to CIs and Hypothesis Testing

Examining Prior Meta-Analyses

Extensions

Recommendations

Report ψ intervals

Power statistical tests for estimation

Bayesian estimation

Computational modeling and formal theory

A more stringent threshold (α = .0005) for two-group between-subjects hypothesis testing

The importance of experimental design

Discussion

Footnotes

Appendix

Transparency

ORCID iDs

Notes

References