Sage Journals: Discover world-class research

Abstract

When running statistical tests, researchers can commit a Type II error, that is, fail to reject the null hypothesis when it is false. To diminish the probability of committing a Type II error (β), statistical power must be augmented. Typically, this is done by increasing sample size, as more participants provide more power. When the estimated effect size is small, however, the sample size required to achieve sufficient statistical power can be prohibitive. To alleviate this lack of power, a common practice is to measure participants multiple times under the same condition. Here, we show how to estimate statistical power by taking into account the benefit of such replicated measures. To that end, two additional parameters are required: the correlation between the multiple measures within a given condition and the number of times the measure is replicated. An analysis of a sample of 15 studies (total of 298 participants and 38,404 measurements) suggests that in simple cognitive tasks, the correlation between multiple measures is approximately .14. Although multiple measurements increase statistical power, this effect is not linear, but reaches a plateau past 20 to 50 replications (depending on the correlation). Hence, multiple measurements do not replace the added population representativeness provided by additional participants.

Keywords

statistical power replicated measures subsampling

A frequent objective of an experiment is to test the effect of one or more independent variables on one or more dependent variables. Statistical tests or confidence intervals can be used for this purpose. In null-hypothesis significance testing, for example, a researcher may calculate the probability (p) of extreme samples that differ between conditions, assuming that the conditions are equal in the population. This p value is helpful to determine whether the variability observed in the data is due to random variations or to other causes, which become more plausible when p is very small.

The probability that a small p value is found depends on only one factor: the separation between the sampling distribution of the true effect and the sampling distribution under the hypothesis tested (often a null hypothesis; Cumming, 2014; Hung, O’Neill, Bauer, & Köhne, 1997; Lazzeroni, Lu, & Belitskaya-Lévy, 2014). This separation can be expressed using what is called a noncentrality parameter (Chartier & Allaire, 2007; Cousineau & Laurencelle, 2011; Goulet-Pelletier & Cousineau, 2018; Hélie, 2007; Lal Saxena & Alam, 1982). Alternatively, it can be expressed using statistical power.

Statistical power is a concept taken from the work of Neyman and Pearson (1933). They proposed selecting a critical p value prior to the statistical test. If the observed p value is greater than this decision threshold, the experimenter concludes that the independent variable has no significant effect on the dependent variable; if it is less than this decision threshold, the experimenter concludes that the independent variable has an effect significant enough not to be due to randomness. It is conventional to use .05 as a decision threshold (but see Benjamin et al., 2018, and Trafimow et al., 2018). This decision threshold, conventionally called α, corresponds to the Type I error rate: the probability of rejecting the null hypothesis when it is true. The corollary to the existence of this critical p value is the existence of another type of decision error, the Type II error (whose probability is quantified as β). Unlike α, the Type II error rate is not chosen directly, because it is a function of α, the sample size, and the population effect size. A small β is obtained when there is a large α, a large sample size, or a large effect size. Because researchers have control over the number of participants in their experiments, the easiest option to reduce β is to increase the sample size.

In a seminal publication, Cohen (1992) calculated the required sample size for many statistical tests and different levels of effect sizes to obtain a recommended statistical power (1 – β) of .80 (also see Cohen, 1969, 1977). The research community realized that to detect small effect sizes, the required sample size can be quite large (e.g., more than 300 participants for analyses of variance, ANOVAs, with three groups). Because this requirement is immense in some areas of experimental psychology, a large number of studies lack adequate statistical power (Sedlmeier & Gigerenzer, 1989). With the emergence of software to calculate the required sample size given an expected effect size (e.g., G*Power; Faul, Erdfelder, Lang, & Buchner, 2007), statistical power in psychological experiments should have increased. However, the situation remains unchanged: Most studies still lack sufficient statistical power (Abraham & Russell, 2008; Button et al., 2013). Even in fields in which effect sizes are reasonably large, such as neuropsychology, the median power is estimated to be below .50 (Bezeau & Graves, 2001).

Statistical power is calculated under the assumption that the measurement is performed once per participant for every condition of an experiment. However, in some areas of psychology, it is conventional to measure participants in each condition multiple times. This method is sometimes referred to as subsampling. Questionnaires, for example, are developed with the objective of regrouping items to quantify different specific psychological constructs. To that end, participants are measured multiple times to increase the reliability of the measure. That said, doubling the number of items does not mean that the questionnaire (and the construct it measures) is twice as valid; the resulting validity is somewhere between 1 and 2 times what it was originally. This attenuation in validity is caused by the measurement covariance across multiple items. Schmidt and Hunter (2015) proposed a post hoc correction to account for this attenuation.

Similarly, in some other areas of psychology (e.g., single-measure paradigms, such as response times experiments), participants may perform multiple trials under the same condition (Rouder & Haaf, 2018). Whereas multiple items are used in questionnaires mainly to increase the reliability of measures and the validity of constructs, there are several reasons for replication in single-measure paradigms. First, if answers can be right or wrong, erroneous answers are often removed under the assumption that they do not adequately reflect the process under investigation. Second, if on some trials participants produce responses whose magnitude is most probably caused by lapses or distraction, these outlying responses are sometimes removed as well. Without replications, there would frequently be empty cells, which complicate data analysis. Third, if measurements were taken only once, the experiment might last only a few minutes, and it would be difficult to attract participants to a controlled environment. Finally, replicating measurement yields a more precise estimate of true performance. As a consequence, statistical power is improved. This last reason is the focus of the present article.

How do replications affect statistical power? If one assumes that there is no measurement covariance, one could argue that when participants are measured twice, the sample size required to achieve a target level of power is halved. Indeed, increasing the reliability of the measure using subsampling reduces the observed variance within each participant.¹ However, this argument ignores the fact that a given participant’s measurements within a condition are correlated: The stronger the correlation is, the less one learns from subsequent measurements.

In this article, we demonstrate how to assess the benefit to statistical power when participants in each condition are measured multiple times. The objective is to extend the work of Cohen (1992) by taking subsampling into account and to help researchers attain adequate statistical power with realistic sample sizes. In the first section, we show how to take subsampling into account using two parameters: the number of measurement replications, m, and the correlation across replications, r. The first step in our approach is to determine the number of participants needed to reach a target level of power without replicated measures. This number is then altered according to the values of these two parameters. The result is the number of participants needed to reach the target level of power with replicated measures.

In the second section, we roughly estimate the magnitude of r in simple cognitive tasks by analyzing data from 15 experiments that used replications. In the third section, we propose recommendations to help researchers adjust the power of their experiments properly. We conclude with general comments.

Correlation Across Replications and Its Effects on Power

In its simplest form, the correlation across replications is just the average Pearson correlation between all the pairs of measures. However, missing and erroneous data can be obstacles to computing the Pearson correlation, and we therefore review three additional methods to estimate the correlation across replications in Appendix A. To evaluate the impact of correlation across replications on statistical power, we first examine two extreme scenarios. We then show that the general case is a linear interpolation of these limiting scenarios.

First scenario: no correlation across replications

In this scenario, the intuitive argument proposed in the introduction is very close to exact: Doubling the number of measurements can be accompanied by halving the number of participants with no loss of statistical power. Consider, for example, the power of a one-group two-tailed t test. Its statistical power (1 – β) is given by

\begin{array}{l} P r (Reject H_{0} | H_{0} is false) = \\ P r (\frac{| \bar{X} - μ_{0} |}{\sqrt{V a r (\bar{X})}} > t (α / 2) | E (\bar{X}) \neq μ_{0}), \end{array}

(1)

where H₀ is the null hypothesis, $\bar{X}$ is the sample mean, µ₀ is the hypothetical mean tested, $E (\bar{X})$ is the true population mean, $\sqrt{V a r (\bar{X})}$ is the standard error of the population mean, and t(α/2) is the critical value for a two-tailed t test based on n₁ – 1 degrees of freedom. We use the subscript 1 in n₁ to denote a situation in which participants are measured only once. The variance of the population mean, $V a r (\bar{X})$ , is known to be the biased variance of X, or Var(X), divided by the number of participants minus 1 (the degrees of freedom; Hays, 1973, also see Harding, Tremblay, & Cousineau, 2014, for a review of standard errors):

V a r (\bar{X}) = \frac{V a r (X)}{n_{1} - 1} .

(2)

To introduce replicated measures in Equation 1, let us use m to denote the number of measurements made and n_m to denote the number of participants in the sample with replications. The formula for the power of a two-tailed t test is as follows:

\begin{array}{l} P r (Reject H_{0} | H_{0} is false) = \\ P r (\frac{| {\bar{X}}_{\bar{m}} - μ_{0} |}{\sqrt{V a r ({\bar{X}}_{\bar{m}})}} > t (α / 2) | E ({\bar{X}}_{\bar{m}}) \neq μ_{0}), \end{array}

(3)

where ${\bar{X}}_{\bar{m}}$ is the mean across participants of the means across measurements. It is known that the mean of subsample means is an unbiased estimate of the population mean, such that $E ({\bar{X}}_{\bar{m}})$ is the same as $E (\bar{X}) .$ Consequently, the numerators in Equations 1 and 3 are the same. When the measures are uncorrelated, the entirety of variance comes from random error. In that case, the variance of the mean of m measures is m times smaller than the variance of the measures. Thus, when r = 0,

V a r ({\bar{X}}_{\bar{m}}) = \frac{V a r (\bar{X})}{m} = \frac{V a r (X)}{m (n_{m} - 1)} .

Given that the numerators in Equations 1 and 3 are the same, statistical power will be the same in these equations when the denominators are equal. Consequently, for a single-measure design and a design with replications to have the same power, (n₁ – 1) must be equated to m(n_m – 1). Hence, when r = 0, the following equation can be used to calculate how many participants measured with replications will provide the same power as a given number of participants measured once:

n_{m} = \frac{n_{1} - 1}{m} + 1 .

(4a)

For example, in the absence of correlation across replications, the statistical power obtained by measuring 100 participants once per condition is the same as the power obtained by measuring 51 participants twice (more precisely, the number is 50.5 participants, which is rounded up). For a more striking example, consider that measuring 250 participants once per condition has the same statistical power as measuring 26 participants 10 times per condition (n_m is equal to 25.9, which is again rounded up).

The same result can be derived from the following linear model:

X_{i j} = μ + ε_{i j}

ε_{i j} ~ N (0, σ^{2}),

where µ is the population grand mean, i is the subject number (i = 1, . . . , n_m), j is the replication number ( j = 1, . . . , m), and ε_ij is random error added to each measurement. In this model, the variance of the means across replications is σ²/m, and the corrected-for-bias variance of the grand mean is σ²/((n_m – 1)m).

This first scenario is highly unrealistic because it assumes that each measurement is independent within and between participants. The next scenario is also highly improbable.

Second scenario: perfect correlation across replications

In this scenario, a perfect correlation between measurements is assumed: Regardless of the number of replications, a participant’s performance is always the same. Consequently, only one measurement per participant is necessary to obtain data with maximum reliability. Therefore, $V a r ({\bar{X}}_{\bar{m}}) = V a r (\bar{X}) / (n_{m} - 1)$ and when r = 1,

n_{m} = n_{1} .

(4b)

In other words, the only source of variation is between participants, and adding replications to the experimental design in this scenario is a waste of time and effort. In this scenario, statistical power after measuring 250 participants once per condition is the same as statistical power after measuring 250 participants any number of times per condition.

The second scenario can also be derived from the following linear model:

X_{i j} = μ + α_{j}

α_{j} \sim N (0, τ^{2}),

in which α_j is the participant’s deviation from the population grand mean, µ. The unbiased variance of the grand mean is τ²/(n_m – 1).

General case: some correlation across replications

Even though the two extreme scenarios just presented are unlikely to occur, they illustrate boundary solutions. As we show next, the general solution is a linear interpolation between these two scenarios based on r. For instance, when r = .5, n_m corresponds to the midpoint of Equations 4a and 4b. This demonstration requires the standard error of the mean when 0 ≤ r ≤ 1. It has been shown (Hansen, Hurwitz, & Bershad, 1961; Krotki, 1978) that the variance of the mean of correlated items having the same variance is given by the following equation:

\begin{array}{l} V a r ({\bar{X}}_{\bar{m}}) = V a r (\bar{X}) (\frac{1}{m} + \frac{m - 1}{m} r) \\ = \frac{V a r (X)}{n_{m} - 1} (\frac{1}{m} + \frac{m - 1}{m} r) \\ = \frac{V a r (X)}{m (n_{m} - 1)} (1 + (m - 1) r) . \end{array}

(5)

By equating the variance of the means in the absence and in the presence of replications (Equations 2 and 5, respectively), we can measure the impact of r and m on the sample size assuming replicated measures n_m:

\frac{V a r (X)}{n_{1} - 1} = \frac{V a r (X)}{m (n_{m} - 1)} (1 + (m - 1) r) .

Solving for n_m, we get $n_{m} - 1 = \frac{n_{1} - 1}{m} (1 + (m - 1) r)$ , so that

\begin{array}{l} n_{m} = \frac{n_{1} - 1}{m} (1 + (m - 1) r) + 1 \\ = \frac{n_{1}}{m} (1 + (m - 1) r) - \frac{1}{m} (1 + (m - 1) r) + 1 \\ = \frac{n_{1}}{m} + \frac{n_{1} m}{m} r - \frac{n_{1}}{m} r - \frac{1}{m} - \frac{m}{m} r + \frac{1}{m} r + 1 \\ = r n_{1} + (1 - r) \frac{n_{1}}{m} - (1 - r) \frac{1}{m} + (1 - r), \end{array}

which, when simplified, is clearly a linear interpolation of the equations for calculating the values of n₁ and n_m that provide equal power in the first two scenarios:

for 0 \leq r \leq 1, n_{m} = r n_{1} + (1 - r) (\frac{n_{1} - 1}{m} + 1) .

(4c)

For example, if the correlation across replications is .33, the statistical power obtained from measuring 250 participants once per condition is the same as the statistical power obtained from measuring 100 participants 10 times per condition (after rounding up). By using the mean squared error instead of $V a r (\bar{X}),$ it is easily shown that Equation 4c is valid for all t tests and ANOVA tests (Keppel, 1973; Winer, Brown, & Michels, 1991).

To model this situation, let

X_{i j} = μ + α_{j} + ε_{i j},

where α_j and ε_ij are defined as previously. The variance of the raw data, Var(X), is τ² + σ², and the variance of the means across replications is τ² + σ²/m, so that the unbiased variance of the population grand mean is (τ² + σ²/m)/(n_m – 1). Because r can be estimated with τ²/(σ² + τ²) (Appendix A), we easily derive Hansen et al.’s (1961) result:

\begin{array}{l} \frac{τ^{2} + σ^{2} / m}{n_{m} - 1} = \frac{σ^{2} + m τ^{2}}{m (n_{m} - 1)} = \frac{σ^{2} + τ^{2} + (m - 1) τ^{2}}{m (n_{m} - 1)} \\ = \frac{σ^{2} + τ^{2}}{m (n_{m} - 1)} (1 + (m - 1) \frac{τ^{2}}{σ^{2} + τ^{2}}) \\ = \frac{V a r (X)}{m (n_{m} - 1)} (1 + (m - 1) r) . \end{array}

Overview of Correlations Across Replications in 15 Studies

To perform an a priori power analysis in a study with replications, an estimate of the effect size and an estimate of the correlation across replications must be obtained. Although effect sizes are often reported in empirical studies, or at least can be calculated using the reported descriptive statistics, correlation across replications is not reported in many single-measure studies.

In other areas of psychology, the correlation across replications is more commonly reported. In a review of educational achievement, Hedges and Hedberg (2007) found a correlation value around .2 when all the surveys were included; when the scores were restricted to more homogeneous subpopulations (e.g., those with low socioeconomic status), the correlation was lower, at around .1. Murray and Blistein (2003) found a higher typical correlation, near .3, in their review of medical research and studies of health-related outcomes, whereas in a review of national health surveys, Gulliford, Ukoumunne, and Chinn (1999) reported a low correlation (as low as .01) when clusters were states, but a higher correlation (about .3) when clusters were at the level of households.

For studies involving questionnaires, researchers can use the reported reliability measures (e.g., Cronbach’s alpha) to estimate the correlation across replications (e.g., by using Equation A1 from Appendix A). However, for studies using a single measure, reliability must be computed from the raw data. Although making raw data publicly available is now encouraged by the majority of journals, this practice is still marginal in experimental psychology (Alsheikh-Ali, Qureshi, Al-Mallah, & Ioannidis, 2011; Wicherts, Borsboom, Kats, & Molenaar, 2006).

To estimate the correlation across replications for single-measure paradigms, we obtained the raw data from 15 simple cognitive experiments that used subsampling in their methodology. Some of these experiments were run by teams of which we were members (Boutet, Lemieux, Goulet, & Collin, 2017: 2 experiments; Cousineau & Shiffrin, 2004: 1 experiment; Goulet, 2015: 2 experiments). The data from 9 experiments were obtained after we contacted the authors (Brisson & Jolicoeur, 2007; Carrasco, Ponte, Rechea, & Sampedro, 1998; Fifić, Townsend, & Eidels, 2008; Lacroix, Giguère, & Larochelle, 2005; Miller, 2006; Palmeri, 1997; Reder & Ritter, 1992; Rickard, 1997; Strayer & Kramer, 1994). Finally, the data from 1 experiment were obtained from a Web repository (Visual Attention Lab, 2015).

These studies varied in sample size (n_m) from 4 (Fifić et al., 2008; Palmeri, 1997) to 58 (Lacroix et al., 2005), and the number of replications (m) varied from only 2 (Lacroix et al., 2005) to 700 (Strayer & Kramer, 1994; see Table 1). Using the method of Shrout and Fleiss (1979) described in Appendix A (Equation A3), we estimated the correlation across replications. The results are shown in Table 1. On average, the correlation was .14. From the number of replications and the actual number of participants, we computed the number of participants measured once that would have provided the same statistical power by inverting Equation 4c:

n_{1} = \frac{(n_{m} - 1) m}{1 + (m - 1) r} + 1 .

(4c reversed)

Table 1.

Summary of the Analysis of Experiments in Cognitive Psychology

Study	Experimental design		Estimated correlation across replications (r)	Statistical power
Study	Number of participants (n_m)	Number of replications (m)	Estimated correlation across replications (r)	Value of n₁ that would provide the same power	Gain in effective participants due to replications (n₁:n_m)
Boutet, Lemieux, Goulet, and Collin (2017), Experiment 1	22	30	.245	79	3.6:1
Boutet et al. (2017), Experiment 2	20	30	.231	76	3.8:1
Brisson and Jolicoeur (2007)	16	256	.225	67	4.2:1
Carrasco, Ponte, Rechea, and Sampedro (1998)	10	80	.095	86	8.6:1
Cousineau and Shiffrin (2004)	6	12	.287	16	2.7:1
Fifić, Townsend, and Eidels (2008)	4	150	.458	8	2.0:1
Goulet (2015), Experiment 1	30	30	.168	150	5.0:1
Goulet (2015), Experiment 2	31	30	.215	126	4.1:1
Lacroix, Giguère, and Larochelle (2005)	58	2	.439	81	1.4:1
Miller (2006)	16	42	.145	92	5.8:1
Palmeri (1997)	4	208	.190	17	4.3:1
Reder and Ritter (1992)	20	18	.303	57	2.9:1
Rickard (1997)	19	90	.056	272	14.3:1
Strayer and Kramer (1994)	32	700	.108	284	8.9:1
Visual Attention Lab (2015)	10	500	.109	83	8.3:1
Mean	19.9	146	.141	99.6	5.32:1

Although these experiments had on average 20 participants (19.9, to be precise), they had the statistical power of experiments with an average of 5 times that number of participants if replications are taken into account (n_m = 99.6, on average; see Table 1).

To further illustrate the gain in statistical power, we computed for each study the ratio of participants that this gain represented (n₁:n_m). The gain in effective participants ranged from 1.4:1 (Lacroix et al., 2005) to 14.3:1 (Rickard, 1997); the average gain was just a little over 5:1 (see Table 1). In other words, measuring participants with replications rather than a single time resulted in a 5-fold increase in the experiments’ ability to detect effects.

Because standard errors are often proportional to the square root of the number of participants, increasing the number of effective participants by, say, 4 divides the effect size that can be detected by 2 (i.e., $\sqrt{4}$ ). Thus, our results suggest that the effect size that can be detected with a power of 80% is much smaller when replications are used: The average reduction was by a factor of approximately 2.2 (i.e., $\sqrt{5}$ ), which made it easier to recruit a sufficient number of participants to afford a power of 80%.

We estimated that without replications, the average Cohen’s d to afford a statistical power of 80% in these studies would have been fairly large, ranging from 0.37 to 2.13, with a mean of 0.92. If the effects investigated had been of average size (d ≈ 0.5), then clearly these experiments would have been very much underpowered without replications. With replications, however, they had 80% power to detect effects of medium size, because dividing 0.92 by 2.2 yields about 0.42, which is just below Cohen’s (1992) definition of a medium-sized effect.

In the extreme case, replications improved the effective number of participants by a factor of 14. In that experiment (Rickard, 1997), the correlation across replications was estimated to be low, so that each measurement was only slightly redundant with the others, a most desirable situation.

Although the average correlation across replications, .14, might not seem large, this value is close to what would typically be considered an indication of reliable data, given the number of replications. By inverting Equation A1, we can use this correlation to calculate Cronbach’s alpha:

α = \frac{m \times r}{1 + (m - 1) r} .

(6)

When a variable has a correlation of .14 across replications and the number of replications is 146, Cronbach’s alpha is .96, which is considered highly reliable.

Recommendations

These results suggest that researchers should estimate the correlation across replications in their studies. In order to plan statistical power, the following steps can be used:

If the researcher has access to the raw data from previous studies, the correlation across replications can be estimated using one of the methods described in Appendix A.

If the researcher does not have access to raw data, pilot studies can be conducted, and the correlation across replications can be computed using the same method.

As a last resort, the researcher can estimate the correlation to be .14. However, we strongly encourage further research on this issue, as the correlation across replications might be affected by many sources of variation, such as experimental conditions, the nature of the task, and whether trials are blocked or in a random order.

When there are multiple conditions in an experiment, we recommend computing the average correlation across all conditions. In the case of unbalanced designs, we recommend using the smallest number of replications (m) in these computations, to ensure a conservative measure of the desired n_m.

Whereas it is typical for reliability measures to be reported for studies involving questionnaires, reliability is almost never reported for single-measure paradigms. We invite researchers to report the correlation across replications more frequently. This measure not only affects statistical power but also can influence the length of error bars.

A potential problem can emerge when researchers are planning to follow the fairly common practice of eliminating from analysis trials on which the obtained measure is unrealistic (outliers) or the response is erroneous. This can result in a data set with fewer replications than planned for, and thus statistical power that is smaller than desired. To alleviate this problem, researchers can use previous data to estimate the number of trials that will be removed in each condition and add this number to the planned number of measurements. For example, if previous data show that about 1% of trials per condition are removed as outliers and that errors occur on about 8% of trials, a researcher planning on replicating a measure 20 times might instead replicate the measure 22 times (20 plus 9% of 20).

General Discussion

We have shown precisely how statistical power is increased by measuring participants multiple times under the same condition. Figure 1 summarizes the potential gain in participants by plotting the n₁:n_m ratio as a function of the number of replications (m), the measurement correlation (r) and the number of recruited participants (n_m). The code used to generate this figure is provided in Box 1. The curves in the figure rapidly reach plateaus (at around 20 to 50 replications) after which very little gain is achieved with additional replications. The number of replications required to reach the asymptotic gain in effective participants (and, consequently, in power) varies as a function of the correlation across replications. When this correlation is high (i.e., little variability between trials is observed), the gain in statistical power is small even if the measure is replicated multiple times (Rouder & Haaf, 2018). However, as we mentioned earlier, the correlation across replications in cognitive tasks seems to be a little below .20. The asymptote for a .20 correlation is reached with approximately 20 replications (providing a ratio of about 4:1). In other words, when the correlation across replications is .20, a researcher would need 4 times fewer participants to reach a desired statistical power if participants are measured 20 times in the same condition than if they are measured only once, and there is little benefit in replicating the measure 100 times rather than 20 times, at least if the focus is on the mean.

Fig. 1.

Gain in effective participants (expressed as the ratio n₁:n_m), computed from Equation 4c reversed, as a function of the number of recruited participants (n_m), the number of replications (m, ranging from 1 to 100 in increments of 1), and the true population cross-measurement correlation (r). Each cluster of lines shows results for a given value of r; the different lines correspond to different values of n_m, from 5 to 100 in increments of 1 (lower values have lower ratios). The dots illustrate where some of the reviewed experiments in Table 1 fall.

Box 1. Mathematica Code Used to Generate the Plot in Figure 1

(* This function returns the effective number of participants *)

(* given the following parameters: n1, m and r *)

(* This is equation 4c/reversed *)

AdjustedN[n1_, m_, r_] := Ceiling[(n1 - 1)/((1/m) + (m - 1)/m*r) + 1]

(* This specifies the values of the parameters and *)

(* the simulated conditions *)

listofn1 = Range[5, 100];

listofm = Range[1, 100];

listofr = {0.1, 0.2, 0.3, 0.4, 0.5};

(*This simulates all the conditions, returning the ratio n_m : n_1 *)

data = Table[

{m, AdjustedN[n1, m, r]/n1},

{r, listofr}, {n1, listofn1}, {m, listofm}

];

(*This merge the first two dimensions*)

data = Apply[Join, data];

(*This generates the figure.*)

Figure1 = ListPlot[data,

(*The figure is customized with the following options:*)

Joined -> True, Axes -> False,

Frame -> {True, True, False, False},

FrameLabel -> {“Replications (m)”, “ Ratio (n_1 : n_m)”},

PlotStyle -> Darker[Gray],

FrameStyle -> Directive[Black, “Arial”, 16]

Measuring a large number of participants is always the best and recommended scenario for any research. Replications cannot replace the added population representativeness that a participant brings. However, the sample size required to reach sufficient statistical power is sometimes unrealistic. For example, consider a group of researchers who initially think they will recruit two groups of 8 participants for a study (16 participants in total). They perform an a priori power analysis assuming a medium effect size and discover that to reach a power of 80%, they need 32 participants in each group (64 participants in total). Thus, 48 additional participants are required; the researchers need to recruit 4 times as many participants as initially planned.

Replications offer an alternative solution. Assuming a conservative cross-replication correlation of .20 (Table 1 suggests .14, which is more liberal), the researchers can change their design to ensure that they measure each of their 16 participants 21 times. In this study, measuring 16 participants 21 times has the same statistical power as measuring 64 participants only once. This approach is more realistic for this group of researchers. It also offers more precise and valid measurements of the participants and an increase in power compared with the initial plan of measuring 16 participants one time each.

Should review studies’ estimates of the statistical power of published psychological research be reconsidered?

The short answer to this question is no. The adjustment to power calculations that we have proposed in this article shows that it is possible to increase statistical power by replicating a measure multiple times in each condition, but we do not mean to say that participants can be traded for trials. Rather, we are saying that the desired sample size can be more obtainable if it is based on a consideration of the number of replications.

Over the years, there have been critiques questioning the lack of statistical power in psychological studies (Bezeau & Graves, 2001; Button et al., 2013; Sedlmeier & Gigerenzer, 1989). These articles reviewed statistical power in fields in which it is very difficult to replicate the measures. For example, in clinical psychology, it is often not convenient to repeat a full IQ test a second time. In the case of a depression inventory, participants might remember their previous responses and repeat them; with a perfect correlation, there is no benefit in replicating the measures. In social psychology research, questionnaires are already composed of repeated measures that are developed to optimize Cronbach’s alpha with the fewest replications possible. Because there is little benefit derived from replications when r is high, there is little reason to administer such a questionnaire a second time.

In sum, we do not doubt the lack of statistical power reported in the influential review articles that we just cited, and that should remain a great concern. A systematic lack of statistical power can be harmful for cumulative science (Abraham & Russell, 2008; Szucs & Ioannidis, 2017). There are research areas in which replicating the measures is problematic, and in these areas, increasing sample size may be the only possible solution.

Why not analyze the replication effect?

An alternative to averaging measures across replications is to analyze them formally. In an ANOVA. for example, an additional replication variable with m levels can be added. This is very rarely done, though we are aware of one exception (Schneider & Shiffrin, 1977).

There are many reasons not to analyze the replication effect. First, replication has little theoretical importance. Second, the replication effect may be significant because it reflects training in the task, and thus it may interact with the effects of other variables, needlessly complicating the presentation of the results. Third, whether or not the replication effect is analyzed has no bearing on the other effects in linear models (as long as there are no missing data). Fourth, a common practice is to remove from analysis trials on which errors occurred. This can result in a considerable amount of missing data and many empty cells, complicating analyses further (if imputation is used) or reducing power (under pairwise deletion).

The present solution to increase statistical power contrasts with other questionable methods, such as running multiple statistical tests, adding participants until a significant result is found, and adding or removing covariates (Simmons, Nelson, & Simonsohn, 2011). These practices, known collectively as p-hacking, are bad research practices because they increase the probability of observing significant results and create biases for significant results in the literature. The present solution, on the other hand, involves making a decision prior to data collection and therefore does not fit in the p-hacking family of practices. We provide in Appendix B a step-by-step example of how to apply the solution proposed in this article.

Footnotes

Appendix A: Estimating the Correlation Across Replications

The correlation across replications is a measure of the reliability of scores obtained multiple times. It indicates how much the measurement varies across these multiple instances (trials). A high correlation across replications means that individual participants show little variability. Given a data set containing replications, it is possible to estimate the correlation across replications in a variety of ways. Here we describe four approaches.

Appendix B: A Step-by-Step Approach

In this appendix, we provide a concrete example of a power analysis taking into account the number of replications and the correlation across replications. The objective is to compute the effective number of participants required to attain statistical power of 80%. In this hypothetical example, we compare the response times (RTs) for “same” and “different” responses in a same/different task (Bamber, 1969). This task consists of a series of trials in which participants view pairs of stimuli and report whether the members of each pair are the same or different. Responses in the same condition are commonly faster than responses in the different condition, and we would like to investigate this fast-same effect in a new experiment.

First, we estimate the effect size on the basis of unpublished results obtained in previous experiments conducted in our laboratory. At this stage, it is possible to compute the desired sample size if the participants are measured only once, and we do so with G*Power (Faul, Erdfelder, Lang, & Buchner, 2007). Second, we use the data from the same experiments to estimate the correlation across replications. Finally, we compute the required number of participants measured with replications using the values obtained in the previous steps. We use a spreadsheet to keep track of our calculations.

Acknowledgements

We would like to thank Marisa Carrasco, Mario Fifić, Pierre Jolicoeur, Guy Lacroix, Thomas Palmeri, Lynne Reder, Timothy Rickard, and David Strayer for kindly agreeing to provide us with raw data from their studies. We would also like to thank Greg Francis, Jean-Christophe Goulet-Pelletier, Jesika Walker, and an anonymous reviewer for comments on an early version of the manuscript.

Action Editor

Frederick L. Oswald served as action editor for this article.

Author Contributions

M.-A. Goulet and D. Cousineau jointly generated the idea for the manuscript. D. Cousineau contacted researchers to access raw data from the studies included in the correlation analysis. M.-A. Goulet created the figures and table, wrote the code in the boxes, and wrote the appendices. Both authors wrote the main article, critically edited the manuscript, and approved the final submitted version of the manuscript.

ORCID iD

Marc-André Goulet

Declaration of Conflicting Interests

The author(s) declared that there were no conflicts of interest with respect to the authorship or the publication of this article.

Funding

This research was supported in part by le Conseil pour la recherche en sciences naturelles et en génie du Canada.

Open Practices

Open Data: no

Open Materials: not applicable

Preregistration: no

Notes

References

Abraham

W. T.

Russell

D. W.

(2008). Statistical power analysis in psychological research. Social and Personality Psychology Compass, 2, 283–301. doi:10.1111/j.1751-9004.2007.00052.x

Alsheikh-Ali

A. A.

Qureshi

Al-Mallah

M. H.

Ioannidis

J. P. A.

(2011). Public availability of published research data in high-impact journals. PLOS ONE, 6(9), Article e24357. doi:10.1371/journal.pone.0024357

Bamber

(1969). Reaction times and error rates for “same”-“different” judgments of multidimensional stimuli. Perception & Psychophysics, 6, 169–174. doi:10.3758/BF03210087

Béland

Cousineau

Loye

(2017). Utiliser le coefficient omega de McDonald à la place de l’alpha de Cronbach [Using McDonald’s coefficient omega instead of Cronbach’s alpha]. McGill Journal of Education, 52. Retrieved from http://mje.mcgill.ca/article/view/9534/7289

Béland

Pichette

Jolani

(2016). Impact on Cronbach’s α of simple treatment methods for miss-ing data. The Quantitative Methods for Psychology, 12, 57–73.

Benjamin

D. J.

Berger

J. O.

Johannesson

Nosek

B. A.

Wagenmakers

E.-J.

Berk

. . . Johnson

V. E.

(2018). Redefine statistical significance. Nature Human Behaviour, 2, 6–10. doi:10.1038/s41562-017-0189-z

Bezeau

Graves

(2001). Statistical power and effect sizes of clinical neuropsychology research. Journal of Clinical and Experimental Neuropsychology, 23, 399–406. doi:10.1076/jcen.23.3.399.1181

Boutet

Lemieux

C. L.

Goulet

M.-A.

Collin

C. A.

(2017). Faces elicit different scanning patterns depending on task demands. Attention, Perception, & Psychophysics, 79, 1050–1063. doi:10.3758/s13414-017-1284-y

Brisson

Jolicoeur

(2007). A psychological refractory period in access to visual short-term memory and the deployment of visual-spatial attention: Multitasking processing deficits revealed by event-related potentials. Psychophysiology, 44, 323–333. doi:10.1111/j.1469-8986.2007.00503.x

10.

Bryk

Raudenbush

(1992). Hierarchical linear models in social and behavioral research: Applications and data analysis methods. Newbury Park, CA: Sage.

11.

Button

K. S.

Ioannidis

J. P. A.

Mokrysz

Nosek

B. A.

Flint

Robinson

E. S. J.

Munafò

M. R.

(2013). Power failure: Why small sample size undermines the reliability of neuroscience. Nature Reviews Neuroscience, 14, 365–376. doi:10.1038/nrn3475

12.

Carrasco

Ponte

Rechea

Sampedro

M. J.

(1998). “Transient structures”: The effects of practice and distractor grouping on within-dimension conjunction searches. Perception & Psychophysics, 60, 1243–1258. doi:10.3758/BF03206173

13.

Chartier

Allaire

J.-F.

(2007). Power estimation in multivariate analysis of variance. Tutorials in Quantitative Methods for Psychology, 3, 70–78. doi:10.20982/tqmp.03.2.p070

14.

Cohen

(1969). Statistical power analysis for the behavioral sciences. San Diego, CA: Academic Press.

15.

Cohen

(1977). Statistical power analysis for the behavioral sciences (2nd ed.). New York, NY: Academic Press.

16.

Cohen

(1992). A power primer. Psychological Bulletin, 112, 155–159. doi:10.1037/0033-2909.112.1.155

17.

Cousineau

Laurencelle

(2011). Non-central t distribution and the power of the t test: A rejoinder. Tutorials in Quantitative Methods for Psychology, 7, 1–4. doi:10.20982/tqmp.07.1.p001

18.

Cousineau

Laurencelle

(2016). A correction factor for the impact of cluster randomized sampling and its applications. Psychological Methods, 21, 121–135. doi:10.1037/met0000055

19.

Cousineau

Shiffrin

R. M.

(2004). Termination of a visual search with large display size effects. Spatial Vision, 17, 327–352. doi:10.1163/1568568041920104

20.

Cronbach

L. J.

(1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16, 297–334. doi:10.1007/BF02310555

21.

Cumming

(2014). The new statistics: Why and how. Psychological Science, 25, 7–29. doi:10.1177/0956797613504966

22.

Faul

Erdfelder

Lang

A.-G.

Buchner

(2007). G*Power 3: A flexible statistical power analysis program for the social, behavioral, and biomedical sciences. Behavior Research Methods, 39, 175–191. doi:10.3758/bf03193146

23.

Fifić

Townsend

J. T.

Eidels

(2008). Studying visual search using systems factorial methodology with target-distractor similarity as the factor. Perception & Psychophysics, 70, 583–603. doi:10.3758/PP.70.4.583

24.

Goulet

M. A.

(2015). Similarities and differences in response times of a simple ‘Same’-‘Different’ task for faces and letters (Unpublished honors thesis). University of Ottawa, Ottawa, Ontario, Canada.

25.

Goulet-Pelletier

J.-C.

Cousineau

(2018). A review of effect sizes and their confidence intervals, Part I: The Cohen’s d family. The Quantitative Methods for Psychology, 14, 242–265. doi:10.20982/tqmp.14.4.p242

26.

Gulliford

M. C.

Ukoumunne

O. C.

Chinn

(1999). Components of variance and intraclass correlations for the design of community-based surveys and intervention studies: Data from the Health Survey for England 1994. American Journal of Epidemiology, 149, 876–883. doi:10.1093/oxfordjournals.aje.a009904

27.

Hansen

M. H.

Hurwitz

W. N.

Bershad

M. A.

(1961). Measurement errors in censuses and in surveys. Bulletin of the International Statistical Institute, 38, 359–374.

28.

Harding

Tremblay

Cousineau

(2014). Standard errors: A review and evaluation of standard error estimators using Monte Carlo simulations. The Quantitative Methods for Psychology, 10, 107–123. doi:10.20982/tqmp.10.2.p107

29.

Hays

W. L.

(1973). Statistics for the social sciences. New York, NY: Holt, Rinehart and Winston.

30.

Hedges

L. V.

Hedberg

E. C.

(2007). Intraclass correlation values for planning group-randomized trials in education. Educational Evaluation and Policy Analysis, 29, 60–87. doi:10.3102/0162373707299706

31.

Hélie

(2007). Understanding statistical power using noncentral probability distributions: Chi-squared, G-squared, and ANOVA. Tutorials in Quantitative Methods for Psychology, 3, 63–69. doi:10.20982/tqmp.03.2.p063

32.

Hung

H. M. J.

O’Neill

R. T.

Bauer

Köhne

(1997). The behavior of the P-value when the alternative hypothesis is true. Biometrics, 53, 11–22.

33.

Keppel

(1973). Design and analysis: A researcher’s handbook. Englewood Cliffs, NJ: Prentice-Hall.

34.

Kish

(1965). Survey sampling. New York, NY: Academic Press.

35.

Krotki

K. P.

(1978). Estimation of correlated response variance. In Proceedings of the Survey Research Methods Section, American Statistical Association (pp. 609–614). Alexandria, VA: American Statistical Association.

36.

Lacroix

G. L.

Giguère

Larochelle

(2005). The origin of exemplar effects in rule-driven categorization. Journal of Experimental Psychology: Learning, Memory, and Cognition, 31, 272–288. doi:10.1037/0278-7393.31.2.272

37.

Lal Saxena

K. M.

Alam

. (1982). Estimation of the non-centrality parameter of a Chi squared distribution. The Annals of Statistics. Retrieved from http://projecteuclid.org/euclid.aos/1176345892

38.

Lazzeroni

L. C.

Belitskaya-Lévy

(2014). P-values in genomics: Apparent precision masks high uncertainty. Molecular Psychiatry, 19, 1336–1340. doi:10.1038/mp.2013.184

39.

Miller

(2006). Simon congruency effects based on stimulus and response numerosity. Quarterly Journal of Experimental Psychology, 59, 387–396. doi:10.1080/02724980443000827

40.

Murray

D. M.

Blistein

J. L.

(2003). Methods to reduce the impact of intraclass correlation in group-randomized trials. Evaluation Review, 27, 79–103. doi:10.1177/0193841X02239019

41.

Neyman

Pearson

E. S.

(1933). On the problem of the most efficient tests of statistical hypotheses. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 231, 289–337. doi:10.1098/rsta.1933.0009

42.

Palmeri

T. J.

(1997). Exemplar similarity and the development of automaticity. Journal of Experimental Psychology: Learning, Memory, and Cognition, 23, 324–354. doi:10.1037/0278-7393.23.2.324

43.

Reder

L. M.

Ritter

F. E.

(1992). What determines initial feeling of knowing? Familiarity with question terms, not with the answer. Journal of Experimental Psychology: Learning, Memory, and Cognition, 18, 435–451. doi:10.1037/0278-7393.18.3.435

44.

Rickard

T. C.

(1997). Bending the power law: A CMPL theory of strategy shifts and the automatization of cognitive skills. Journal of Experimental Psychology: General, 126, 288–311. doi:10.1037/0096-3445.126.3.288

45.

Rouder

J. N.

Haaf

J. M.

(2018). Power, dominance, and constraint: A note on the appeal of different design traditions. Advances in Methods and Practices in Psychological Science, 1, 19–26. doi:10.1177/2515245917745058

46.

Schmidt

F. L.

Hunter

J. E.

(2015). Meta-analysis methods for d values. In Schmidt

F. L.

Hunter

J. E.

(Eds.), Methods of meta-analysis: Correcting error and bias in research findings (3rd ed., pp. 279–342). London, England: Sage.

47.

Schneider

Shiffrin

R. M.

(1977). Controlled and automatic human information processing: I. Detection, search, and attention. Psychological Review, 84, 1–66. doi:10.1037/0033-295X.84.1.1

48.

Sedlmeier

Gigerenzer

(1989). Do studies of statistical power have an effect on the power of studies? Psychological Bulletin, 105, 309–316. doi:10.1037/0033-2909.105.2.309

49.

Shrout

P. E.

Fleiss

J. L.

(1979). Intraclass correlations: Uses in assessing rater reliability. Psychological Bulletin, 86, 420–428. doi:10.1037/0033-2909.86.2.420

50.

Simmons

J. P.

Nelson

L. D.

Simonsohn

(2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22, 1359–1366. doi:10.1177/0956797611417632

51.

Strayer

D. L.

Kramer

A. F.

(1994). Strategies and automaticity: I. Basic findings and conceptual framework. Journal of Experimental Psychology: Learning, Memory, and Cognition, 20, 318–341. doi:10.1037/0278-7393.20.2.318

52.

Szucs

Ioannidis

J. P. A.

(2017). Empirical assessment of published effect sizes and power in the recent cognitive neuroscience and psychology literature. PLOS Biology, 15(3), Article e2000797. doi:10.1371/journal.pbio.2000797

53.

Thompson

S. K.

(2012). Sampling (3rd ed.). Hoboken, NJ: John Wiley & Sons.

54.

Trafimow

Amrhein

Areshenkoff

C. N.

Barrera-Causil

C. J.

Beh

E. J.

Bilgiç

Y. K.

. . . Marmolejo-Ramos

(2018). Manipulating the alpha level cannot cure significance testing. Frontiers in Psychology, 9, Article 699. doi:10.3389/fpsyg.2018.00699

55.

Visual Attention Lab. (2015). What can 8,000 trials tell you about visual search? Retrieved from http://search.bwh.harvard.edu/new/data_set_files.html

56.

Wicherts

J. M.

Borsboom

Kats

Molenaar

(2006). The poor availability of psychological research data for reanalysis. American Psychologist, 61, 726–728. doi:10.1037/0003-066X.61.7.726

57.

Winer

B. J.

Brown

D. R.

Michels

K. M.

(1991). Statistical principles in experimental design. New York, NY: McGraw-Hill.

58.

Woltman

Feldstain

MacKay

Rocchi

(2012). An introduction to hierarchical linear modeling. Tutorials in Quantitative Methods for Psychology, 8, 52–69. doi:10.20982/tqmp.08.1.p052

The Power of Replicated Measures to Increase Statistical Power

Abstract

Keywords

Correlation Across Replications and Its Effects on Power

First scenario: no correlation across replications

Second scenario: perfect correlation across replications

General case: some correlation across replications

Overview of Correlations Across Replications in 15 Studies

Recommendations

General Discussion

Box 1. Mathematica Code Used to Generate the Plot in Figure 1

Should review studies’ estimates of the statistical power of published psychological research be reconsidered?

Why not analyze the replication effect?

Footnotes

Appendix A: Estimating the Correlation Across Replications

Appendix B: A Step-by-Step Approach

Acknowledgements

Action Editor

Author Contributions

ORCID iD

Declaration of Conflicting Interests

Funding

Open Practices

Notes

References