Abstract
In his response to our article, Sims suggests that increasing the size of educational randomized control trials (RCTs) is a feasible and affordable way to increase their informativeness. We doubt this.
In our article “Rigorous Large-Scale Educational RCTs Are Often Uninformative: Should We Be Concerned?” we demonstrated that a large proportion of contemporary randomized control trials (RCTs) are not able to detect the small effects typically produced by educational interventions. We also noted that—given how much larger trials would need to be—simply increasing their size seems an impractical solution to this problem. Sims suggests that we should have focused on the absolute weighted average of trials’ effect sizes, rather than their simple weighted average. Based on his reanalysis, Sims concludes that educational trials could affordably be powered to detect effects of 0.08 standard deviations (SDs). He estimates the cost of such a trial to be around $3 million, four to five times the cost of current trials.
We agree with Sims that the absolute weighted average is a more useful figure to consider in this context and are grateful to him for pointing this out. However, we doubt the implications he draws. There are three main reasons: (a) Sims’s figure for the mean absolute effect size is considerably higher than our estimate, (b) trials’ minimum detectable effect sizes (MDESs) should be lower than their mean absolute effect size, and (c) Sims’s cost estimates rest on the contentious assumption that a trial’s MDES and its cost are linearly related. We discuss each point in turn.
Sims’s Mean Absolute Effect Size Is Too High
Computing the weighted average absolute effect size from our data—available here: https://doi.org/10.6084/m9.figshare.c.4421087—produces a considerably lower estimate than Sims’s: 0.06 SDs rather than 0.08 SDs (a nontrivial difference: a properly powered independent-samples t-test [alpha: 0.05; power: 80%] requires 4,908 participants to detect an effect of 0.08 SDs, but 8,724 participants to detect an effect of 0.06 SDs). We suspect that this discrepancy is due to how trials with multiple outcomes (around half the trials in our dataset) should be treated. To ensure a representative estimate and avoid violating statistical independence, our analysis is limited to a single randomly chosen outcome per trial, a common procedure (e.g., Lipsey & Wilson, 2001). In contrast, Sims prioritized mathematics-related outcome when available, on the grounds that they tend to be measured more precisely. We question this decision: Prioritizing precisely measured outcomes biases the effect size estimate upward, since, for the same effect, a less noisy measure will typically yield a larger standardized effect size. Moreover, mathematics outcomes are relatively uncommon (present in 27% of trials in our dataset, compared to a figure of 63% for language-related outcomes), making Sims’s estimate somewhat unrepresentative of effects typically observed in educational trials.
Trials’ MDESs Should Be Lower Than Their Mean Absolute Effect Sizes
Perhaps more importantly, even if Sims’s estimate were accurate, trials’ MDESs should be lower than the mean absolute effect size across all trials. One reason is that educational trials are heterogenous; they involve different interventions, age groups, and outcome measures (this is the rationale for treating trials as a random factor in ours and Sims’s analyses). Given this, powering a trial to the average effect size would mean appropriately powering only a subset of trials—those whose true effects are equal to or above this average. In other words, unless trials’ MDESs are set to be below their expected effect sizes, a large number of them will be underpowered (McShane & Böckenholt, 2014).
A further problem is that larger trials are likely to yield smaller effects. Sims suggests that to be properly powered, trials would require 280 schools, considerably more than existing norms (the median number of schools in Educational Endowment Foundation [EEF] trials is 50). Should we expect these larger trials to generate similar effects to what is currently typical? Larger trials often cannot be implemented with the same scrutiny and fidelity as smaller trials, which increases standard deviations and lowers effect sizes (e.g., Cheung & Slavin, 2016; Weisburd et al., 1993). For this reason, we should expect larger trials to produce smaller effects than those found in smaller trials, again suggesting that Sims’s MDES of 0.08 SDs is somewhat high.
The Relationship Between Trial Cost and MDES Is Nonlinear
Sims used existing trial data to estimate the cost of a trial with an MDES of 0.08 SDs. As he acknowledges, this prediction falls worryingly far outside the data range (see Figure 1 in Sims’s comment). Equally problematic, however, is the assumption that MDES and cost are linearly related. We doubt that this assumption is reasonable. Assuming linearity would mean, for example, that the cost of decreasing a trial’s MDES from 0.25 SDs to 0.20 SDs would be the same as decreasing it from 0.10 SDs to 0.05 SDs. This is surely wrong. The two scenarios require radically different numbers of additional participants (for an independent-samples t-test, 142 and 4,710 extra participants, respectively). The true relationship most likely mirrors that of MDES and sample size, where the cost of reducing a trial’s MDES by a fixed amount becomes increasingly large. If this is correct, then running trials powered to detect an effect size of 0.08 SDs would likely cost more than $3 million.
We are grateful to Sims for his comment and for provoking us to think about these issues further. However, we continue to believe that increasing trials’ sample sizes is unlikely to be an efficient way to reduce the worryingly large proportion of trials that are uninformative. Less costly means of increasing the power of trials, such as using more targeted outcome measures or focusing on more targeted subgroups, appear more promising.
