Abstract
A priori power analyses allow researchers to estimate the number of participants needed to detect the effects of an intervention. However, power analyses are only as valid as the parameter estimates used. One such parameter, the expected effect size, can vary greatly depending on several study characteristics, including the nature of the intervention, developer of the outcome measure, and age of the participants. Researchers should understand this variation when designing studies. Our meta-analysis examines the relationship between science education intervention effect sizes and a host of study characteristics, allowing primary researchers to access better estimates of effect sizes for a priori power analyses. The results of this meta-analysis also support programmatic decisions by setting realistic expectations about the typical magnitude of impacts for science education interventions.
In the past two decades, there has been extensive focus on how to calculate power for cluster-randomized trials, or CRTs (e.g., Bloom, 2005; Bloom, Bos, & Lee, 1999; Donner & Klar, 2000; Hedges & Rhoads, 2009; Konstantopolous, 2008; Murray, 1998; Raudenbush, 1997; Raudenbush & Liu, 2000; Raudenbush, Martinez, & Spybrook, 2007; Schochet, 2008). Adequate statistical power for intervention studies helps researchers avoid making Type II errors—errors in which a researcher fails to detect an effect in a sample where that effect indeed exists in the population. Type II errors can occur when there is an insufficient number of participants in the study and/or the effect is smaller than expected. Underpowered studies can lead to inconclusive results that inhibit knowledge accumulation in a field, particularly when the same inconclusive findings are cited repeatedly.
Intervention research in science education is in its infancy in comparison to other fields such as mathematics and reading. Small studies and studies lacking comparison groups abound. In this study, we identified fewer than 2% of 6,600 reports since 2001 that reported impacts from a design conducive to generating confident causal inferences and statistical conclusions (e.g., had a comparison group, included at least 60 participants). From this result, it follows naturally that replications of rigorous impact studies are also rare (Makel & Plucker, 2014; Taylor et al., 2016).
While it is the case that meta-analyses can detect the existence of a significant overall effect from a set of nonsignificant impact estimates, this is contingent on whether there are sufficient studies of an intervention for this meta-analytic utility to come to bear. A lack of conclusive and unbiased findings from primary studies can severely inhibit knowledge accumulation in science education. With so few studies of intervention impacts, the science education field needs more than replications and meta-analyses. Science education needs more sufficiently powered primary studies of program impacts. Once strong causal impact studies with promising effects emerge, replications and meta-analyses will be able to advance the science education field even further.
In this paper, we present the results of a meta-analysis designed to support a priori power analyses in science education research as well as policy or programmatic decisions about intervention effects more broadly. Primary researchers can use the effect size estimates generated by our work in combination with recently published estimates for the intraclass correlation (ICC) and covariate correlation (e.g., Spybrook, Westine, & Taylor, 2016, Westine, Spybrook, & Taylor, 2013) to inform the designs of their own causal impact studies in science education. Policymakers and other decision makers can use our estimates to develop realistic expectations about the types of effects to expect from interventions with specified characteristics. We note here, however, that the most reliable sources of information about the effectiveness of a given intervention are impact estimates from prior studies of that intervention and/or meta-analyses of effects from that or similar interventions. The parameter estimates from this study are meant to refine effect size expectations beyond these primary sources of evidence or provide general guidance in the absence of any extant effect size information.
Although we focus here solely on the effects of science education interventions, this meta-analysis has characteristics that make its findings informative to researchers and decision makers in other education disciplines. This study examines the relationship between the effect size magnitude and key study characteristics, including students’ grade level, design of the study, outcome measure type, or intervention focus. Researchers synthesizing intervention studies outside of science education have found noteworthy variation in these very same intervention study characteristics (Cheung & Slavin, 2016; Hill, Bloom, Black, & Lipsey, 2008). As such, we assert that our findings are unique but likely cumulative with those of syntheses outside of science education.
Researchers often use a priori power analyses to estimate the number of participants needed in a study. In cluster randomized trials, accurate a priori power analyses rely on having accurate estimates of the three key design parameters mentioned previously: the ICC (a measure of the between-cluster variance as a fraction of the total variance), the extent to which covariates can account for variation in the outcome, and the estimated effect size. Any a priori power analysis is only as accurate as the design parameter estimates. If any of the design parameter estimates are inaccurate, then too many or too few subjects may be recruited, resulting in higher than necessary costs or an underpowered trial.
Recognizing the importance of accurate power analysis parameter estimates and finding none in the science education literature, the BSCS Science Learning and Western Michigan University began a joint project to provide empirical estimates of these design parameters. Spybrook et al. (2016), Westine et al. (2013), and Westine (2016) are products of this collaboration and examined the ICC and the variance explained by covariates. This manuscript complements that prior work by examining effect sizes from intervention research in science education and does so in a way that extends the scope and approach used in prior synthesis efforts, most notably, the two syntheses of inquiry-based science instruction (Furtak, Seidel, Iverson, & Briggs, 2012; Minner, Levy, & Century, 2010) and the more broadly focused syntheses of elementary science interventions (Slavin, Lake, Hanley, & Thurston, 2014) and secondary science interventions (Cheung, Slavin, Kim, & Lake, 2016). Our study extends the scope of the two syntheses of inquiry instruction by examining a much broader array of science education interventions. In the latter two syntheses, researchers extracted one effect size per study and excluded studies with researcher-developed outcome measures out of a concern about overalignment of outcome to treatment. Our screening and analytic approach contrasts that of Cheung et al. (2016) and Slavin et al. (2014) by modeling the effects of test developer (possible overalignment) instead of using it to exclude studies and extracting multiple effect sizes per study while accounting for the dependency among those effects.
The primary research question of this meta-analysis is: What is the relationship between the magnitude of the intervention effects and key study characteristics? The study characteristics of interest included the design (randomized studies compared to matched quasi-experimental studies), whether the outcome measure was developed by the study authors, who receives the intervention (e.g., students only, teachers only, both students and teachers), the science discipline targeted by the intervention, the treatment provider’s role (e.g., researcher or teacher), and the grade level of the students. Together, the suite of papers provides essential information for study designers in science education to conduct a priori power analyses.
Method
Eligibility Criteria
Eligible interventions were those implemented in either formal education settings or education lab research settings with students in primary and secondary schools. We included only studies published in English because we lacked capacity for translation. We included an array of science achievement outcome measures, including content knowledge, use of scientific practices, and outcomes related to understanding of the nature of science. Eligible interventions are school-based or lab-based interventions of any duration whose efficacy was reported between 2001 and 2014. We selected 2001 as the start of our collection of studies because it was the year of the passage of the No Child Left Behind Act—an act that called attention to the need for experimental intervention research in education and established funding mechanisms for conducting such studies.
We define lab-based interventions as those delivered to students at a university or other research site (e.g., nonprofit site). Interventions in museums only met eligibility requirements if the instruction was formalized. For example, a study of formal lecture demonstrations to classes of students at a museum was included, but studies of free-choice learning at a museum were not included. We included no studies from homeschool settings.
Eligible interventions included curriculum programs (including computer software activities for students to use), professional development programs with student-level outcomes, and use of specific instructional approaches or teaching strategies. Eligible comparison conditions include actual control groups (no intervention), “business-as-usual” (BaU) comparison groups (extant programs or practices), or a sham treatment (watching a movie that was not expected to be particularly beneficial). We did not include alternative interventions as eligible comparison groups. Our decision to exclude treatment-treatment studies was based on the fact that effect sizes from these studies would likely be smaller than (incomparable to) treatment-control, treatment-BaU, or treatment–sham treatment studies. To be eligible, the study design had to include at least two groups whose outcomes could be compared. We included studies in which group assignment was determined randomly (person-randomized or cluster randomized) or nonrandomly (e.g., quasi-experiment). Quasi-experiments were only included when they had clear matching on pretests prior to assignment to treatment or comparison conditions. To limit risk of bias, we required studies to have at least 30 students assigned to either the treatment or comparison group. This decision was informed by the work of Turner, Bird, and Higgins (2013), who concluded that in meta-analysis with at least two well-powered studies (like this one), underpowered studies contribute little additional insight. This finding is further supported by more recent work associating small studies with increased heterogeneity (IntHout, Ioannidis, Borm, & Goeman, 2015) and risk of bias (Afshari & Wetterslev, 2015).
In summary, each study in the ultimate set of eligible studies met all the following criteria:
based either in a school or an educational research lab setting
included either primary or secondary students
published in English
included at least one student achievement outcome
published between 2001 and 2014
studied a specific science education intervention
included at least two groups that could be compared on the outcomes (i.e., treatment-control, treatment–business as usual, treatment–sham treatment)
included a pretest or other measure to estimate baseline equivalence
included random assignment or matching on pretest for quasi-experiments
included at least 30 students in each treatment or comparison group
focused on a general population of students (e.g., did not have an explicit focus on students with learning disabilities).
Information Sources and Search Strategies
The study used several information sources to locate qualifying studies: Ulrich’s Web (ulrichsweb.serialssolutions.com), the Web of Science database (www.isiknowledge.com), the
Search strategies for published studies
We used Ulrich’s Web (the online version of
To search within these journals, we created a search string for use in Web of Science that we intended to capture a wide range of experimental and quasi-experimental studies in science education. The search string was: [TS= (intervention OR control OR treat* OR Experiment* OR Quasi* OR Effect* OR Compar* OR Trial OR Efficacy OR Random OR Assign*) AND PY = (2001 OR … 2014) AND SO= (“Journal of Research in Science Teaching” OR…)] with the additional 22 journals not shown here. We chose to do a journal-specific search because our search was so broad—our original open searches were leading to hundreds of thousands of abstracts. We followed our initial journal-specific search for published studies by examining references of included studies. This reference harvest led to studies from a total of 71 journals (see journal list in online Supplemental Appendix S2).
Search strategy for unpublished studies (grey literature)
Our grey literature search included: a search string employed within
Data Collection and Coding
Researchers coding the full text of manuscripts used a FileMaker Pro database. For each study coded, we obtained a portable document format (PDF) file and embedded a link to the file directly into the database for coding and archival purposes. Coders could highlight the PDFs and make comments. Training involved collaborative coding tasks to establish coding norms and independent coding tasks to estimate intercoder reliability. Across all codes, the average level percentage agreement for independent coding was 83%. The coding team held weekly meetings to ask questions and resolve issues that had arisen during the previous week. When discrepancies arose, the PIs made final coding decisions. The database was hosted on a server such that all team members could access the database simultaneously.
Key information about the characteristics of a study or statistical information needed to extract effect sizes was often missing from published reports. In these situations, coders contacted study authors directly to inquire about the missing information. Specifically, 59 study authors were queried requesting various information, including intervention dosage, demographic information, details about the instrument used in study outcomes, timing of the posttest, the type of assignment to groups (e.g., RCT), and requests for means and standard deviations used in analyses. Most often, requests were for more specific information about the frequency or duration of the intervention (dosage), requested in 24% of the queries, and demographic information disaggregated by treatment group, requested in 20% of the queries. Nineteen percent of queried authors responded, sharing important information relevant to coding the studies. In the remaining instances, authors no longer had access to the data or did not respond to our query.
Variables coded
In addition to the bibliography and eligibility tables, there were four tables in the database: header (data related to the study as a whole), groups (data about each treatment and comparison group in the study), dependent variables (data about each outcome variable in the study), and effect sizes (data about group means, standard deviations, observed sample sizes, and/or other information used to estimate effect sizes, including
Choice and coding of moderators
Our choices of effect size moderators to test were influenced by our desire to provide evidence that either corroborates or challenges findings in the extant literature. For example, Hill et al. (2008) examined the extent to which the nature of the outcome measure (broadly focused standardized test vs. specialized topical test) and the grade level of the students (elementary school vs. middle school vs. high school) influenced effect size magnitudes from multiple disciplines, finding larger effect sizes for specialized topical tests and middle school interventions. Extending these analyses using a multiple regression approach (meta-regression), Cheung and Slavin (2016) also tested the effects of outcome measure type and grade level, finding larger effect sizes in studies with researcher-made measures and studies of elementary school students. Additionally, Cheung and Slavin (2016) tested the effects of study design (RCT vs. QED). The approach in the current study sought to build on these prior efforts.
We primarily used binary codes but in one case used a set of related indicator (dummy) codes.
Inclusion of multiple effect sizes
In traditional meta-analyses, each study contributes one effect size and the effect size estimates are independent across studies. However, researchers frequently report multiple effects. In some instances, researchers might use multiple outcome measures with the same individuals (e.g., one assessment might measure students’ understanding of science content, and another might measure students’ understanding of scientific practices such as the development of explanations). In other instances, multiple outcome measures arise from testing the same students at multiple timepoints (posttest and delayed posttest models fall in this category). Some studies have two or three treatment groups and each treatment group is compared to the same comparison group.
Effect size calculation
The 636 effect sizes (Hedges’
Frequently, studies using cluster-randomized or cluster quasi-experimental designs failed to use appropriate analyses. That is, the study used cluster assignment (e.g., entire classes or schools were assigned to treatment or comparison conditions), but the analyses did not account for clustering (analyses were conducted as if individual students were assigned to treatment or comparison conditions). In such circumstances, the reported sample size is too large, and the analyses use underestimated standard errors. Higgins and Green (2011) describe how meta-analysts can adjust for such mismatched analyses by calculating a reduced, effective sample size for the study. Equation 1 conducts this adjustment:
where
If the number of clusters for a given group was 1 (e.g., cluster assignment with one treatment class and one comparison class), no adjustment is made. This design feature occurred in 15 studies, and we report results of sensitivity analyses in online Table S3 that compare results with these studies included versus omitted. In addition, we did not adjust the number of participants for studies in which individual students were assigned to the treatment or comparison condition.
After we adjusted the number of participants based on cluster size, we Winsorized the effective sample sizes (N) computed in equation 1. We used the Winsorized values of
Statistical model
Our data included 636 effect sizes across 96 studies (an average of 6.6 effect sizes per study). Effect sizes within studies are correlated, and this dependence needs to be accounted for in the analysis. As information on the correlation between these effect sizes was not reported in primary studies, we used a method that was robust to misspecification of the correlation structure. For this reason, we used weighted least squares to estimate the meta-regression model and adjusted the standard errors for dependence within studies through use of robust variance estimation (RVE; Hedges, Tipton, & Johnson, 2010). Unlike standard model-based methods such as multivariate meta-analysis (Jackson, Riley, & White, 2011), RVE does not require the correlation structure to be correctly specified when calculating standard errors and hypothesis tests; instead, it estimates the standard errors empirically using a sandwich-estimator (for a tutorial, see Tanner-Smith, Tipton, & Polanin, 2016).
Additionally, we use small-sample corrections to RVE (Tipton, 2015; Tipton & Pustejovsky, 2015). These small-sample corrections involve the specification of a “working model” for the correlation structure—here we use a “correlated effects” model—and use of a
Working model and weighting
In most studies in this meta-analysis, the effect sizes were dependent because they were measured on the same individuals. For this reason, we assumed the “correlated effects” model as our working model in RVE. We also used this model to define approximately inverse-variance weights, defined as
where
Meta-regression
Our meta-regression model for the primary analysis was estimated using the R package
We grand mean centered all variables in the regression. As a result, the intercept is an estimate of the average effect size at the grand mean of all predictors. We assessed statistical significance of an estimated regression coefficient (β
where
The test statistic is compared with critical values from student’s
Missing data handling
We were unable to determine the treatment provider’s discipline (teacher or other) for 16 effect sizes across four studies. Rather than lose the data completely, we conducted a single imputation and ran the meta-regression on a complete data set. We used
Results
Study Selection
The abstract screen culled the list to 1,174 unique studies that underwent a full-text screen. The full-text screen reduced the number of eligible studies to 96, and these studies came from a total of 21 different countries: United States (43), Turkey (15), Israel (7), Taiwan (6), Canada (3), Germany (3), Nigeria (3), Jordan (2), Kenya (2), Brazil (1), China (1), England (1), Finland (1), Greece (1), India (1), Korea (1), Netherlands (1), New Zealand (1), Singapore (1), Slovenia (1), and Switzerland (1). Figure 1 identifies the number of studies at each stage of our process. In this figure, a “record” refers to a single citation from one database. Because we used multiple databases, sometimes different databases located the same record. We identified 6,622 records through database searching (raw number, including duplicates), along with 30 additional records through our grey literature search. The combined total number of unique records was just 6,637 after excluding 15 duplicates. We use the term

Number of records, articles, and studies identified and retained at each stage of the selection process.
It was essential that coders not include effect sizes for the same outcomes from the same participants more than once. This problem can arise when authors publish more than one article from the same research study. We searched author names across all 1,286 articles to look for articles that appeared to report findings from the same participants. We combined multiple articles into a single “study” and coded at the study level. The process of linking related articles into unique studies brought our number of studies down to 1,174. Of these, a full-text eligibility screen (using information from all linked articles for each study) yielded 96 studies that met our full inclusion criteria.
We excluded studies during the full-text screen for a variety of reasons. Many abstracts were sufficiently vague that we included the associated articles in the full-text screen for no reason other than there was insufficient information in the abstract to make an eligibility determination. For example, an abstract might refer to “impacts of an intervention” but not mention the existence or absence of a comparison group. Other abstracts might refer to “students” without clarifying (in the title, abstract, or keywords) whether they were primary, secondary, or undergraduate students.
Occasionally, study coders continued coding beyond identifying a reason for excluding a study. This occurred particularly when coders were unsure of a disqualifying coding decision and sought consultation with the larger team before excluding a study. Thus, some studies had multiple reasons listed for their exclusion while others had just one. Table 1 provides a summary of the reasons studies were excluded. However, without further coding of all disqualifying features of every study, the percentages should not be interpreted as complete; in fact, they likely underestimate the number of studies lacking a characteristic.
As the table shows, the two most common reasons for excluding a study related to study design. Twenty-two percent of the excluded studies did not include an eligible comparison group (this includes studies that had no comparison group as well as those that had an alternative treatment comparison group). We excluded 21% of studies because the student sample size was too small (fewer than 30 students in either the treatment or comparison group). A substantial portion (14.7%) were excluded because there were no reported measures of baseline equivalence. We included studies provided that the baseline measures were reported but did not use magnitude of the baseline equivalence effect size as an exclusion criterion.
Reasons Studies Were Excluded From Meta-Analysis
Overall Statistics
Table 2 shows the mean posttest effect size, standard deviation, and study sample size for each category of study design or intervention characteristic. The weighted average pretest effect size was −0.01 standard deviations, suggesting that although we did not disqualify studies based on baseline equivalence, it does not appear to have introduced substantial bias in the impact estimates.
Mean Effect Size by Study Design or Intervention Characteristic
In addition, we report several other statistics of importance. The intercept (
The overall average effect, however, is not particularly useful when there is heterogeneity in effect sizes across studies. To understand the true variation in study-average effect sizes, a prediction interval is useful. In this study, the 95% prediction interval for study-average effect sizes is [–0.393, 1.371], indicating that while the average study produced a positive effect, in some studies, the true effect was negative (and in others, the effect was positive and much larger). This interval is based on the fact that the variation in effect sizes across studies was high (
Study and Intervention Characteristics as Moderators
Moderator effects from the meta-regression
Table 3 shows the parameter estimates from our meta-regression using RVE with correlated effects weights, conducted using Equation 3. Note that sensitivity analyses for the effect of Winsorizing (vs. not Winsorizing) the effect sizes indicated no difference in any parameter estimate to three decimal places. Sensitivity analyses were also conducted with regard to including studies with a single cluster per treatment condition. Two noteworthy differences emerged, and these results are reported and discussed in Table S3 online.
Results of Robust Variance Estimation Meta-Regression Using 96 Studies With 292 Effect Sizes
The interpretation of the meta-regression coefficients (slopes) is conceptually similar to any multiple regression using binary indicators—each regression coefficient is a covariate-adjusted difference in mean effect size between groups of effect sizes that differ on the target characteristic. For example, the regression coefficient of −0.004 for RCT represents the model-based estimate of the difference in average effect size for RCTs compared to QEDs. That is, controlling for all covariates, RCT effect sizes are estimated to be 0.004 standard deviations smaller than that of matched QEDs, on average. Similarly, the effects from researcher-developed assessments were estimated to be nearly 0.258
Also notable (though not significant) is that interventions with components for both students and teachers produced higher effect sizes than those that target students only with an adjusted difference of 0.149 standard deviations. When an intervention was conducted in a science discipline that tends to be more mathematical, the adjusted effect sizes were slightly smaller (0.044 standard deviations) than those in science disciplines that are less mathematical. Interventions for secondary students (Grades 9–12) showed slightly higher effects than for students in primary and lower secondary (K–8) grades, with an adjusted difference of 0.053 standard deviations. Finally, when teachers delivered an intervention, the effects were nearly identical (adjusted difference = 0.005
Using the Meta-Regression Parameter Estimates in A Priori Power Analyses
When conducting a priori power analyses, a study designer should first consider whether there are existing meta-analyses or effect sizes from isolated primary studies of the same or similar interventions. In the absence of such information, we propose use of our meta-regression parameter estimates to arrive at an empirically based effect size estimate. The least precise approach would be to use the intercept estimate, which provides the overall average (weighted) of all 292 effect sizes in the meta-analysis. We don’t anticipate that this would be appropriate in the majority of cases as most study designers will have information on at least a subset of the study and/or outcome characteristics that can be used to adjust the overall mean effect size estimate. Optimally, a study designer would have information on all variables and make the eight corresponding adjustments to the grand mean, based on the magnitude of the meta-regression coefficients in Table 3. Yet another variation that leverages existing impact information would be to use an extant impact estimate or summary effect from a meta-analysis in place of the intercept from our model and then make the corresponding adjustments using our meta-regression coefficients.
To facilitate accurate and convenient computation of predicted effect sizes based on selected study/outcome characteristics, we developed a Web-based application that uses the results generated by the R
Example: Using meta-regression estimates to estimate an effect size
Here we present an example of how science education researchers may use our meta-regression results in a power analysis. Assume a researcher has developed a high school (
In addition to reporting an estimated effect size, the online application also indicates the precision of this estimate. A 95% confidence interval is computed for each expected effect size based on the set of observed covariate values. These intervals are calculated using
where both the degrees of freedom (η) and standard error (i.e.,
Using findings from Westine et al. (2013), study planners can obtain a reasonable estimate of the school-level intraclass correlation for 10th-grade outcomes (ICC = 0.196) and the variance explained by a school-level pretest covariate (
Discussion
Comparisons to Extant Research
The average effect size (the intercept)
The grand mean effect size we estimated for this study, 0.489, is larger than any of the sample size weighted mean effect sizes reported in recent synthesis studies of science education interventions. Two recent studies are particularly relevant to the present study. The first, Slavin et al. (2014), synthesized a total of 23 effect sizes for elementary school science interventions, finding the following summary effects for three intervention categories: 0.02
The effect of study design
The effect of study design was very small but consistent with recent work in this area. Although we observed a smaller effect of design than that observed by Cheung and Slavin (2016), the direction of the effects is the same, with both studies estimating smaller effects for randomized designs.
The effect of bundled interventions
The results of this analysis are suggestive that there is a positive effect of developing bundled interventions that provide products and/or services for both teachers and students. Unfortunately, this tentative result cannot be corroborated by the two recent syntheses by Slavin and colleagues as they categorized interventions differently than we did in the present study.
The effect of science discipline type
The effect of discipline type was quite small, and the possibility that this effect is spurious is too high to support a confident claim. Further study is needed around whether a noteworthy effect of discipline exists in the effect size population.
The effect of who develops the outcome measure
The finding that stands out dramatically is the positive relationship between use of researcher-developed outcome assessments and the magnitude of the treatment effect. This can result from either overalignment of outcome measures to treatments, insensitivity of standardized measures to treatment effects, or both. We cautiously assert that the primary source of this relationship in our data is likely the tendency of broadly focused standardized assessments to be insensitive to treatment effects. We found in our coding of each study’s methodological approach only a few instances of treatment-outcome overalignment but acknowledge that assessments of overalignment can be subjective and no clear definition exists.
The effect of students’ grade level
Slavin et al. (2014) did not report an overall summary effect for the 23 effect sizes in their synthesis of elementary school science interventions, nor did Cheung et al. (2016) report the like for the 21 effects of secondary school science interventions in their synthesis. However, both studies reported the individual study effects and sample sizes necessary to compute overall summary effects by grade span (elementary vs. secondary). Using this information, we conducted a random effects meta-analysis with student sample size weighting, finding for elementary school interventions a weighted summary effect of 0.33
It would appear that the findings of the present study, where the weighted average of effect sizes for secondary school interventions is higher than for elementary school interventions, diverge from that of the prior syntheses. However, this appears to be an artifact of how the grade intervals were coded in the present study (K–8, 9–12) as opposed to the other syntheses. For example, the grade intervals in the Slavin et al. (2014) and Cheung et al. (2016) studies were K–5 and 6–12, respectively. When we disaggregate our effect sizes into these new grade intervals, we see a similar effect of grade level, with the weighted average effect size for interventions in the K–5 grade interval estimated at 0.09 standard deviations greater than the weighted average effect size for interventions in the Grades 6–12 interval. This is largely consistent with the mean effect size difference across these grade intervals from the Slavin et al. and Cheung et al. work.
The effect of who delivers the intervention
The effect of who delivers the intervention was also quite small and inconclusive. Although further study is needed to assess whether an effect of treatment provider exists in the effect size population, the lack of a clear effect challenges conventional wisdom in education that larger effects will be observed (all else equal) when the intervention developer also delivers the intervention (e.g., proof of concept or efficacy studies) as opposed to when a nondeveloper implements the intervention (e.g., scale-up studies).
Limitations
This study had several notable limitations. First, across the set of eligible studies, there was significant imbalance across the categories of several nominal variables. This required us to dichotomize planned moderators to achieve better balance and statistical power or in some cases eliminate the moderator altogether.
Omitted moderators included participant characteristics such as percentage minority, a contrast for lab- versus school-based intervention, and intervention duration. Percentage minority could not be used due to insufficient reporting in general and by treatment condition in particular. The lab- versus school-based intervention contrast could not be used due to extreme imbalance (4 lab-based, 92 school-based interventions). Outcome type could not be used as a variable because too few effect sizes for affective outcomes existed. The moderator for comparison group type suffered the same fate as only 5 studies used a no intervention comparison group, while 91 used a business-as-usual counterfactual. Finally, intervention duration could not be used because the data were extremely skewed and the resulting meta-regression degrees of freedom were less than 4, the minimum cutoff for trustworthy results suggested in the
The planned moderator that required a collapse into binary categories was
Some unbalanced moderators remained (particularly
Implications
For Power Analyses
The field of intervention research in science education is in its infancy. After searching over 6,600 abstracts, we found just 96 that met our full inclusion criteria. The lack of a wide research base means that few intervention researchers have data from a comparison group study on which to base their own power analyses for future work. Even when they do, those results may be based on design characteristics that differ in important ways from what a researcher might propose in a larger efficacy trial. Specifically, as Lipsey and Wilson (2001) have suggested, pre-post (one group) effect sizes are incongruent with comparative (two group) effect sizes, although a conversion can be made between the two if the pre-post correlation is known (Borenstein, 2009). As we have shown, overestimating an effect size can lead to drastic underestimation of required sample sizes. When designing studies, a researcher can (and should) use relevant pilot effect sizes, provided that they map to the design of their planned study. Alternatively, a researcher would be wise to use meta-analytic findings (should they exist) for interventions that are a close match to their own invention. Barring that, the use of data from our set of studies would allow researchers who might otherwise have little information on which to base effect sizes for a priori power analyses to make empirically based decisions in this important stage of study design. At a minimum, researchers will have some idea of whether their proposed effect size might be conservative or optimistic based on what we have seen in the science education literature. Using results from our meta-analysis, intervention researchers in science education will be better able to design studies of causal impacts. Better designed studies are more likely to be funded and published. We propose that an important implication of our work will be an improvement in the quality and quantity of impact studies in science education.
Further, the use of adequately powered intervention studies in science education has importance for the field (beyond what it means for individual researchers). The tendency of the field to discount studies with nonsignificant
For Programmatic Decisions
Over and above its utility for study designers (i.e., power analyses), the results of this study help establish an effect size “landscape” for science education. The prediction interval described previously establishes a range of plausible values for individual effect sizes, indicating that for interventions like those in this study, 95% of the effect sizes are likely to fall between −0.393 and 1.371 standard deviations. In terms of central tendency, merely interpreting the intercept estimate (
For Synthesis Methods
To date, most of the oft-cited work around benchmarks for effect sizes estimates have used a univariate approach to aggregating and summarizing average effect sizes (e.g., Cheung et al., 2016; Hill et al., 2008; Slavin et al., 2014). Considering our findings, we join recent calls to decrease the use of this approach (Polanin & Pigott, 2015). Although in general the adjusted mean differences from our meta-regression coefficients mirror the raw differences in mean effect sizes from the descriptive statistics table (Table 2), both in magnitude and direction, important differences remain. For example, we highlight the
For Future Research
In this age of evolving technology, science education interventions are becoming more transportable to other formats, contexts, and learning environments. As such, parallel work is desperately needed for online interventions and those implemented in informal settings (e.g., museums, science centers). A second key charge for the future is to conduct similar research in other disciplines using the techniques of this study. Specifically, we advocate for the use of meta-regression to estimate mean effect size differences across categories while controlling for other influential moderators. Until this work is conducted by the field using comparable techniques, accumulating knowledge about effect size moderation will be challenging.
Until then, our current challenge is to encourage study designers to use the information provided here to more precisely estimate required sample sizes for science education intervention studies. Doing so will decrease the likelihood that study designers will over- or under-recruit schools, teachers, and students, conserving the precious human and financial resources the field needs to continue an agenda of rigorous intervention research.
Supplemental Material
DS_10.1177_2332858418791991 – Supplemental material for Investigating Science Education Effect Sizes: Implications for Power Analyses and Programmatic Decisions
Supplemental material, DS_10.1177_2332858418791991 for Investigating Science Education Effect Sizes: Implications for Power Analyses and Programmatic Decisions by Joseph A. Taylor, Susan M. Kowalski, Joshua R. Polanin, Karen Askinas, Molly A. M. Stuhlsatz, Christopher D. Wilson, Elizabeth Tipton and Sandra Jo Wilson in AERA Open
Footnotes
Acknowledgements
We gratefully acknowledge assistance of Karen M. Askinas, S. Josh Carson, Lila Goldstein, Jeff Hoover, Brendan Martin, Alexandra Monzon, Jennifer Nichols, Ran Shi, and Jay Wade. We also acknowledge the assistance of Zhipeng Hou for his work on the power calculator. This research was supported by a grant from the National Science Foundation (1118555). The opinions are those of the authors and do not reflect the views of the National Science Foundation.
Authors’ Note
Taylor and Kowalski are co-equal first authors.
Authors
JOSEPH A. TAYLOR is principal scientist at BSCS Science Learning. His research focuses on the effectiveness of science education interventions, optimal design of and reporting from intervention studies, and issues around knowledge accumulation from STEM education research.
SUSAN M. KOWALSKI is a senior research scientist at BSCS Science Learning. Her research encompasses two major strands: meta-analyses of science education research and research on curriculum and professional development programs for middle and high school teachers and students.
JOSHUA R. POLANIN is a principal researcher at the American Institutes for Research. Dr. Polanin’s research focuses on improving the methods of meta-analysis through pragmatic approaches to data management and analyses.
KAREN ASKINAS is a research associate at BSCS Science Learning. She has developed and maintained large data sets for meta-analyses and for teacher and student assessments in science education research projects.
MOLLY A. M. STUHLSATZ is a research scientist at BSCS Science Learning. Her recent research focuses on the impact of leadership development on district-level student outcomes in science, investigating the effectiveness of videocase-based teacher professional development, and examining the feasibility of using of computer scoring models to score open-ended assessment items.
CHRISTOPHER D. WILSON is a senior research scientist and director of research at BSCS Science Learning. His research focuses on the assessment of teacher and student learning in science education, the impact of lesson analysis based professional development, and the application of automated scoring techniques to the measurement of teacher PCK and student argumentation.
ELIZABETH TIPTON is an assistant professor of applied statistics in the Human Development Department at Teachers College, Columbia University. Her research focuses on the development of methods for improving generalizations from cluster randomized and multi-site experiments—including improved site selection and recruitment strategies—and on methods for estimation of treatment impacts in these studies and through meta-analysis.
SANDRA JO WILSON is a principal associate at Abt Associates, Inc. Her recent research focuses on risk factors associated with school failure, the prevention of high school dropout, parenting and family-based interventions, and the effectiveness of educational interventions and practices for the What Works Clearinghouse.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
