Abstract
Despite being among the fastest growing segments of the student population, English Language Learners (ELLs) have yet to attain the same academic success as their English-proficient peers, particularly in science. In an effort to support the pedagogical needs of this group, educators have been urged to adopt inquiry approaches to science instruction. Whereas inquiry instruction has been shown to improve science outcomes for non-ELLs, systematic evidence in support of its effectiveness with ELLs has yet to be established. The current meta-analysis summarizes the effect of inquiry instruction on the science achievement of ELLs in elementary school. Although an analysis of 26 articles confirmed that inquiry instruction produced significantly greater impacts on measures of science achievement for ELLs compared to direct instruction, there was still a differential learning effect suggesting greater efficacy for non-ELLs compared to ELLs. Contextual factors that moderate these effects are identified and discussed.
Keywords
The Next Generation Science Standards (NGSS; Achieve, 2012) reflect the importance of introducing children to scientific and engineering practices early to prepare them for STEM careers. The NGSS framework emphasizes the use of rich content and practices that refine and deepen science inquiry in ways that go beyond the use of hands-on, constructivist approaches to science instruction (Achieve, 2012). However, implementing these standards in elementary school presents unique challenges to educators who must increasingly teach complex concepts and reasoning for English Language Learners (ELLs), or students who have yet to fully develop proficiency in English (Saunders & Marcelletti, 2013). Furthermore, engaging in the science and engineering practices are language intensive for all students and ELL students in particular (Lee, Quinn, & Valdés, 2013). Although the population of ELL students has increased substantially in recent years, their achievement in science has not (Maerten-Rivera, Myers, Lee, & Penfield, 2010; National Center for Education Statistics [NCES], 2014).
To better support the pedagogical needs of this growing population, educators have been encouraged to adopt inquiry-based approaches based on the premise that hands-on instruction makes science learning more engaging, concrete, and meaningful (Janzen, 2008; National Research Council [NRC], 2012; Roseberry & Warren, 2008). Whereas inquiry-based instruction has been shown to improve the science achievement of English-proficient (or non-ELL) students (Furtak, Seidel, Iverson, & Briggs, 2012), consensus regarding both the effectiveness and appropriateness of this approach for ELL students has yet to be established. Substantive differences in the linguistic backgrounds, academic experiences, and pedagogical needs of ELL and non-ELL students have led to disagreement regarding the benefits of inquiry-based approaches for linguistically diverse students. Thus, we seek to conduct a meta-analysis examining the effectiveness of inquiry-based instruction for ELL students. We begin with a brief overview of ELL students’ performance in science, the rationale behind teaching ELL students with inquiry-based instruction, and the promises and challenges associated with its application.
The Need for Effective Science Instruction: Underachievement of ELL Students in Science
ELL students’ educational attainment has received growing attention due to persistently low achievement in general and in STEM in particular (Bravo & Cervetti, 2014; Diamond, Maerten-Rivera, Rohrer, & Lee, 2014; Lara-Alecio et al., 2012). Despite increased resources to enhance STEM education, ELL students have yet to attain the same level of academic success as their English-proficient peers (Lee & Buxton, 2013; Maerten-Rivera et al., 2010; Tong, Irby, Lara-Alecio, & Koch, 2014). For example, ELL students consistently score lower on the science portion of the National Assessment of Educational Progress (NAEP) at all grade levels and are more likely to score below basic (NCES, 2014). These findings indicate that ELL students are in need of greater support in STEM education as compared to their non-ELL peers (Genesee, Lindholm-Leary, Saunders, & Christian, 2005; Goldenberg, 2013).
ELL students’ low achievement in science may be attributed in part to their limited proficiency in English and weak mastery of academic language (Kieffer, Lesaux, Rivera, & Francis, 2009). Scientific texts are linguistically complex, informationally dense, and highly technical (Echevarria, Richards-Tutor, Canges, & Francis, 2011; Fang, 2006). The linguistic complexity of scientific texts can impede meaningful learning for ELL students by interrupting information processing and conceptual understanding (Fang, 2006; Janzen, 2008). Thus, ELL students’ science learning may be constrained by their proficiency in English (Lee, 2005).
Potential Benefits of Learning Science With Inquiry Instruction
One view is that ELL students learn best when instruction is situated within meaningful, interactive activities that leverage the language and cultural backgrounds of students (Bravo & Cervetti, 2014; Echevarria et al., 2011). Inquiry instruction is grounded in the constructivist principle that meaningful learning occurs when students engage in authentic activities that promote active knowledge construction through self-guided exploration (Bruner, 1996; Lee, 2005). Students are encouraged to construct knowledge by posing questions about the natural world, test theories through carefully planned investigations, and draw conclusions based on empirical results (Bruner, 1996). Thus, teachers facilitate meaningful dialogue, experimentation, and engagement (Minner, Levy, & Century, 2010). Inquiry instruction is often contrasted with traditional approaches, such as direct instruction, which aim to build factual knowledge through explicit exposition and highly structured teacher guidance (Kirschner, Sweller, & Clark, 2006). Although direct instruction is commonly used, inquiry learning has been found to improve students’ attitudes toward science (Jiang & McComas, 2015), enhance problem-solving skills (Lazonder & Harmsen, 2016), and increase learning outcomes (Alfieri, Brooks, Aldrich, & Tenenbaum, 2011).
Although most examinations of inquiry instruction to date involve non-ELL students, its benefits are assumed to generalize to ELL students in a number of ways. First, inquiry instruction’s use of engaging, multisensory activities is assumed to increase ELL students’ access to scientific content by reducing the demands of scientific language (Janzen, 2008). Second, its multimodal nature encourages physical and cognitive engagement to support deeper levels of learning (Huerta & Jackson, 2010). Third, inquiry instruction encourages ELL students to communicate their understanding of scientific concepts and procedures, which may promote their oral and written language skills (August, Branum-Martin, Cardenas-Hagan, & Francis, 2009). Finally, the collaborative nature of inquiry instruction is thought to promote rich learning experiences for ELL students that foster both conceptual knowledge and scientific communication (Lee & Buxton, 2013).
Concerns Regarding the Effectiveness of Inquiry Instruction for ELL Students
Despite its potential benefits, there remain concerns and contradictory findings regarding the effectiveness of inquiry instruction with ELL students. First, ELL students may lack sufficient English proficiency to benefit fully from inquiry instruction (August et al., 2009; Bresser & Fargason, 2013; Huerta, Tong, Irby, & Lara-Alecio, 2016). Despite using multimodal approaches to pedagogy, inquiry instruction still has heavy linguistic demands, requiring students to generate predictions, communicate their findings, and engage in meaningful scientific discourse. However, many ELL students are still developing the very language skills critical for active participation and building understanding of the content. Thus, the provision of more hands-on, active learning opportunities may not sufficiently address the linguistic challenges faced by ELL students in the science classroom (August et al., 2009; Bravo & Cervetti, 2014; Lee, Deaktor, Enders, & Lambert, 2008).
Second, the assumption that inquiry instruction is more effective than traditional methods has also been challenged (e.g., Kirschner et al., 2006; Tobias & Duffy, 2009). The hands-on, self-guided exploration characteristic of inquiry may not provide sufficient instructional guidance and structure to facilitate meaningful learning and transfer (Mayer, 2004). Although hands-on instruction may provide students with salient, highly contextualized learning experiences, inquiry instruction may not provide enough of a framework to enable students to represent scientific principles and understanding more abstractly and generalize what they have learned to new contexts.
Finally, the benefits of inquiry instruction may be limited to students who already have sufficient prior knowledge to support exploratory learning (Kirschner et al., 2006; Klahr & Nigam, 2004). Because ELL students’ access to quality instruction is often limited by English-only instruction, tracking into remedial classes, and attending English support services at the exclusion of content-area instruction (Robinson-Cimpian, Thompson, & Umansky, 2016), they may lack the academic preparation to fully benefit from inquiry instruction. Thus, the effects of inquiry instruction for ELL students requires greater examination.
Factors That May Influence the Effectiveness of Inquiry Instruction
From a developmental perspective, there are compelling reasons to expect that the effectiveness of inquiry-based instruction may differ on the basis of student grade level (Meyer, 2000). One factor is that as ELL students progress from first grade and beyond, they build their knowledge base in science, proficiency in English, and metacognitive abilities—all of which contribute to higher learning and achievement. Consequently, inquiry-based instruction may be more advantageous for older ELL students who, compared to their younger counterparts, are more likely to have the requisite skills and knowledge to meet the demands of learning science with inquiry-based instruction. On the other hand, the increasingly rigorous academic and linguistic demands associated with science inquiry in higher grade levels might overburden older ELL students and result in diminished learning (Tolbert Stoddart, Lyon, & Solis, 2014).
Second, the effectiveness of inquiry instruction may be influenced by factors such as teacher preparation and instructional time. Many elementary school teachers report they have been inadequately prepared to teach ELL students science (Cervetti, Kulikowich, & Bravo, 2015; Zwiep & Straits, 2013). However, teachers’ instructional skills and pedagogical knowledge have been shown to have a significant impact on students’ science achievement (Heller, Daehler, Wong, Shinohara, & Miratrix, 2012). Professional development has been found to improve the delivery of inquiry instruction by raising teachers’ pedagogical knowledge and understanding of ELL students’ learning needs (Yoon, Duncan, Lee, Scarloss, & Shapley, 2007).
Third, inquiry instruction requires heavy investments in instructional time. There is considerable variation in the amount of class time devoted to inquiry instruction, which may also influence its effectiveness for ELL students (Baker, Fabrega, Galindo, & Mishook, 2004; Dorph, Shields, Tiffany-Morales, Hartry, & McCaffrey, 2011). Thus, our meta-analysis considers professional development and instructional time in a moderation analysis.
Prior Reviews of Inquiry-Based Instruction for ELL and Non-ELL Students
Several narrative reviews summarizing the prevailing state of knowledge on effective teaching approaches with ELL students provide initial support for the use and effectiveness of inquiry-based instruction with ELL students. Lee (2005) performed a systematic review of research on the science education (K–12) of ELL students and found that hands-on, inquiry-based instruction was generally associated with positive achievement outcomes among all students, including those with lower levels of English proficiency and prior science experience. More recently, Janzen’s (2008) narrative review on content-area instruction in science with ELL students found similar evidence suggesting that inquiry-based instruction led to improvements in both ELL students’ language development and science achievement. Although these reviews offer a useful summary of research on the effectiveness of inquiry-based instruction with ELL students, they use qualitative rather than quantitative methods, do not provide effect size estimates and furthermore, are not the most current anymore.
Three more recent meta-analyses comparing the effectiveness of inquiry-based instruction with direct instruction support the advantage of inquiry-based instruction. First, Alfieri et al.’s (2011) meta-analysis contrasted the effectiveness of direct instruction to both guided and unguided forms of inquiry-based instruction. They found that inquiry-based instruction produced greater achievement outcomes in science than direct instruction (d = .11). Similarly, Furtaket al.’s (2012) meta-analysis found that inquiry-based instruction resulted in significantly greater learning outcomes (d = .50). Finally, Lazonder and Harmsen (2016) showed that guided forms of inquiry instruction produced a positive effect on students’ science content knowledge (d = .50) and ability to perform inquiry (d = .66). Although these meta-analyses provide evidence suggesting that inquiry-based instruction can be an effective method of learning for students as compared with traditional instruction, they were based on studies conducted primarily with mainstream English-proficient students, and thus, their results may not generalize to ELL students.
Present Study
Previous syntheses of research have concluded that inquiry-based instruction is a particularly effective approach for improving the science achievement outcome for students. However, to our knowledge, no study to date has explicitly evaluated changes in ELL students’ science achievement as a result of receiving inquiry instruction in a comprehensive and quantitative synthesis. To this end, we conducted a meta-analysis to determine the extent to which inquiry instruction serves ELL students’ science achievement, addressing the following questions:
Research Question 1: Is inquiry-based science instruction an effective method of teaching for ELL students relative to direct instruction?
Research Question 2: Does inquiry science instruction provide comparable learning benefits to ELL students relative to their English proficient peers?
Research Question 3: What types of factors, if any, moderate the impact of inquiry instruction on science achievement outcomes for ELL students?
Method
Selection of Studies and Data Collection
Inclusion criteria
We developed selection criteria that would capture empirical studies designed to evaluate the impact of inquiry instruction on science achievement for ELL students. Both published and unpublished studies were eligible to be included as long as they (a) used an experimental or quasi-experimental research design, (b) provided data for ELL students between kindergarten and sixth grade, (c) included a treatment that received inquiry instruction and either a business-as-usual control receiving direct instruction or a non-ELL student comparison group, (d) assessed the effects of inquiry instruction on students’ science learning outcomes and reported these effects quantitatively, (e) provided sufficient data to calculate effect sizes, and (f) were either published or translated in English. To avoid sample bias to the best extent possible, studies that focused exclusively on students who were reclassified as fluent English proficient (i.e., former ELLs) were excluded from this meta-analysis. Furthermore, studies that combined results for ELL and non-ELL students or elementary and non-elementary school students such that effect sizes could not be extracted independently for each subsample were also excluded.
Search procedure
A comprehensive and systematic search was conducted (between 2000 and 2016) using ERIC, PsycINFO, and Google Scholar, with the search terms science, instruction, education, teaching, K–6, methods, English as a second language, English language learner, limited English proficient, inquiry, discovery, hands-on, and projects strategies. The search was restricted to studies that were published in the years 2000 to 2016. To identify unpublished studies in ERIC and Google Scholar, we modified the search parameters to include dissertations, theses, and conference proceedings. We also submitted our selected articles to both forward and backward searches. Forward searches were carried out by searching for articles that cited other studies that met our search criteria, while backward searches were conducted by manually reviewing the reference sections of each paper for additional studies that matched our search criteria. Studies identified in literature reviews and prior syntheses were also reviewed for inclusion. Finally, we contacted authors of the included studies to solicit other published or unpublished studies that may be relevant to this meta-analysis.
Study selection
This search procedure returned over 5,000 potentially relevant articles. Using the selection criteria established previously, we examined the title, abstract, and keywords of each article. Studies that met the most fundamental aspects of the selection criteria—that is, whether or not a study investigated the effect of (a) inquiry-based instruction on the (b) science achievement of (c) ELL students—were flagged for potential inclusion and saved for a second review. When abstracts did not provide adequate information for eligibility judgments, the full text of the article was obtained and screened for potential inclusion using the aforementioned search criteria. If multiple reports of the same study were identified (e.g., dissertation/thesis, journal article), they were grouped together and cross-referenced for complete information, and the most comprehensive study was retained. Based on this first round of the literature search, 32 articles were flagged as potentially relevant.
In the second round of reviews, we evaluated each article in greater detail. Six studies were excluded because they lacked an eligible treatment/comparison group or science achievement measure or did not provide sufficient information to calculate effect sizes. Studies with missing effect size information were excluded only if we could not obtain the data to estimate effect sizes after requesting them from the corresponding authors. Disagreements regarding whether to include a study were discussed by the research team until consensus was reached. Overall, this selection procedure yielded a total of 26 studies for inclusion in the meta-analysis. Figure 1 summarizes the study search procedure and selection criteria.

Flow diagram of study selection procedure and selection criteria.
Coding of studies
First, students were classified based on their English proficiency (Saunders & Marcelletti, 2013). Students who were non-native speakers of English with limited English proficiency were coded as ELL, and native English speakers and language-minority students proficient in English were coded as non-ELL. Instruction involving hands-on, self-guided learning tasks requiring students to construct science knowledge using questions and investigations was coded as inquiry instruction (Bruner, 1996; Furtak et al., 2012). Explicit instruction using highly structured lectures, demonstrations, textbooks, or other teacher-centered methods was coded as direct instruction (Alfieri et al., 2011; Mayer, 2004). Finally, science achievement outcomes were coded if they quantitatively assessed changes in students’ performance on measures of conceptual, factual, or procedural knowledge (Minner et al., 2010).
Implementation variables were coded for moderation analysis. We coded the length of the intervention as the number of months of intervention. Student grade level was coded for K through six. Professional development training was coded as 1 when it was provided and 0 when no training was provided. When professional development was provided, the duration in hours was coded, and the dosage was categorized as small if under 15 hours were provided or large if 15 or more hours were provided. The focus of the training regime was considered ELL focused when training addressed the needs or instruction of ELL students. It was coded as non-ELL focused if training was not specific to the needs of ELL students, such as addressing science pedagogy in general.
The methodological features of each study were coded based on its publication status (published journal article vs. unpublished dissertation/technical report), research design (randomized experiment vs. quasi-experiment), measurement design (pretest and posttest vs. posttest only), assessment format (multiple choice vs. constructed response), and assessment type (researcher-developed test vs. standardized test).
The process of coding was conducted by the first author using a standardized coding protocol developed in advance by the research team (available on request from the authors). However, to ensure reliability and accuracy, all studies were double-coded by the second author. Interrater reliability was established by calculating the percentage of overlap between each coder, which yielded a high percent agreement of 93.6%. Coding discrepancies were discussed as a group until consensus was reached.
Meta-Analytic Procedures
Evaluating the effects of inquiry instruction for ELL students
We derived three separate meta-analytic effect size (ES) estimates based on standardized mean differences. First, we evaluated the effectiveness of inquiry instruction, or treatment ES, using the standardized mean difference in science achievement outcomes between ELL students who learned with inquiry instruction (treatment condition) and ELL students who learned with traditional instruction (control condition). Positive values for treatment ES indicate that ELL students in the treatment condition outperformed those in the control condition.
Examining the effects of inquiry instruction between ELL and non-ELL students
The second analysis examined whether inquiry instruction had similar benefits for ELL and non-ELL students. We calculated the inquiry ES using the standardized mean difference in learning outcomes between ELL and non-ELL students who received inquiry instruction within studies that reported data for both groups. Positive values for inquiry ES indicate that ELL students showed greater gains in inquiry instruction than non-ELL students.
To contextualize the inquiry ES findings, we estimated the effect size for traditional science instruction (traditional ES) using the standardized mean difference in learning outcomes between ELL and non-ELL students who received traditional instruction within studies that reported data for both groups. Positive values for traditional ES indicate that ELL students showed greater gains with traditional instruction than non-ELL students. Studies that reported information to estimate one effect size but not another were included in all analyses for which sufficient information was provided.
Computation of effect sizes
The calculation of the standardized mean difference (Cohen’s d) effect size was estimated depending on the data provided. First, when only posttest data were available, the standardized mean difference was calculated:
where
where
When pretest and posttest data were reported for both groups, pretest-adjusted estimates of the standardized mean difference were calculated as:
where
When dichotomous data were reported (e.g., proportion of students who attained proficiency on a standardized test), we converted the log odds ratio of successes between groups into the standardized mean difference using the transformation procedure outlined by Borenstein, Hedges, Higgins, and Rothstein (2009).
If means and standard deviations were missing but regression coefficients reported, effects sizes were approximated using the t statistic corresponding to the null hypothesis of independent group differences between the treatment and control condition (Borenstein et al., 2009; Lipsey & Wilson, 2001):
where
Finally, to adjust for upward bias in Cohen’s d associated with small samples (n < 20), all effect sizes were transformed into Hedge’s g using the small-sample correction factor proposed by Hedges (1982):
where
Dependent effect sizes
To resolve statistical dependence among studies reporting multiple outcomes for the same group of students, we report the mean effect size for all outcomes to yield a single effect size per study. Similarly, for longitudinal studies involving the same cohort of students, effect sizes were collapsed together to yield a single average effect size per study. We report the mean effect size for multiple treatment groups when they were compared to a single control group. However, effect sizes generated from two or more different subgroups (i.e., grade levels, cohorts of students, or treatments) within a study such that each subgroup was accompanied with its own distinct comparison group were treated as independent (Borenstein et al., 2009). This made it possible for multiple effect sizes to be extracted from a single study. These procedures ensured that each effect size was estimated based on an independent set of data and that each analysis was conducted with an independent set of effect sizes.
Data synthesis
To estimate the overall effect size, studies were issued weights based on their level of precision (i.e., standard error). Because the effects of inquiry instruction on science achievement outcomes were assumed to vary among studies as a function of population, intervention, and methodological differences, we used random effects models to calculate the overall weighted mean effect size (
where gi is the observed effect size for study
i
and
Heterogeneity of effect sizes
We used the Q test of heterogeneity to examine the variation in effect size estimates between studies (Lipsey & Wilson, 2001). Moreover, the I2 statistic quantifies the percent of variation attributable to true heterogeneity relative to sampling error (Higgins, Thompson, Deeks, & Altman, 2003). Overall, I2 values range from 0% to 100%, with increasing values reflecting greater levels of heterogeneity.
Moderation analysis
When there was significant heterogeneity across studies, we conducted moderation analyses to examine whether variation among effect sizes could be explained by factors that differ between studies. For categorical variables, we performed a Q test of between-group differences (QB) using CMA’s one-way ANOVA function. For continuous variables, we tested the relation between a moderator and magnitude of effect size using CMA’s unrestricted maximum likelihood meta-regression function. All moderation analyses were conducted using random effects models weighted by the inverse variance of effect sizes.
Sensitivity Analyses and Robustness Checks
Four sensitivity analyses were used to assess the impact of statistical methods and data inclusion choices on the conclusions of the results and therefore examine the robustness of our findings.
Robust variance estimation
As noted previously, we resolved statistical dependence among our sample of effect sizes using standard meta-analytic methods, namely, collapsing multiple effect sizes across studies to create a single synthetic effect. To utilize all effect sizes from each study, we reanalyzed the data set of effect sizes using robust variance estimation (RVE; Hedges, Tipton, & Johnson, 2010) with a correction for small sample size bias (Tipton, 2015). This approach permits the synthesis of multiple dependent effect sizes by adjusting the standard errors to account for an assumed correlation (ρ) between effect sizes within studies, thereby minimizing the loss of information that occurs through aggregation. One important limitation to this approach, however, is that a minimum of 40 independent studies with an average of five effect sizes per study are needed to estimate a meta-regression coefficient (Tanner-Smith & Tipton, 2014). This issue is particularly problematic in meta-analyses involving categorical variables with multiple levels. As a result, RVE methods were employed in the synthesis of overall weighted mean effect sizes. All analyses using RVE were conducted in the R statistical environment (version 3.4.2) using the robumeta package (Fisher & Tipton, 2014; Tanner-Smith & Tipton, 2014).
Outliers
Boxplots were used to identify potential outliers, defined as effect sizes that were 1.5 interquartile ranges above the 75th percentile range or below the 25th percentile range of the distribution. Two effect sizes were identified as outliers in the treatment ES analyses. Because these outliers could not be attributed to methodological or theoretical differences between each study, coupled with the relatively small number of studies in the sample, we elected not to eliminate these estimates. Rather, we adjusted the outliers downward to more conservative values using the 90% Winsorization procedure described by Lipsey and Wilson (2001). All analyses were subsequently carried out using the adjusted data set. Results for the original sample are reported in a sensitivity analysis.
Study quality
We assessed the methodological quality of included studies using a version of the Quality Assessment Tool for Quantitative Studies (National Collaborating Centre for Methods and Tools, 2008), which was adapted for use with educational research. This quality appraisal tool uses judgments about the extent to which bias may be present in six methodological domains to produce an overall quality rating of weak, moderate, or strong. Although methodologically rigorous studies are more likely to produce valid results (Higgins, Altman, & Sterne, 2017), we decided not to exclude studies on the bases of methodological quality. Including these studies allowed us to maintain our sample of effect sizes and provides a more complete picture of the current research landscape. However, to examine whether our findings were sensitive to differences in study quality, we conducted a sensitivity analysis that excluded studies with an overall quality rating of weak.
Publication bias
We assessed the potential for publication bias among the sample of effect sizes included in the treatment ES analysis as studies that report null findings or relatively small effects are less likely to be published (Rosenthal, 1979; Song, Hooper, & Loke, 2013). We tried to mitigate publication bias a priori by seeking to include unpublished work (k = 6). The extent and impact of publication bias was assessed graphically using funnel plots and statistically using Egger’s linear regression test (Egger, Smith, Schneider, & Minder, 1997) and a trim and fill analysis of the corresponding funnel plots (Duval & Tweedie, 2000).
Results
Contrasting the Effects of Inquiry and Traditional Instruction for ELL Students
Our first objective was to evaluate whether inquiry-based instruction is more effective than traditional instruction for ELL students. To address this question, we tested if inquiry instruction is more effective than traditional instruction for ELL students by calculating the standardized mean difference (k = 23) in science learning outcomes between ELL students who received inquiry instruction (n = 4,204) and ELL students who received traditional instruction (n = 4,087). Figure 2 shows that overall, ELL students receiving inquiry instruction tended to obtain science scores that were over one-quarter a standard deviation higher than those receiving traditional instruction, Treatment ES = + 0.28 (SE = 0.07, p < .001). The 95% confidence interval ranges from 0.15 to 0.41, suggesting that overall inquiry instruction produces a small positive impact on ELL students’ learning outcomes.

Estimated mean treatment effect size (difference in science achievement between English language learners in treatment and control conditions) for each study with overall mean weighted effect size. Forest plot showing treatment effect sizes with 95% confidence interval and 95% prediction interval. Studies with alphabetic superscripts refer to multiple independent effect sizes generated from the same study.
Contrasting the Effects of Inquiry and Traditional Instruction Between Language Groups
Our second question examines whether inquiry instruction leads to comparable learning benefits to ELL and non-ELL students and are presented in Figure 3. To this end, we estimated the standardized mean difference (k = 30) in science learning outcomes between ELL (n = 5,459) and non-ELL (n = 42,700) students receiving inquiry instruction. The significant inquiry ES of −0.31 (SE = 0.08, p < .001) suggests that non-ELL students obtained science achievement scores that were about one-third a standard deviation higher than those of ELL students.

Estimated mean inquiry effect size (difference in science achievement between English language learners and non-English language learners in treatment condition) for each study with overall weighted effect size. Forest plot showing inquiry effect sizes with 95% confidence interval and 95% prediction interval. Studies with alphabetic superscripts refer to multiple independent effect sizes generated from the same study. The data used to calculate an overall effect size for Lee et al. 2004–2007 is based on information reported in Lee, Maerten-Rivera, Penfield, Leroy, and Secada (2008); Lee, Mahotiere, Salinas, Penfield, and Maerten-Rivera (2009); and Lee, Penfield, and Maerten-Rivera (2009).
Next, we investigated how the achievement gap between ELL and non-ELL students receiving inquiry science instruction compared to relative performance of ELL and non-ELL students receiving traditional science instruction. To do so, we calculated the standardized mean difference (k = 15) in science learning outcomes between ELL (n = 3,085) and non-ELL (n = 9,364) students who received traditional instruction in the control condition (see Figure 4). Overall, non-ELL students obtained science scores that were almost half a standard deviation higher than those of ELL students in traditional classrooms, traditional ES = −0.46 (SE = 0.12, p < .001). The achievement gap between ELL and non-ELL students was greater in science classrooms using traditional instruction (

Estimated mean traditional effect sizes (difference in science achievement between English language learners and non-English language learner students in control condition) for each study with overall weighted effect size. Forest plot showing traditional effect sizes with 95% confidence interval and 95% prediction interval.
Heterogeneity of Effect Sizes
Heterogeneity analyses were conducted to the presence and degree between-study variation using the Q-test and I2 statistic. First, we tested the treatment ES, or the degree to which ELL students obtained higher outcomes with inquiry instruction, for heterogeneity. We found a high degree of heterogeneity among the studies, Q = 126.84, df = 22, p < .001, with the I2 statistic revealing that 83% of the total observed variance could be attributed to between-study differences rather than within-study sampling error. Next, we examined the heterogeneity of the inquiry ES or the achievement gap between ELL and non-ELL students receiving inquiry instruction. Once again, there was a high degree of heterogeneity among the studies, Q = 377.03, df = 29, p < .001, with the I2 statistic indicating that 92% of the variance could be attributed to between-study differences. Finally, the traditional ES, or the achievement gap between ELL and non-ELL students receiving traditional science instruction, also showed a high degree of heterogeneity, Q = 246.77, df = 11, p < .001, with the I2 statistic revealing that 92% of the variance is attributable to true heterogeneity. Due to the significant heterogeneity across each sample of effect sizes, we conducted a set of moderator analyses across each sample of effect sizes to identify the sources of between-study variation.
Moderation Analyses
To identify moderating factors that may influence the effect of inquiry instruction on ELL students’ science achievement, we calculated two sets of analyses to examine the potential influence of categorical and continuous moderators for each effect size. Table 1 presents the results for categorical moderators obtained from the subgroup analyses for treatment ES, while the subgroup moderation results corresponding to traditional ES and inquiry ES are displayed in Table 2 and Table 3, respectively. Table 4 presents the results for continuous moderators obtained from the weighted random effects meta-regression analyses. To mitigate against the potential of confounding variable bias in the meta-regression analyses, each predictor is included in the regression analyses as a covariate, along with the following indicators of methodological quality: publication status, research design, and measurement design.
Overall Weighted Mean Treatment Effect Size (ES) for Subgroup Analyses of Categorical Moderators
p < .10. *p < .05. **p < .01. ***p < .001.
Overall Weighted Mean Inquiry Effect Size (ES) for Subgroup Analyses of Categorical Moderators
p < .10. *p < .05. **p < .01. ***p < .001.
Overall Weighted Mean Traditional Effect Size (ES) for Subgroup Analyses of Categorical Moderators
p < .10. *p < .05. **p < .01. ***p < .001.
Meta-Regression of Continuous Variables on Overall Weighted Mean Effect Sizes (ES)
Note. Random effects models were used in all meta-regression analyses. Random effects variance components were estimated using maximum likelihood. Effect sizes computed as Hedge’s g. Reference group for controls = unpublished study, quasi-experiment, posttest-only design.
†p < .10. *p < .05. ***p < .001.
Publication status
The effect of publication status moderated the findings for treatment ES (QB = 8.19, df = 1, p = .004). Studies published in peer-reviewed journals had average treatment ESs that were significantly larger than those in nonpublished studies (
Research design
We found significant moderation effects based on the research design for the treatment ES (QB = 5.20, df = 1, p = .02) but not for the inquiry ES or traditional ES. Studies using quasi-experimental designs showed significantly larger treatment ESs (
Measurement design
The type of measurement design moderated findings for both the inquiry ES (QB = 14.52, df = 1, p < .001) and traditional ES (QB = 16.18, df = 1, p < .001) but not for the treatment ES. Studies that used posttest-only designs revealed science achievement gaps between ELL and non-ELL students that were on average three to four times larger than those using pretest-posttest designs for both the inquiry ES (
Assessment format and assessment type
Whereas assessment format did not moderate the findings for any of the three main effect sizes, differences in assessment type was a moderator for the treatment ES (QB = 5.08, df = 1, p = .024), inquiry ES (QB = 5.03, df = 1, p = .025), and traditional ES (QB = 2.92, df = 1, p = .087). For the treatment ES, studies using researcher-developed assessments (
Student grade level
We treated grade level as a continuous variable. When controlling for methodological quality, the meta-regression revealed a significant negative association between average student grade level and magnitude of effect for the traditional ES (b = −0.20, SE = 0.06, p < .001). This effect suggests that the science achievement gap between ELL and non-ELL students fades in traditional instruction across higher grade levels. No other moderation effects involving grade level were significant.
Professional development
Whereas the dosage of professional development was not a significant moderator for any of the effects of interest, the focus of professional development training moderated the findings for treatment ES (QB = 8.74, df = 2, p = .013). Studies in which professional development focused on supporting ELL students yielded larger treatment ESs (
Length of treatment
Although the length of treatment moderated the findings for the treatment ES and traditional ES, the moderation effects were very small. Specifically, we found a significant negative association between the length of treatment (in weeks) and magnitude of effect for the treatment ES (b = −0.01, SE = 0.01, p = .03) and traditional ES (b = −0.02, SE = 0.004, p < .001).
Sensitivity Analysis and Robustness Checks
In an effort to examine the robustness of the overall mean effect size estimates, a series of sensitivity analyses were conducted.
Robust variance estimation
To assess the impact of using alternative statistical methods for handling dependence on our findings, we reanalyzed the data set using robust variance estimation (RVE). Results from the RVE meta-analysis were virtually identical to those produced in the standard meta-analysis across all effect size estimates: treatment ES = +0.31, SE = 0.08, df = 20.9, p < .01; inquiry ES = −0.31, SE = 0.07, df = 28.2, p < .001; traditional ES = −0.46, SE = 0.12, df = 14, p < .01). Taken altogether, these analyses suggest that the effect size estimates reported in the meta-analysis are robust to the statistical approach used to model correlated effect sizes (see Appendix B for results).
Outliers
Two effect sizes were identified as outliers in the treatment ES analyses and substituted with Winsorize-adjusted values. Sensitivity analysis revealed that based on the original sample, the overall treatment ES increases to +0.36 (SE = 0.09, p < .001). Excluding these outliers from the analysis reduces the overall treatment ES to +0.22 (SE = 0.05, p < .001), which remains both positive and statistically significant. Thus, our interpretations of the findings remain the same with or without adjustment of the two outliers.
Study quality
The quality of evidence used to estimate the effectiveness of inquiry instruction for ELL students was encouraging as the majority of included studies were rated as strong or moderate. However, sensitivity analysis revealed that low-quality studies were associated with larger effect sizes and therefore may be overestimating the benefits of inquiry instruction for ELL students (see Appendix C for details). To assess whether our findings were driven by low-quality studies, we restricted our sample to studies with overall methodological quality ratings of either strong or moderate. Results from this analysis produced an overall treatment ES of +0.22 (SE = 0.07, p = .003). Thus, the pattern of results and our substantive interpretations of them remained the same with or without the inclusion of methodologically weak studies.
Assessment of publication bias
Two statistical analyses provide no evidence supporting the presence of publication bias. Egger’s linear regression test was performed to evaluate the extent to which a study’s effect is related to its sample size by regressing the effect size estimate of a study against the precision of the study, indexed by its standard error. We found no association between these factors (bias coefficient = −0.83, SE = 1.40, p = .56). An analysis of the funnel plot shown in Figure 5 did not identify any effect sizes that needed to be trimmed or filled, suggesting that publication bias does not likely impact our findings.

Funnel plot used for evaluating publication bias. Average weighted treatment effect size estimated for each study (horizontal axis) plotted against corresponding standard error (vertical axis).
Discussion
Question 1: Is Inquiry Instruction Effective for Teaching Science to ELL Students?
The primary purpose of this meta-analytic review was to examine whether inquiry instruction is an effective approach for teaching science to elementary-grade ELL students relative to traditional approaches of science instruction. Overall, we found that ELL students who received inquiry instruction demonstrated gains in science scores that were approximately one-quarter of a standard deviation higher than their ELL peers receiving traditional, direct instruction. Our findings extend past research on science teaching and learning with mainstream K–12 students, including low-performing and at-risk non-ELL students (Hill, Bloom, Black, & Lipsey, 2008; Lipsey et al., 2012) to young, elementary school aged ELL students.
Despite the pedagogical and theoretical concerns about inquiry instruction for ELL students (i.e., Kirschner et al., 2006; Secker, 2002; Tobias & Duffy, 2009), most studies found that inquiry instruction was either as effective or more effective than traditional science instruction for ELL students. There was only one of the studies we identified reporting that inquiry instruction was significantly worse for ELL students than traditional instruction. Further, the 95% prediction interval ranged from −0.30 to 0.86. Assuming these true effect sizes are normally distributed about the mean (Borenstein et al., 2009), we can predict that about 71% of future studies would yield a meaningful positive effect (between 0.10 and 0.86) favoring inquiry instruction, and 7% of future studies would yield a meaningful negative effect (between −0.10 and −0.30) favoring traditional instruction. Thus, our findings mitigate concerns that inquiry instruction may hinder science outcomes for ELL students.
While these findings show that ELL students have stronger science outcomes from inquiry instruction compared to traditional approaches to science instruction, further research is needed to better understand why. Although several compelling reasons have been suggested in the literature, such as rich peer-to-peer collaboration and hands-on learning, perhaps the most widely cited explanation is that inquiry instruction reduces the reliance on language for engaging with and understanding scientific content through hands-on learning activities that emphasize nonverbal forms of processing and participation (Huerta & Jackson, 2010; Lee & Buxton, 2013). The active learning involved in inquiry instruction is thought to maximize meaningful learning opportunities for ELL students by diminishing the heavy linguistic demands associated with traditional forms of textbook and lecture-based learning (Fang, 2006; Lewis, Lee, Santau, & Cone, 2010). These practices align with instructional strategies considered effective for ELL students (Echevarria et al., 2011; Goldenberg, 2013). Because the advantages were greater when teachers received professional development focused on supporting ELL students, it is possible that inquiry instruction’s benefits may stem from its alignment with best practices for ELL students. However, further research is needed to explore the mechanism by which inquiry instruction provides stronger learning outcomes for ELL students.
Question 2: Does Inquiry Instruction Provide Comparable Effects for ELL and Non-ELLs?
In addition to examining the effectiveness of inquiry instruction for ELL students, we explored whether ELL students experienced learning benefits that are comparable to those enjoyed by their non-ELL peers. Although non-ELL students showed stronger science outcomes using both instructional models, the achievement gap between ELL and non-ELL students was diminished with inquiry instruction. Whereas the achievement may be diminished further by restricting our findings to studies that used pretest-posttest designs, the more limited gains shown by ELL students remain statistically significant and practically important. ELL students’ learning outcomes may have been adversely affected by their limited proficiency in English and weak mastery of the academic language in ways that non-ELL students are not affected (Lee, 2005). For example, common linguistic features in science texts and discourse that may interfere with ELL students’ comprehension and learning include the frequent use of ordinary words with nonvernacular meanings, complex sentence structures, and even passive voice (Fang, 2006). In contrast, non-ELL students are relatively unaffected by these factors and therefore may be better able to construct scientific knowledge (Bresser & Fargason, 2013; Fang, 2006; Janzen, 2008; Mayer, 2004). Thus, while inquiry science instruction may better support learning for ELL students than traditional instructions by mitigating the linguistic demands through the use of hands-on activities, it may not be enough to fully remedy the comprehension difficulties ELL students experience in science class (August et al., 2009; Bravo & Cervetti, 2014; Kieffer et al., 2009).
Alternately, the differentially smaller effects of inquiry instruction on ELL students’ science achievement may result from their more limited prior experience with science. There have been concerns that although inquiry instruction may provide rich benefits to high-performing students by providing them with opportunities to apply, test, and deepen their understanding of science (Kirschner et al., 2006; Klahr & Nigam, 2004), students with less exposure to science are less likely to have the necessary schemata in place to construct deep meaning through the self-guided exploration. Consequently, these students are less likely to develop the same depth of understanding of the scientific principles they are exploring (Kendeou & van den Broek, 2007; Norman & Schmidt, 1992). Thus, despite showing greater achievement through inquiry instruction than traditional science instruction, ELL students may not have sufficient background knowledge to reap the same benefits from inquiry instruction as their non-ELL peers.
In sum, although the evidence suggests that inquiry instruction may differentially benefit ELL and non-ELL students, it did, however, promote substantive gains in science for both groups of students. Although preliminary, findings from this review also suggest that inquiry instruction may help diminish the achievement gap between ELL and non-ELL students in science. Based on this evidence, we argue that inquiry instruction has the potential to effectively educate a diverse body of students and recommend that further efforts be taken to understand how inquiry instruction can be adapted to better serve ELL students’ instructional needs and further reduce the achievement gap between ELL and non-ELL students in science.
Question 3: What Factors Moderate the Effect of Inquiry Instruction for ELLs?
The final aim of this study was to investigate factors that may moderate the effectiveness of inquiry instruction for ELL students. One such factor was professional development. We found equivocal evidence concerning the effects of the duration of professional development. Although the moderation analysis found that the duration of professional development was unrelated to the impact of inquiry instruction on ELL students’ science achievement, partitioning our sample into subgroups with at least 15 hours or less than 15 hours of professional development provided suggestive results consistent with Yoon et al. (2007). More specifically, we found that studies with at least 15 hours of professional development yielded positive and statistically significant treatment effects whereas those that offered less professional development time did not. In contrast, we found the focus, or content, of professional development yielded a far more robust effect. Studies in which professional development focused on building teachers’ skills and knowledge to address ELL students’ academic and linguistics needs yielded greater positive effect sizes than those that did not. Taken together, our findings add to a growing body of work suggesting that a minimal threshold of professional development training geared toward the needs of the target student population may better enable teachers to develop the pedagogical skills and knowledge to implement inquiry instruction in a way that maximizes ELL student learning (Darling-Hammond, Chung-Wei, Andree, Richardson, & Orphanos, 2009; Garet, Porter, Desimone, Birman, & Yoon, 2001).
Methodological features, such as the study’s publication status, research design, and assessment type, also moderated the reported effects of inquiry instruction on ELL students’ science achievement. In contrast to what is commonly found in meta-analytic research (Lazonder & Harmsen, 2016; Schroeder, Scott, Tolson, Huang, & Lee, 2007; Seidel & Shavelson, 2007), published studies and quasi-experiments produced significantly lower effect sizes than unpublished studies and randomized experiments. These findings suggest that differences in methodological quality are correlated with effect size estimates. Furthermore, researcher-developed assessments yielded significantly greater effect sizes than standardized assessment, which may be attributed to the proximal versus distal nature of the assessment used (Ruiz-Primo, Shavelson, Hamilton, & Klein, 2002). That is, researcher-developed assessments tend to be more closely aligned with the content of the intervention than standardized assessments. Thus, researcher-developed assessments tend to be more sensitive to the impact of instructional intervention on student achievement, resulting in larger effect sizes.
Alternately, this effect may reflect the degree to which assessments present ELL students with the opportunity to demonstrate their science knowledge. In developing an assessment instrument, standardized tests are restricted almost exclusively to multiple choice items, whereas researcher-developed assessments tend to be more amenable to open-ended and constructed response items. Constructed response items may offer a more meaningful measure of achievement insofar as they enable students, particularly ELL students, to express their scientific understanding using their own words—without being constrained by the linguistically demanding and potentially confusing options provided in multiple-choice formats (Turkan & Liu, 2012). Our meta-analytic findings provide some evidence for this assumption as effect sizes tended to be higher for constructed response items than multiple-choice items. However, the limited number of studies that reported outcomes separately for constructed response and multiple-choice item formats tempers our confidence in these results. Indeed, further research is needed to disentangle whether one method of assessment offers more valid inferences than another.
Research Limitations and Future Directions
Although our findings contribute to the literature, they should be interpreted in light of the study’s limitations. First, the stringent study selection criteria applied in the study selection phase led to the exclusion of a large number of qualitative studies that did not provide the statistical information needed to calculate effect sizes. Although these studies may provide valuable descriptive insight into the experiences of ELL students during inquiry instruction, they did not report the data needed to conduct a quantitative synthesis. We recommend that future research consult qualitative studies for evidence of additional contextual factors that may explain variability in the effectiveness of inquiry-based instruction.
Second, rather than include the full range of English proficiency of language minority students in this meta-analysis, we focused on current ELL students and excluded former ELL students who were either reclassified as fluent English proficient, exited from English language development programs, or no longer received English language support services. This decision may restrict the generalizability of the findings to a narrow subpopulation of ELL students. However, including the scores of former ELL students, who typically resemble non-ELL students in terms of English proficiency and academic achievement, could obscure the effects of inquiry instruction for current ELL students. Indeed, including former ELL students may risk results that overstate the benefits of inquiry learning for ELL students and understate the science achievement gap between ELL and non-ELL students (cf. Saunders & Marcelletti, 2013). Nonetheless, we recognize this approach as a limitation and therefore recommend that future research focus on investigating the effect of inquiry learning across various levels of English language proficiency.
Third, while the sample size included in this meta-analysis was large enough to compute reliable main effect sizes, we often lacked sufficient data to compute potentially interesting moderating effects, mainly because primary studies failed to report such information. For example, few studies reported salient features of the instructional approach (e.g., time on task) and demographics information pertaining to students (e.g., race/ethnicity, gender, primary language spoken at home) and teachers (e.g., years of teaching experience, educational attainment, pedagogical training), which have important theoretical and pedagogical implications. As such, we recommend that future research in science teaching include clear descriptions of not only the instructional approach but also their student and teacher samples.
Conclusion
This synthesis contributes to the field and advances our current understanding of evidence-based science pedagogy in several ways. For example, despite a growing body of individual studies suggesting that inquiry instruction is a particularly effective approach for teaching science ELL students, conflicting results and a lack of empirical consensus have rendered the precise nature and magnitude of its effects unclear. In an effort to address this gap, our study systematically surveyed the literature on inquiry instruction and synthesized empirical findings from over a decade’s worth of research. Consequently, results from this study can be used to inform national dialogue concerning effective, appropriate, and equitable instructional practices for linguistically diverse students, which is essential given the growing number of ELL students in U.S.-based public schools. These findings have never been more important as educational practitioners, researchers, and policymakers deliberate solutions to pressing issues facing K–12 schooling and make critical decisions that will undoubtedly shape the future of science education for all students. Finally, this study serves as a comprehensive synthesis of what the current state of literature reveals about the effectiveness of inquiry instruction for ELL students, providing valuable information and insight for those interested in advancing the frontier in opportune ways. Such guidance is particularly important given the extraordinary amount of resources involved in the planning and execution of truly effective and impactful research.
Overall, the findings from our work suggest that inquiry instruction has the potential to improve science learning and performance for not only English-proficient students but also ELL students—albeit to a lesser extent. Although the learning benefits associated with inquiry instruction are compelling, our data suggest that ELL students might require additional academic and linguistic support if they are to attain a level of science achievement that is on par with their English-proficient peers. Therefore, we urge researchers to not only continue to explore the nature of inquiry instruction through applied experimental investigation but also to investigate other avenues for improving ELL students’ science learning and performance, in particular, research programs aimed at investigating the mechanisms by which inquiry-based instruction leads to improved achievement, how technology can be used to support such outcomes, and whether the NGSS provides other opportunities for ELL students to access quality science education.
Footnotes
Appendix
Examining the Effectiveness of Inquiry Instruction for English Language Learner Students by Overall Study Quality Ratings
| Treatment Effect Size and 95% Confidence Interval |
Test of Difference |
|||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Quality Rating | k |
|
SE | Lower | Upper | p | I2 (%) | QB | df | p |
| Individual studies | 3.11 | 2 | .21 | |||||||
| Strong | 6 | 0.20 | 0.11 | −0.02 | 0.42 | 0.074 | 21 | |||
| Moderate | 11 | 0.23 | 0.10 | 0.04 | 0.42 | 0.020 | 76 | |||
| Weak | 6 | 0.46 | 0.12 | 0.23 | 0.69 | 0.000 | 91 | |||
| Combined studies | ||||||||||
| High quality | 17 | 0.22 | 0.07 | 0.07 | 0.35 | 0.003 | 66 | 3.27 | 1 | .07 |
| Low quality | 6 | 0.46 | 0.12 | 0.23 | 0.69 | 0.000 | 91 | |||
Note. Appraisal of study quality measured using the Quality Assessment Tool for Quantitative Studies. High-quality group composed of studies with overall quality ratings of strong and moderate. Low-quality group composed of studies with overall quality ratings of weak.
Authors’ Note
GE and JA are each supported by the National Science Foundation (Grants No. DGE-1321846), and SMJ is supported by the National Institute on Aging (Grant No.1K02AG054665-01). SMJ has an indirect financial interest in the MIND Research Institute, whose interests are related to this work.
Authors
GABRIEL ESTRELLA is a PhD candidate in the School of Education at the University of California, Irvine. His research focuses on evidence-based learning strategies and teaching methods, the academic achievement and motivational outcomes of underrepresented students in science, and meta-analytic research methods.
JACKY AU is a PhD candidate in the Department of Cognitive Science at the University of California, Irvine. He is a meta-analyst, specializing in the field of cognitive training and plasticity.
SUSANNE M. JAEGGI is a faculty member at the School of Education and the Department of Cognitive Sciences at the University of California, Irvine. She studies learning and individual differences across the life span, focusing on the development of cognitive interventions and the investigation of whether and how those interventions generalize to nontrained cognitive domains.
PENELOPE COLLINS is an associate professor in the School of Education at the University of California, Irvine. Her research examines the development of language, literacy, and academic skills for children from linguistically diverse backgrounds and effective educational practices to support English language learners.
