Abstract
In this preregistered within-teacher randomized controlled trial (n = 84), we tested the effects of grouping English learners (ELs) in homogeneous groups (all ELs) versus heterogeneous groups (ELs and non-ELs) on language, reading comprehension, and argumentative writing. Findings indicated no significant main effects of grouping. However, preregistered moderation analyses indicated that heterogeneous groups benefited students with higher English language skills (Hedges’ g = 0.27–0.59 or 0.75–1.93 grade equivalents), whereas homogeneous groups benefited students with lower English skills (g = 0.31–0.58 or 1.00–1.55 grade equivalents). Instructional observations indicated that teachers provided more specialized strategies for ELs in homogeneous groups and more authentic questions for students in heterogeneous groups. Findings question the default use of homogeneous grouping and support considering English proficiency when making instructional and policy decisions for EL instruction.
Introduction
Every day, teachers face decisions about how to group students for literacy instruction. For teachers in linguistically diverse classrooms, the potential challenges and benefits of grouping decisions can be magnified, because teachers consider students’ language proficiencies alongside their reading levels and social dynamics. Pervasive dilemmas arise in such contexts: Should teachers group all their multilingual students classified as English learners (ELs) together to target instruction as a function of language level (i.e., homogeneous grouping), or should they group ELs with non-ELs to leverage peer modeling and interaction in English (i.e., heterogeneous grouping)?
Complicating the challenge is that both homogeneous and heterogeneous grouping approaches have some support from theory and research. For instance, teachers have reported that homogeneous groups may be easier to teach and allow for more targeted, direct instruction (e.g., Estrada, 2014; Kieffer et al., 2023). Likewise, policies that highlight ELs as a qualitatively different subgroup of students encourage homogeneous grouping such that ELs receive separate and differentiated instruction (Estrada, 2014; Gándara & Orfield, 2012). In contrast, researchers have documented the negative effects of tracking and homogeneous grouping by ability in general (e.g., Oakes, 2005; Murphy, Greene, et al., 2017) and by EL status in particular (for a review, see Gamoran, 2017). Negative effects of homogeneous grouping of ELs are consistent with foundational sociolinguistic and second-language acquisition theory that contends that access to—and interaction with—English-proficient peers is key to learning a second language (e.g., van Lier, 2002; Wong Fillmore, 1991).
Given the wide diversity of ELs along multiple dimensions (e.g., National Academies of Sciences, Engineering, and Medicine [NASEM], 2017), the effects of heterogeneous versus homogeneous grouping have the potential to be highly variable. In particular, students’ responses to the affordances of the different types of grouping may depend on their English language skills, especially in the ubiquitous contexts of English-only instruction in the United States. For instance, it may be the case that a threshold of English skills is necessary to benefit from interaction with non-EL peers or that students with more emerging English skills receive less attention from their teachers in heterogeneous groups. In both cases, English language skills would positively moderate the effects of heterogeneous grouping such that ELs with more advanced language skills would benefit more from heterogeneous groups. Heterogeneous groups also may be characterized by a more interactive or equitably distributed discursive environment, potentially springing from teachers having higher linguistic expectations. Alternatively, ELs with emerging English language skills may have the most to gain from such discourse or from interactions with their non-EL peers, which would yield negative moderation effects of heterogeneous grouping.
In this preregistered randomized controlled trial, we explored the effects of heterogeneous grouping for instruction (i.e., grouping ELs with non-ELs) versus homogeneous grouping (i.e., grouping only ELs together) on ELs’ language and literacy outcomes. We did so in the context of a small-group, language-based literacy curriculum designed with and for multilingual learners called Cultivating Linguistic Awareness for Voice and Equity in Schools (CLAVES; Proctor et al., 2021). We focus on grades 4 and 5, a time when the increasing language demands of text can be particularly challenging for ELs (e.g., Uccelli et al., 2015). All participating ELs received ~12 weeks (18 total hours) of CLAVES, with half randomly assigned to homogeneous groups (consisting of four ELs) and half assigned to heterogeneous groups (consisting of two Els and two non-ELs). Students (n = 84) were individually randomly assigned to within-teacher groups, with teachers providing instruction to both the homogeneous and heterogeneous groups. We explored the effects of grouping on proximal language outcomes, linguistic awareness, reading comprehension, and argumentative writing. In addition to evaluating the main effects of grouping, we explored moderation by initial English language skills to explore whether and how grouping mattered for student outcomes. Finally, we observed and documented teachers delivering the same lesson to each group across three occasions and analyzed these data to explore the potential mechanisms of grouping effects on outcomes.
Grouping of ELs
Although there is clear consensus on the negative effects of ability tracking for non-ELs (i.e., grouping students in classes by academic achievement; e.g., Gamoran, 2017), there is much less evidence and consensus on how to group ELs for instruction. There are multiple ways ELs can be grouped (including grouping ELs together, but separated by English proficiency level), but much of the theory and research has focused on whether to group ELs with non-ELs, which also was our focus in this study.
There is a strong theoretical basis for heterogeneous grouping of ELs with non-ELs. Ecological theories of language development (Kramsch, 2002; van Lier, 2002) emphasized the social and multiply situated nature of learning language. Rather than viewing language as simply a set of skills taught by a teacher to a student, an ecological view acknowledges the layers of affordances (including peers) for language learning in a given setting (van Lier, 2002). These affordances are central to the process of learning language for authentic social goals. Relatedly, scholars have argued that opportunities to interact with more English-proficient models is essential for ELs’ language learning (e.g., Wong Fillmore, 1991). Indeed, research in dual language immersion programs suggests that homogeneous grouping is less efficacious than heterogeneous grouping given the linguistic and social expectations of such models (e.g., Hamayan et al., 2013).
Attitudes Toward Grouping
At the same time that theories support heterogeneous grouping, practice has consistently favored homogeneous grouping and separate tracking for ELs (e.g., Callahan, 2005; Estrada, 2014; Umansky, 2016). For instance, in a recent survey of 154 middle and high school teachers, 58% agreed or strongly agreed with the statement, “Courses comprised of mostly/only ELs provide better instructional support for ELs than general education courses” (Kieffer et al., 2023). This attitude is arguably promoted by research and professional literature highlighting the unique linguistic needs of ELs (e.g., Baker et al., 2014). Educators may respond by grouping students separately to better focus on those needs. As such, homogeneous groupings may ensure that ELs receive sufficient instructional attention, whereas heterogeneous groupings may result in less teacher attention and fewer opportunities to engage with and access the curriculum.
Grouping Policy
Theoretical orientations toward grouping for ELs are also at odds with common policies, including the Every Student Succeeds Act, its predecessors, and related state policies. Such policies emphasize the status of ELs as a particular subgroup of students who require specialized linguistic support and instruction. Even when such policies do not explicitly call for homogeneous grouping, they may implicitly support it by virtue of their calls for specialized supports (e.g., Gándara & Orfield, 2012). On the one hand, research has documented substantial tracking of ELs and associations with negative effects (Black et al., in press; Callahan, 2005; Umansky, 2016). On the other hand, grouping of ELs varies considerably across schools and districts, suggesting that concentrating ELs together is more popular among some educators than others (Estrada, 2014).
Grouping for Intervention
Reading intervention research also presents conflicting information about heterogeneous versus homogeneous grouping for ELs. A review by Richards-Tutor et al. (2016) of reading interventions for ELs with, or at risk for, learning disabilities found that all the interventions were provided in relatively homogeneous small groups or one on one. By contrast, classwide interventions targeting oral language development have been conducted successfully in heterogeneous groups (e.g., Carlo et al., 2004; Lesaux et al., 2014). A review of 14 peer-mediated intervention studies conducted with ELs (Pyle et al., 2017) found that interventions with structured, heterogeneous grouping or peer pairing were associated with positive academic outcomes. However, this review could not determine whether homogeneous grouping would have yielded similar effects due to a lack of studies with homogeneous grouping. Baker et al. (2014) recommend homogeneous grouping for teaching foundational reading skills but heterogeneous grouping for teaching oral language, writing, and comprehension. However, this recommendation was based on expert opinion rather than on direct empirical evidence.
Perhaps the most relevant findings for this study come from evaluation of the eighth grade Promoting Adolescents’ Comprehension of Text (PACT) intervention by Vaughn et al. (2017). They found that the effects of their discussion-based intervention on students’ acquisition of social studies content knowledge were moderated by the proportion of ELs in the class. Specifically, students made larger gains from the intervention in classes with fewer ELs, and this pattern was stronger for ELs than for non-ELs. This correlational finding supports the hypothesis that heterogeneous groups that include ELs and non-ELs are better contexts for language-based interventions than homogeneous groups with only ELs. However, as the authors acknowledged, there are many confounding factors at the class and school levels that may have influenced this finding, raising the need for experimental research.
Controlling for Curriculum
Questions about grouping can be particularly challenging to study absent a common instructional framework, because grouping approaches will vary with different curricula. Further, if instruction is generally ineffective, grouping decisions are likely to matter less. Thus, in this study, we controlled for instruction by using CLAVES, a small-group literacy curriculum, and provided it to all students in the heterogeneous and homogeneous groups. CLAVES was designed with and for multilingual students and their teachers and centers on four principles (Proctor et al., 2021): metalinguistic awareness, dialogic teaching, multimodality, and multilingualism. Using these principles, CLAVES takes a language- and discussion-based instructional approach that teaches discrete language skills (language-based) by centering student talk (discussion-based) to influence reading comprehension. Prior evidence suggests that CLAVES is efficacious at promoting ELs’ language and reading comprehension (Proctor et al., 2020) as well as writing (Silverman et al., 2021).
In this study, we did not seek to establish further evidence of the efficacy of the curriculum. Instead, we chose CLAVES as a promising baseline instructional model for testing the effects of grouping, because (a) it has preliminary evidence of efficacy (Proctor et al., 2020; Silverman et al., 2021), (b) it is designed for use with small groups, (c) it attends to language beyond vocabulary, which allows for a broader variety of outcomes, (d) it is designed for use with upper elementary school–aged ELs who face increasing language demands of texts, and (e) teachers are able to implement it with fidelity (Proctor et al., 2020). Moreover, this curriculum was grounded in findings from an existing research base that derived instructional recommendations from studies conducted with heterogeneous classes of ELs and non-ELs (e.g., Baker et al., 2014), so it is appropriate for the heterogeneous groups.
Theoretical Mechanisms for Grouping: Overall and by English Language Skills
Even with a common curricular framework, teachers may adapt their instruction in meaningful ways to the students in front of them (Troyer, 2019). In linguistically diverse classes, these adaptations may affect all students equally or may be particularly important for students with particular levels of English proficiency. We hypothesize three main mechanisms for the effects of grouping of ELs, along with subhypotheses for how they may affect students with more emergent or developed English language skills.
First, homogeneous groups may allow teachers to use techniques that are targeted to ELs’ needs. For instance, they may be more likely to pay more attention to basic vocabulary, to prompt students to visualize concepts, and to leverage use of the home language (see Goldenberg et al., 2013). Targeting ELs’ language needs is commonly cited by teachers as an advantage of grouping ELs together (Kieffer et al., 2023). In contrast, teachers working with groups with both ELs and non-ELs may neglect these supports. We hypothesize that these adaptations (if enacted) would yield better language outcomes for ELs, particularly for ELs with more emerging English language skills, who may need such supports the most.
Second, heterogeneous groups may be characterized by more sophisticated teacher talk and questioning. The literature on tracking has long documented differences in teacher talk across high- and low-tracked classrooms and groups (e.g., Applebee et al., 2003; Gamoran & Nystrand, 1994; Murphy, Greene, et al., 2017), and we expect similar or greater differences favoring heterogeneous small groups over homogeneous ones. We hypothesized that these differences would yield better outcomes in language and reading comprehension for heterogeneous groups, perhaps more so for ELs with better developed English language skills, who may be better positioned to respond to sophisticated talk in English.
Third, related to the second hypothesis, heterogeneous groups may be characterized by more sophisticated student talk and opportunities for discussion. Experimental evidence on ability grouping of non-ELs has shown these positive peer effects for heterogeneous groups (Murphy, Greene, et al., 2017). We hypothesized that these differences would yield better outcomes for heterogeneous groups and again particularly better outcomes for ELs with more developed English language skills. Given these possibilities, we explored how instruction differs between homogeneous and heterogeneous groups using classroom observation data.
This Study
In this study, we aimed to experimentally test the effects of heterogeneous grouping (ELs with non-ELs) versus homogeneous grouping (all ELs) in the context of small-group language and literacy instruction. We randomly assigned individual students classified by their schools as ELs (n = 84) to either homogeneous groups (four ELs) or heterogeneous groups (two ELs and two non-ELs). 1 Note that we are defining homogeneous and heterogeneous here based on students’ EL classification rather than their home language, academic skills, or other characteristics. This is consistent with prior correlational literature on concentration of ELs (e.g., Estrada et al., 2020; Kieffer & Weaver, 2024; Vaughn et al., 2017) and with common practice in English-medium settings (as opposed to bilingual settings), where students are grouped by their EL classification rather than by their home language.
We evaluated the main effects of grouping on proximal language skills (i.e., vocabulary and morphology taught in CLAVES), an intermediate measure of core analytic language skills (Uccelli et al., 2015), and more distal measures of reading comprehension and argumentative writing. 2 In addition, we explored a priori, preregistered questions about how the effects of grouping differed by students’ pretest English language skills to determine for whom grouping matters more. Finally, to shed light on the possible mechanisms for the effects of grouping, we explored differences in instruction between groups using classroom observation data. The study was preregistered at the Registry of Efficacy and Effectiveness Studies (https://sreereg.icpsr.umich.edu/sreereg/) as #8020.1v2. Note that the data we collected cannot be made available due to constraints imposed by the participating district’s institutional review board.
Three research questions guided this study, with the first question being confirmatory and the latter two questions being exploratory:
What are the effects of heterogeneous versus homogeneous grouping in grades 4 and 5 for ELs on reading comprehension, argumentative writing, and language skills?
To what extent do English language skills moderate the effects of grouping on reading comprehension, argumentative writing, and language skills?
How does observed instruction differ between homogeneous and heterogeneous groups?
Method
Participating Schools
Six elementary schools in a large urban school district in the northeast of the United States participated in this research. Schools were selected on the basis of serving large numbers of ELs (ranging from 23 to 45%). Majorities of the school populations also were considered economically disadvantaged (ranging from 76 to 89%), defined as participating in one or more economic assistance programs. Four of the schools served majority Asian student populations (ranging from 53 to 81%), predominantly from Chinese American backgrounds, and smaller Latine student populations (ranging from 17 to 36%). The other two schools served majority Latine student populations (54% and 67%) with smaller Asian student populations (22% and 34%). As noted for the participating students below, the most common languages were Mandarin, Cantonese, and Spanish. Given differences among the schools in their populations, and likely other characteristics, we included school fixed effects in our models (see below). Although some schools offered bilingual education programs, none of the participating classrooms was in a bilingual education program. That said, it is worth noting that the district generally supported home-language use for ELs in English-medium instruction.
Participating Teachers
Thirteen teachers participated (ranging from one to three per school) who taught 28 groups with students drawn from 14 classes (see Figure 1). Ten teachers taught one homogeneous and one heterogeneous group each, consistent with the original design. One teacher, an English-as-a-second-language pull-out teacher, taught groups drawn from three different classrooms for a total of three homogeneous and three heterogeneous groups (one group of each type from each of the three classrooms). These students also received English language arts instruction from their classroom teacher. In addition, two teachers each taught one group (heterogeneous or homogeneous) at a time, with those two groups drawn from the same class; these teachers switched which group they taught halfway through the implementation to minimize the confound of teacher. The latter two deviations from the cleanest within-teacher design may introduce noise into the results, so we conducted robustness checks to ensure that these deviations did not change our results, which they did not (see below and Supplementary Tables S9–S11 in the online version of the journal). The teachers completed a survey on their background and implementation of the curriculum (see implementation results below); 12 of the 13 provided information about their background. In general, teachers were experienced and credentialed educators, with some having extensive preparation in teaching multilingual learners and most having substantial experience working with multilingual learners (Table 1).

Overview of design.
Participating teacher characteristics (n = 12 respondents)
Note. Subcategories for languages other English do not add up to the total, because some individual teachers spoke multiple languages other than English.
Participating Students
Within each of the 14 participating classrooms, six ELs and two non-ELs were selected. When there were more than six ELs and/or more than two non-ELs who consented to participate in a given classroom, we followed a multiple-step process to select and assign students (see Figures 2 and 3). For EL selection, within each classroom, we first excluded students as ineligible due to entering or emerging levels of English proficiency (based on state English proficiency test scores from the prior year), because the curriculum was not deemed appropriate for them (as represented by negative numbers in the third row of Figure 2). Second, we took teacher recommendations for which eligible students would benefit from the instruction. Although this step may reduce generalizability slightly, it was important to maintain the partnership with the teachers and ensure compliance with the design and implementation of the curriculum. When fewer than six were recommended, we continued to include nonrecommended students. Third, if the recommendations did not total six ELs, we randomly selected enough ELs to complete the final sample of six ELs to be randomized within each class. Aside from this blocking by classroom, ELs were not blocked on other characteristics or scores prior to randomization due in part to the small numbers involved. For non-EL selection, again within each classroom, we first took teacher recommendations for which consented students would benefit from the instruction. If more than two non-ELs were recommended, we randomly selected from among the recommended students (or if no students were recommended, we randomly selected from among the consented students) to yield the final sample of two non-ELs per classroom.

Selection of English learners (ELs) into the sample for randomization.

Selection of non-English learners (non-ELs) into the final sample.
Following selection, student participants included 84 fourth and fifth graders classified as ELs (41 girls; see Table 2). On family surveys, the most common home language was Mandarin or Cantonese (61%), followed by Spanish (21%), with additional home languages of Korean (two participants), Arabic (one participant), and Punjabi (one participant), whereas 7% (6) did not report their home language. Four students reported speaking English as their home language despite being EL classified. These students continued to be treated as ELs in the school system, because their home-language survey at school entry indicated another language at home, and they had not yet been reclassified as English proficient, so we retained them in the sample.On average, the sample performed in the nationally average range at pretest on the standardized measures of reading comprehension, word recognition, vocabulary, and morphology (see below) but with considerable variability across students on each measure, as shown in overall sample mean and standard deviations in percentile ranks in the third column of Table 2.
Demographics and pretest descriptive statistics, overall and by group (n = 84)
NAEP = National Assessment of Educational Progress.
Note. Means, differences, and t-statistics are estimated from a multiply-imputed dataset, whereas standard deviation are estimated from original datasets. English language skills composite = average of z scores, calculated with means and standard deviations from the ReadBasix norming group and combining vocabulary and morphology. T-statistics and p-values are derived from regression models with group type predicting each pretest variable, using clustered standard errors for nesting of students within small groups.
The heterogeneous groups also included 28 non-ELs. Given our research questions and the fact that these students participated in only one condition (heterogeneous groups), we did not analyze their outcome data. Rather, per sociological theories of language acquisition (Wong Fillmore, 1991), their presence can be considered part of the treatment for the ELs in the heterogeneous groups. On average, they had higher pretest English language and reading comprehension skills than the ELs (Table 3), as we would expect based on prior research and on the criteria used to identify ELs (e.g., August et al., 2005; U.S. Department of Education, 2024).
Selected pretest descriptive statistics for non-English learners (non-ELs) and English learners (ELs)
Design
After selection, participants were individually randomly assigned within self-contained fourth and fifth grade classrooms to homogeneous groups (n groups = 14; n ELs = 56) or heterogeneous groups (n groups = 14; n ELs = 28), as shown in Figure 1. As noted earlier, 28 non-ELs were also assigned to the heterogeneous groups, but their outcome data are not the focus here. With the exception of the two groups noted earlier, each pair of heterogeneous and homogeneous groups were each taught by the same teacher; this within-teacher design controls for teacher differences in implementation when estimating the effect of grouping type. Consistent with our definition of homogeneous as defined by EL classification, the homogeneous groups were composed of all students classified as ELs but not necessarily all speakers of the same language; five of the 14 homogeneous groups included speakers of more than one language. As indicated in the preregistration, power analyses indicated a minimum detectable effect size for the main effect of grouping of 0.33. 3
To control for curricular content, all students in both group types received CLAVES. Teachers were asked to implement CLAVES for 30 minutes a day three times a week for 12 weeks, or a total of 18 hours of instruction. On an exit survey at the completion of implementation, teachers reported some variability in their frequency of implementation (as is common in teacher-delivered curricula), but they reported roughly the same total dosage. Specifically, 11 of 13 teachers reported teaching CLAVES three times a week, whereas two teachers reported teaching CLAVES five times a week for fewer weeks. All teachers reported teaching CLAVES for a minimum of 25 minutes a day, with a median of 30–40 minutes and a range of 25–50 minutes. Importantly, they reported the same dosage for their heterogeneous and homogeneous groups; the median time per day and the pattern of days per week were reported to be the same for the two groups. Teachers also reported finishing CLAVES on the same lesson for both groups.
Teachers were told that their groups differed in composition, because we were interested in how they modified their instruction based on grouping and because teachers’ differential perceptions and responses were part of our theory for how grouping effects might emerge. We also shared with the teachers that we did not have a strong hypothesis about which grouping would be most beneficial and that there were good reasons for either grouping approach to outperform the other.
Curriculum
CLAVES was designed over a 2-year period in collaboration with multilingual students and their teachers, with the goal of creating a language-based literacy curriculum that started with the varied strengths and needs that accompany linguistically diverse classrooms and schools (see Proctor et al., 2021 for details). In this study, teachers were asked to implement two full curriculum units. Each unit was thematic and consisted of two six-lesson text cycles and one three-lesson writing cycle (see Supplementary Table S1 in the online version of the journal). Each six-lesson text cycle was anchored by a single text and an end-of-cycle discussion question, supplemented with a set of knowledge-building texts (video or print based) that served to provide additional perspectives on the theme in service of thoughtful discussion. Five of the 12 teachers who completed the exit survey reported completing the two units; seven ended at some point during the second unit; and none of these teachers completed the writing cycle for the second unit. They reported teaching a median of 31 lessons with some variability (SD = 5.20 lessons; minimum = 20; maximum = 35).
Supplementary Table S1 in the online version of the journal shows the breakdown of the text and writing cycles for a given unit. For every six-lesson text cycle, lessons 1 and 2 (lesson type A) consisted of engagement with four to five key vocabulary words derived from the anchor text and guided reading of the anchor text itself. Lessons 3–5 (lesson type B) were focused on reading, listening, watching, and discussing the supplemental texts and engaging with targeted dimensions of morphology and/or syntax in relation to the supplemental texts. Finally, lesson 6 (lesson type C) was devoted entirely to discussing the guiding question for the cycle. The three-lesson writing cycle served as a summative activity to write about the central questions discussed at the end of each text cycle. Lesson 1 of the writing cycle discussed the “text type” that students were being asked to produce (i.e., an argumentative essay or an op-ed in units 1 and 2, respectively), whereas lessons 2 and 3 were dedicated to drafting and finalizing the writing products.
Teachers were provided with all lesson plans, instructional materials, and student workbooks. The lesson plans provided teachers with soft scripting, specifically through the provision of activity directions and recommended guiding questions relative to the texts and videos that were part of the instructional cycles. See the Supplemental Material in the online version of the journal for more details on the curriculum content.
Measures
Language and Literacy Skills
English language and literacy skills were measured with instruments targeting parallel constructs at pretest and posttest using a variety of assessments. Given the randomized design, it was not essential to include the same measures at pretest and posttest, because including the pretest scores served to improve precision rather than control for confounders (e.g., Murnane & Willett, 2010). Thus, we prioritized standardized measures at pretest that also would allow us to describe the sample relative to national norms and prioritized researcher-developed measures at posttest that were more aligned (to varying degrees) with CLAVES. At both pretest and posttest, measures were administered in a whole-group setting, combining students from the homogeneous and heterogeneous groups.
Reading Comprehension (Pretest and Posttest)
English reading comprehension was measured using the ReadBasix assessment program (previously known as the Reading Inventory and Scholastic Evaluation or RISE) developed by researchers at the Educational Testing Service (Sabatini et al., 2019). ReadBasix is a web-administered reading skills componential battery designed with vertical scales for children in grades 3–12. The ReadBasix reading comprehension subtest presents an informational text with literal and inferential multiple-choice questions. There are four passages with seven questions for each passage, and students have 7 minutes and 30 seconds for each passage. The technical manual for ReadBasix reports an acceptable (IRT) marginal reliability of .753 for grade 4 and .674 for grade 5 of the norming sample (Sabatini et al., 2019). We considered this to be a distal outcome of the curriculum.
Vocabulary and Morphology (Pretest Only)
The English vocabulary subtest of ReadBasix presents students with a target word and asks them to choose the most closely related word from three options. The task has 30 items and must be completed in 6 minutes. The technical manual for ReadBasix reports an IRT marginal reliability of .832 to .867 for grades 4 and 5 of the norming sample (Sabatini et al., 2019). The English morphology subtest of ReadBasix presents students with an incomplete sentence, which students must complete by selecting the correct morphological form of a word from three options. The task has 30 items and must be completed in 7 minutes. Students are presented with practice questions and examples before completing test items. The technical manual for ReadBasix reports IRT marginal reliability of .868 to .871 for grades 4 and 5 of the norming sample (Sabatini et al., 2019). These two subtests correlated highly with one another at pretest (r = .85) and were combined to form a single z score composite for English language skills. This composite was used as a pretest covariate and moderator.
Word Recognition/Decoding (Pretest Only)
The English word recognition/decoding subtest of ReadBasix asks students to determine if a target word is a real word, a fake word, or a fake word that sounds like a real word, with equal weighting for each question type. There are 30 items, which must be completed in 4 minutes and 50 seconds. The technical manual for ReadBasix reports an IRT marginal reliability of .896 to .917 for grades 4 and 5 of the norming sample (Sabatini et al., 2019). This measure was used only to describe the sample.
CLAVES Language Skills (Posttest Only)
English language skills taught in the curriculum were measured with three proximal tasks targeting taught words and word parts. First, a researcher-developed semantic associations task (Lesaux et al., 2010; Schoonen & Verhallen, 1998, cited in Carlo et al., 2004) was used to measure the intervention’s effect on students’ learning of words targeted in the program. The task consists of 15 words selected from the four lesson cycles included in the program. The goal for the student is to identify the three most proximally related words to the target vocabulary word from six surrounding words; each time was scored 0 to 3 points based on the number of correct related words identified. Reliability was adequate (α = .75). Two additional tasks were used to measure morphologic knowledge of taught suffixes and prefixes. The suffix task is a nonword morphologic decomposition task drawn from Kieffer and Lesaux (2012) based on previous research (e.g., Tyler & Nagy, 1989). In this 30-item multiple-choice task, testers complete a sentence (e.g., “The man is a great ____.”) by choosing a nonsense word with the appropriate derivational suffix (e.g., “tranter”) from among four choices. All derivational suffixes assessed were taught in the curriculum. Sample reliability was adequate (α = .79). The second paradigm targeting prefixes is the Rehit task from Apel et al. (2013) and Apel and Diehm (2014). This 16-item multiple-choice task requires testers to pick the correct definition for a novel word (e.g., “rehit”) from four answer options (e.g., “to hit something again”) based on their knowledge of the prefix. The first eight items feature a real root word (e.g., “-hit”), whereas the final eight items feature a nonsense root word (e.g., “-faw”). Again, the prefixes were selected from those taught in the curriculum. Sample reliability was good (α = .87).
Given the conceptual overlap in these three tasks, the high correlations among them (r = .58, r = .58, and r = .67), and the benefits of reducing the number of outcomes and related tests, we combined these three scores into a single factor score using confirmatory factor analysis (CFA). CFA is valuable for creating factor scores that represent a single construct tapped by multiple observed indicators (e.g., Brown, 2015). CFA models with a unidimensional CLAVES language skills factor had excellent fit (root mean square error of approximation = .00; comparative fit index = 1.00; Tucker–Lewis index = 1.02; standardized root mean square residual = .02). 4
Core Analytic Language Skills (Posttest Only)
To measure broader English linguistic knowledge and awareness beyond the discrete skills specifically targeted by CLAVES, we used the Core Analytic Language Skills Instrument (CALS-I; previously known as Core Academic Language Skills Instrument; Uccelli, 2023; Uccelli et al., 2015). The CALS-I is a 45-minute criterion-referenced, research-based measure designed to evaluate a specific set of students’ language skills for grades 4–8. This group-administered assessment includes eight tasks: (a) connecting ideas, (b) tracking themes, (c) organizing texts, (d) breaking words, (e) comprehending sentences, (f) identifying definitions, (g) epistemic stance, and (h) metalanguage. These tasks involve multiple-choice, sorting, matching, and short written responses. Reliability evidence is strong (Cronbach’s α = .93 from Uccelli et al., 2015; see also Barr et al., 2019). We considered this an intermediate outcome of the curriculum; although it was conceptually aligned with the focus of CLAVES, it was designed by external researchers and did not target the specific language features taught in the curriculum.
Argumentative Writing (Pretest and Posttest)
Argumentative writing in English was assessed by providing students with a brief dilemma, accompanied by a prompt for students to create an argument that reflected their opinion. In both pre- and posttests, students were given 15 minutes to draft, revise, and finalize their written product. Writing samples were scored using the holistic rubric of the Persuasive Writing Framework in the 2017 National Assessment for Education Progress (National Assessment Governing Board, 2017). See online Supplemental Materials in the online version of the journal for more details on prompts, rubric, and reliability.
Observational Protocol
To shed light on the effects of possible mechanisms on language and literacy outcomes, we observed each group on three occasions during implementation, sampling across lesson types (i.e., with one lesson each from lesson type A, B, and C, described earlier), and observing the same lesson taught by a given teacher to their two groups. We developed an observation protocol to evaluate interaction quality practices during implementation of the curriculum. This protocol draws from an existing observational scale for EL instruction (Goldenberg et al., 2013) and a classroom observation framework (Murphy, Firetto, et al., 2017; Murphy et al., 2018) with the aim of capturing teacher and student moves that are rooted in the principles of the curriculum as well as past research on EL instruction. Specifically, the observation protocol was designed to be sensitive to hypothesized differences in teacher and student talk between the homogeneous and heterogeneous groups. Eight research assistants were trained in the protocol and had to achieve 80% interrater reliability with video examples from a prior study using the curriculum before conducting live observations. Interrater reliability was also >80% on a midterm check with videos. COVID-related institutional review board restrictions from our district partner prevented us from double-coding observations. Confirmatory factor analyses supported the use of six domains: (a) strategies to adapt instruction for ELs, (b) opportunities for interaction and language production in English, (c) primary language support, (d) high-level thinking questions, (e) connection questions, and (f) student responses. The Supplemental Materials in the online version of the journal provide more information on the development and validity of the observation protocol.
Data Analytic Approach
To address the first research question—concerning the main effects of grouping—we fitted multiple regression models with a dummy variable for heterogeneous versus homogeneous grouping as the question predictor, pretest covariates (e.g., English language skills composite and reading comprehension or argumentative writing depending on outcome), a fixed effect of grade, and fixed effects of schools. Given random assignment, pretest covariates are not necessary to control for preexisting differences but nonetheless improve precision in the estimate of the effects of grouping (see e.g., Murnane & Willett, 2010). We used clustered standard errors to account for nesting of students within small groups, an approach recommended as an alternative to multilevel modeling when the random effects are not of interest (McNeish et al., 2017). To address the second research question—concerning moderation by English language skills—we incorporated an interaction between pretest English language skills and grouping type. This interaction captures the extent to which the effect of grouping depends on students’ pretest English language skills. To make the results of the first and second research questions more interpretable, we converted standardized regression coefficients into years of growth (see below). To address the third research question—concerning differences in observed and teacher-reported instructional characteristics—we conducted straightforward descriptive analyses interpreting observed and teacher-reported differences in instructional characteristics between homogeneous and heterogeneous groups using Hedges’ g effect sizes. Hedges’ g is a type of standardized effect size that provides a basis for interpreting the magnitude of differences while adjusting for small-sample-size bias. In the case of the observation data, we used Cohen’s (1992) guidelines, where 0.2 is small, 0.5 is moderate, and 0.8 is large, focusing on differences >0.2. Although we generally prefer using more empirical benchmarks for interpreting effect sizes (such as the years-of-growth metric we used earlier), these are not available for differences in observational data. Given the modest number of groups (n = 28 groups), we favored a descriptive approach for analyzing the observation data over more sophisticated mediation analyses.
There were modest amounts of missing data for the student pretest and outcome data as well as the observation data. For the student data, four students attrited after randomization (5% overall attrition), with two students each from the homogeneous and heterogeneous groups (0% differential attrition), which can be characterized as low attrition (What Works Clearinghouse, 2022). Note that per the What Works Clearinghouse (2022), this low attrition alongside our randomized controlled design means that controlling for pretest covariates is not strictly required, but we nonetheless included them to improve the precision of our estimates of grouping effects. Pretest covariates were missing for three cases (4% of the sample). These missing pretest and posttest data were multiply imputed using all the pretest and posttest variables, gender, and grade. Multiple imputation leverages the information from other cases and variables to generate plausible values for the missing data while accounting for the uncertainty about the missing data by generating multiple datasets (20 in our case), yielding less biased estimates than casewise deletion or single imputation (Enders, 2022). Nonetheless, we conducted a robustness check to confirm that results were similar without using multiple imputation (see below). For the observation data, four of 120 observations (3% overall) were missing; these were for two teachers teaching each of their two groups receiving lesson type C (i.e., 0% differential missing data). These data also were multiply imputed using all the available observation data, so their imputed values are based, in part, on relevant information from these two teachers’ four other observations (two groups each for lesson types A and B) as well as the other teachers’ observation data for lesson type C.
Results
Pretest Differences
As shown in Table 2, there were no statistically significant differences between the heterogeneous and homogeneous groups at pretest on any variables, including gender (p = .609), home-language background (all p values > .10), reading comprehension (p = .274), word reading (p = .694), or English language skills (p = .600) according to regression models with clustered standard errors accounting for nesting within small groups.
Main Effects of Grouping
As shown in Table 4, there were no statistically significant main effects of grouping on any of the four outcomes (all p values > .10) in models that accounted for nesting in small groups with clustered standard errors and controlled for pretest reading comprehension, pretest English language skills (composite of ReadBasix vocabulary and morphology), grade, and fixed effects of schools. This indicates that the effects of grouping for the average EL were not statistically different from zero. The effect sizes for heterogeneous grouping were trivial in magnitude for CLAVES language skills (Hedges’ g = −0.004), CALS-I (g = −0.01), and argumentative writing (g = −0.04). The main effect size for reading comprehension was statistically nonsignificant (p = 0.113) but meaningfully sized (g = 0.25), favoring the heterogeneous group. Effect sizes were similar, and significance results were identical in models without covariates and with various combinations of covariates (results available from authors).
Models for main effects of heterogeneous versus homogeneous grouping (n = 84)
CLAVES = Cultivating Linguistic Awareness for Voice and Equity in Schools; CALS-I = Core Analytic Language Skills Instrument.
Note. Models included pretest English language skills composite, fixed effects of grade, and fixed effects of schools as covariates. Argument writing model also included pretest argumentative writing as a covariate, whereas the other three models also included pretest reading comprehension as a covariate.
Moderation Effects
Three of the four outcomes demonstrated statistically significant interactions between grouping and pretest English language skills: reading comprehension (p = .038), CLAVES language skills (p = .015), and CALS-I (p = .023). The interaction was not significant for argumentative writing (p = .405). 5 This indicates that the effects of grouping on three of the outcomes depended on students’ prior English language skills. As shown in Figures 4–6, these were all crossover interactions such that students with lower English language skills benefited more from homogeneous grouping, whereas students with higher English language skills benefited more from heterogeneous grouping. The effect sizes in the bottom row of Table 5 are standardized such that they represent the difference in the effect of grouping associated with a one standard deviation difference in pretest English language skills. These effect sizes can be considered medium in magnitude (Cohen, 1992).

Effects of homogeneous versus heterogeneous grouping on reading comprehension as a function of pretest English language skills.

Effects of homogeneous versus heterogeneous grouping on CLAVES language skills as a function of pretest English language skills.

Effects of homogeneous versus heterogeneous grouping on core analytic language skills as a function of pretest English language skills.
Models for heterogeneous versus homogeneous grouping interaction with pretest language skills (n = 84)
CLAVES = Cultivating Linguistic Awareness for Voice and Equity in Schools; CALS-I = Core Analytic Language Skills Instrument.
Note. Models included pretest language skills composite, fixed effects of grade, and fixed effects of schools as covariates. Argument writing model also included pretest argumentative writing as a covariate, while the other three models also included pretest reading comprehension as a covariate.
To further interpret the magnitude of these interactions, we fitted the effect of heterogeneous versus homogeneous grouping at three prototypical points in the distribution of pretest English language skills: below average (z score of −1.0, or one standard deviation below the mean on the ReadBasix norms), average (z score of 0.0, or at the mean on the ReadBasix norms), and above average (z score of 1.0, or one standard deviation about the mean on ReadBasix norms). As shown in Table 6, the fitted effect sizes range from moderately negative (favoring homogeneous groups) for students with below-average pretest English language skills to moderately positive for students with above-average pretest English language skills. To provide more context for interpreting the magnitudes of these effects, we converted them to grade equivalents. For two outcomes, we divided by the normed one-grade difference between grades 4 and 5 for reading comprehension (using Sabatini et al., 2019) and core analytic language skills (using Barr et al., 2019). For CLAVES language skills norms are not available, so we created grade-equivalent benchmarks using the sample difference at posttest between students in grades 4 and 5 to represent one grade-level difference; because this is a less generalizable estimate, these grade equivalents should be interpreted with more caution. As shown in Table 6, heterogeneous grouping had negative effects for below-average students (−1.0 to −1.55 grade equivalents) and positive effects for above-average students (0.75–1.93 grade equivalents) that can be considered practically meaningful in terms of grade equivalents.
Fitted effect sizes for effects of heterogeneous versus homogeneous grouping at selected values of pretest language skills, expressed in Hedges’ g and grade equivalent (GE; n = 84)
CLAVES = Cultivating Linguistic Awareness for Voice and Equity in Schools; CALS-I = Core Analytic Language Skills Instrument.
Note. Below average = z score of −1, or one standard deviation below the norming mean on pretest English language skills composite; average = z score of 0, or at the norming mean; above average = z score of 1, or one standard deviation above the norming mean.
Given the variation in home language in the sample, we conducted additional exploratory moderation analyses by this student characteristic. Home language did not moderate the effect of grouping for any of the four outcomes. Specifically, interactions between home-language background (i.e., comparing Mandarin- and Cantonese-English bilinguals with all other bilinguals) were not statistically significant for reading comprehension (p = .271), CLAVES language skills (p = .574), language awareness (p = .379), and argumentative writing (p = .940).
Robustness Checks
Our findings are robust to a number of additional specifications. Our final design deviated in two ways from our original design: One pull-out teacher taught six groups (three heterogeneous and three homogeneous), and two teachers each taught a single group (heterogeneous or homogeneous) drawn from the same class. Robustness checks indicated that the primary results were the same excluding these deviating cases. Specifically, the effect sizes for the interaction terms were all within 0.05 standard deviation of the primary results when estimated without these cases. Results also were similar when using only cases with complete data rather than multiple imputation. See Supplementary Tables S9–S11 in the online version of the journal for details. In addition, we checked whether there was variation in the non-ELs’ pretest language and reading skills that could have explained variation in outcomes; although there were wide individual differences in non-ELs’ skills (see Table 3), there were no significant differences in average skills of the non-ELs across the 14 heterogeneous groups (see Supplementary Table S12 in the online version of the journal).
Differences in Observed Instruction
To shed light on possible mechanisms for these effects, we observed each group on three occasions during implementation, sampling across lesson types and observing the same lesson taught by a given teacher to their two groups. Groups were rated on several domains drawn from prior research (see above). Preliminary analyses indicated that the observation ratings did not correlate strongly across the lesson types (r values ranged from −0.27 to 0.56, with a majority <0.30), making it inappropriate to combine ratings across lesson types. Thus, we analyzed each domain for each lesson type separately.
Table 7 provides Hedges’ g effect sizes for the differences between homogeneous and heterogeneous groups by domain and lesson type, with positive values indicating higher ratings for homogeneous groups. As shown in the table, several of the domains were observed to be similar, as indicated by ten comparisons with Hedge’s g effect sizes with absolute values <0.20 (i.e., below Cohen’s (1992) rule of thumb for small groups). However, eight comparisons yielded arguably meaningful effect sizes (i.e., with absolute values >0 .20).
Hedges’ g effect-size differences in observation data between homogeneous and heterogeneous groups for each lesson type (positive favors homogeneous; n = 28 groups)
These eight comparisons indicated differences in instructional approaches between the groups. First, teachers’ instruction in homogenous groups demonstrated higher ratings on strategies to adapt instruction for ELs for lesson type A (text reading; Hedges’ g = 0.65) and for lesson type C (small-group discussions; g = 0.22) compared with the heterogeneous groups (see Table 7). Teachers’ instruction in the homogeneous groups also demonstrated higher factor scores on opportunities for interaction and language production in English in lesson type C (Hedges’ g = 0.38) than in the heterogeneous groups (see Table 7). These findings are consistent with our hypothesis that homogeneous grouping may facilitate more targeted language instruction and use of specialized language supports for ELs compared with heterogeneous grouping.
Second, by contrast, the heterogeneous groups demonstrated higher ratings on two types of authentic questions: high-level thinking questions in lesson type B (extended language work; Hedges’ g = −0.30) and lesson type C (Hedges’ g = −0.41) and connection questions in lesson type B (Hedges’ g = −0.39). These results were consistent with our hypothesis that heterogeneous grouping may promote higher teacher expectations, as manifested by more demanding questions and more extended discourse.
Third, in a more surprising result, the factor scores indicated that more home-language support was used in the heterogeneous groups versus in the homogeneous groups in lesson type B (Hedges’ g = −0.28) and lesson type C (Hedges’ g = −0.44). However, when one examines the observed means, it is clear that home-language support was very rare in both grouping types, occurring only 2–4% of the time for homogeneous groups and 7–9% of the time for heterogeneous groups. Given this very low occurrence rate, we refrain from interpreting the factor-score difference.
Finally, contrary to our hypothesis that heterogeneous groups would be characterized by richer student discourse (e.g., discourse characterized by students describing how they arrived at a conclusion, providing evidence for claims, or taking multiple turns at talking without teacher interruption; Murphy et al., 2018), there were relatively trivial differences in student responses between groups across lesson types (Hedges’ g = −0.18 to 0.05).
Discussion
In this study, we aimed to explore the effects of heterogeneous grouping (ELs with non-ELs) versus homogeneous grouping (all ELs) in the context of small-group language and literacy instruction. Using a preregistered randomized, controlled design, we estimated the effects of grouping type on proximal outcomes (CLAVES language skills), intermediate outcomes (core analytic language skills), and distal outcomes (standardized reading comprehension and argumentative writing). There was no evidence of main effects of grouping on any outcome, indicating that grouping type did not affect outcomes for the average EL student.
Despite the lack of evidence of grouping effects for the average student, we found consistent evidence for interactions between initial English language skills (a composite of vocabulary and morphology scores) and grouping type. These interactions were significant for reading comprehension, CLAVES language skills, and CALS-I but not for argumentative writing. Students with lower initial English skills benefited substantially more from homogeneous groups (Hedges’ g = 0.31–0.58 for students one standard deviation below the national mean), whereas students with higher initial English skills benefited substantially more from heterogeneous groups (Hedges’ g = 0.27–0.59 for students one standard deviation above the national mean). When converted to grade equivalents, the effects can be considered practically meaningful, with a 1- to 1.5-grade-equivalent benefit from homogeneous grouping for the students with lower initial English skills and a 0.75- to 1.9-grade-equivalent benefit from heterogeneous grouping for the students with higher initial English skills.
In exploratory analyses, our observation data indicated that when teaching to the homogeneous groups, teachers used more specialized strategies to adapt instruction for ELs (e.g., explaining vocabulary beyond that taught in CLAVES, modeling use of grammatical forms through recasts; Goldenberg et al., 2013) and providing more opportunities for interaction and language production in English than when teaching the heterogeneous groups. By contrast, when teaching the heterogeneous groups, teachers asked more authentic questions, including higher-order and connection questions (e.g., questions that required students to break down ideas or concepts or questions that elicited connections to other literary or nonliterary works; Murphy et al., 2018). Both of these findings aligned with our hypotheses about the differential affordances of different types of grouping. Contrary to our hypothesis about student discourse being promoted by heterogeneous grouping, we did not find evidence that student responses were different across the groups. Although we found some differences in home-language support, we refrain from interpreting these results, given the very low prevalence of such support across both group types (<10% irrespective of grouping). In addition, five of the 14 homogeneous groups had more than one home language represented, which may have influenced teachers’ decisions to provide home-language support. Overall, our within-teacher design provided good confidence that these differences in teacher practices actually were caused by the differences in grouping. Each of these findings has important implications for research, policy, and classroom practice.
No Average Effects for Grouping of ELs
None of the average effects of grouping were statistically significant. Three of the outcomes demonstrated trivially sized main effects (Hedges’ g = −0.01 to −0.04), whereas the fourth, reading comprehension, had a potentially meaningful main effect (Hedges’ g = 0.25)—favoring heterogeneous groups—that was not statistically significant (p = .113), likely due to our modest sample size. The finding for reading comprehension suggests that we can be relatively confident that homogeneous groups did not have an advantage for this outcome but less confident that heterogeneous groups did not offer benefits.
Our null findings for the average effects of grouping (at least for language and writing outcomes) are somewhat surprising based on prior negative associations on the related (but distinct) question of cross-class tracking of ELs (Black et al., in press; Estrada et al., 2020; Vaughn et al., 2017; but see also Kieffer & Weaver, 2024). Of course, these studies differ from this study in their designs and populations. However, this divergence also may suggest that within-class grouping of ELs may function differently than cross-class tracking, with the broader sociological implications of creating segregated classrooms and tracking systems being different from those of small-group organization.
The lack of grouping effects is also surprising in light of language socialization theories. According to these theories (e.g., Kramsch, 2002; van Lier, 2002; Wong Fillmore, 1991), non-EL peers are valuable resources in language learning ecologies. One implication of our findings is that EL peers in the homogeneous groups are also valuable resources in these ecologies (e.g., Carhill-Poza, 2015). Elevating non-ELs to esteemed language model status may be reinforcing an artificial binary about who has and who lacks valuable input to offer to language development. Contrary to this binary, EL peers may be better situated to calibrate their interactions and broker for other ELs. The particular context of the CLAVES curriculum may also be relevant here here, due to its discussion-based and dialogic approaches, which start from the premise that instruction should be designed to center the strengths and needs of bilingual students in instruction. As such, an undergirding notion of the instructional design also was that ELs should be able to benefit from interactions with one another regardless of the English proficiencies of their peers. Future qualitative research would be valuable to unpack these implications.
More practically, the lack of positive benefits for homogenous grouping (even in the absence of negative effects) calls into question this widespread practice. Kieffer and Weaver (2024) argued that their similar null associations for EL classroom composition on elementary reading growth problematize the common approach of grouping ELs together. Many teachers endorse homogeneous grouping (see e.g., Chorzempa & Graham, 2006; Kieffer et al., 2023)), and data indicate that the practice is widespread (Kieffer & Weaver, 2024). The logic of this prevalent practice is that grouping ELs by their EL status together will facilitate more targeted instruction that will produce better outcomes. Our findings do not support this logic, at least for the average EL. Even if homogeneous grouping is not academically harmful, it may have negative social effects, such as limiting ELs’ access to cultural and social capital offered by their non-EL peers (see e.g., Estrada et al., 2020; Gándara & Orfield, 2012).
A Nuanced Perspective on Grouping of ELs
Our moderation results indicate that the effects of heterogeneous and homogeneous grouping depend on ELs’ initial English language skills. There are multiple explanations for these results, some of which are illuminated by our observational data, and each of which raises questions for future research.
One possibility is that teachers hold and enact higher expectations for ELs in heterogeneous groups and that ELs with higher English language skills are most susceptible to the effects of these expectations. Although we cannot directly observe teachers’ expectations (nor did we interview teachers about their expectations), our observation data provide preliminary support for this claim of higher expectations by demonstrating that teachers, when teaching heterogeneous groups, asked more authentic questions (i.e., higher-order thinking and connection questions; Murphy et al., 2018) compared with when they were teaching their homogeneous groups. If ELs with higher English skills were better prepared to benefit from such questions by virtue of stronger English skills, this may explain their greater success in heterogeneous groups. This is not to say that the teachers who asked fewer authentic questions necessarily held low expectations for what their homogeneously grouped students ultimately could achieve; perhaps they simply had different perspectives about how to get them there. Nonetheless, from the data we have, teachers were observed enacting different demands on their students based on their grouping.
A complementary explanation is that teachers provide more targeted and scaffolded instruction in homogeneous groups in ways that particularly benefit ELs with more emerging English skills. Our observation data support this explanation by demonstrating that teachers, when teaching homogeneous groups, used more strategies and language supports specifically designed for ELs as well as providing more opportunities for interaction and language production in English. Educators cite this opportunity to provide more targeted language instruction and support as an advantage of homogeneous grouping (e.g., Chorzempa & Graham, 2006; Estrada, 2014; Kieffer et al., 2023).
An explanation with less support in our observation data concerns peer effects. It may be the case that ELs with higher levels of English language skills may benefit from non-ELs serving as language models in heterogeneous groups in ways that ELs with more emerging levels of English do not. Interactions with non-ELs may be more in the zone of proximal development, as it is constructed in social interaction (Vygotsky, 1978), for ELs with higher English language skills than for ELs with lower English language skills. This interpretation is consistent with our outcomes analyses but not our observation data. If this were the case, we would have expected to see students in the heterogeneous groups demonstrating higher levels of student responses, but we found no evidence for this. It is possible that positive peer effects of ELs and those of non-ELs cancel each other out. Having more EL peers may have made students more comfortable to participate more frequently and in more diverse ways, bringing the homogeneous groups up to a level more similar to the heterogeneous groups. As noted earlier, the affordances of EL peers may be more valuable in discussion settings than we assume, and these affordances may be especially valuable for ELs at earlier stages of English language development. Although we cannot discount the possibility of non-EL peer effects completely, our observation data tend to support the conclusion that teaching differences are instead driving the effects we found.
Limitations
This study has some limitations to note. This study was conducted amid the COVID-19 pandemic, which constrained recruitment and led to a more modest sample size than we intended. Given this, we lacked power to detect small effects and so must interpret our null effects with caution. Although three of the main effects were trivial in magnitude, the main effect of grouping on reading comprehension (favoring the heterogeneous groups) was arguably meaningful (Hedges’ g = 0.25) but not statistically significant. Our modest number of groups also prevented us from conducting formal mediation analyses using the observation data, so we instead took a descriptive approach, focused on interpreting effect-size differences. Our modest sample size also may have prevented us from detecting moderation by home language. In addition, it is worth noting that our sample also had a larger proportion of Mandarin- and Cantonese-English bilinguals than in many settings. Future research with larger samples from varying backgrounds is called for.
In addition, our partner district’s institutional review board did not allow us to collect video data, which would have facilitated more detailed analyses of our observations. Although our live-coded observations provided useful data for our questions, future research using more detailed quantitative and qualitative discourse analyses of small-group interactions would be valuable. Our standardized measure of reading comprehension and pretest measures of vocabulary and morphology were normed on largely monolingual populations due to difficulties in finding appropriate and easy-to-use computerized assessments normed with multilingual learners. Nonetheless, more attention to the reliability and validity of measures with multilingual learners is warranted.
We anchored our study with a specific curriculum, which allowed us to control for curriculum across groups and increase internal validity but lessened external validity to the extent that findings may be less generalizable to instruction with substantially different curricula. Finally, the study was limited to classrooms where English was the medium of instruction, raising the need for studies in other contexts.
Implications
Overall, our findings highlight the need to take into account not only students’ official classification but also their actual English language skills when making grouping decisions. Homogeneous grouping may be most advantageous for English learners at earlier levels of English language development to the extent that it prompts greater teacher scaffolding and targeted language instruction that may be especially beneficial. Teachers should be purposeful in providing such supports. Although we did not instruct our participating teachers to provide more scaffolding and targeted instruction to their homogeneous groups, they did so spontaneously. More purposeful approaches may yield even better results.
At the same time, when pursuing heterogeneous grouping, given its social benefits (see e.g., Gamoran, 2017), our findings suggest that teachers need additional support to meet the needs of their ELs with less developed English language skills. For instance, professional development on how to integrate scaffolding and language support when teaching heterogeneous groups would be warranted. Conversely, when schools do choose to employ homogeneous grouping of ELs, professional development on how to raise the expectations and rigor of the instruction (e.g., promoting higher-order questioning) would be warranted.
Our findings also have important implications for research. Although the observation data confirmed some of our hypotheses—about how teachers adapt their questions and targeted language instruction to their students—they also complicated our hypotheses about peer effects. Specifically, the affordances of EL and non-EL peers in supporting students’ discourse and learning remain somewhat undetermined. The affordances of EL peers may relate more to brokering with and making sense of a text, whereas the affordances of non-EL peers may stem from exposure to new linguistic repertoires. Both quantitative and qualitative studies are needed to unpack the processes by which ELs do and do not learn from their EL and non-EL peers. In addition, although this study was, in part, motivated by the mixed evidence for classroom-level grouping of ELs, our findings are ultimately limited to small-group instruction. Future research, including experimental or quasi-experimental studies with rich qualitative components, could shed further light on the issue of how to group ELs within schools. Such work is also needed beyond the elementary years given that most of the prior work has been conducted with data from elementary schools.
Conclusion
Grouping ELs for literacy instruction is an everyday decision that teachers make within the constraints of broader policies. By finding no average effect of grouping, our findings suggest that educators should not default to the common practice of concentrating ELs of all English proficiency levels together. Rather, our findings indicate that grouping effects depend on students’ English language skills, with benefits of homogeneous grouping for students at more emergent levels of English proficiency and benefits of heterogeneous grouping for students with more advanced English levels. Further, our findings suggest that teachers’ instruction differs based on grouping, even when teaching the same curriculum. This has implications for supporting teachers, including encouraging more rigor and extended discourse when teaching homogeneous groups and more targeted language support when teaching heterogeneous groups. Instead of a simple answer for policy and practice, our work demonstrates some of the nuances that educators and policymakers should take into account when deciding how ELs should be grouped.
Supplemental Material
sj-pdf-1-aer-10.3102_00028312251355989 – Supplemental material for Effects of Heterogeneous Versus Homogeneous Grouping of English Learners’ Language and Literacy Development: Evidence from a Randomized Controlled Trial
Supplemental material, sj-pdf-1-aer-10.3102_00028312251355989 for Effects of Heterogeneous Versus Homogeneous Grouping of English Learners’ Language and Literacy Development: Evidence from a Randomized Controlled Trial by Michael J. Kieffer, C. Patrick Proctor, Andrew W. Weaver, Sasha Karbachinskiy, Qihan Chen, Qun Yu, Gabriella Solano, Aaron Coleman, Shaelyn M. Cavanaugh, Xiaoying Wu, Elise Cappella and Rebecca D. Silverman in American Educational Research Journal
Footnotes
Acknowledgements
We thank Tyara Dabrio, Rachel Hodes, Audrey McMaster, Dee Perry, Aimee Salgado, Summer Wu, and Mayee Yeh for their essential contributions to data collection. Thanks also to Helyn Kim for her invaluable insight and support, to Hsiaolin Hsieh for her help with assessment and study implementation, and to Maria Carlo for comments on an early version of this work. Finally, whole-hearted thanks go to the district’s leadership, teachers, and students who made this work possible.
Funding
This research was funded by the Institute of Education Sciences, U.S. Department of Education Award No. R305A200069. The opinions expressed are those of the authors and do not represent views of the institute or the U.S. Department of Education.
Notes
M
C. P
A
S
Q
Q
G
A
X
S
E
R
