Abstract
U.S. public schools engaged in an unprecedented effort to expand tutoring in the wake of the COVID-19 pandemic. Broad-based support for scaling tutoring emerged, in part, because of the large effects on student achievement found in prior meta-analyses. We conduct an expanded meta-analysis of 263 randomized controlled trials and explore how estimates change when we better align our sample with a policy-relevant target of inference: large-scale tutoring programs aiming to improve performance on independent, standardized tests. Pooled effect sizes from studies with stronger target-equivalence are .16 to .22 standard deviations, relative to .40 standard deviations in our full sample. This result is driven by stark declines in pooled effect sizes as program scale increases. We explore four hypotheses for this pattern and document how a bundled package of recommended design features serves to partially inoculate programs from this attenuation at scale.
Efforts to take tutoring to scale across the U.S. public education system represent a rare collective undertaking to reform modern schooling. Historically, education—both formal and informal—was primarily an individualized endeavor with tutors and pupils or master craftsmen and apprentices working together one-on-one. The rise of large-scale public education systems over the last two centuries evolved around a different organizing principle—one in which teachers became charged with the task of educating entire classrooms of students (Tyack, 1974). While teaching students in groups allowed these systems to expand access rapidly, it also created substantial challenges for educators to meet the full spectrum of students’ individual needs.
The COVID-19 pandemic toppled the precarious balance teachers have long tried to achieve between whole-class instruction and differentiated instruction. The public health crisis caused widespread school closures as well as acute hardships for many families. In the United States (U.S.), researchers estimate that median student achievement initially fell .24 standard deviations (SDs) in math and .13 SD in reading, with even larger declines among low-achieving students (Callen et al., 2024). The pandemic both exacerbated longstanding inequalities in student achievement and created a shared priority to accelerate learning.
In the months following the pandemic, a rare consensus emerged among policymakers, researchers, and practitioners that tutoring had a critical role to play in addressing the educational harms caused by COVID-19. Integrating tutoring into the public education system at scale became a primary policy response to pandemic-related learning disruptions. Unlike previous unfunded attempts to scale tutoring, such as President Clinton’s America Reads initiative, the federal government and individual states catalyzed these efforts with substantial financial investments (National Student Support Accelerator [NSSA], 2023). The federal Elementary and Secondary School Emergency Relief Fund (ESSER) provided $190 billion to public schools and required districts to spend a sizable fraction of this on student learning acceleration (including 20% of the third wave of ESSER funding) (Goldhaber & Falken, 2025). Of the $117 billion spent as of June 2023, reports suggest that at least $5.4 billion was spent on tutoring and other learning acceleration efforts (CCSSO, 2023; U.S. Department of Education, 2024). While only a small share of the overall grants, this is on par with estimates of the annual cost of administering tutoring to students who need it nationwide (Kraft & Falken, 2021).
Efforts to scale tutoring after COVID-19’s onset appear to have substantially expanded access to individualized instruction in U.S. public schools. The nationally representative School Pulse Survey found that by December 2022, 37% of schools reported offering high-dosage tutoring, defined as “tutoring that takes place for at least 30 minutes per session, one on one or in small-group instruction, offered three or more times per week, is provided by educators or well-trained tutors, [and] aligns with an evidence-based core curriculum or program.” This statistic increases to 59% when schools were asked if they offer more standard tutoring, defined as a less intensive and structured approach to individualized instruction. At the same time, districts have yet to implement these programs at the scale or dosage many believe is required to support a full academic recovery (Goldhaber et al., 2022). Estimates based on the nationally representative Understanding America Study and the School Pulse Survey in December 2022 place the number of students receiving more intensive tutoring between 2% and 10%.
Efforts to expand access to tutoring were, in many ways, evidence-based policy. Meta-analyses conducted by several independent research teams that reviewed randomized controlled trials (RCTs) of tutoring programs have all found large pooled effects of tutoring on test-based measures of achievement in the range of .30 to .40 SD (Dietrichson et al., 2017; Fryer, 2017; Inns et al., 2019; Nickow et al., 2020, 2024; Pellegrini et al., 2021). These effects are roughly equal to an 11 to 15 percentage-point increase (von Hippel, 2025), or to the amount of learning in reading that upper elementary students in the U.S. typically make in an entire school year (Hill et al., 2008). These large pooled effects played a central role in motivating calls by policymakers and researchers—including ourselves (Kraft & Falken, 2021; Robinson et al., 2021)—to advocate for scaling tutoring. Influential technologists such as Mark Zuckerberg and Sal Khan have extolled the moonshot-like potential of tutoring, evoking the eye-popping 2 SD effects found in small-scale studies conducted by University of Chicago doctoral students under the supervision of Benjamin Bloom in the 1980s. However, scholars have recently raised new critiques about Bloom’s 2-sigma studies (Barnum, 2018; von Hippel, 2024) and the generalizability of pooled effect sizes generated from meta-analytic reviews (Dahabreh et al., 2020; Littell, 2024; Slough & Tyson, 2023).
In this paper, we conduct an expanded and updated meta-analysis of RCTs evaluating tutoring programs to explore the external validity of pooled effect size estimates. While the common empirical focus on RCTs bolsters the internal validity of meta-analytic estimates, meta-analytic reviews of experimental studies with small-to-medium nonprobability samples do not necessarily produce estimates that generalize to broader efforts to scale tutoring (Littell, 2024). As many scholars have highlighted, strong internal validity does not beget broad external validity (Banerjee & Duflo, 2009; Esterling et al., 2024; Pritchett & Sandefur, 2015). Questions about external validity are particularly relevant in the tutoring context because schools and districts are often motivated to expand tutoring while operating within budget constraints. This can create tension between maintaining fidelity to best practices and supporting more students. We aim to inform efforts to implement tutoring at scale by answering two primary research questions:
1) What expectations should we have for the magnitude of tutoring effects on independent, standardized tests for large-scale programs implemented in high-income countries?
2) How does the aim, format, and intended dosage of tutoring programs moderate their effects?
We address these questions by generating pooled effect sizes from a sample of 263 RCTs published between 1967 and 2023 and examining the sensitivity of our results to sample restrictions that better align our estimates with a specific, policy-relevant target of inference: large-scale tutoring programs aiming to improve achievement on independent, standardized tests in high-income countries. We then leverage meta-regression analyses to explore how tutoring program effects vary across a range of program characteristic measures as well as brief syntheses of RCTs that isolate the effects of specific program features. Prior policy briefs have hypothesized that a combination of program features commonly used to define “high-dosage” or “high-impact” tutoring is central to program effectiveness (e.g., Robinson et al., 2021). These features include in-person programming, tutoring during the school day, sustained student-tutor relationships, student-tutor ratio of no more than 3:1, meeting at least three times per week for a semester or more, and using high-quality instructional materials informed by diagnostic assessments to individually target instruction. We conclude by examining how common approaches to addressing scaling challenges, such as moving tutoring online, increasing student-tutor ratios, using peer tutors, and decreasing dosage, might affect program efficacy at scale.
Our analyses reveal a stark pattern of declining effects of tutoring programs when taken to scale. 1 Consistent with prior meta-analyses, we find a large, pooled effect size of .40 SD on student achievement across our full sample, driven by the large effects of literacy tutoring programs in elementary grades. When we restrict our sample to larger-scale tutoring programs evaluated based on independent, standardized assessments, our estimates shrink by 45% to 60%. In our preferred analytic samples, we estimate a pooled effect size of .22 SD for programs serving 400 to 999 students and .16 SD for programs serving 1,000 students or more. We view these average effects both as having considerable policy importance given their meaningful magnitude and stronger external validity. Exploratory analyses point to several likely explanations for these declining effects at scale, including: 1) systematic program differences with increasing student-teacher ratios and decreasing intended dosage as programs scale, 2) larger programs being less able to target students who may most benefit from tutoring, and 3) declining implementation quality, such as lower delivery of intended dosage. Encouragingly, we do find that a combination of recommended program design features somewhat buffers against the large decline in effects we find at scale.
Our study makes several contributions to the literature. We extend prior tutoring meta-analyses (e.g., Kulik & Fletcher, 2016; Nickow et al., 2024; Ritter et al., 2009; Slavin & Lake, 2008) by compiling a sample of 263 RCTs, roughly three times the number of studies as the largest prior reviews. This larger sample allows us to explore how our overall effect size estimates compare to those for subsamples of studies that are more aligned with the target of inference used by many researchers and policymakers. Second, our study serves as an applied example of why it is critical to attend to external validity when conducting meta-analytic reviews and engaging in evidence-based policymaking. Finally, our analyses generate important insights to inform ongoing efforts to scale and sustain tutoring within the U.S. public school system. Our findings provide stronger, more externally valid evidence to support investments in tutoring, while also recalibrating expectations toward more plausible gains for students.
Methods
Literature Search Procedures
We began by searching for articles in seven electronic databases, including Academic Search Premier, APA PsychInfo, AEA EconLit, ERIC, Google Scholar, Science Direct, and Web of Science. We also searched two working paper series, from the Brown University Annenberg Institute and the National Bureau of Economic Research, to ensure we captured studies not yet published in peer-reviewed outlets (Alexander, 2020; Pigott & Polanin, 2020). This served to minimize the extent to which we were missing key research, especially work produced by scholars from historically marginalized groups (Boveda et al., 2023). Our search terms included keywords related to (a) tutoring (e.g., “tutor”), (b) educational contexts (e.g., “school”), and (c) impact evaluation research methods (e.g., “RCT”). We used Boolean operators between all terms, specifically “OR” between terms within each of these three keyword categories, many of which were synonyms, and “AND” between each of the three categories to maximize the relevance of search results without overlooking key studies. The full list of search parameters is provided in the online Appendix Table B1. We identified 45 preexisting reviews and meta-analyses of tutoring-related interventions and scanned the reference lists of these for new studies. We continued our literature search through the end of 2023, the cutoff date for studies we formally coded. Though we stopped coding new studies, we continued to track newly released studies and incorporated several into our narrative synthesis and discussion. Our search generated over 14,000 studies. After removing duplicates, we followed Pigott and Polanin (2020) and had two team members conduct an initial screening for relevance using titles and abstracts. This left 1,347 studies that we subjected to an in-depth inclusion review of the full texts, ultimately resulting in a final analytic sample of 263 studies.
Inclusion Criteria
To identify our analytic sample, we assessed studies against eight inclusion criteria: 1) human tutoring, 2) 1:1 or in small groups, 3) focused on academics, 4) measured effects on standardized tests in math or reading, 5) K–12 students, 6) in an OECD country, 7) RCT design, and 8) randomized more than 20 students or four classrooms. First, programs under study needed to meet our broad definition of tutoring: “One non-parental person providing supplemental academic support to a single student or small group of students.” We excluded studies of individualized instruction provided by a book, computer program, or other curricular tool without the direct support of a human tutor. While we included studies of programs where the tutor was a teacher, paraprofessional, college student, volunteer, or peer, we excluded studies of parent tutoring programs because all relevant studies we identified evaluated models of parent training or professional development rather than direct parent-child instruction. For example, we excluded Goudey (2009), which had been included in prior meta-analyses (e.g., Nickow et al., 2024), because of its focus on parental tutoring. Second, the tutoring intervention must have been implemented with either a 1:1 student-tutor ratio or in groups of eight or fewer students. 2 Third, the tutoring content had to focus on academic subjects. This excluded, for example, studies of mentoring or socioemotional interventions without an academic component. Fourth, our focus on academic interventions also meant that studies needed to report effects on academic outcomes, specifically standardized tests designed by a third party and used for accountability purposes or formative assessments (“independent tests”), or assessments designed by the research team to capture intervention impacts (“intervention tests”), measuring performance in either reading or math. We excluded studies where the only outcome was a nontest academic measure (e.g., GPA, attendance) because the sample of these studies was too small to facilitate broad comparisons. Fifth, the tutees had to be K–12 students. This excluded studies of tutoring in early childhood settings, of college or graduate students, and of adults. Sixth, the intervention had to take place in a member country of the Organization for Economic Co-operation and Development (OECD), given our focus on high-income country settings. For example, we excluded a study of phone-based tutoring in Kenya (Schueler & Rodriguez-Segura, 2022). Seventh, we limited our sample to RCT designs to parallel prior reviews and given RCTs’ relative advantage at isolating causal impacts. That said, we supplement our meta-analysis with a synthetic review of recent non-experimental studies, which helps us consider tutoring impacts at a scale not captured by most RCTs. Finally, the studies had to have a sample size of more than 20 students when randomization occurred at the student level, or more than four classrooms or schools when randomization was at the classroom or school level.
We also applied inclusion criteria to the effects reported and coded all qualifying estimates from each study. First, the effect estimates had to examine the same outcome as the subject of the tutoring (i.e., we dropped estimates of the impact of math tutoring on reading achievement). Second, we focused on treatment-control contrasts that isolated tutoring whenever possible, dropping estimates where the control condition involved tutoring-like programs and comparisons between treatment arms without a pure no-tutoring control group. However, we included studies in which we judged tutoring to be a key element of a larger set of interventions and reforms that together were evaluated against a business-as-usual control group. 3 Finally, we prioritized estimates from reduced-form models that define treatment as the offer of tutoring. We view these intent-to-treat estimates as the relevant impact for the types of inferences policymakers often make about what the effect of a program will be as implemented at scale.
Coding Procedures
Our research team of 20 coders double-coded each study in our sample. We trained coders on a common set of studies until they achieved a consistently high agreement rate with master codes created by our most experienced coders. After coding each study independently, coders then met to reconcile any differences and arrive at a final set of codes. When a pair of coders felt that the reconciliation was not straightforward, they brought questions to the principal investigators for a final determination. The team kept a record of decision rules that resulted from these meetings to ensure consistency across coders and over time. 4
Our codebook included 128 codes that we grouped into five categories. Some codes varied at the study level, while others varied at the intervention or estimate level. The first group of codes cataloged study information such as publication type (e.g., journal article, working paper) and publication year. The second group tracked information about the context in which the study occurred, such as the country, school level, and participant demographics. The third set covered information about the intervention itself and the treatment/control contrast (e.g., student-tutor ratio, the intended dosage, tutor type). The fourth category was information on the methods used by the study’s authors, such as the level of assignment to treatment, whether standard errors were clustered at the appropriate level, and whether we had concerns about attrition or contamination of the randomization. The fifth set included information about the effects, including estimated effect sizes, standard errors, sample sizes, and outcome instruments.
We highlight one key code that we use throughout our analyses: the number of treated students. Prior meta-analytic reviews often explore how effect sizes vary by the total sample size of an evaluation. We take a somewhat different approach given our focus on identifying studies that are more closely aligned to a specific target of inference. We code the number of students randomly assigned to receive treatment as an estimate of the number of treated students. 5
Calculating Effect Sizes
Study authors reported treatment effects in a variety of ways. Whenever they were available, we defaulted to relying on standardized effect sizes generated from linear regressions estimating standardized mean differences between the treatment and control group, often controlling for baseline covariates. One advantage of model-based estimates is that the associated standard errors typically account for the ways data may be clustered, as recommended by Hedges (2007). When these estimates (and/or their associated standard errors) were unstandardized, we standardized them using unadjusted pretreatment control group SD whenever possible (if unavailable, we used pooled SD). In other cases, we estimated a standardized effect size using the pre-post treatment means, SD, and sample sizes for the treatment and control groups. For each estimate, we then calculated a Hedges’ g effect size, correcting for upward bias present for small-sample studies (Borenstein et al., 2009) as follows:
Here,
Meta-Analytic Estimates
We generated our pooled standardized effect size estimates using a correlated and hierarchical effects (CHE) model as described by Pustejovsky and Tipton (2022). Like robust variance estimation (RVE) meta-analytic techniques (Hedges et al., 2010; Tanner-Smith & Tipton, 2014), the CHE approach upweights effects estimated with greater precision and allows for the nesting of estimates within clusters. This is important in our case, given that we often observe multiple estimates for a given study (for example, when there are multiple outcomes or interventions examined in a single study). However, the typical RVE approach requires researchers to choose between either a “hierarchical effects” (HE) or a “correlated effects” (CE) approach. HE models account for both between-study and within-study variation in effect sizes but assume that effect size estimates within the same study are independent. CE models account for the correlation between effect size estimates within studies but assume that there is no within-study variation in true effect size parameters. The CHE approach has the benefit of allowing for both between-study and within-study variation in true effect sizes while also accounting for correlated effect estimates within studies. We, therefore, fit the following CHE model:
where
Meta-Regression
Researchers have typically explored the relative importance of moderators by comparing the pooled effect sizes of tutoring programs with different program features. This approach is limited, however, because program features are often bundled and could be correlated with unobserved aspects of program quality (Tipton et al., 2023). We attempt to reduce these potential biases using meta-regressions to examine which moderators predict larger impact estimates, conditional on other study and program design features. We estimate the following CHE model:
Here, we include a vector of study and intervention features (
Target of Inference
Our aim is to draw inferences about tutoring programs that are most relevant to the target, context, outcomes, and scale of tutoring programs envisioned by policymakers in the U.S. and other similar high-income countries. Specifically, we hope to inform the expectations of leaders who are seeking to address overall declines and growing gaps in academic outcomes post-COVID by integrating tutoring into the K–12 public school system. We imagine that because leaders are being held accountable for results on statewide standardized exams that assess a broad set of skills covered by state content standards, policymakers will be more interested in tutoring impacts on statewide exams that measure general skills as opposed to assessments that measure narrower sets of skills or that are designed by researchers to align tightly with the focal content of the tutoring intervention.
A primary goal of our work is to inform efforts to expand access to tutoring programs. We therefore aim to draw inferences about reasonable expectations for the impacts of tutoring programs implemented at scale, as opposed to small-scale pilot programs. Throughout the paper, we present estimates of pooled effects sizes across four bins of program size: 0–99, 100–399, 400–999, and 1,000 or more students. Although there are likely many more small districts that might aim to serve less than 400 students, larger tutoring programs will serve a disproportionately greater number of students, making such programs a policy-relevant focus of our analysis.
Results
Characteristics of Included Studies
Our final analytic sample includes 263 RCTs that evaluate 338 distinct tutoring interventions. 7 We present characteristics of these studies at the study/RCT-level in Table 1. Our sample skews toward recent research, with almost two-thirds of included reports published in the years since 2009 and almost 86% since 1999. Only five studies in our sample assess interventions implemented since the beginning of the pandemic, almost all of which provided remote tutoring, giving us limited power to disentangle virtual delivery from the pandemic context. Three-fourths of the studies in our sample are academic journal articles. The modal study examined a tutoring program in an urban, public school setting.
Study characteristics
Setting grade-level categories are not mutually exclusive.
Our sample reflects a substantial imbalance in the subject, grade-level, and size of tutoring programs evaluated in the literature, as illustrated by the evidence gap maps shown in Figures 1 and 2 (Polanin et al., 2023). Most of the studies assess literacy tutoring among early elementary school students (37%) and programs serving fewer than 100 students (62%). This concentration on small elementary reading programs is worth noting because if impacts differ across grade levels, subjects, or with program scale, pooled results based on our full sample may not be immediately generalizable to other program types. The imbalance of studies with some specific characteristics also limits the degrees of freedom available to estimate pooled effects in these subsamples and for these characteristics in our moderation analyses.

Evidence gap map by school level and subject area.

Evidence gap map by tutored student sample size and subject area.
We provide further details on the characteristics of the programs evaluated in each of these studies in Table 2. Most interventions were delivered in-person (97%), at school (86%), during school hours (76%), using a 2:1 student-tutor ratio or less (62%), and with a provided curriculum (89%). Although individual tutoring was the modal approach (46%), student-tutor ratios varied widely. We observe greater variation in design choices across the features of tutor type, intended dosage, and whether students were pulled out of class for tutoring.
Intervention characteristics
Intended dosage metrics are not binary variables and are not mutually exclusive. Standard deviations are reported in parentheses, where applicable. All other sets of variables are percentages.
Full Sample Estimates of Tutoring Impacts
Similar to prior tutoring meta-analyses, we find large, pooled effect sizes across our full sample of studies. As shown in Table 3, we estimate that the average effect on student achievement of a broad variety of tutoring interventions subjected to rigorous evaluation via RCTs is .40 SD when stacking math and reading achievement impacts. The prediction interval ranges from −.16 SD to .96 SD, illustrating the considerable heterogeneity of impacts we might expect across individual tutoring programs. This large average effect is driven, in part, by the pooled effect of literacy tutoring in lower elementary grades of .47 SD, which makes up a large portion of our sample (60%). Still, the pooled effects of tutoring on math achievement are also large (.39 SD). We find inconsistent patterns in pooled effects across grade levels by subject. Impacts of reading tutoring for elementary school students are substantially larger than for middle and high school students (.30 SD). In math, we find the largest effects at the upper elementary level (.46 SD), followed by middle and high school (.36 SD).
Estimates pooled by grade level and tested subject
Notes. Prediction intervals are included for each estimate in brackets; robust standard errors are reported in parentheses. Estimates may be included in more than one group if they treat students in multiple grade levels. Lower elementary indicates treatment in grades K–2; upper elementary indicates treatment in grades 3–5; middle school and high school indicate grades 6–12; high school indicates grades 9–12. Pooled achievement include impact estimates for both math and reading subject tests;
p < .10; ** p < .05, * p < .01.
Sensitivity Analyses
We next explore whether our pooled estimates are robust to a variety of sensitivity checks, as shown in Table 4. First, we examine whether results differ for studies that may have lower internal validity due to quality concerns with the randomization design or empirical analyses. For example, some authors described their methods as an RCT but indicated or intimated that students, teachers, parents, or administrators had some influence over whether a student ended up in the treatment or control group. Another example is when a sizable number of students were excluded from the analytic sample because of noncompliance, attrition, or a move. When we examine results separately for studies for which we did not have quality concerns, they remain essentially unchanged. When we omit the top and bottom 2.5% of effect size observations, the pooled effect size estimate drops only slightly to .37 SD, a decline that is largely driven by a reduction in pooled reading effects. 8
Estimates by RCT quality concerns, stacked subjects
Notes. Estimates may be included in more than one group if students in multiple grade levels were treated. All cells pool across math and reading. Panels A and B split up the entire sample by whether we identified any concerns with the quality of the RCT. Panel C omits the top and bottom 2.5% of observations by effect size magnitude. DF = degrees of freedom; ES = effect size; SE = standard error.
p < .10; **p < .05, *p < .01.
Finally, we examine whether estimates vary by the decade in which they were published as a rough proxy for study quality. Education research has taken major leaps in terms of methodological rigor and quality standards over the past three decades, particularly in applying causal inference methods (Angrist, 2004). 9 As shown in Table 4, we find substantial variation in the magnitude of the pooled impacts based on publication decade, with larger estimates prior to 2000 (.40 SD) and between 2000 and 2009 (.52 SD) than for those published between 2010 and 2019 (.36 SD). We observe somewhat smaller impacts for the most recent studies published in 2020 or later (.33 SD). We cannot definitively disentangle whether this variation in impacts is due to methodological changes, policy changes, or other study or program characteristic changes over time, but differences across decades remain even after we control for a host of study and program characteristics, as shown in Table 8.
What Expectations Should We Have for Tutoring Effects at Scale?
Evidence from our meta-analysis of experimental studies
In Table 5, we explore how our pooled effect size estimates change when we restrict our sample to more closely approximate our target of inference. Removing estimates that rely on assessments designed by the research team induces a .07 SD decline in our aggregate estimate. 10 Removing the most extreme 5% of our point estimates only reduces our pooled estimate by .03 SD, while limiting to studies published in 2010 or later reduces our pooled estimate by .05 SD. However, restricting the sample to studies that provided tutoring to incrementally larger groups of students profoundly changes the magnitude of our estimates. Using our full sample, we find that programs offering tutoring to fewer than 100 students have a pooled effect size of .51 SD, whereas programs tutoring between 100 and 399 students have a pooled effect size of .30 SD. As shown in Figure 3, this estimate continues to decline—almost linearly—as we further restrict the sample such that pooled effects for programs serving between 400 and 999 students and 1,000 or more students have an average effect of .26 SD and .16 SD, respectively.
Pooled effect size estimates overall and by treated student sample size
Notes. Prediction intervals are included for each estimate in brackets; robust standard errors are reported in parentheses. Each cell presents the Hedges’ g estimate, stacking both math and reading. Model (1) offers the pooled average impact of tutoring across the entire subsample indicated in each panel. Models (2) through (5) disaggregate the estimate in Model (1) by the tutored student sample size of each study.
p < .10; **p < .05, *p < .01.

Pooled estimated impacts of tutoring across program size and study characteristics.
When we focus on results for independent test outcomes only, presented in Table 5 Panel B, impacts range from .22 SD to .16 SD for tutoring programs in high-income countries operating at a scale of 400 to 999 and 1,000 or more students. There are four important points to highlight about this preferred set of estimates, which are closer to our target of inference. First, they are about 45% to 60% smaller than the pooled estimate using our full meta-analytic sample, suggesting that inferences made using the broader sample are not well-calibrated to tutoring programs at scale. Second, effect sizes between .16 SD and .22 SD are of medium-to-large magnitude and still very impressive for large-scale education interventions (Kraft, 2020). Third, our pooled effect size estimate for programs serving 1,000 students or more is very imprecisely estimated, given the limited number of RCTs of tutoring programs at this scale that meet our target-of-inference-aligned inclusion criteria. Fourth, the wide prediction intervals associated with these estimates suggest that we should expect tutoring program effects to vary considerably, with some individual programs producing quite small or even negative effects and others resulting in sizable gains. We further illustrate the robustness of this pattern of results by presenting point estimates for these different subsamples visually in Figure 3, where the overall pattern of declines at scale remains unchanged.
Evidence from large-scale, non-experimental studies
Meta-analytic reviews of the literature on tutoring frequently restrict their focus to studies that employ RCTs in an effort to ensure researchers are identifying the unbiased, causal effect of tutoring. This restriction strengthens the internal validity of the pooled effect sizes, but can sometimes limit researchers’ ability to study more representative samples with greater external validity (Tipton & Olsen, 2018). Large-scale RCTs are expensive and often require the active consent of participants, making them financially and logistically challenging to conduct. Our meta-analytic sample contains only nine studies that evaluate programs serving at least 1,000 students. This sparse data makes it difficult to accurately project plausible effects from tutoring programs taken to scale in larger school districts, given a lack of common support.
We attempt to further inform our understanding of the plausible effects of tutoring by turning to non-experimental studies of large-scale programs (n treated ≥1,000). Much of the literature evaluating large-scale programs focuses on after-school tutoring provided by private tutoring organizations and funded by two federal initiatives, 21st Century Learning Centers and Supplemental Educational Services (SES) under the No Child Left Behind Act. Studies of these initiatives often evaluate programs across large districts and entire states with thousands of treated students and find effects that are notably smaller than those we find with our full meta-analytic sample (Deke et al., 2012; Heinrich et al., 2010, 2014; James-Burdumy et al., 2005; Ross et al., 2008; Springer et al., 2014; Zimmer et al., 2009, 2010). These small-to-medium effects (frequently ≤.10 SD) may be fully explained by poor attendance at these off-site after-school programs and their design features, such as large student-tutor ratios and rotating tutors. However, the scale of the programs may also have contributed to their underwhelming results by influencing program design choices and implementation quality.
Several non-experimental studies of large-scale programs from the post-COVID era provide more relevant assessments of ongoing attempts to integrate tutoring in the U.S. public school system at scale. Carbonari, Dewey, et al. (2024) evaluate the efforts of four mid- to large-sized districts to support students’ academic recovery in math during the 2021–22 academic year by providing tutoring and additional instructional time. Using a value-added framework controlling for prior test scores, they find estimates that are uniformly smaller than .04 SD and often precisely estimated null effects, likely due to challenges related to staffing and student attendance. The same research team finds similar results in an expanded analysis of tutoring and small-group instruction across eight districts during the 2022–23 academic year (Carbonari, DeArmond, et al., 2024). They document statistically insignificant estimates of the average effects of tutoring and small-group instruction of .03 SD in math and .07 SD in reading when pooling across tutoring programs that jointly served over 12,000 students.
Kraft et al. (2024) studied efforts to scale tutoring in Metro-Nashville Public Schools (MNPS) over the course of two and a half years to serve over 4,000 students by the spring of 2023. In contrast to the districts studied by Carbonari and colleagues, MNPS was largely successful at engaging students to attend tutoring frequently and staffing their program at scale by hiring their own teachers as tutors. Using an event study design, they find medium effects of tutoring on independent test scores in reading (.09 SD), but no effects on test scores in math, on average.
Two studies of high-impact tutoring in the District of Columbia explore the Office of the State Superintendent of Education’s (OSSE) efforts to scale tutoring by contracting with a diverse portfolio of tutoring organizations to serve over 5,000 students in 2022–23 and over 7,000 in 2023–24. Across both years of the program, the research team finds evidence that tutoring dosage increased and that tutored students were slightly more likely to attend school on tutoring days. However, comparisons of tutored students’ growth on interim and state achievement tests relative to students who did not receive tutoring suggest tutoring had very limited effects given estimates that are small in magnitude and of both negative and positive sign (Lu et al., 2025; Pollard et al., 2024).
Two recent evaluations of public tutoring programs implemented across the United Kingdom (UK) and in Victoria, Australia, also provide early evidence of post-pandemic tutoring impacts at large scales in high-income contexts. Both analyses used matching methods that included baseline test scores to reweight regression analyses, comparing the test-score gains of tutored students to comparison-group students in the third year of these tutoring programs. Government Social Research, an evaluation agency within the UK Civil Service, found small-to-medium effects of the UK National Tutoring Programme on math (.06 SD) and English achievement (.03 SD) among Key Stage 2 students (years 3–6), but no effects on the achievement of Key Stage 4 students (years 10–11) in either subject (Moore et al., 2024). The Victorian Auditor-General’s Office found no significant effects of the statewide tutor learning initiative on students’ achievement gains in math and reading across students in years 3 through 10 (Victorian Auditor-General’s Office, 2024). Together, these non-experimental studies of large-scale tutoring programs are consistent with a pattern of declining effects at scale.
Why Do Tutoring Effects Decline at Scale?
The phenomenon of declining effects when interventions are scaled is well-documented in education research (Cheung & Slavin, 2016; Kraft, 2020, 2023). Understanding why this pattern also exists for tutoring programs is critical to informing efforts to expand access to tutoring and maintain its effectiveness at scale. We posit and test four primary hypotheses that might explain this pattern, while recognizing that unmeasured confounders, including contextual factors such as district characteristics, may also be at play.
Hypothesis #1: Declining effects do not reflect a true phenomenon but are instead due to selective reporting, standardization techniques, and/or spillover
It is possible that the negative relationship between program effects and program size is a product of the research process rather than a real pattern of differential effects. First, such a pattern could be caused by selective reporting that is more acute among studies with smaller samples. Here, we define selective reporting as the phenomenon where studies that produce statistically insignificant results are less likely to result in academic publications. This could occur through multiple mechanisms, including researchers being less likely to write papers when they find null results, researchers making subjective modeling decisions that push preferred estimates over traditional significance thresholds (i.e., p-hacking), and journals being less likely to publish studies that find null results (i.e., publication bias). Of course, researchers could also be systematically designing studies of programs that are likely to have larger effects to also have smaller sample sizes, given that less statistical power is necessary to detect larger effects.
We explore potential bias in three ways given that no single test can definitely rule out publication bias (McShane et al., 2016). First, we produce funnel plots and conduct a trim and fill analysis (Duval & Tweedie, 2000) to assess the degree of symmetry of our point estimates around the meta-analytic mean. An imbalance in publications falling on either side of the vertical line at the center of the full plot would suggest potential bias and lead the studies to be imputed to make the data more symmetric. We do this at both the individual effect-size level and at the study level by collapsing multiple effect sizes to account for the nested nature of the data. As shown in online Appendix Figure B1 and Table B2, we find no evidence of publication bias in our full sample of studies using this method. We then repeat these analyses after subsetting our data into studies with fewer than 100 treated students versus at least 100 treated students and find no evidence of differential publication bias among small-sample studies.
Second, we test for evidence of p-hacking bias by plotting the p-values from our sample of effect sizes and examining whether there is an excess mass of p-values just below conventional significance thresholds in these distributions, following the intuition of Brodeur et al. (2020). 11 A visual inspection of online Appendix Figure B2 reveals that the distribution of p-values is smooth across critical values for traditional significance thresholds in the full sample and in subsamples of smaller and larger sample studies. We then formally test for differential bunching below each conventional statistical threshold using a randomization test to examine whether p-values are binomial-distributed with equal probability around a given cutpoint. In Table 6, we show that we find little evidence of differential bunching of estimates with p-values just below the .05 and .10 significance thresholds. Only one of the six tests we run in our full sample, using three different bandwidths for each threshold, is marginally significant. We similarly find no compelling evidence that p-hacking among subsamples of small-scale studies or large-scale studies.
Tests for significant differences in estimate mass across p-value thresholds
Notes: Here we present the likelihood of observing the number of significant p-values in our data at the 5% and 10% significance levels within the bandwidths .02, .01, and .005 around those thresholds. For each of these combinations, we isolate the subsample within the indicated bandwidth around the indicated threshold (“N estimates within bandwidth”), present the share of significant estimate p-values in that range (“Share significant estimates”), and calculate the likelihood of having at least that many significant estimates assuming a binomial distribution (“One-sided p-value”). We repeat this exercise for our full sample of estimates in Panel A and disaggregate according to treated student sample size in Panels B and C. All estimates pool across both math and reading subject areas.
Our final test of selective reporting is to compare pooled effect sizes between academic journal publications and other types of studies, such as working papers or reports. 12 If selective reporting was occurring because journals have been less likely to publish nonsignificant findings, we would expect to see larger average estimates from academic journal articles than studies not published in academic journals. In Table B3, we show that this is indeed the pattern we find. Specifically, we observe an average pooled effect from studies in academic journals of .44 SD versus .24 SD for studies not published in academic journals. We also check for differences by the number of citations a study has received per year since its publication, regardless of publication type. 13 These results reinforce an interest among the academy of larger-magnitude impacts, with a .34 SD pooled effect size for the bottom 50% of studies by citations contrasted with .44 SD pooled effect size for the top 50%.
We interpret these results with caution, especially given that academic journal status is correlated with other factors such as publication date. Our sample of non-academic-journal studies skews more recent, and we know that more recent studies have demonstrated smaller pooled effects. These results are therefore not proof positive of selective reporting, but are consistent with that possibility. In sum, we find mixed evidence on whether selective reporting could explain the pattern of declining pooled effects for programs implemented at a greater scale.
A second possible statistical explanation for the differential pattern of effect sizes across smaller and larger tutoring programs is due to the standardization process. Tutoring programs typically target students in a specific range of the performance distribution. We find that 94% of the studies we coded describe some type of efforts to target students, with 89% of studies evaluating programs that specifically targeted low-performing students. As Fitzgerald and Tipton (2025) document, this targeting results in samples recruited to participate in RCTs being more homogenous than the population as a whole. 14 Targeted sampling reduces the variation in achievement among the study sample, artificially inflating the magnitude of the effect sizes when researchers standardize their outcome measure using sample-based estimates of its standard deviation. It is possible, if not likely, that the overall effect sizes from meta-analyses of tutoring are somewhat inflated because of this practice. This may also help to explain the pattern of attenuated effects we find if smaller-scale tutoring programs are able to more precisely target students, resulting in even more homogeneous participant populations compared to larger-scale programs. Said another way, the pattern of declining effects by program size might be less pronounced if all studies had used an estimate of the SD of their test score outcome derived from nationally representative populations.
Finally, the presence of peer spillover effects could contribute to a differential pattern of tutoring effects by program size. A large body of evidence documents peer effects in K–12 education settings (Barrios-Fernandez, 2023). If being in the same class or school as a student receiving tutoring has positive spillover effects on nontutored students and the magnitude of these effects increases with the concentration of treated students in a class or school, then larger-scale tutoring programs could differentially attenuate the treatment-control contrast and contribute to the pattern of declining effects we find. However, it is not obvious that this would happen in practice, given that the concentration of treated students per class or school could be similar across smaller and larger programs if larger programs simply serve more schools.
Hypothesis #2: Scaling causes programs to systematically alter key design features
A second potential explanation for declining effects with scale is that leaders systematically change the design of tutoring programs for larger- versus smaller-scale interventions. To assess the evidence for this hypothesis, we first explore how key program features change as programs are taken to scale. Table 7 reveals two systematic differences in program design features between smaller and larger programs. First, larger programs are substantially less likely to tutor students individually. Programs serving over 400 students are roughly 10 percentage points less likely to rely on 1:1 student-tutor ratios than small programs that serve fewer than 100 students. Second, larger programs aim to deliver less dosage, primarily by shortening the number of weeks tutoring programs run. Here, the relationship is not entirely monotonic, with the smallest tutoring programs offering moderate dosage, middle-sized tutoring programs with the highest total dosage, and larger tutoring programs offering the least. For example, on average, programs that serve 100 to 399 students scheduled 39 total tutoring hours while those serving greater than 1,000 scheduled 27 total hours. Unexpectedly, we see that larger programs are even slightly more likely to use teachers and paraprofessionals as tutors and to provide a high degree of supervision and support to tutors—characteristics hypothesized to promote larger effects.
Intervention characteristics across treated student sample size
Notes. Except for intended dosage variables, all measures are percent-scaled 0 to 100. Intended dosage variable units are indicated, with standard deviations presented in parentheses.
As another test, we examine whether the negative relationship between program effects and size is attenuated when we control for the full range of observable program characteristics in a meta-regression framework. We do this by comparing the results of two meta-regressions. The first model shown in Table 8 reports coefficients from binned sample size indicators, which capture the clear negative relationship relative to the omitted category of studies evaluating small programs serving fewer than 100 students. Adding controls for study design features in Column 2 leaves this pattern unchanged. Further adding our full set of controls for program characteristics in Column 3 makes little difference, suggesting that program features are not a primary driver of declining effects of tutoring at scale.
Meta-regression controlling for study and intervention features
Notes. Standard errors are presented in parentheses. For each model, we present the moderator coefficient (“
p < .10; **p < .05, *p < .01.
Hypothesis #3: Heterogeneous tutoring effects cause the marginal student to benefit less as tutoring programs expand
The attenuation of tutoring effects as program sizes increase may also be a product of the heterogeneous effects of tutoring across students. Prior research has found that tutoring may be more effective for students who are lower-performing (Kraft, 2015; Robinson et al., 2024), Black (Fryer & Howard-Noveck, 2020), and from low-income families (Carlana & La Ferrara, 2024). It is plausible that smaller-scale tutoring programs appear more effective because they better target students who stand to benefit the most. As tutoring programs scale, they may be expanding to serve students who will benefit less, on average.
We explore this by comparing weighted averages of student characteristics in our sample of RCTs, disaggregating by the size of the tutoring program, in Table 9. This comparison reveals a clear pattern where studies of smaller tutoring programs serve larger percentages of historically marginalized students. Students in smaller tutoring programs were 13 percentage points more likely to be English learners, 11 percentage points more likely to receive special education services, 8 percentage points more likely to be from low-income backgrounds, and 5 percentage points more likely to be Hispanic. These sizable differences in the characteristics of students served by smaller and larger tutoring programs are likely to attenuate the estimated effects of tutoring as program scale increases.
Student sample characteristics by size of tutored student sample
Notes. Means are taken at the intervention level. All variables presented range from 0 to 100. ELLs = English language learners.
A related possibility that we cannot directly test with our data is that smaller programs treat student populations that are more homogeneous. Homogeneity may make implementation easier because there is less of a need to tailor interventions to a variety of student achievement levels or other unique needs. Expanding tutoring programs might mean programming is provided to a more diverse group of students with a wider set of challenges, making it more difficult to produce large impacts among increasingly heterogeneous groups.
Hypothesis #4: Implementation quality declines as tutoring programs scale
A final hypothesis for why we observe smaller impacts for larger programs is that the quality of program implementation declines as tutoring programs are brought to scale. Imagine, for example, two tutoring programs with the exact same intended program design features, but one serves a small number of students at a single school, and the other is brought to scale district-wide. Intended dosage may be identical across programs, but the actual delivered dosage may decline at scale if student attendance suffers or time-on-task declines. Administrative needs are likely higher for the large-scale program. Small programs may be more likely to represent pilot efforts led by uniquely trail-blazing, motivated, and talented leaders, whereas administrators recruited to run large programs may not be as effective, on average. Implementation quality could also suffer if the effectiveness of the average tutor is lower for larger programs than for smaller programs. However, if tutoring screening tools are only weakly related to tutor performance, then tutor quality may not decline with scale (Davis et al., 2017). It may be more challenging to coordinate communication between tutors and teachers for large-scale programs. There may simply be less oversight with a greater number of tutoring sites, making it more difficult to ensure fidelity of implementation to program models for larger interventions.
Unfortunately, most tutoring RCTs do not directly measure implementation quality, limiting what we can say about this hypothesis using our meta-analytic dataset. For example, our coding reflects intended measures of dosage rather than the actual number of total hours of tutoring that treated students received. However, survey data and several recent studies on post-COVID tutoring efforts do point to significant implementation challenges. To start, the majority of K–12 public school principals report experiencing barriers (e.g., funding, timing, or staffing challenges) that limited their ability to effectively provide tutoring on the nationally representative SPP survey (National Center for Education Statistics, 2024). The aforementioned “Road to Recovery” (R2R) evaluation of large-scale academic recovery efforts by Carbonari, Dewey, et al. (2024) documents how districts fell well short of leaders’ expectations with regard to both the number of students served and the actual dosage of the interventions. This is consistent with SPP survey results showing that, among schools that provided tutoring, larger schools had somewhat lower student participation rates (National Center for Education Statistics, 2024). Tutor programs cannot be effective at scale when little actual tutoring happens.
Buy-in is also a problem identified among staff. Programs that appeared to successfully scale high-quality tutoring after the pandemic emphasized the importance of district-level leadership, goal setting, buy-in from school leaders and teachers, a willingness to rethink scheduling, the pursuit of multiple funding sources, and the ability to make difficult choices about spending trade-offs (Cohen, 2024). Leaders in the R2R districts highlighted staffing challenges related to pandemic surges, a tight labor market, and limited district capacity for recruitment and human resources management. These issues of staffing challenges and organizational capacity are echoed by findings from a qualitative study on programs in two urban districts (Makori et al., 2024). Implementation challenges do not appear to be solely a function of acute post-pandemic conditions, as the R2R team’s follow up report from 2022-23 revealed similar difficulties (Carbonari, DeArmond, et al., 2024). Leaders interviewed for the R2R report pointed to their need to adapt tutoring program designs—sometimes departing from best practices—to align with federal, state, and local policies. This is likely to remain a challenge as schools and districts look to a range of federal, state, and local funding sources to support tutoring programs after the COVID-relief funding runs dry (Accelerate, 2023; Cohen, 2024).
How Does the Aim, Format, and Intended Dosage of Tutoring Programs Moderate Their Effects?
Understanding how tutoring program effects vary based on their aim, format, and intended dosage is paramount for improving specific program designs and avoiding broad generalizations when program effects vary considerably. We explore how the pooled effect sizes described previously vary across a range of moderators. We group these moderators into three broad buckets:
Results from our meta-analytic regression in Table 8 reveal that only a few study features and program characteristics appear systematically related to effect sizes when included in our fully controlled meta-analytic model. Intervention tests produce meaningfully larger effect sizes relative to independent tests (.17 SD). Tutoring outside of the school day has a strong negative association with effect sizes relative to tutoring during the school day (−.16 SD). We similarly find a negative association with using a specified tutor type not in our major categories, relative to a teacher (−.18 SD), although this group mostly consists of community members and/or volunteers. Results for total intended dosage do not show a monotonic relationship with impacts, suggesting intended dosage may only be weakly related to actual dosage.
Like so many complex interventions, the efficacy of tutoring programs may lie in the combination of program design features rather than any single characteristic. Prior literature has focused on a bundle of program features that research suggests are associated with larger effects, aligned with what is sometimes described as “high-quality,” “high-dosage,” or “high-impact” programs (e.g., Robinson et al., 2021). This bundle of features includes in-person programming, delivered at school during school hours, with a student-tutor ratio of no more than 3:1, meeting at least three times per week, ensuring a high overall dosage of intended tutoring (which we proxy for with at least 15 hours of total tutoring), and using a provided curriculum. 15
When we test whether the combination of these features is greater than the sum of their parts, we find encouraging results, reported in Table 10. Specifically, the overall pattern of declining effect sizes persists among tutoring programs that utilize a bundled package of recommended design features, but the attenuation at scale is much less pronounced. When we isolate only individual features of this bundle, effects continue to erode to varying degrees as programs scale (Appendix Table B4, online). As shown in Figure 4, while the pooled effect among studies of programs serving between 100–399 students declines by 40% relative to programs serving 99 students or fewer in the full sample, it only declines by 9% in the restricted sample of studies with the bundled package of design features. The decline among programs serving 400–999 students is also less pronounced, dropping 49% in the full sample and 29% in the bundled package sample. When we restrict our analysis to subsamples of studies using independent test measures, excluding outlier observations, or limiting to those published after 2009, we continue to see limited attenuation of program effects across sample size, at least for programs serving fewer than 1,000 students. 16
Estimates for programs that combine best practices, stacked subjects
Notes. All cells stack estimates for math and reading. Column (1) presents the pooled meta-analytic estimated effect for the subsample of studies described in each panel. Columns (2) through (4) disaggregate the estimate in Column (1) according to the tutored student sample size. Panel A is limited to the described subsample of programs sharing a set of best practices in their designs. Panel B is restricted to studies with independent test outcome measures. Panel C drops the top and bottom 2.5% of effect sizes from the whole sample (“no outliers”), and Panel D excludes studies published prior to 2010. DF = degrees of freedom; ES = effect size; SE = standard error.
p < .10; **p < .05, *p < .01.

Gaps between pooled estimates for all studies compared to those using a bundle of tutoring best practices.
What Does the Research Suggest About Modifying Program Design Features to Reduce Costs and Increase Scalability?
Although the bundled package of program features appears to help sustain program effectiveness at scale, several aspects are costly and can be difficult to implement at scale. Here, we explore the potential implications of modifying specific program features.
Moving tutoring online: Many districts and programs have adopted online tutoring to access a larger potential supply of tutors. How might this affect the efficacy of tutoring? When we limit our sample to the 59 estimates of tutoring delivered virtually (drawn from six unique studies), the pooled estimate reported in Table 11a is .08 SD. 17 This is substantially smaller than the unadjusted pooled estimate of in-person program impacts of .41 SD and even substantially smaller than our preferred pooled estimates of expected impacts for our target of inference, .16 to .22 SD (Table 5). However, our sample of virtual programs is small and offers very limited degrees of freedom (1.9), so we caution against over-interpreting these differences. Additionally, results from our meta-analytic regressions presented in Table 8 suggest that these smaller effects are likely driven by other program features. Conditional on our extensive set of codes for observable program features, we estimate a positive but statistically insignificant coefficient when comparing virtual tutoring programs to those in person.
a. Average effect sizes across different program design features, stacked subjects
Notes. Each column isolates a subsample of effects according to tutoring program characteristics. All estimates stack math and reading. DF = degrees of freedom; ES = effect size; SE = standard error.
p < .10; **p < .05, *p < .01.
b. Average effect sizes across different program design features, with stacked subjects
Notes. Each column isolates a subsample of effects according to tutoring program characteristics. All estimates stack math and reading. DF = degrees of freedom; ES = effect size; SE = standard error.
p < .10; ** p < .05, * p < .01.
Further evidence of the efficacy of online tutoring comes from a novel study design where students were randomized to receive literacy tutoring either in-person or online, with tutors fully crossed across conditions. The authors find no statistically significant difference in the achievement growth of students who were tutored online versus in person, although tutors report feeling more connected to the students they tutored in-person (Hashim et al., 2025). Several new studies of virtual tutoring released in 2024 and 2025 find positive estimates that are similar to or somewhat larger than the magnitude of our pooled estimate for virtual programs (Carlana & La Ferrara, 2024; Neitzel & Storey, 2024; Ready et al., 2024), with another study finding null or negative results (Huffaker et al., 2025). We read this evidence as suggestive that online tutoring has the potential to be an effective approach to addressing scaling challenges when accompanied by other effective program design characteristics.
Increasing student-tutor ratios: The cost of tutoring is driven largely by tutor compensation. Many districts and tutoring organizations have chosen to increase student-tutor ratios as a means of expanding access while managing costs. In Table 11a, we report pooled effect estimates by student-tutor ratio and find somewhat larger impacts, on average, for programs with lower ratios. Using the full sample, we estimate pooled effects of .39, .40, .34, and .30 SD for 1:1, 2:1, 3:1, and 4:1 programs, respectively. The effects for programs with five or more students per tutor are substantially larger (.72 SD), but this result is not robust to excluding studies that use intervention assessments. The overall pattern of declining effects persists when we focus on tutoring programs evaluated using independent tests. When we examine student-tutor ratios in a meta-regression framework (Table 8), we again find a pattern of larger effects for smaller student-tutor ratios, although individual estimates are imprecise.
Evidence from the 10 studies that experimentally vary student-tutor ratios, summarized in Table 12a, provides a range of contrasts from 1:1 versus 2:1 ratios (Carlana & La Ferrara, 2024; Loeb et al., 2023; Vadasy & Sanders, 2008) to 4:1 to 13:1 ratios (Vaughn et al., 2010). Most examine interventions with elementary students (Clarke et al., 2017, 2020, 2023; Doabler et al., 2019; Loeb et al., 2023; Schwartz et al., 2012; Vadasy & Sanders, 2008) except for three with middle schoolers (Carlana & La Ferrara, 2024; Kraft & Lovison, 2024; Vaughn et al., 2010). The effect size differences most often favor smaller ratios but are not always large in magnitude and do not typically achieve statistical significance. However, many of these studies are underpowered to detect small differences in effects between treatment arms. In short, the existing research suggests that lower ratios produce larger effects, but it is possible to deliver tutoring in pairs or small groups and maintain meaningful effects.
a. Multi-arm studies experimentally comparing different student-tutor ratios
Notes. All studies examine elementary programs except for three that study middle school programs: Carlana and La Ferrara (2024), Kraft and Lovison (2024), and Vaughn et al. (2010). Shaded areas identify the number or range of students per tutor reported in each study.
b. Multi-arm studies experimentally comparing different intended dosages of tutoring
Note. There is more than one effect in each study because the authors report effects on multiple reading outcomes or assessments. All studies examine tutoring in reading subjects except for Carlana and La Ferrera (2021), which focused on multiple subjects. Shaded areas identify the number or range of students per tutor reported in each study.
Using peer tutors: An alternative approach to scaling tutoring on a fixed budget is to enlist K–12 students as peer tutors. We find that pooled effect sizes for peer tutoring in our full sample are an impressive .32 SD, as shown in Table 11b. Our meta-analytic regression (Table 8) suggests peer tutoring is as effective as tutoring by teachers, conditional on other program and study characteristics, with a nonsignificant difference of .11 SD in favor of teachers relative to peer tutors. We know of only one study that randomizes students to different tutor types. Mathes et al. (2003) use a partially matched and partially randomized design to compare teachers who implemented small-group (4–5:1) instruction versus overseeing pairs of students who used Peer Assisted Learning Strategies (PALS). They find effect sizes of .70 SD for teacher-directed, small-group instruction and .55 SD for peer-assisted instruction. Further evidence documents that the peer-tutoring program PALS for kindergarteners does scale and maintain its efficacy (Stein et al., 2008), suggesting that peer tutoring may provide a viable path for reducing program costs while sustaining effects.
Decreasing dosage: A fourth approach to scaling tutoring while controlling costs is to reduce overall dosage. Pooled effect estimates presented in Table 11b do not reveal a clear monotonic trend between intended dosage hours and program impacts in our more restricted sample. Across the full sample as well as the more restricted samples, programs offering over 60 hours of tutoring consistently have the smallest impacts. In our preferred subsample assessed with independent tests, the greatest magnitude of effect is for programs providing 15–29 hours of tutoring (.40 SD). However, these pooled estimates may be confounded with other study characteristics correlated with intended dosage. When employing meta-regression to control for a variety of program features and study characteristics, we do not find consistent statistically significant differences based on the total hours of intended dosage, as shown in Table 8.
Evidence from four studies that randomly assigned students to different intended doses of tutoring to isolate the causal impact of intended dosage suggests some benefits of higher intended dosage. As shown in Table 12b, three of these studies evaluate elementary school programs (Al Otaiba et al., 2005; Begeny, 2011; Wanzek & Vaughn, 2008) and one middle school program (Carlana & La Ferrara, 2021). These studies provide a range of contrasts, for example, comparing four versus nine total hours of tutoring (Begeny, 2011) to comparing 36 hours versus 72 hours (Al Otaiba et al., 2005). More often than not, these studies show greater effect sizes for programs designed to provide higher than lower intended dosages. In short, studies that experimentally vary intended dosage suggest that reducing it may attenuate effects. However, the actual delivered dosage is likely a more salient program feature.
Discussion
Evidence-based policymaking has increasingly become the standard in education, particularly as practitioners look to implement proven approaches to accelerate students’ academic growth after the substantial disruptions caused by the COVID-19 pandemic. While this trend is encouraging, it places increased importance on the external validity of research. Even well-designed and -implemented RCTs offer incomplete information to policymakers and practitioners if the evidence they produce is at arm’s length from the realities of implementing education policies and practice at scale. Meta-analyses that pool evidence across multiple studies seemingly offer stronger external validity, but aggregating across multiple studies with limited generalizability does not make the results valid for a very different target of inference.
Our study illustrates the importance of carefully considering the alignment between the research evidence and the policy target of inference. We find that attempts to better harmonize our meta-analytic sample of 263 studies to the target of inference used by most policymakers—large-scale tutoring programs aiming to increase student performance on independent tests—substantially reduce the pooled effect sizes. This attenuation is driven by the declining impacts of tutoring programs as they scale and, in part, explained by decreasing intended dosage and increasing student-student ratios, expanding tutoring to students who may benefit less, and declining success at delivering on the intended tutoring dosage.
This pattern of declining effects at scale often leads to a circular argument that “the program works when implemented with fidelity, it just wasn’t implemented correctly when taken to scale.” Alternatively, one might ask, “If implementation becomes systematically more difficult at scale, then does a program really work?” We see four possible responses to this challenge: 1) start small, learn, iterate, and engage in the hard but critical work to scale vertically (i.e., expanding program size) over time while maintaining program fidelity, 2) redesign the program to be easier to implement at scale, 3) adopt a more flexible approach to scaling that allows for localized adaptation, and/or 4) decide that a program is best delivered in a small-scale format and focus on horizontal scaling (replicating small programs).
To be clear, we view our target-equivalent estimates of the effects of tutoring as still meaningful and policy-relevant (Kraft, 2020). And we see tutoring as one of the most promising evidence-based approaches to accelerating student achievement. If districts could leverage tutoring at scale for those students whose learning was most negatively affected by the pandemic and produce effects similar to our policy-relevant estimates, it would be a huge success. In fact, several recent experimental studies of tutoring programs implemented post-COVID at a medium scale (Carlana & La Ferrara, 2024; Cortes et al., 2024; Gortazar et al., 2024) and at a large scale (Robinson et al., 2024) find effects on par with those from our target-aligned pooled effect sizes.
That said, we also think it is equally important for policymakers and practitioners alike to have more grounded expectations about what tutoring can accomplish. Several other recent studies using both experimental and non-experimental methods suggest early attempts to scale tutoring have produced quite small effects (Carbonari, DeArmond, et al., 2024; Carbonari, Dewey, et al., 2024; Kraft et al., 2024). Outsized expectations can lead policymakers and practitioners to become disillusioned when they fail to realize the eye-popping effect sizes of small-scale, boutique tutoring programs implemented under favorable circumstances among students who often opt into participating, particularly when meta-analytic estimates mask those contextual factors. Unrealistic expectations can also lead policymakers to mistakenly rely on a single or limited set of interventions when multiple interrelated programs may be needed to achieve their goals. Contextualizing tutoring program effects relative to their costs will also be critical for identifying sustainable models (Kohlmoos & Steinberg, 2024).
New technology may also present opportunities to scale tutoring with greater fidelity while maintaining program effects and reducing per-pupil costs. Recent studies suggest that computer-assisted learning programs paired with tutoring (Bhatt et al., 2024) or integrated into core academic classes (Oreopoulos et al., 2024) can support effective instruction, potentially reducing common obstacles to scaling tutoring. There is growing interest in the potential of generative artificial intelligence to offer effective tutoring at scale, although early programs appear to fall well short of this goal (Barnum, 2024). We remain optimistic about the potential of these new technologies but emphasize that the benefits of human tutoring likely extend far beyond student performance on standardized tests, to say nothing about the value of tutoring for the tutor. Human tutoring offers the opportunity for authentic personal connections and social interactions that can contribute to student development. It also creates volunteer and employment opportunities and valuable experiences for those interested in pursuing a career in education.
Conclusion
Efforts to integrate tutoring at scale into the U.S. K–12 public education system are at a critical juncture. New evidence documenting the mixed results of early efforts to expand access in the wake of the COVID-19 pandemic is emerging just as large-scale federal funding to support tutoring ends. With this paper, we aim to inform ongoing efforts to refine tutoring programs when implemented at scale and better calibrate expectations for what these programs are capable of accomplishing. Our findings highlight the importance of conducting research that considers both internal and external validity to best inform policy and practice.
Our analyses suggest that a bundled package of program features hypothesized to promote effective tutoring does guard against some of the attenuation that occurs as programs expand. It remains an open question whether adapting individual features of this bundle—such as moving tutoring online, increasing student-tutor ratios, using peer tutors, or decreasing dosage—can be done without compromising effectiveness. Such changes may attenuate effects but still be an equally, if not more, cost-effective way to deliver tutoring at scale. Our hope is that as policymakers experiment with new tutoring models, they will partner with researchers to learn about the impacts of these adaptations. Continued efforts to integrate individualized instruction into the U.S. K–12 education system would benefit from a decades-long approach that focuses first on establishing effectiveness and then on scaling, rather than the other way around.
Supplemental Material
sj-docx-1-rer-10.3102_00346543261446660 – Supplemental material for What Impacts Should We Expect From Tutoring at Scale? Exploring Meta-Analytic Generalizability
Supplemental material, sj-docx-1-rer-10.3102_00346543261446660 for What Impacts Should We Expect From Tutoring at Scale? Exploring Meta-Analytic Generalizability by Matthew A. Kraft, Beth E. Schueler and Grace T. Falken in Review of Educational Research
Footnotes
Notes
Authors
MATTHEW A. KRAFT is a professor of education and economics at Brown University;
BETH E. SCHUELER is an associate professor of education at Stanford University;
GRACE T. FALKEN is a project director at the Annenberg Institute at Brown University;
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
