What Impacts Should We Expect From Tutoring at Scale? Exploring Meta-Analytic Generalizability

Abstract

U.S. public schools engaged in an unprecedented effort to expand tutoring in the wake of the COVID-19 pandemic. Broad-based support for scaling tutoring emerged, in part, because of the large effects on student achievement found in prior meta-analyses. We conduct an expanded meta-analysis of 263 randomized controlled trials and explore how estimates change when we better align our sample with a policy-relevant target of inference: large-scale tutoring programs aiming to improve performance on independent, standardized tests. Pooled effect sizes from studies with stronger target-equivalence are .16 to .22 standard deviations, relative to .40 standard deviations in our full sample. This result is driven by stark declines in pooled effect sizes as program scale increases. We explore four hypotheses for this pattern and document how a bundled package of recommended design features serves to partially inoculate programs from this attenuation at scale.

Keywords

tutoring program evaluation policy equity achievement meta-analysis

Efforts to take tutoring to scale across the U.S. public education system represent a rare collective undertaking to reform modern schooling. Historically, education—both formal and informal—was primarily an individualized endeavor with tutors and pupils or master craftsmen and apprentices working together one-on-one. The rise of large-scale public education systems over the last two centuries evolved around a different organizing principle—one in which teachers became charged with the task of educating entire classrooms of students (Tyack, 1974). While teaching students in groups allowed these systems to expand access rapidly, it also created substantial challenges for educators to meet the full spectrum of students’ individual needs.

The COVID-19 pandemic toppled the precarious balance teachers have long tried to achieve between whole-class instruction and differentiated instruction. The public health crisis caused widespread school closures as well as acute hardships for many families. In the United States (U.S.), researchers estimate that median student achievement initially fell .24 standard deviations (SDs) in math and .13 SD in reading, with even larger declines among low-achieving students (Callen et al., 2024). The pandemic both exacerbated longstanding inequalities in student achievement and created a shared priority to accelerate learning.

In the months following the pandemic, a rare consensus emerged among policymakers, researchers, and practitioners that tutoring had a critical role to play in addressing the educational harms caused by COVID-19. Integrating tutoring into the public education system at scale became a primary policy response to pandemic-related learning disruptions. Unlike previous unfunded attempts to scale tutoring, such as President Clinton’s America Reads initiative, the federal government and individual states catalyzed these efforts with substantial financial investments (National Student Support Accelerator [NSSA], 2023). The federal Elementary and Secondary School Emergency Relief Fund (ESSER) provided $190 billion to public schools and required districts to spend a sizable fraction of this on student learning acceleration (including 20% of the third wave of ESSER funding) (Goldhaber & Falken, 2025). Of the $117 billion spent as of June 2023, reports suggest that at least $5.4 billion was spent on tutoring and other learning acceleration efforts (CCSSO, 2023; U.S. Department of Education, 2024). While only a small share of the overall grants, this is on par with estimates of the annual cost of administering tutoring to students who need it nationwide (Kraft & Falken, 2021).

Efforts to scale tutoring after COVID-19’s onset appear to have substantially expanded access to individualized instruction in U.S. public schools. The nationally representative School Pulse Survey found that by December 2022, 37% of schools reported offering high-dosage tutoring, defined as “tutoring that takes place for at least 30 minutes per session, one on one or in small-group instruction, offered three or more times per week, is provided by educators or well-trained tutors, [and] aligns with an evidence-based core curriculum or program.” This statistic increases to 59% when schools were asked if they offer more standard tutoring, defined as a less intensive and structured approach to individualized instruction. At the same time, districts have yet to implement these programs at the scale or dosage many believe is required to support a full academic recovery (Goldhaber et al., 2022). Estimates based on the nationally representative Understanding America Study and the School Pulse Survey in December 2022 place the number of students receiving more intensive tutoring between 2% and 10%.

Efforts to expand access to tutoring were, in many ways, evidence-based policy. Meta-analyses conducted by several independent research teams that reviewed randomized controlled trials (RCTs) of tutoring programs have all found large pooled effects of tutoring on test-based measures of achievement in the range of .30 to .40 SD (Dietrichson et al., 2017; Fryer, 2017; Inns et al., 2019; Nickow et al., 2020, 2024; Pellegrini et al., 2021). These effects are roughly equal to an 11 to 15 percentage-point increase (von Hippel, 2025), or to the amount of learning in reading that upper elementary students in the U.S. typically make in an entire school year (Hill et al., 2008). These large pooled effects played a central role in motivating calls by policymakers and researchers—including ourselves (Kraft & Falken, 2021; Robinson et al., 2021)—to advocate for scaling tutoring. Influential technologists such as Mark Zuckerberg and Sal Khan have extolled the moonshot-like potential of tutoring, evoking the eye-popping 2 SD effects found in small-scale studies conducted by University of Chicago doctoral students under the supervision of Benjamin Bloom in the 1980s. However, scholars have recently raised new critiques about Bloom’s 2-sigma studies (Barnum, 2018; von Hippel, 2024) and the generalizability of pooled effect sizes generated from meta-analytic reviews (Dahabreh et al., 2020; Littell, 2024; Slough & Tyson, 2023).

In this paper, we conduct an expanded and updated meta-analysis of RCTs evaluating tutoring programs to explore the external validity of pooled effect size estimates. While the common empirical focus on RCTs bolsters the internal validity of meta-analytic estimates, meta-analytic reviews of experimental studies with small-to-medium nonprobability samples do not necessarily produce estimates that generalize to broader efforts to scale tutoring (Littell, 2024). As many scholars have highlighted, strong internal validity does not beget broad external validity (Banerjee & Duflo, 2009; Esterling et al., 2024; Pritchett & Sandefur, 2015). Questions about external validity are particularly relevant in the tutoring context because schools and districts are often motivated to expand tutoring while operating within budget constraints. This can create tension between maintaining fidelity to best practices and supporting more students. We aim to inform efforts to implement tutoring at scale by answering two primary research questions:

1) What expectations should we have for the magnitude of tutoring effects on independent, standardized tests for large-scale programs implemented in high-income countries?

2) How does the aim, format, and intended dosage of tutoring programs moderate their effects?

We address these questions by generating pooled effect sizes from a sample of 263 RCTs published between 1967 and 2023 and examining the sensitivity of our results to sample restrictions that better align our estimates with a specific, policy-relevant target of inference: large-scale tutoring programs aiming to improve achievement on independent, standardized tests in high-income countries. We then leverage meta-regression analyses to explore how tutoring program effects vary across a range of program characteristic measures as well as brief syntheses of RCTs that isolate the effects of specific program features. Prior policy briefs have hypothesized that a combination of program features commonly used to define “high-dosage” or “high-impact” tutoring is central to program effectiveness (e.g., Robinson et al., 2021). These features include in-person programming, tutoring during the school day, sustained student-tutor relationships, student-tutor ratio of no more than 3:1, meeting at least three times per week for a semester or more, and using high-quality instructional materials informed by diagnostic assessments to individually target instruction. We conclude by examining how common approaches to addressing scaling challenges, such as moving tutoring online, increasing student-tutor ratios, using peer tutors, and decreasing dosage, might affect program efficacy at scale.

Our analyses reveal a stark pattern of declining effects of tutoring programs when taken to scale.¹ Consistent with prior meta-analyses, we find a large, pooled effect size of .40 SD on student achievement across our full sample, driven by the large effects of literacy tutoring programs in elementary grades. When we restrict our sample to larger-scale tutoring programs evaluated based on independent, standardized assessments, our estimates shrink by 45% to 60%. In our preferred analytic samples, we estimate a pooled effect size of .22 SD for programs serving 400 to 999 students and .16 SD for programs serving 1,000 students or more. We view these average effects both as having considerable policy importance given their meaningful magnitude and stronger external validity. Exploratory analyses point to several likely explanations for these declining effects at scale, including: 1) systematic program differences with increasing student-teacher ratios and decreasing intended dosage as programs scale, 2) larger programs being less able to target students who may most benefit from tutoring, and 3) declining implementation quality, such as lower delivery of intended dosage. Encouragingly, we do find that a combination of recommended program design features somewhat buffers against the large decline in effects we find at scale.

Our study makes several contributions to the literature. We extend prior tutoring meta-analyses (e.g., Kulik & Fletcher, 2016; Nickow et al., 2024; Ritter et al., 2009; Slavin & Lake, 2008) by compiling a sample of 263 RCTs, roughly three times the number of studies as the largest prior reviews. This larger sample allows us to explore how our overall effect size estimates compare to those for subsamples of studies that are more aligned with the target of inference used by many researchers and policymakers. Second, our study serves as an applied example of why it is critical to attend to external validity when conducting meta-analytic reviews and engaging in evidence-based policymaking. Finally, our analyses generate important insights to inform ongoing efforts to scale and sustain tutoring within the U.S. public school system. Our findings provide stronger, more externally valid evidence to support investments in tutoring, while also recalibrating expectations toward more plausible gains for students.

Methods

Literature Search Procedures

We began by searching for articles in seven electronic databases, including Academic Search Premier, APA PsychInfo, AEA EconLit, ERIC, Google Scholar, Science Direct, and Web of Science. We also searched two working paper series, from the Brown University Annenberg Institute and the National Bureau of Economic Research, to ensure we captured studies not yet published in peer-reviewed outlets (Alexander, 2020; Pigott & Polanin, 2020). This served to minimize the extent to which we were missing key research, especially work produced by scholars from historically marginalized groups (Boveda et al., 2023). Our search terms included keywords related to (a) tutoring (e.g., “tutor”), (b) educational contexts (e.g., “school”), and (c) impact evaluation research methods (e.g., “RCT”). We used Boolean operators between all terms, specifically “OR” between terms within each of these three keyword categories, many of which were synonyms, and “AND” between each of the three categories to maximize the relevance of search results without overlooking key studies. The full list of search parameters is provided in the online Appendix Table B1. We identified 45 preexisting reviews and meta-analyses of tutoring-related interventions and scanned the reference lists of these for new studies. We continued our literature search through the end of 2023, the cutoff date for studies we formally coded. Though we stopped coding new studies, we continued to track newly released studies and incorporated several into our narrative synthesis and discussion. Our search generated over 14,000 studies. After removing duplicates, we followed Pigott and Polanin (2020) and had two team members conduct an initial screening for relevance using titles and abstracts. This left 1,347 studies that we subjected to an in-depth inclusion review of the full texts, ultimately resulting in a final analytic sample of 263 studies.

Inclusion Criteria

To identify our analytic sample, we assessed studies against eight inclusion criteria: 1) human tutoring, 2) 1:1 or in small groups, 3) focused on academics, 4) measured effects on standardized tests in math or reading, 5) K–12 students, 6) in an OECD country, 7) RCT design, and 8) randomized more than 20 students or four classrooms. First, programs under study needed to meet our broad definition of tutoring: “One non-parental person providing supplemental academic support to a single student or small group of students.” We excluded studies of individualized instruction provided by a book, computer program, or other curricular tool without the direct support of a human tutor. While we included studies of programs where the tutor was a teacher, paraprofessional, college student, volunteer, or peer, we excluded studies of parent tutoring programs because all relevant studies we identified evaluated models of parent training or professional development rather than direct parent-child instruction. For example, we excluded Goudey (2009), which had been included in prior meta-analyses (e.g., Nickow et al., 2024), because of its focus on parental tutoring. Second, the tutoring intervention must have been implemented with either a 1:1 student-tutor ratio or in groups of eight or fewer students.² Third, the tutoring content had to focus on academic subjects. This excluded, for example, studies of mentoring or socioemotional interventions without an academic component. Fourth, our focus on academic interventions also meant that studies needed to report effects on academic outcomes, specifically standardized tests designed by a third party and used for accountability purposes or formative assessments (“independent tests”), or assessments designed by the research team to capture intervention impacts (“intervention tests”), measuring performance in either reading or math. We excluded studies where the only outcome was a nontest academic measure (e.g., GPA, attendance) because the sample of these studies was too small to facilitate broad comparisons. Fifth, the tutees had to be K–12 students. This excluded studies of tutoring in early childhood settings, of college or graduate students, and of adults. Sixth, the intervention had to take place in a member country of the Organization for Economic Co-operation and Development (OECD), given our focus on high-income country settings. For example, we excluded a study of phone-based tutoring in Kenya (Schueler & Rodriguez-Segura, 2022). Seventh, we limited our sample to RCT designs to parallel prior reviews and given RCTs’ relative advantage at isolating causal impacts. That said, we supplement our meta-analysis with a synthetic review of recent non-experimental studies, which helps us consider tutoring impacts at a scale not captured by most RCTs. Finally, the studies had to have a sample size of more than 20 students when randomization occurred at the student level, or more than four classrooms or schools when randomization was at the classroom or school level.

We also applied inclusion criteria to the effects reported and coded all qualifying estimates from each study. First, the effect estimates had to examine the same outcome as the subject of the tutoring (i.e., we dropped estimates of the impact of math tutoring on reading achievement). Second, we focused on treatment-control contrasts that isolated tutoring whenever possible, dropping estimates where the control condition involved tutoring-like programs and comparisons between treatment arms without a pure no-tutoring control group. However, we included studies in which we judged tutoring to be a key element of a larger set of interventions and reforms that together were evaluated against a business-as-usual control group.³ Finally, we prioritized estimates from reduced-form models that define treatment as the offer of tutoring. We view these intent-to-treat estimates as the relevant impact for the types of inferences policymakers often make about what the effect of a program will be as implemented at scale.

Coding Procedures

Our research team of 20 coders double-coded each study in our sample. We trained coders on a common set of studies until they achieved a consistently high agreement rate with master codes created by our most experienced coders. After coding each study independently, coders then met to reconcile any differences and arrive at a final set of codes. When a pair of coders felt that the reconciliation was not straightforward, they brought questions to the principal investigators for a final determination. The team kept a record of decision rules that resulted from these meetings to ensure consistency across coders and over time.⁴

Our codebook included 128 codes that we grouped into five categories. Some codes varied at the study level, while others varied at the intervention or estimate level. The first group of codes cataloged study information such as publication type (e.g., journal article, working paper) and publication year. The second group tracked information about the context in which the study occurred, such as the country, school level, and participant demographics. The third set covered information about the intervention itself and the treatment/control contrast (e.g., student-tutor ratio, the intended dosage, tutor type). The fourth category was information on the methods used by the study’s authors, such as the level of assignment to treatment, whether standard errors were clustered at the appropriate level, and whether we had concerns about attrition or contamination of the randomization. The fifth set included information about the effects, including estimated effect sizes, standard errors, sample sizes, and outcome instruments.

We highlight one key code that we use throughout our analyses: the number of treated students. Prior meta-analytic reviews often explore how effect sizes vary by the total sample size of an evaluation. We take a somewhat different approach given our focus on identifying studies that are more closely aligned to a specific target of inference. We code the number of students randomly assigned to receive treatment as an estimate of the number of treated students.⁵

Calculating Effect Sizes

Study authors reported treatment effects in a variety of ways. Whenever they were available, we defaulted to relying on standardized effect sizes generated from linear regressions estimating standardized mean differences between the treatment and control group, often controlling for baseline covariates. One advantage of model-based estimates is that the associated standard errors typically account for the ways data may be clustered, as recommended by Hedges (2007). When these estimates (and/or their associated standard errors) were unstandardized, we standardized them using unadjusted pretreatment control group SD whenever possible (if unavailable, we used pooled SD). In other cases, we estimated a standardized effect size using the pre-post treatment means, SD, and sample sizes for the treatment and control groups. For each estimate, we then calculated a Hedges’ g effect size, correcting for upward bias present for small-sample studies (Borenstein et al., 2009) as follows:

g^{*} = (1 - \frac{3}{4 (n_{T} + n_{C}) - 9}) g

Here, $g^{*}$ is the corrected effect size estimate, $n_{T}$ is the number of treated units (i.e., students or classrooms), $n_{C}$ is the number of comparison units, and $g$ is the uncorrected effect size estimate.

Meta-Analytic Estimates

We generated our pooled standardized effect size estimates using a correlated and hierarchical effects (CHE) model as described by Pustejovsky and Tipton (2022). Like robust variance estimation (RVE) meta-analytic techniques (Hedges et al., 2010; Tanner-Smith & Tipton, 2014), the CHE approach upweights effects estimated with greater precision and allows for the nesting of estimates within clusters. This is important in our case, given that we often observe multiple estimates for a given study (for example, when there are multiple outcomes or interventions examined in a single study). However, the typical RVE approach requires researchers to choose between either a “hierarchical effects” (HE) or a “correlated effects” (CE) approach. HE models account for both between-study and within-study variation in effect sizes but assume that effect size estimates within the same study are independent. CE models account for the correlation between effect size estimates within studies but assume that there is no within-study variation in true effect size parameters. The CHE approach has the benefit of allowing for both between-study and within-study variation in true effect sizes while also accounting for correlated effect estimates within studies. We, therefore, fit the following CHE model:

T_{i j}^{k} = β_{0}^{k} + u_{j}^{k} + v_{i j}^{k} + e_{i j}^{k}

where $T_{i j}^{k}$ represents an impact estimate i on outcome k (either math, reading, or stacking subjects together in a single analysis) from study j. $β_{0}^{k}$ is the overall weighted average impact of tutoring on outcome k, $u_{j}^{k}$ is a study-level random effect, while $v_{i j}^{k}$ is an effect size random effect and $e_{i j}^{k}$ is the effect size estimate error.⁶ Variance components are estimated using restricted maximum likelihood (REML) estimation. Here, Var( $u_{j}$ ) = $τ^{2},$ Var( $v_{i j}$ ) = $ω^{2}, Var (e_{i j}) = s_{j}^{2}$ , and Cov( $e_{h j}, e_{i j})$ = Therefore, $τ^{2}$ is the between-study variation in study-average true effect sizes, $ω^{2}$ is the within-study variation in true effect sizes, and $s_{j}^{2}$ is the average (known) sampling variance in the study. Like CE, the CHE approach makes the simplifying assumption that there is a single, known correlation $ρ$ between pairs of effect sizes from the same study that is constant across all studies, which we set to r = .60 following Pustejovsky & Tipton (2022). In addition to pooled effect estimates and associated standard errors, we also report prediction intervals for select estimates to describe the degree of heterogeneity in our sample and to illustrate the range of plausible effects policymakers might expect (Borenstein et al., 2017).

Meta-Regression

Researchers have typically explored the relative importance of moderators by comparing the pooled effect sizes of tutoring programs with different program features. This approach is limited, however, because program features are often bundled and could be correlated with unobserved aspects of program quality (Tipton et al., 2023). We attempt to reduce these potential biases using meta-regressions to examine which moderators predict larger impact estimates, conditional on other study and program design features. We estimate the following CHE model:

T_{i j}^{k} = β_{0}^{k} + Γ X_{i / j} + u_{j}^{k} + v_{i j}^{k} + e_{i j}^{k}

Here, we include a vector of study and intervention features ( $Γ X_{i / j}$ ). While this model does not allow us to isolate the causal impact of a particular intervention feature on student outcomes, it does allow us to tease apart which of the observable study and intervention characteristics are driving the largest differences in effect size estimates. When possible, we complement these analyses with results from multi-arm RCTs that randomly assign students to tutoring programs that differ only by a single design feature. These studies provide credible causal estimates of the effects of specific program design features but are often underpowered to detect small-to-medium differences in effects produced by modifying only one aspect of a tutoring program.

Target of Inference

Our aim is to draw inferences about tutoring programs that are most relevant to the target, context, outcomes, and scale of tutoring programs envisioned by policymakers in the U.S. and other similar high-income countries. Specifically, we hope to inform the expectations of leaders who are seeking to address overall declines and growing gaps in academic outcomes post-COVID by integrating tutoring into the K–12 public school system. We imagine that because leaders are being held accountable for results on statewide standardized exams that assess a broad set of skills covered by state content standards, policymakers will be more interested in tutoring impacts on statewide exams that measure general skills as opposed to assessments that measure narrower sets of skills or that are designed by researchers to align tightly with the focal content of the tutoring intervention.

A primary goal of our work is to inform efforts to expand access to tutoring programs. We therefore aim to draw inferences about reasonable expectations for the impacts of tutoring programs implemented at scale, as opposed to small-scale pilot programs. Throughout the paper, we present estimates of pooled effects sizes across four bins of program size: 0–99, 100–399, 400–999, and 1,000 or more students. Although there are likely many more small districts that might aim to serve less than 400 students, larger tutoring programs will serve a disproportionately greater number of students, making such programs a policy-relevant focus of our analysis.

Results

Characteristics of Included Studies

Our final analytic sample includes 263 RCTs that evaluate 338 distinct tutoring interventions.⁷ We present characteristics of these studies at the study/RCT-level in Table 1. Our sample skews toward recent research, with almost two-thirds of included reports published in the years since 2009 and almost 86% since 1999. Only five studies in our sample assess interventions implemented since the beginning of the pandemic, almost all of which provided remote tutoring, giving us limited power to disentangle virtual delivery from the pandemic context. Three-fourths of the studies in our sample are academic journal articles. The modal study examined a tutoring program in an urban, public school setting.

Table 1

Study characteristics

	Sample mean (%)	n
Publication date
Published before 1980	3.80	10
Published in the 1980s	1.90	5
Published in the 1990s	7.98	21
Published in the 2000s	24.33	64
Published in the 2010s	50.19	132
Published in the 2020s	11.79	31
Publication type
Academic journal article	77.57	204
Research firm report	8.37	22
University-based research center report	1.90	5
Working paper	1.52	4
Dissertation	7.98	21
Other publication type	2.66	7
*Setting grade level
Lower elementary (K–2)	62.36	164
Upper elementary (3–5)	41.83	110
Middle and high (6–12)	12.17	32
Setting urbanicity
Urban setting	39.54	104
Suburban setting	4.94	13
Rural setting	4.94	13
Multiple urbanicities studied	18.63	49
Urbanicity unknown	31.94	84
Setting country
USA	80.99	213
International/OECD country	19.01	50
Treated student sample
0–99 treated sample	59.32	156
100–399 treated sample	29.66	78
400–999 treated sample	7.60	20
≥1,000 treated sample	3.42	9
Tutoring subject
English as a second language	1.90	5
Math	27.76	73
Reading	64.64	170
Multiple subjects	5.70	15
N studies	263

Setting grade-level categories are not mutually exclusive.

Our sample reflects a substantial imbalance in the subject, grade-level, and size of tutoring programs evaluated in the literature, as illustrated by the evidence gap maps shown in Figures 1 and 2 (Polanin et al., 2023). Most of the studies assess literacy tutoring among early elementary school students (37%) and programs serving fewer than 100 students (62%). This concentration on small elementary reading programs is worth noting because if impacts differ across grade levels, subjects, or with program scale, pooled results based on our full sample may not be immediately generalizable to other program types. The imbalance of studies with some specific characteristics also limits the degrees of freedom available to estimate pooled effects in these subsamples and for these characteristics in our moderation analyses.

Figure 1.

Evidence gap map by school level and subject area.

Figure 2.

Evidence gap map by tutored student sample size and subject area.

We provide further details on the characteristics of the programs evaluated in each of these studies in Table 2. Most interventions were delivered in-person (97%), at school (86%), during school hours (76%), using a 2:1 student-tutor ratio or less (62%), and with a provided curriculum (89%). Although individual tutoring was the modal approach (46%), student-tutor ratios varied widely. We observe greater variation in design choices across the features of tutor type, intended dosage, and whether students were pulled out of class for tutoring.

Table 2

Intervention characteristics

	Sample mean (SD)	n
Virtual/in-person delivery
Tutoring online	3.25	11
Tutoring in-person	96.75	327
Where tutoring happens
Tutoring at school	85.80	290
Tutoring at home	1.18	4
Tutoring in multiple locations/other	2.66	9
Tutoring location unknown	10.36	35
When tutoring happens
Tutoring during school	76.33	258
Tutoring after school	6.51	22
Tutoring during vacation	.30	1
Multiple time windows/other	4.44	15
Timing unknown	12.43	42
Student-tutor ratio
1:1 student-tutor ratio	45.86	155
2:1 student-tutor ratio	16.27	55
3:1 student-tutor ratio	13.02	44
4:1 student-tutor ratio	14.20	48
≥5:1 student-tutor ratio	7.99	27
Ratio unknown	2.66	9
Tutor type
Tutored by teacher	17.46	59
Tutored by paraprofessional	17.16	58
Tutored by peer	9.47	32
Tutored by college/graduate student	15.98	54
Other tutor type	12.13	41
Tutor type unknown	27.81	94
*Intended Dosage (units specified)
Sessions per week	3.39 (1.30)
Hours per session	.61 (.38)
Hours per week	2.01 (1.54)
Weeks per year	16.29 (9.15)
Hours total	33.35 (31.66)
Curriculum provided
Yes	89.05	301
No	10.65	36
Unknown	.30	1
N interventions	338

Intended dosage metrics are not binary variables and are not mutually exclusive. Standard deviations are reported in parentheses, where applicable. All other sets of variables are percentages.

Full Sample Estimates of Tutoring Impacts

Similar to prior tutoring meta-analyses, we find large, pooled effect sizes across our full sample of studies. As shown in Table 3, we estimate that the average effect on student achievement of a broad variety of tutoring interventions subjected to rigorous evaluation via RCTs is .40 SD when stacking math and reading achievement impacts. The prediction interval ranges from −.16 SD to .96 SD, illustrating the considerable heterogeneity of impacts we might expect across individual tutoring programs. This large average effect is driven, in part, by the pooled effect of literacy tutoring in lower elementary grades of .47 SD, which makes up a large portion of our sample (60%). Still, the pooled effects of tutoring on math achievement are also large (.39 SD). We find inconsistent patterns in pooled effects across grade levels by subject. Impacts of reading tutoring for elementary school students are substantially larger than for middle and high school students (.30 SD). In math, we find the largest effects at the upper elementary level (.46 SD), followed by middle and high school (.36 SD).

Table 3

Estimates pooled by grade level and tested subject

	Lower elementary	Upper elementary	Middle and high school	Pooled grades
	(1)	(2)	(3)	(4)
Panel A. Math achievement
Effect size	.327***	.462***	.364**	.390***
Standard error	(.038)	(.061)	(.101)	(.035)
$τ^{2}$	.000	.043	.127	.019
$ω^{2}$	.166	.265	.046	.215
Prediction interval	[.327, .327]	[.056, .868]	[−.333, 1.061]	[.118, .661]
Degrees of freedom	21.7	36.1	16	61.8
n	228	267	52	506
Panel B. Reading achievement
Effect size	.466***	.374***	.303***	.406***
Standard error	(.042)	(.068)	(.078)	(.034)
$τ^{2}$	.131	.188	.060	.105
$ω^{2}$	.308	.547	.215	.314
Prediction interval	[−.243, 1.174]	[−.475, 1.223]	[−.176, .782]	[−.228, 1.040]
Degrees of freedom	122.3	69.4	25.1	170.6
n	1,259	606	145	1,712
Panel C. Pooled achievement
Effect size	.435***	.410***	.334***	.404***
Standard error	(.035)	(.051)	(.064)	(.026)
$τ^{2}$	.104	.144	.074	.081
$ω^{2}$	.283	.460	.163	.289
Prediction interval	[−.196, 1.066]	[−.333, 1.154]	[−.200, .867]	[−.155, .963]
Degrees of freedom	149.4	103.8	36.9	232.7
n	1,487	873	197	2,218

Notes. Prediction intervals are included for each estimate in brackets; robust standard errors are reported in parentheses. Estimates may be included in more than one group if they treat students in multiple grade levels. Lower elementary indicates treatment in grades K–2; upper elementary indicates treatment in grades 3–5; middle school and high school indicate grades 6–12; high school indicates grades 9–12. Pooled achievement include impact estimates for both math and reading subject tests; $τ^{2}$ is the between-study variation in study-average true effect sizes, and $ω^{2}$ is the within-study variation in true effect sizes.

***

p < .10; ** p < .05, * p < .01.

Sensitivity Analyses

We next explore whether our pooled estimates are robust to a variety of sensitivity checks, as shown in Table 4. First, we examine whether results differ for studies that may have lower internal validity due to quality concerns with the randomization design or empirical analyses. For example, some authors described their methods as an RCT but indicated or intimated that students, teachers, parents, or administrators had some influence over whether a student ended up in the treatment or control group. Another example is when a sizable number of students were excluded from the analytic sample because of noncompliance, attrition, or a move. When we examine results separately for studies for which we did not have quality concerns, they remain essentially unchanged. When we omit the top and bottom 2.5% of effect size observations, the pooled effect size estimate drops only slightly to .37 SD, a decline that is largely driven by a reduction in pooled reading effects.⁸

Table 4

Estimates by RCT quality concerns, stacked subjects

	Lower elementary	Upper elementary	Middle & high school	Pooled grades
	(1)	(2)	(3)	(4)
Panel A. Studies with no RCT quality concerns
ES	.439***	.398***	.225***	.397***
SE	(.041)	(.055)	(.039)	(.029)
$τ^{2}$	.119	.154	.001	.083
DF	124.3	91.4	23.0	194.1
n	1,220	804	168	1,870
Panel B. Studies with an RCT quality concern
ES	.402***	.500***	.845**	.435***
SE	(.058)	(.119)	(.338)	(.062)
$τ^{2}$	.036	.084	.463	.088
DF	23.3	11.9	5.7	39.2
n	267	69	29	348
Panel C. Omitting top and bottom 2.5% of effect size observations
ES	.385***	.363***	.299***	.372***
SE	(.021)	(.032)	(.052)	(.019)
$τ^{2}$	.025	.048	.058	.037
DF	133.4	99.7	35.8	222.2
n	1,425	812	187	2,108
Panel D. Studies published prior to 2000
ES	.399***	.301*	.635*	.400***
SE	(.084)	(.146)	(.268)	(.080)
$τ^{2}$	.052	.262	.368	.118
DF	16.5	17.6	5.8	32.8
n	191	67	25	259
Panel E. Studies published between 2000 and 2009
ES	.526***	.569***	.702	.515***
SE	(.087)	(.152)	(.409)	(.075)
$τ^{2}$	.240	.401	.526	.226
DF	47.1	25.2	3.7	60.2
n	503	252	37	654
Panel F. Studies published between 2010 and 2019
ES	.382***	.361***	.220***	.360***
SE	(.030)	(.045)	(.048)	(.026)
$τ^{2}$	.014	.023	.018	.013
DF	60.4	46.2	15.8	101.0
n	676	486	112	1,113
Panel G. Studies published in 2020 and following
ES	.332***	.373***	.158**	.330***
SE	(.074)	(.095)	(.049)	(.053)
$τ^{2}$	.042	.027	.009	.035
DF	12.0	9.7	6.6	25.3
N	117	68	23	192

Notes. Estimates may be included in more than one group if students in multiple grade levels were treated. All cells pool across math and reading. Panels A and B split up the entire sample by whether we identified any concerns with the quality of the RCT. Panel C omits the top and bottom 2.5% of observations by effect size magnitude. DF = degrees of freedom; ES = effect size; SE = standard error. $τ^{2}$ is the between-study variation in study-average true effect sizes.

***

p < .10; **p < .05, *p < .01.

Finally, we examine whether estimates vary by the decade in which they were published as a rough proxy for study quality. Education research has taken major leaps in terms of methodological rigor and quality standards over the past three decades, particularly in applying causal inference methods (Angrist, 2004).⁹ As shown in Table 4, we find substantial variation in the magnitude of the pooled impacts based on publication decade, with larger estimates prior to 2000 (.40 SD) and between 2000 and 2009 (.52 SD) than for those published between 2010 and 2019 (.36 SD). We observe somewhat smaller impacts for the most recent studies published in 2020 or later (.33 SD). We cannot definitively disentangle whether this variation in impacts is due to methodological changes, policy changes, or other study or program characteristic changes over time, but differences across decades remain even after we control for a host of study and program characteristics, as shown in Table 8.

What Expectations Should We Have for Tutoring Effects at Scale?

Evidence from our meta-analysis of experimental studies

In Table 5, we explore how our pooled effect size estimates change when we restrict our sample to more closely approximate our target of inference. Removing estimates that rely on assessments designed by the research team induces a .07 SD decline in our aggregate estimate.¹⁰ Removing the most extreme 5% of our point estimates only reduces our pooled estimate by .03 SD, while limiting to studies published in 2010 or later reduces our pooled estimate by .05 SD. However, restricting the sample to studies that provided tutoring to incrementally larger groups of students profoundly changes the magnitude of our estimates. Using our full sample, we find that programs offering tutoring to fewer than 100 students have a pooled effect size of .51 SD, whereas programs tutoring between 100 and 399 students have a pooled effect size of .30 SD. As shown in Figure 3, this estimate continues to decline—almost linearly—as we further restrict the sample such that pooled effects for programs serving between 400 and 999 students and 1,000 or more students have an average effect of .26 SD and .16 SD, respectively.

Table 5

Pooled effect size estimates overall and by treated student sample size

	No sample size restriction	0–99 treated students	100–399 treated students	400–999 treated students	≥1,000 treated students
	(1)	(2)	(3)	(4)	(5)
Panel A. Full analytic sample with no restrictions
ES	.404***	.506***	.302***	.258***	.155*
SE	(.026)	(.042)	(.029)	(.041)	(.072)
$τ^{2}$	.081	.136	.016	.011	.038
PI	[−.155, .963]	[−.217, 1.230]	[.053, .551]	[.054, .463]	[−.225, .535]
DF	232.7	141.9	64.3	15.8	7.6
n	2,218	1,400	638	112	68
Panel B. Independent tests only
ES	.328***	.427***	.256***	.216***	.155*
SE	(.020)	(.033)	(.026)	(.038)	(.072)
$τ^{2}$	.019	.039	.008	.009	.038
PI	[.055, .600]	[.042, .813]	[.083, .430]	[.033, .400]	[−.225, .535]
DF	184.6	112.0	59.9	15.8	7.6
n	1,805	1,083	558	97	67
Panel C. Omitting the top and bottom 2.5% of effect sizes
ES	.372***	.475***	.290***	.258***	.155*
SE	(.019)	(.026)	(.030)	(.041)	(.072)
$τ^{2}$	.037	.031	.032	.011	.038
PI	[−.004, .747]	[.131, .819]	[−.060, .640]	[.054, .463]	[−.225, .535]
DF	222.2	130.8	69.7	15.8	7.6
n	2,108	1,303	625	112	68
Panel D. Studies published in or since 2010
ES	.354***	.435***	.319***	.289***	.155*
SE	(.024)	(.038)	(.037)	(.045)	(.072)
$τ^{2}$	.018	.020	.019	.008	.038
PI	[.087, .620]	[.158, .712]	[.052, .586]	[.114, .463]	[−.225, .535]
DF	126.1	63.9	46.1	12.4	7.6
n	1,305	722	428	87	68

Notes. Prediction intervals are included for each estimate in brackets; robust standard errors are reported in parentheses. Each cell presents the Hedges’ g estimate, stacking both math and reading. Model (1) offers the pooled average impact of tutoring across the entire subsample indicated in each panel. Models (2) through (5) disaggregate the estimate in Model (1) by the tutored student sample size of each study. $τ^{2}$ is the between-study variation in study-average true effect sizes.

***

p < .10; **p < .05, *p < .01.

Figure 3.

Pooled estimated impacts of tutoring across program size and study characteristics.

When we focus on results for independent test outcomes only, presented in Table 5 Panel B, impacts range from .22 SD to .16 SD for tutoring programs in high-income countries operating at a scale of 400 to 999 and 1,000 or more students. There are four important points to highlight about this preferred set of estimates, which are closer to our target of inference. First, they are about 45% to 60% smaller than the pooled estimate using our full meta-analytic sample, suggesting that inferences made using the broader sample are not well-calibrated to tutoring programs at scale. Second, effect sizes between .16 SD and .22 SD are of medium-to-large magnitude and still very impressive for large-scale education interventions (Kraft, 2020). Third, our pooled effect size estimate for programs serving 1,000 students or more is very imprecisely estimated, given the limited number of RCTs of tutoring programs at this scale that meet our target-of-inference-aligned inclusion criteria. Fourth, the wide prediction intervals associated with these estimates suggest that we should expect tutoring program effects to vary considerably, with some individual programs producing quite small or even negative effects and others resulting in sizable gains. We further illustrate the robustness of this pattern of results by presenting point estimates for these different subsamples visually in Figure 3, where the overall pattern of declines at scale remains unchanged.

Evidence from large-scale, non-experimental studies

Meta-analytic reviews of the literature on tutoring frequently restrict their focus to studies that employ RCTs in an effort to ensure researchers are identifying the unbiased, causal effect of tutoring. This restriction strengthens the internal validity of the pooled effect sizes, but can sometimes limit researchers’ ability to study more representative samples with greater external validity (Tipton & Olsen, 2018). Large-scale RCTs are expensive and often require the active consent of participants, making them financially and logistically challenging to conduct. Our meta-analytic sample contains only nine studies that evaluate programs serving at least 1,000 students. This sparse data makes it difficult to accurately project plausible effects from tutoring programs taken to scale in larger school districts, given a lack of common support.

We attempt to further inform our understanding of the plausible effects of tutoring by turning to non-experimental studies of large-scale programs (n treated ≥1,000). Much of the literature evaluating large-scale programs focuses on after-school tutoring provided by private tutoring organizations and funded by two federal initiatives, 21st Century Learning Centers and Supplemental Educational Services (SES) under the No Child Left Behind Act. Studies of these initiatives often evaluate programs across large districts and entire states with thousands of treated students and find effects that are notably smaller than those we find with our full meta-analytic sample (Deke et al., 2012; Heinrich et al., 2010, 2014; James-Burdumy et al., 2005; Ross et al., 2008; Springer et al., 2014; Zimmer et al., 2009, 2010). These small-to-medium effects (frequently ≤.10 SD) may be fully explained by poor attendance at these off-site after-school programs and their design features, such as large student-tutor ratios and rotating tutors. However, the scale of the programs may also have contributed to their underwhelming results by influencing program design choices and implementation quality.

Several non-experimental studies of large-scale programs from the post-COVID era provide more relevant assessments of ongoing attempts to integrate tutoring in the U.S. public school system at scale. Carbonari, Dewey, et al. (2024) evaluate the efforts of four mid- to large-sized districts to support students’ academic recovery in math during the 2021–22 academic year by providing tutoring and additional instructional time. Using a value-added framework controlling for prior test scores, they find estimates that are uniformly smaller than .04 SD and often precisely estimated null effects, likely due to challenges related to staffing and student attendance. The same research team finds similar results in an expanded analysis of tutoring and small-group instruction across eight districts during the 2022–23 academic year (Carbonari, DeArmond, et al., 2024). They document statistically insignificant estimates of the average effects of tutoring and small-group instruction of .03 SD in math and .07 SD in reading when pooling across tutoring programs that jointly served over 12,000 students.

Kraft et al. (2024) studied efforts to scale tutoring in Metro-Nashville Public Schools (MNPS) over the course of two and a half years to serve over 4,000 students by the spring of 2023. In contrast to the districts studied by Carbonari and colleagues, MNPS was largely successful at engaging students to attend tutoring frequently and staffing their program at scale by hiring their own teachers as tutors. Using an event study design, they find medium effects of tutoring on independent test scores in reading (.09 SD), but no effects on test scores in math, on average.

Two studies of high-impact tutoring in the District of Columbia explore the Office of the State Superintendent of Education’s (OSSE) efforts to scale tutoring by contracting with a diverse portfolio of tutoring organizations to serve over 5,000 students in 2022–23 and over 7,000 in 2023–24. Across both years of the program, the research team finds evidence that tutoring dosage increased and that tutored students were slightly more likely to attend school on tutoring days. However, comparisons of tutored students’ growth on interim and state achievement tests relative to students who did not receive tutoring suggest tutoring had very limited effects given estimates that are small in magnitude and of both negative and positive sign (Lu et al., 2025; Pollard et al., 2024).

Two recent evaluations of public tutoring programs implemented across the United Kingdom (UK) and in Victoria, Australia, also provide early evidence of post-pandemic tutoring impacts at large scales in high-income contexts. Both analyses used matching methods that included baseline test scores to reweight regression analyses, comparing the test-score gains of tutored students to comparison-group students in the third year of these tutoring programs. Government Social Research, an evaluation agency within the UK Civil Service, found small-to-medium effects of the UK National Tutoring Programme on math (.06 SD) and English achievement (.03 SD) among Key Stage 2 students (years 3–6), but no effects on the achievement of Key Stage 4 students (years 10–11) in either subject (Moore et al., 2024). The Victorian Auditor-General’s Office found no significant effects of the statewide tutor learning initiative on students’ achievement gains in math and reading across students in years 3 through 10 (Victorian Auditor-General’s Office, 2024). Together, these non-experimental studies of large-scale tutoring programs are consistent with a pattern of declining effects at scale.

Why Do Tutoring Effects Decline at Scale?

The phenomenon of declining effects when interventions are scaled is well-documented in education research (Cheung & Slavin, 2016; Kraft, 2020, 2023). Understanding why this pattern also exists for tutoring programs is critical to informing efforts to expand access to tutoring and maintain its effectiveness at scale. We posit and test four primary hypotheses that might explain this pattern, while recognizing that unmeasured confounders, including contextual factors such as district characteristics, may also be at play.

Hypothesis #1: Declining effects do not reflect a true phenomenon but are instead due to selective reporting, standardization techniques, and/or spillover

It is possible that the negative relationship between program effects and program size is a product of the research process rather than a real pattern of differential effects. First, such a pattern could be caused by selective reporting that is more acute among studies with smaller samples. Here, we define selective reporting as the phenomenon where studies that produce statistically insignificant results are less likely to result in academic publications. This could occur through multiple mechanisms, including researchers being less likely to write papers when they find null results, researchers making subjective modeling decisions that push preferred estimates over traditional significance thresholds (i.e., p-hacking), and journals being less likely to publish studies that find null results (i.e., publication bias). Of course, researchers could also be systematically designing studies of programs that are likely to have larger effects to also have smaller sample sizes, given that less statistical power is necessary to detect larger effects.

We explore potential bias in three ways given that no single test can definitely rule out publication bias (McShane et al., 2016). First, we produce funnel plots and conduct a trim and fill analysis (Duval & Tweedie, 2000) to assess the degree of symmetry of our point estimates around the meta-analytic mean. An imbalance in publications falling on either side of the vertical line at the center of the full plot would suggest potential bias and lead the studies to be imputed to make the data more symmetric. We do this at both the individual effect-size level and at the study level by collapsing multiple effect sizes to account for the nested nature of the data. As shown in online Appendix Figure B1 and Table B2, we find no evidence of publication bias in our full sample of studies using this method. We then repeat these analyses after subsetting our data into studies with fewer than 100 treated students versus at least 100 treated students and find no evidence of differential publication bias among small-sample studies.

Second, we test for evidence of p-hacking bias by plotting the p-values from our sample of effect sizes and examining whether there is an excess mass of p-values just below conventional significance thresholds in these distributions, following the intuition of Brodeur et al. (2020).¹¹ A visual inspection of online Appendix Figure B2 reveals that the distribution of p-values is smooth across critical values for traditional significance thresholds in the full sample and in subsamples of smaller and larger sample studies. We then formally test for differential bunching below each conventional statistical threshold using a randomization test to examine whether p-values are binomial-distributed with equal probability around a given cutpoint. In Table 6, we show that we find little evidence of differential bunching of estimates with p-values just below the .05 and .10 significance thresholds. Only one of the six tests we run in our full sample, using three different bandwidths for each threshold, is marginally significant. We similarly find no compelling evidence that p-hacking among subsamples of small-scale studies or large-scale studies.

Table 6

Tests for significant differences in estimate mass across p-value thresholds

	p-Value threshold = .10			p-Value threshold = .05
Bandwidth	± .02	± .01	± .005	± .02	± .01	± .005
Panel A. Full sample
n estimates within bandwidth	116	54	32	179	91	47
% Significant estimates	.53	.50	.41	.56	.48	.45
One-sided p-value	.32	.55	.89	.05	.66	.81
Panel B. Studies with 0–99 treated students
n estimates within bandwidth	84	42	25	122	64	35
% Significant estimates	.52	.50	.44	.53	.42	.43
One-sided p-value	.37	.56	.79	.26	.92	.84
Panel C. Studies with 100+ treated students
n estimates within bandwidth	32	12	7	57	27	12
% Significant estimates	.53	.50	.29	.63	.63	.50
One-sided p-value	.43	.61	.94	.03	.12	.61

Notes: Here we present the likelihood of observing the number of significant p-values in our data at the 5% and 10% significance levels within the bandwidths .02, .01, and .005 around those thresholds. For each of these combinations, we isolate the subsample within the indicated bandwidth around the indicated threshold (“N estimates within bandwidth”), present the share of significant estimate p-values in that range (“Share significant estimates”), and calculate the likelihood of having at least that many significant estimates assuming a binomial distribution (“One-sided p-value”). We repeat this exercise for our full sample of estimates in Panel A and disaggregate according to treated student sample size in Panels B and C. All estimates pool across both math and reading subject areas.

Our final test of selective reporting is to compare pooled effect sizes between academic journal publications and other types of studies, such as working papers or reports.¹² If selective reporting was occurring because journals have been less likely to publish nonsignificant findings, we would expect to see larger average estimates from academic journal articles than studies not published in academic journals. In Table B3, we show that this is indeed the pattern we find. Specifically, we observe an average pooled effect from studies in academic journals of .44 SD versus .24 SD for studies not published in academic journals. We also check for differences by the number of citations a study has received per year since its publication, regardless of publication type.¹³ These results reinforce an interest among the academy of larger-magnitude impacts, with a .34 SD pooled effect size for the bottom 50% of studies by citations contrasted with .44 SD pooled effect size for the top 50%.

We interpret these results with caution, especially given that academic journal status is correlated with other factors such as publication date. Our sample of non-academic-journal studies skews more recent, and we know that more recent studies have demonstrated smaller pooled effects. These results are therefore not proof positive of selective reporting, but are consistent with that possibility. In sum, we find mixed evidence on whether selective reporting could explain the pattern of declining pooled effects for programs implemented at a greater scale.

A second possible statistical explanation for the differential pattern of effect sizes across smaller and larger tutoring programs is due to the standardization process. Tutoring programs typically target students in a specific range of the performance distribution. We find that 94% of the studies we coded describe some type of efforts to target students, with 89% of studies evaluating programs that specifically targeted low-performing students. As Fitzgerald and Tipton (2025) document, this targeting results in samples recruited to participate in RCTs being more homogenous than the population as a whole.¹⁴ Targeted sampling reduces the variation in achievement among the study sample, artificially inflating the magnitude of the effect sizes when researchers standardize their outcome measure using sample-based estimates of its standard deviation. It is possible, if not likely, that the overall effect sizes from meta-analyses of tutoring are somewhat inflated because of this practice. This may also help to explain the pattern of attenuated effects we find if smaller-scale tutoring programs are able to more precisely target students, resulting in even more homogeneous participant populations compared to larger-scale programs. Said another way, the pattern of declining effects by program size might be less pronounced if all studies had used an estimate of the SD of their test score outcome derived from nationally representative populations.

Finally, the presence of peer spillover effects could contribute to a differential pattern of tutoring effects by program size. A large body of evidence documents peer effects in K–12 education settings (Barrios-Fernandez, 2023). If being in the same class or school as a student receiving tutoring has positive spillover effects on nontutored students and the magnitude of these effects increases with the concentration of treated students in a class or school, then larger-scale tutoring programs could differentially attenuate the treatment-control contrast and contribute to the pattern of declining effects we find. However, it is not obvious that this would happen in practice, given that the concentration of treated students per class or school could be similar across smaller and larger programs if larger programs simply serve more schools.

Hypothesis #2: Scaling causes programs to systematically alter key design features

A second potential explanation for declining effects with scale is that leaders systematically change the design of tutoring programs for larger- versus smaller-scale interventions. To assess the evidence for this hypothesis, we first explore how key program features change as programs are taken to scale. Table 7 reveals two systematic differences in program design features between smaller and larger programs. First, larger programs are substantially less likely to tutor students individually. Programs serving over 400 students are roughly 10 percentage points less likely to rely on 1:1 student-tutor ratios than small programs that serve fewer than 100 students. Second, larger programs aim to deliver less dosage, primarily by shortening the number of weeks tutoring programs run. Here, the relationship is not entirely monotonic, with the smallest tutoring programs offering moderate dosage, middle-sized tutoring programs with the highest total dosage, and larger tutoring programs offering the least. For example, on average, programs that serve 100 to 399 students scheduled 39 total tutoring hours while those serving greater than 1,000 scheduled 27 total hours. Unexpectedly, we see that larger programs are even slightly more likely to use teachers and paraprofessionals as tutors and to provide a high degree of supervision and support to tutors—characteristics hypothesized to promote larger effects.

Table 7

Intervention characteristics across treated student sample size

	Treated student sample size
	0 to 99	100 to 399	400 to 999	≥1,000
Virtual/in-person delivery
Tutoring online	.52	6.36	.00	23.08
Tutoring in-person	99.48	93.64	100.00	76.92
Where tutoring happens
Tutoring at school	86.46	80.91	95.65	100.00
Tutoring at home	.52	2.73	.00	.00
Tutoring in multiple locations/other	2.60	3.64	.00	.00
Tutoring location unknown	10.42	12.73	4.35	.00
When tutoring happens
Tutoring during school	77.08	71.82	86.96	84.62
Tutoring after school	6.25	9.09	.00	.00
Tutoring during vacation	.52	.00	.00	.00
Multiple time windows/other	3.12	6.36	8.70	.00
Timing unknown	13.02	12.73	4.35	15.38
Student-tutor ratio
1:1 student-tutor ratio	49.48	41.82	39.13	38.46
2:1 student-tutor ratio	11.98	21.82	13.04	38.46
3:1 student-tutor ratio	11.98	15.45	17.39	.00
4:1 student-tutor ratio	14.06	16.36	13.04	.00
≥5:1 student-tutor ratio	9.90	2.73	8.70	23.08
Ratio unknown	2.60	1.82	8.70	.00
Tutor type
Tutored by teacher	15.10	19.09	21.74	30.77
Tutored by paraprofessional	15.10	18.18	26.09	23.08
Tutored by peer	14.06	1.82	4.35	15.38
Tutored by college/grad student	18.75	14.55	4.35	7.69
Other tutor type	7.29	16.36	26.09	23.08
Tutor type unknown	29.69	30.00	17.39	.00
Intended dosage
Sessions per week	3.31(1.35)	3.46(1.09)	3.55(1.58)	3.58(1.68)
Hours per session	.61(.44)	.60(.29)	.53(.23)	.72(.39)
Hours per week	1.98(1.82)	2.04(1.06)	1.90(1.15)	2.30(1.33)
Weeks per year	14.96(9.74)	18.61(7.90)	18.48(8.26)	13.71(3.30)
Hours total	30.20(31.74)	38.92(31.09)	36.85(35.10)	27.00(11.84)
n interventions	192	110	23	13

Notes. Except for intended dosage variables, all measures are percent-scaled 0 to 100. Intended dosage variable units are indicated, with standard deviations presented in parentheses.

As another test, we examine whether the negative relationship between program effects and size is attenuated when we control for the full range of observable program characteristics in a meta-regression framework. We do this by comparing the results of two meta-regressions. The first model shown in Table 8 reports coefficients from binned sample size indicators, which capture the clear negative relationship relative to the omitted category of studies evaluating small programs serving fewer than 100 students. Adding controls for study design features in Column 2 leaves this pattern unchanged. Further adding our full set of controls for program characteristics in Column 3 makes little difference, suggesting that program features are not a primary driver of declining effects of tutoring at scale.

Table 8

Meta-regression controlling for study and intervention features

	(1)			(2)			(3)
	$β$	SE	DFs	$β$	SE	DFs	$β$	SE	DFs
100–399 treated sample (ref. 0–99)	−.199***	(.051)	148.2	−.169***	(.048)	124.8	−.151***	(.050)	100.9
400–999 treated sample (ref. 0–99)	−.251***	(.057)	22.0	−.216***	(.054)	23.6	−.167***	(.056)	27.4
≥1000 treated sample (ref. 0–99)	−.323**	(.094)	7.3	−.307**	(.104)	8.6	−.350**	(.139)	12.2
Published in 2000s (ref. pre-2000)				.141	(.105)	57.2	.109	(.117)	53.8
Published in 2010s (ref. pre-2000)				.051	(.079)	47.2	.019	(.091)	45.6
Published in 2020s (ref. pre-2000)				.061	(.087)	60.9	.044	(.106)	60.5
Flag for poor RCT quality				.051	(.064)	54.5	.036	(.062)	56.1
Lower elementary math (ref. LE reading)				−.083	(.072)	40.3	−.113	(.079)	45.5
Upper elementary reading (ref. LE reading)				−.115*	(.062)	56.6	−.115*	(.061)	51.5
Upper elementary math (ref. LE reading)				−.030	(.071)	61.6	−.068	(.082)	65.1
Middle and high school reading (ref. LE reading)				−.124	(.078)	19.8	−.106	(.078)	19.4
Middle and high school math (ref. LE reading)				−.043	(.086)	20.2	.009	(.086)	21.9
Intervention assessment (ref. independent test)				.197***	(.059)	48.5	.172***	(.052)	48.0
OECD country (ref. USA)				.116	(.124)	58.0	.080	(.108)	57.2
Tutoring delivered online (ref. in-person)							−.050	(.120)	8.5
Tutoring at multiple locations/other (ref. at school)							−.022	(.100)	15.4
Tutoring location missing (ref. at school)							.109	(.147)	33.5
Curriculum not provided							−.044	(.089)	38.0
Tutoring outside of school hours (ref. during school)							−.155***	(.056)	32.6
Tutoring timing missing (ref. during school)							.071	(.091)	40.8
2:1 student-tutor ratio (ref. 1:1)							.059	(.048)	22.5
3:1 student-tutor ratio (ref. 1:1)							−.062	(.073)	42.9
4:1 student-tutor ratio (ref. 1:1)							−.078	(.064)	54.5
≥5:1 student-tutor ratio (ref. 1:1)							.205	(.177)	27.7
Ratio missing (ref. 1:1)							−.147	(.089)	8.2
Tutored by paraprofessional (ref. teacher)							−.057	(.062)	43.7
Tutored by K–12 peer (ref. teacher)							−.111	(.088)	33.9
Tutored by college/graduate student (ref. teacher)							−.082	(.089)	78.7
Other tutor type (ref. teacher)							−.185***	(.069)	61.5
Tutor type missing (ref. teacher)							.005	(.072)	78.9
Total dosage 0–14 hours (ref. ≥60 hours)							.139	(.108)	90.5
Total dosage 15–29 hours (ref. ≥60 hours)							.181**	(.074)	71.9
Total dosage 30–44 hours (ref. ≥60 hours)							.147*	(.076)	43.6
Total dosage 45–59 hours (ref. ≥60 hours)							.128	(.076)	32.2
Total dosage missing (ref. ≥60 hours)							.134*	(.076)	58.8
Constant	.502***	(.042)	137.7	.409***	(.073)	37.0	.366***	(.126)	55.0
$τ^{2}$	.073			.065			.056
n	2,218			2,218			2,218

Notes. Standard errors are presented in parentheses. For each model, we present the moderator coefficient (“ $β$ ”), the standard error of that estimate (“SE”), and the moderator’s small-sample corrected t-distribution degrees of freedom (“DF”). $τ^{2}$ is the between-study variation in study-average true effect sizes. LE = Lower Elementary.

***

p < .10; **p < .05, *p < .01.

Hypothesis #3: Heterogeneous tutoring effects cause the marginal student to benefit less as tutoring programs expand

The attenuation of tutoring effects as program sizes increase may also be a product of the heterogeneous effects of tutoring across students. Prior research has found that tutoring may be more effective for students who are lower-performing (Kraft, 2015; Robinson et al., 2024), Black (Fryer & Howard-Noveck, 2020), and from low-income families (Carlana & La Ferrara, 2024). It is plausible that smaller-scale tutoring programs appear more effective because they better target students who stand to benefit the most. As tutoring programs scale, they may be expanding to serve students who will benefit less, on average.

We explore this by comparing weighted averages of student characteristics in our sample of RCTs, disaggregating by the size of the tutoring program, in Table 9. This comparison reveals a clear pattern where studies of smaller tutoring programs serve larger percentages of historically marginalized students. Students in smaller tutoring programs were 13 percentage points more likely to be English learners, 11 percentage points more likely to receive special education services, 8 percentage points more likely to be from low-income backgrounds, and 5 percentage points more likely to be Hispanic. These sizable differences in the characteristics of students served by smaller and larger tutoring programs are likely to attenuate the estimated effects of tutoring as program scale increases.

Table 9

Student sample characteristics by size of tutored student sample

	Treated student sample size
	0 to 99	≥100
Student Demographics
% Asian	2.50	2.42
% Black	31.30	33.14
% Hispanic/Latinx	29.65	25.02
% Native American	1.77	.52
% Multiracial	.60	2.28
% White	28.80	35.93
% Other	6.25	7.28
% Free/reduced-price lunch	72.83	64.76
% Special education	28.80	17.91
% English language learners	30.70	17.40
Program Targeted to Certain Students
Any targeting	92.71	95.89
Targets low performers	88.54	85.62
Targets ELLs	8.33	.68
Targets underserved students	18.75	21.92
Targets socioemotional problems	3.12	3.42
n interventions	192	146

Notes. Means are taken at the intervention level. All variables presented range from 0 to 100. ELLs = English language learners.

A related possibility that we cannot directly test with our data is that smaller programs treat student populations that are more homogeneous. Homogeneity may make implementation easier because there is less of a need to tailor interventions to a variety of student achievement levels or other unique needs. Expanding tutoring programs might mean programming is provided to a more diverse group of students with a wider set of challenges, making it more difficult to produce large impacts among increasingly heterogeneous groups.

Hypothesis #4: Implementation quality declines as tutoring programs scale

A final hypothesis for why we observe smaller impacts for larger programs is that the quality of program implementation declines as tutoring programs are brought to scale. Imagine, for example, two tutoring programs with the exact same intended program design features, but one serves a small number of students at a single school, and the other is brought to scale district-wide. Intended dosage may be identical across programs, but the actual delivered dosage may decline at scale if student attendance suffers or time-on-task declines. Administrative needs are likely higher for the large-scale program. Small programs may be more likely to represent pilot efforts led by uniquely trail-blazing, motivated, and talented leaders, whereas administrators recruited to run large programs may not be as effective, on average. Implementation quality could also suffer if the effectiveness of the average tutor is lower for larger programs than for smaller programs. However, if tutoring screening tools are only weakly related to tutor performance, then tutor quality may not decline with scale (Davis et al., 2017). It may be more challenging to coordinate communication between tutors and teachers for large-scale programs. There may simply be less oversight with a greater number of tutoring sites, making it more difficult to ensure fidelity of implementation to program models for larger interventions.

Unfortunately, most tutoring RCTs do not directly measure implementation quality, limiting what we can say about this hypothesis using our meta-analytic dataset. For example, our coding reflects intended measures of dosage rather than the actual number of total hours of tutoring that treated students received. However, survey data and several recent studies on post-COVID tutoring efforts do point to significant implementation challenges. To start, the majority of K–12 public school principals report experiencing barriers (e.g., funding, timing, or staffing challenges) that limited their ability to effectively provide tutoring on the nationally representative SPP survey (National Center for Education Statistics, 2024). The aforementioned “Road to Recovery” (R2R) evaluation of large-scale academic recovery efforts by Carbonari, Dewey, et al. (2024) documents how districts fell well short of leaders’ expectations with regard to both the number of students served and the actual dosage of the interventions. This is consistent with SPP survey results showing that, among schools that provided tutoring, larger schools had somewhat lower student participation rates (National Center for Education Statistics, 2024). Tutor programs cannot be effective at scale when little actual tutoring happens.

Buy-in is also a problem identified among staff. Programs that appeared to successfully scale high-quality tutoring after the pandemic emphasized the importance of district-level leadership, goal setting, buy-in from school leaders and teachers, a willingness to rethink scheduling, the pursuit of multiple funding sources, and the ability to make difficult choices about spending trade-offs (Cohen, 2024). Leaders in the R2R districts highlighted staffing challenges related to pandemic surges, a tight labor market, and limited district capacity for recruitment and human resources management. These issues of staffing challenges and organizational capacity are echoed by findings from a qualitative study on programs in two urban districts (Makori et al., 2024). Implementation challenges do not appear to be solely a function of acute post-pandemic conditions, as the R2R team’s follow up report from 2022-23 revealed similar difficulties (Carbonari, DeArmond, et al., 2024). Leaders interviewed for the R2R report pointed to their need to adapt tutoring program designs—sometimes departing from best practices—to align with federal, state, and local policies. This is likely to remain a challenge as schools and districts look to a range of federal, state, and local funding sources to support tutoring programs after the COVID-relief funding runs dry (Accelerate, 2023; Cohen, 2024).

How Does the Aim, Format, and Intended Dosage of Tutoring Programs Moderate Their Effects?

Understanding how tutoring program effects vary based on their aim, format, and intended dosage is paramount for improving specific program designs and avoiding broad generalizations when program effects vary considerably. We explore how the pooled effect sizes described previously vary across a range of moderators. We group these moderators into three broad buckets:

Aim	Format	Intended Dosage
• School level• Subject taught• Scale of program• Curriculum focus	• Student-tutor ratio• Delivery in-person or virtual• Where tutoring occurs• When tutoring occurs• Tutor type	• Frequency• Duration

Results from our meta-analytic regression in Table 8 reveal that only a few study features and program characteristics appear systematically related to effect sizes when included in our fully controlled meta-analytic model. Intervention tests produce meaningfully larger effect sizes relative to independent tests (.17 SD). Tutoring outside of the school day has a strong negative association with effect sizes relative to tutoring during the school day (−.16 SD). We similarly find a negative association with using a specified tutor type not in our major categories, relative to a teacher (−.18 SD), although this group mostly consists of community members and/or volunteers. Results for total intended dosage do not show a monotonic relationship with impacts, suggesting intended dosage may only be weakly related to actual dosage.

Like so many complex interventions, the efficacy of tutoring programs may lie in the combination of program design features rather than any single characteristic. Prior literature has focused on a bundle of program features that research suggests are associated with larger effects, aligned with what is sometimes described as “high-quality,” “high-dosage,” or “high-impact” programs (e.g., Robinson et al., 2021). This bundle of features includes in-person programming, delivered at school during school hours, with a student-tutor ratio of no more than 3:1, meeting at least three times per week, ensuring a high overall dosage of intended tutoring (which we proxy for with at least 15 hours of total tutoring), and using a provided curriculum.¹⁵

When we test whether the combination of these features is greater than the sum of their parts, we find encouraging results, reported in Table 10. Specifically, the overall pattern of declining effect sizes persists among tutoring programs that utilize a bundled package of recommended design features, but the attenuation at scale is much less pronounced. When we isolate only individual features of this bundle, effects continue to erode to varying degrees as programs scale (Appendix Table B4, online). As shown in Figure 4, while the pooled effect among studies of programs serving between 100–399 students declines by 40% relative to programs serving 99 students or fewer in the full sample, it only declines by 9% in the restricted sample of studies with the bundled package of design features. The decline among programs serving 400–999 students is also less pronounced, dropping 49% in the full sample and 29% in the bundled package sample. When we restrict our analysis to subsamples of studies using independent test measures, excluding outlier observations, or limiting to those published after 2009, we continue to see limited attenuation of program effects across sample size, at least for programs serving fewer than 1,000 students.¹⁶

Table 10

Estimates for programs that combine best practices, stacked subjects

	No sample size restriction	0–99 treated students	100–399 treated students	400–999 treated students
	(1)	(2)	(3)	(4)
Panel A. Subsample of programs in-person, at school, during school, with ratio no more than 3:1, provided curricula, meeting ≥3 times per week, ≥15 hours of intended dosage
ES	.420***	.456***	.414***	.320***
SE	(.035)	(.058)	(.052)	(.064)
$τ^{2}$	.035	.064	.014	.011
DF	61.4	35.7	20.2	5.4
n	730	474	214	42
Panel B. Subsample of programs in-person, at school, during school, with a ratio no more than 3:1, provided curricula, meeting ≥3 times per week, ≥15 hours of intended dosage, independent test outcomes
ES	.387***	.434***	.363***	.313***
SE	(.034)	(.059)	(.046)	(.069)
$τ^{2}$	.027	.062	.011	.013
DF	56.4	33.7	18.7	5.2
n	610	393	181	36
Panel C. Subsample of programs in-person, at school, during school, with a ratio no more than 3:1, provided curricula, meeting ≥3 times per week, ≥15 hours of intended dosage, omitting top and bottom 2.5% of effect sizes
ES	.418***	.457***	.406***	.320***
SE	(.032)	(.049)	(.052)	(.064)
$τ^{2}$	.029	.034	.029	.011
DF	60.0	33.5	21.2	5.4
n	707	458	207	42
Panel C. Subsample of programs in-person, at school, during school, with a ratio no more than 3:1, provided curricula, meeting ≥3 times per week, ≥15 hours of intended dosage, published in or since 2010
ES	.412***	.399***	.423***	.402***
SE	(.041)	(.079)	(.059)	(.047)
$τ^{2}$	.024	.059	.014	.000
DF	36.4	15.7	16.3	3.2
n	437	248	168	21

Notes. All cells stack estimates for math and reading. Column (1) presents the pooled meta-analytic estimated effect for the subsample of studies described in each panel. Columns (2) through (4) disaggregate the estimate in Column (1) according to the tutored student sample size. Panel A is limited to the described subsample of programs sharing a set of best practices in their designs. Panel B is restricted to studies with independent test outcome measures. Panel C drops the top and bottom 2.5% of effect sizes from the whole sample (“no outliers”), and Panel D excludes studies published prior to 2010. DF = degrees of freedom; ES = effect size; SE = standard error.

***

p < .10; **p < .05, *p < .01.

Figure 4.

Gaps between pooled estimates for all studies compared to those using a bundle of tutoring best practices.

What Does the Research Suggest About Modifying Program Design Features to Reduce Costs and Increase Scalability?

Although the bundled package of program features appears to help sustain program effectiveness at scale, several aspects are costly and can be difficult to implement at scale. Here, we explore the potential implications of modifying specific program features.

Moving tutoring online: Many districts and programs have adopted online tutoring to access a larger potential supply of tutors. How might this affect the efficacy of tutoring? When we limit our sample to the 59 estimates of tutoring delivered virtually (drawn from six unique studies), the pooled estimate reported in Table 11a is .08 SD.¹⁷ This is substantially smaller than the unadjusted pooled estimate of in-person program impacts of .41 SD and even substantially smaller than our preferred pooled estimates of expected impacts for our target of inference, .16 to .22 SD (Table 5). However, our sample of virtual programs is small and offers very limited degrees of freedom (1.9), so we caution against over-interpreting these differences. Additionally, results from our meta-analytic regressions presented in Table 8 suggest that these smaller effects are likely driven by other program features. Conditional on our extensive set of codes for observable program features, we estimate a positive but statistically insignificant coefficient when comparing virtual tutoring programs to those in person.

Table 11

a. Average effect sizes across different program design features, stacked subjects

	Delivery mode		Student-tutor ratio
	In-person	Virtual	1:1 ratio	2:1 ratio	3:1 ratio	4:1 ratio	≥5:1 ratio
	(1)	(2)	(3)	(4)	(5)	(6)	(7)
Panel A. Full analytic sample with no restrictions
ES	.410***	.078*	.386***	.395***	.338***	.304***	.723***
SE	(.027)	(.023)	(.028)	(.048)	(.058)	(.046)	(.209)
$τ^{2}$	.081	.000	.025	.018	.063	.004	.578
DF	227.8	1.9	104.6	30.1	30.6	24.7	20.2
n	2,159	59	1,059	290	311	367	154
Panel B. Independent test outcomes only
ES	.336***	.061**	.378***	.331***	.265***	.274***	.216***
SE	(.020)	(.010)	(.031)	(.046)	(.041)	(.047)	(.062)
$τ^{2}$	.016	.000	.031	.015	.016	.000	.000
DF	180.9	1.6	97.6	26.0	24.2	21.5	10.6
n	1,754	51	850	241	278	326	78

Notes. Each column isolates a subsample of effects according to tutoring program characteristics. All estimates stack math and reading. DF = degrees of freedom; ES = effect size; SE = standard error.

***

p < .10; **p < .05, *p < .01.

b. Average effect sizes across different program design features, with stacked subjects

	Tutor type				Intended dosage of tutoring hours
	Teacher	Paraeducator	College/graduate student	K–12 peer	0–14 hours	15–29 hours	30–44 hours	45–59 hours	≥60 hours
	(1)	(2)	(3)	(4)	(5)	(6)	(7)	(8)	(9)
Panel A. Full analytic sample with no restrictions
ES	.409***	.402***	.400***	.315***	.494***	.445***	.397***	.317***	.224***
SE	(.057)	(.040)	(.058)	(.076)	(.084)	(.044)	(.050)	(.063)	(.038)
$τ^{2}$	.106	.014	.039	.059	.222	.013	.043	.042	.016
DF	46.6	33.9	36.3	23.7	52.6	49.3	41.2	17.7	26.4
n	494	363	325	122	481	520	399	155	340
Panel B. Independent tests outcomes only
ES	.322***	.389***	.325***	.321***	.315***	.403***	.353***	.300***	.205***
SE	(.045)	(.039)	(.058)	(.086)	(.045)	(.047)	(.043)	(.065)	(.032)
$τ^{2}$	.045	.007	.035	.068	.010	.005	.005	.043	.005
DF	38.6	32.9	32.3	19.4	32.3	43.6	31.2	16.8	22.3
n	435	325	234	102	323	432	339	126	309

***

p < .10; ** p < .05, * p < .01.

Further evidence of the efficacy of online tutoring comes from a novel study design where students were randomized to receive literacy tutoring either in-person or online, with tutors fully crossed across conditions. The authors find no statistically significant difference in the achievement growth of students who were tutored online versus in person, although tutors report feeling more connected to the students they tutored in-person (Hashim et al., 2025). Several new studies of virtual tutoring released in 2024 and 2025 find positive estimates that are similar to or somewhat larger than the magnitude of our pooled estimate for virtual programs (Carlana & La Ferrara, 2024; Neitzel & Storey, 2024; Ready et al., 2024), with another study finding null or negative results (Huffaker et al., 2025). We read this evidence as suggestive that online tutoring has the potential to be an effective approach to addressing scaling challenges when accompanied by other effective program design characteristics.

Increasing student-tutor ratios: The cost of tutoring is driven largely by tutor compensation. Many districts and tutoring organizations have chosen to increase student-tutor ratios as a means of expanding access while managing costs. In Table 11a, we report pooled effect estimates by student-tutor ratio and find somewhat larger impacts, on average, for programs with lower ratios. Using the full sample, we estimate pooled effects of .39, .40, .34, and .30 SD for 1:1, 2:1, 3:1, and 4:1 programs, respectively. The effects for programs with five or more students per tutor are substantially larger (.72 SD), but this result is not robust to excluding studies that use intervention assessments. The overall pattern of declining effects persists when we focus on tutoring programs evaluated using independent tests. When we examine student-tutor ratios in a meta-regression framework (Table 8), we again find a pattern of larger effects for smaller student-tutor ratios, although individual estimates are imprecise.

Evidence from the 10 studies that experimentally vary student-tutor ratios, summarized in Table 12a, provides a range of contrasts from 1:1 versus 2:1 ratios (Carlana & La Ferrara, 2024; Loeb et al., 2023; Vadasy & Sanders, 2008) to 4:1 to 13:1 ratios (Vaughn et al., 2010). Most examine interventions with elementary students (Clarke et al., 2017, 2020, 2023; Doabler et al., 2019; Loeb et al., 2023; Schwartz et al., 2012; Vadasy & Sanders, 2008) except for three with middle schoolers (Carlana & La Ferrara, 2024; Kraft & Lovison, 2024; Vaughn et al., 2010). The effect size differences most often favor smaller ratios but are not always large in magnitude and do not typically achieve statistical significance. However, many of these studies are underpowered to detect small differences in effects between treatment arms. In short, the existing research suggests that lower ratios produce larger effects, but it is possible to deliver tutoring in pairs or small groups and maintain meaningful effects.

Table 12

a. Multi-arm studies experimentally comparing different student-tutor ratios

Citation	N of treated students	Subject	Small ratio	Large ratio	Students per tutor													Diff. (Small–Big)
Citation	N of treated students	Subject	Small ratio	Large ratio	1	2	3	4	5	6	7	8	9	10	11	12	13	Diff. (Small–Big)
Carlana and La Ferrara (2024)	607	Multiple	1:1	2:1														.09
Clarke et al. (2017)	415	Math	2:1	5:1														.09
																		.52
																		.14
																		.25
																		.76
Clarke et al. (2020)	880	Math	2:1	5:1														−.02
																		−.03
																		−.03
																		.01
																		−.03
																		.12
Clarke et al. (2023)	322	Math	2:1	5:1														.07
																		.20
																		.77
Doabler et al. (2019)	465	Math	2:1	5:1														−.01
																		.04
																		.00
																		.10
																		−.01
Kraft and Lovison (2024)	180	Math	1:1	3:1														.14
Loeb et al. (2023)	1,080	Reading	1:1	2:1														.06
Loeb et al. (2023)	1,080	Reading	1:1	2:1														.03
Schwartz et al. (2012)	170	Reading	1:1	3:1														.63
																		.35
																		.41
																		.23
																		.41
																		.30
																		.36
Vadasy and Sanders (2008)	54	Reading	1:1	2:1														−.09
																		−.22
																		−.37
																		−.12
																		−.06
																		−.22
																		−.08
Vaughn et al. (2010)	514	Reading	4:1	13:1														−.10
																		−.01
																		.07
																		.11
																		−.08
																		.26
																		.08
																		.22
																		.13
																		.11
																		.06
																		.23

Notes. All studies examine elementary programs except for three that study middle school programs: Carlana and La Ferrara (2024), Kraft and Lovison (2024), and Vaughn et al. (2010). Shaded areas identify the number or range of students per tutor reported in each study.

b. Multi-arm studies experimentally comparing different intended dosages of tutoring

Citation	N of treated students	Amount of low dosage	Amount of high dosage	Total hours of tutoring												Effect Size Diff. (High–Low)
Citation	N of treated students	Amount of low dosage	Amount of high dosage	2	8	14	20	26	32	38	44	50	56	62	68	Effect Size Diff. (High–Low)
Al Otaiba et al. (2005)	49	Two 20 min. sessions/week	Four 20 min. sessions/week													.01
																.01
																−.12
																.18
																.13
																.15
																.18
Begeny (2011)	58	1.5 9-min. sessions/week	Three 9-min. sessions/week													.10
Begeny (2011)	58	1.5 9-min. sessions/week	Three 9-min. sessions/week													.29
Carlana and La Ferrara (2021)	530	Three hours/week	Six hours/week													.22



Wanzek and Vaughn (2008)	35	Five 30 min. sessions/week	Ten 30 min. sessions/week													−.4
																.54
																.39
																−.76

Note. There is more than one effect in each study because the authors report effects on multiple reading outcomes or assessments. All studies examine tutoring in reading subjects except for Carlana and La Ferrera (2021), which focused on multiple subjects. Shaded areas identify the number or range of students per tutor reported in each study.

Using peer tutors: An alternative approach to scaling tutoring on a fixed budget is to enlist K–12 students as peer tutors. We find that pooled effect sizes for peer tutoring in our full sample are an impressive .32 SD, as shown in Table 11b. Our meta-analytic regression (Table 8) suggests peer tutoring is as effective as tutoring by teachers, conditional on other program and study characteristics, with a nonsignificant difference of .11 SD in favor of teachers relative to peer tutors. We know of only one study that randomizes students to different tutor types. Mathes et al. (2003) use a partially matched and partially randomized design to compare teachers who implemented small-group (4–5:1) instruction versus overseeing pairs of students who used Peer Assisted Learning Strategies (PALS). They find effect sizes of .70 SD for teacher-directed, small-group instruction and .55 SD for peer-assisted instruction. Further evidence documents that the peer-tutoring program PALS for kindergarteners does scale and maintain its efficacy (Stein et al., 2008), suggesting that peer tutoring may provide a viable path for reducing program costs while sustaining effects.

Decreasing dosage: A fourth approach to scaling tutoring while controlling costs is to reduce overall dosage. Pooled effect estimates presented in Table 11b do not reveal a clear monotonic trend between intended dosage hours and program impacts in our more restricted sample. Across the full sample as well as the more restricted samples, programs offering over 60 hours of tutoring consistently have the smallest impacts. In our preferred subsample assessed with independent tests, the greatest magnitude of effect is for programs providing 15–29 hours of tutoring (.40 SD). However, these pooled estimates may be confounded with other study characteristics correlated with intended dosage. When employing meta-regression to control for a variety of program features and study characteristics, we do not find consistent statistically significant differences based on the total hours of intended dosage, as shown in Table 8.

Evidence from four studies that randomly assigned students to different intended doses of tutoring to isolate the causal impact of intended dosage suggests some benefits of higher intended dosage. As shown in Table 12b, three of these studies evaluate elementary school programs (Al Otaiba et al., 2005; Begeny, 2011; Wanzek & Vaughn, 2008) and one middle school program (Carlana & La Ferrara, 2021). These studies provide a range of contrasts, for example, comparing four versus nine total hours of tutoring (Begeny, 2011) to comparing 36 hours versus 72 hours (Al Otaiba et al., 2005). More often than not, these studies show greater effect sizes for programs designed to provide higher than lower intended dosages. In short, studies that experimentally vary intended dosage suggest that reducing it may attenuate effects. However, the actual delivered dosage is likely a more salient program feature.

Discussion

Evidence-based policymaking has increasingly become the standard in education, particularly as practitioners look to implement proven approaches to accelerate students’ academic growth after the substantial disruptions caused by the COVID-19 pandemic. While this trend is encouraging, it places increased importance on the external validity of research. Even well-designed and -implemented RCTs offer incomplete information to policymakers and practitioners if the evidence they produce is at arm’s length from the realities of implementing education policies and practice at scale. Meta-analyses that pool evidence across multiple studies seemingly offer stronger external validity, but aggregating across multiple studies with limited generalizability does not make the results valid for a very different target of inference.

Our study illustrates the importance of carefully considering the alignment between the research evidence and the policy target of inference. We find that attempts to better harmonize our meta-analytic sample of 263 studies to the target of inference used by most policymakers—large-scale tutoring programs aiming to increase student performance on independent tests—substantially reduce the pooled effect sizes. This attenuation is driven by the declining impacts of tutoring programs as they scale and, in part, explained by decreasing intended dosage and increasing student-student ratios, expanding tutoring to students who may benefit less, and declining success at delivering on the intended tutoring dosage.

This pattern of declining effects at scale often leads to a circular argument that “the program works when implemented with fidelity, it just wasn’t implemented correctly when taken to scale.” Alternatively, one might ask, “If implementation becomes systematically more difficult at scale, then does a program really work?” We see four possible responses to this challenge: 1) start small, learn, iterate, and engage in the hard but critical work to scale vertically (i.e., expanding program size) over time while maintaining program fidelity, 2) redesign the program to be easier to implement at scale, 3) adopt a more flexible approach to scaling that allows for localized adaptation, and/or 4) decide that a program is best delivered in a small-scale format and focus on horizontal scaling (replicating small programs).

To be clear, we view our target-equivalent estimates of the effects of tutoring as still meaningful and policy-relevant (Kraft, 2020). And we see tutoring as one of the most promising evidence-based approaches to accelerating student achievement. If districts could leverage tutoring at scale for those students whose learning was most negatively affected by the pandemic and produce effects similar to our policy-relevant estimates, it would be a huge success. In fact, several recent experimental studies of tutoring programs implemented post-COVID at a medium scale (Carlana & La Ferrara, 2024; Cortes et al., 2024; Gortazar et al., 2024) and at a large scale (Robinson et al., 2024) find effects on par with those from our target-aligned pooled effect sizes.

That said, we also think it is equally important for policymakers and practitioners alike to have more grounded expectations about what tutoring can accomplish. Several other recent studies using both experimental and non-experimental methods suggest early attempts to scale tutoring have produced quite small effects (Carbonari, DeArmond, et al., 2024; Carbonari, Dewey, et al., 2024; Kraft et al., 2024). Outsized expectations can lead policymakers and practitioners to become disillusioned when they fail to realize the eye-popping effect sizes of small-scale, boutique tutoring programs implemented under favorable circumstances among students who often opt into participating, particularly when meta-analytic estimates mask those contextual factors. Unrealistic expectations can also lead policymakers to mistakenly rely on a single or limited set of interventions when multiple interrelated programs may be needed to achieve their goals. Contextualizing tutoring program effects relative to their costs will also be critical for identifying sustainable models (Kohlmoos & Steinberg, 2024).

New technology may also present opportunities to scale tutoring with greater fidelity while maintaining program effects and reducing per-pupil costs. Recent studies suggest that computer-assisted learning programs paired with tutoring (Bhatt et al., 2024) or integrated into core academic classes (Oreopoulos et al., 2024) can support effective instruction, potentially reducing common obstacles to scaling tutoring. There is growing interest in the potential of generative artificial intelligence to offer effective tutoring at scale, although early programs appear to fall well short of this goal (Barnum, 2024). We remain optimistic about the potential of these new technologies but emphasize that the benefits of human tutoring likely extend far beyond student performance on standardized tests, to say nothing about the value of tutoring for the tutor. Human tutoring offers the opportunity for authentic personal connections and social interactions that can contribute to student development. It also creates volunteer and employment opportunities and valuable experiences for those interested in pursuing a career in education.

Conclusion

Efforts to integrate tutoring at scale into the U.S. K–12 public education system are at a critical juncture. New evidence documenting the mixed results of early efforts to expand access in the wake of the COVID-19 pandemic is emerging just as large-scale federal funding to support tutoring ends. With this paper, we aim to inform ongoing efforts to refine tutoring programs when implemented at scale and better calibrate expectations for what these programs are capable of accomplishing. Our findings highlight the importance of conducting research that considers both internal and external validity to best inform policy and practice.

Our analyses suggest that a bundled package of program features hypothesized to promote effective tutoring does guard against some of the attenuation that occurs as programs expand. It remains an open question whether adapting individual features of this bundle—such as moving tutoring online, increasing student-tutor ratios, using peer tutors, or decreasing dosage—can be done without compromising effectiveness. Such changes may attenuate effects but still be an equally, if not more, cost-effective way to deliver tutoring at scale. Our hope is that as policymakers experiment with new tutoring models, they will partner with researchers to learn about the impacts of these adaptations. Continued efforts to integrate individualized instruction into the U.S. K–12 education system would benefit from a decades-long approach that focuses first on establishing effectiveness and then on scaling, rather than the other way around.

Supplemental Material

sj-docx-1-rer-10.3102_00346543261446660 – Supplemental material for What Impacts Should We Expect From Tutoring at Scale? Exploring Meta-Analytic Generalizability

Supplemental material, sj-docx-1-rer-10.3102_00346543261446660 for What Impacts Should We Expect From Tutoring at Scale? Exploring Meta-Analytic Generalizability by Matthew A. Kraft, Beth E. Schueler and Grace T. Falken in Review of Educational Research

Footnotes

ORCID iDs

Matthew A. Kraft

Beth E. Schueler

Grace T. Falken

Notes

Authors

MATTHEW A. KRAFT is a professor of education and economics at Brown University; mkraft@brown.edu. His research focuses on: (1) improving the effectiveness of K–12 educators and schools and (2) examining how education systems can adapt to and mitigate climate change.

BETH E. SCHUELER is an associate professor of education at Stanford University; bschu@stanford.edu. She studies education policy, politics, and governance with a particular focus on K–12 school and district improvement efforts.

GRACE T. FALKEN is a project director at the Annenberg Institute at Brown University; grace_falken@brown.edu. She studies educator labor markets and the intersection of education policy and climate change.

References

Accelerate. (2023). Beyond recovery: Funding high-impact tutoring for the long term. Accelerate. https://accelerate.us/beyond-recovery/

Al Otaiba

Schatschneider

Silverman

(2005). Tutor-assisted intensive learning strategies in kindergarten: How much is enough? Exceptionality, 13(4), 195–208. https://doi.org/10.1207/s15327035ex1304_2

Alexander

P. A.

(2020). Methodological guidance paper: The art and science of quality systematic reviews. Review of Educational Research, 90(1), 6–23. https://doi.org/10.3102/0034654319854352

Angrist

J. D.

(2004). American education research changes tack. Oxford Review of Economic Policy, 20(2), 198–212. https://doi.org/10.1093/oxrep/grh011

Banerjee

A. V.

Duflo

(2009). The experimental approach to development economics. Annual Review of Economics, 1, 151–178. https://doi.org/10.1146/annurev.economics.050708.143235

Barnum

(2018, January 29). Why “personalized learning” advocates like Mark Zuckerberg keep citing a 1984 study—And why it might not say much about schools today. Chalkbeat. https://www.chalkbeat.org/2018/1/29/21104250/why-personalized-learning-advocates-like-mark-zuckerberg-keep-citing-a-1984-study-and-why-it-might-n/

Barnum

(2024, February 16). We tested an ai tutor for kids. It struggled with basic math. The Wall Street Journal. https://www.wsj.com/tech/ai/ai-is-tutoring-students-but-still-struggles-with-basic-math-694e76d3

Barrios-Fernandez

(2023). Peer effects in education. In Banerjee

(Ed.), Oxford research encyclopedia of economics and finance. Oxford Academic. https://doi.org/10.1093/acrefore/9780190625979.013.894

Begeny

J. C.

(2011). Effects of the Helping Early Literacy with Practice Strategies (helps) reading fluency program when implemented at different frequencies. School Psychology Review, 40(1), 149–157. https://doi.org/10.1080/02796015.2011.12087734

10.

Bhatt

Guryan

Khan

LaForest

Mishra

(2024). Can technology facilitate scale? Evidence from a randomized evaluation of high dosage tutoring (NBER Working Paper No. 32510). https://doi.org/10.17605/OSF.IO/UW8EH

11.

Borenstein

Hedges

L. V.

Higgins

J. P. T.

Rothstein

H. R.

(2009). Introduction to meta-analysis (1st ed.). Wiley. https://doi.org/10.1002/9780470743386

12.

Borenstein

Higgins

J. P. T.

Hedges

L. V.

Rothstein

H. R.

(2017). Basics of meta-analysis: I2 is not an absolute measure of heterogeneity. Research Synthesis Methods, 8(1), 5–18. https://doi.org/10.1002/jrsm.1230

13.

Boveda

Ford

K. S.

Frankenberg

López

(2023). Editorial vision 2022–2025. Review of Educational Research, 93(5), 635–640. https://doi.org/10.3102/00346543231170179

14.

Brodeur

Cook

Heyes

(2020). Methods matter: P-hacking and publication bias in causal analysis in economics. The American Economic Review, 110(11), 3634–3660.

15.

Callen

Goldhaber

Kane

T. J.

McDonald

McEachin

Morton

(2024). Pandemic learning loss by student baseline achievement: Extent and sources of heterogeneity (CALDER Working Paper Nos. 292–0224). Center for Analysis of Longitudinal Data in Education Research at the American Institutes for Research. https://caldercenter.org/publications/pandemic-learning-loss-student-baseline-achievement-extent-and-sources-heterogeneity

16.

Carbonari

M. V.

DeArmond

Dewey

Dizon-Ross

Goldhaber

(2024). Impacts of academic recovery interventions on student achievement in 2022-23 (CALDER Working Paper Nos. 303–0724). Center for Analysis of Longitudinal Data in Education Research at the American Institutes for Research. https://caldercenter.org/publications/impacts-academic-recovery-interventions-student-achievement-2022-23

17.

Carbonari

M. V.

Dewey

Kane

T. J.

Muroga

DeArmond

Dizon-Ross

Goldhaber

Morton

Davison

Hashim

A. K.

(2024). The impact and implementation of academic interventions during Covid: Evidence from the road to recovery project (CALDER Working Paper Nos. 275-0624–2). Center for Analysis of Longitudinal Data in Education Research at the American Institutes for Research. https://caldercenter.org/sites/default/files/CALDER%20WP%20275-0624-2.pdf

18.

Carlana

La Ferrara

(2021). Apart but connected: Online tutoring and student outcomes during the COVID-19 pandemic (EdWorkingPaper Nos. 21–350). Annenberg Institute for School Reform at Brown University. https://eric.ed.gov/?id=ED613650

19.

Carlana

La Ferrara

(2024). Apart but connected: Online tutoring, cognitive outcomes, and soft skills (Working Paper No. 32272). National Bureau of Economic Research. https://doi.org/10.3386/w32272

20.

CCSSO. (2023). How states are using federal relief funding to scale high-impact tutoring. Council of Chief State School Offices. https://753a0706.flowpaper.com/CCSSOESSERTutoring/#page=1

21.

Cheung

A. C. K.

Slavin

R. E.

(2016). How methodological features affect effect sizes in education. Educational Researcher, 45(5), 283–292. https://doi.org/10.3102/0013189X16656615

22.

Clarke

Doabler

C. T.

Kosty

Kurtz Nelson

Smolkowski

Fien

Turtura

(2017). Testing the efficacy of a kindergarten mathematics intervention by small group size. AERA Open, 3(2), 233285841770689. https://doi.org/10.1177/2332858417706899

23.

Clarke

Doabler

C. T.

Sutherland

Kosty

Turtura

Smolkowski

(2023). Examining the impact of a first grade whole number intervention by group size. Journal of Research on Educational Effectiveness, 16(2), 326–349. https://doi.org/10.1080/19345747.2022.2093299

24.

Clarke

Doabler

C. T.

Turtura

Smolkowski

Kosty

D. B.

Sutherland

Kurtz-Nelson

Fien

Baker

S. K.

(2020). Examining the efficacy of a kindergarten mathematics intervention by group size and initial skill: Implications for practice and policy. The Elementary School Journal, 121(1), 125–153. https://doi.org/10.1086/710041

25.

Cohen

(2024). Learning curve: Lessons from the tutoring revolution in public education. FutureEd & the National Student Support Accelerator. https://www.future-ed.org/learning-curve-lessons-from-the-tutoring-revolution-in-public-education/

26.

Cortes

Kortecamp

Loeb

Robinson

(2024). A scalable approach to high-impact tutoring for young readers: Results of a randomized controlled trial (Working Paper No. 32039). National Bureau of Economic Research. https://doi.org/10.3386/w32039

27.

Dahabreh

I. J.

Petito

L. C.

Robertson

S. E.

Hernán

M. A.

Steingrimsson

J. A.

(2020). Toward causally interpretable meta-analysis: Transporting inferences from multiple randomized trials to a new target population. Epidemiology, 31(3), 334. https://doi.org/10.1097/EDE.0000000000001177

28.

Davis

J. M. V.

Guryan

Hallberg

Ludwig

(2017). The economics of scale-up (No. w23925). National Bureau of Economic Research. https://doi.org/10.3386/w23925

29.

Deke

Dragoset

Bogen

Gill

(2012). Impacts of Title I Supplemental Educational Services on student achievement (Nos. 2012–4053). National Center for Education Evaluation and Regional Assistance. https://eric.ed.gov/?id=ED532016

30.

Dietrichson

Bøg

Filges

Klint Jørgensen

A.-M.

(2017). Academic interventions for elementary and middle school students with low socioeconomic status: A systematic review and meta-analysis. Review of Educational Research, 87(2), 243–282. https://doi.org/10.3102/0034654316687036

31.

Doabler

C. T.

Clarke

Kosty

Kurtz-Nelson

Fien

Smolkowski

Baker

S. K.

(2019). Examining the impact of group size on the treatment intensity of a tier 2 mathematics intervention within a systematic framework of replication. Journal of Learning Disabilities, 52(2), 168–180. https://doi.org/10.1177/0022219418789376

32.

Duval

Tweedie

(2000). A nonparametric “trim and fill” method of accounting for publication bias in meta-analysis. Journal of the American Statistical Association, 95(449), 89–98. https://doi.org/10.1080/01621459.2000.10473905

33.

Esterling

Brady

Schwitzgebel

(2024). The necessity of construct and external validity for deductive causal inference. https://doi.org/10.31219/osf.io/2s8w5

34.

Fesler

Chojnacki

(2023). Air tutors’ online tutoring: Math knowledge impacts and participant math perceptions. Middle years math grantee report series. Mathematica. Mathematica. https://eric.ed.gov/?id=ED628638

35.

Fitzgerald

K. G.

Tipton

(2025). Using extant data to improve estimation of the standardized mean difference. Journal of Educational and Behavioral Statistics, 50(1), 128–148. https://doi.org/10.3102/10769986241238478

36.

Fryer

R. G.

(2017). The production of human capital in developed countries: Evidence from 196 randomized field experiments. In Banerjee

A. V.

Duflo

(Eds.), Handbook of Economic Field Experiments (Vol. 2, pp. 95–322). North-Holland. https://doi.org/10.1016/bs.hefe.2016.08.006

37.

Fryer

R. G.

Howard-Noveck

(2020). High-dosage tutoring and reading achievement: Evidence from New York City. Journal of Labor Economics, 38(2), 421–452. https://doi.org/10.1086/705882

38.

Goldhaber

Falken

G.T.

(2025) ESSER and student achievement: Assessing the impacts of the largest one-time Federal investment in K-12 schools (CALDER Working Paper no. 301-0325-2). Center for Analysis of Longitudinal Data in Education Research at the American Institutes for Research. https://caldercenter.org/publications/esser-and-student-achievement-assessing-impacts-largest-one-time-federal-investment

39.

Goldhaber

Kane

T. J.

McEachin

Morton

(2022, November 16). Opinion | To help students shoot for the moon, we must think bigger and bolder. Washington Post. https://www.washingtonpost.com/opinions/2022/11/16/pandemic-learning-loss-education-moonshot/

40.

Gortazar

Hupkau

Roldan

(2023). Online tutoring works: Experimental evidence from a program with vulnerable children (EdWorkingPaper Nos. 23–743). Annenberg Institute for School Reform at Brown University. https://edworkingpapers.com/ai23-743

41.

Gortazar

Hupkau

Roldán-Monés

(2024). Online tutoring works: Experimental evidence from a program with vulnerable children. Journal of Public Economics, 232, 105082. https://doi.org/10.1016/j.jpubeco.2024.105082

42.

Goudey

(2009). A parent involvement intervention with elementary school students [Unpublished graduate thesis]. University of Alberta. https://doi.org/10.7939/R3Q12G

43.

Hashim

Pace Miles

Croke

(2025). Experimental Evidence on the Impact of Tutoring Format and Tutors: Findings from an Early Literacy Tutoring Program (edworkingpapers.com). https://doi.org/10.26300/ZGY5-SR80

44.

Hedges

L. V.

(2007). Effect sizes in cluster-randomized designs. Journal of Educational and Behavioral Statistics, 32(4), 341–370. https://doi.org/10.3102/1076998606298043

45.

Hedges

L. V.

Tipton

Johnson

M. C.

(2010). Robust variance estimation in meta-regression with dependent effect size estimates. Research Synthesis Methods, 1(1), 39–65. https://doi.org/10.1002/jrsm.5

46.

Heinrich

C. J.

Burch

Good

Acosta

Cheng

Dillender

Kirshbaum

Nisar

Stewart

(2014). Improving the implementation and effectiveness of out-of-school-time tutoring: Special symposium on qualitative and mixed-methods for policy analysis. Journal of Policy Analysis and Management, 33(2), 471–494. https://doi.org/10.1002/pam.21745

47.

Heinrich

C. J.

Meyer

R. H.

Whitten

(2010). Supplemental education services under No Child Left Behind: Who signs up, and what do they gain? Educational Evaluation and Policy Analysis, 32(2), 273–298. https://doi.org/10.3102/0162373710361640

48.

Hill

C. J.

Bloom

H. S.

Black

A. R.

Lipsey

M. W.

(2008). Empirical benchmarks for interpreting effect sizes in research. Child Development Perspectives, 2(3), 172–177. https://doi.org/10.1111/j.1750-8606.2008.00061.x

49.

Huffaker

Robinson

C. D.

Bardelli

White

Loeb

(2025). When interventions don’t move the needle: Insights from null results in education research (edworkingpapers.com). Annenberg Institute at Brown University. https://doi.org/10.26300/58DD-6D02

50.

Inns

A. J.

Lake

Pellegrini

Slavin

(2019). A quantitative synthesis of research on programs for struggling readers in elementary schools (Best Evidence Encylopedia). Center for Research and Reform in Education.

51.

James-Burdumy

Dynarski

Moore

Deke

Mansfield

Pistorino

Warner

(2005). When schools stay open late: The national evaluation of the 21st Century Community Learning Centers program [Final report]. US Department of Education. https://eric.ed.gov/?id=ED485162

52.

Kohlmoos

Steinberg

M. P.

(2024). Contextualizing the impact of tutoring on student learning: Efficiency, cost effectiveness, and the known unknowns. Accelerate. https://accelerate.us/efficiency-and-cost-effectiveness/

53.

Kraft

M. A.

(2015). How to make additional time matter: Integrating individualized tutorials into an extended day. Education Finance and Policy, 10(1), 81–116. https://doi.org/10.1162/EDFP_a_00152

54.

Kraft

M. A.

(2020). Interpreting effect sizes of education interventions. Educational Researcher, 49(4), 241–253. https://doi.org/10.3102/0013189X20912798

55.

Kraft

M. A.

(2023). The effect-size benchmark that matters most: Education interventions often fail. Educational Researcher, 52(3), 183–187. https://doi.org/10.3102/0013189X231155154

56.

Kraft

M. A.

Edwards

D. S.

Cannata

(2024). The scaling dynamics and causal effects of a district-operated tutoring program (EdWorkingPaper Nos. 24–1030). Annenberg Institute at Brown University. https://edworkingpapers.com/ai24-1030

57.

Kraft

M. A.

Falken

G. T.

(2021). A blueprint for scaling tutoring and mentoring across public schools. AERA Open, 7, 233285842110428. https://doi.org/10.1177/23328584211042858

58.

Kraft

M. A.

List

J. A.

Livingston

J. A.

Sadoff

(2022). Online tutoring by college volunteers: Experimental evidence from a pilot program. AEA Papers and Proceedings, 112, 614–618. https://doi.org/10.1257/pandp.20221038

59.

Kraft

M. A.

Lovison

V. S.

(2024). The effect of student-tutor ratios: Experimental evidence from a pilot online math tutoring program (EdWorkingPapers.com). Annenberg Institute at Brown University. https://edworkingpapers.com/ai24-976

60.

Kulik

J. A.

Fletcher

J. D.

(2016). Effectiveness of intelligent tutoring systems: A meta-analytic review: A meta-analytic review. Review of Educational Research, 86(1), 42–78. https://doi.org/10.3102/0034654315581420

61.

Littell

J. H.

(2024). The logic of generalization from systematic reviews and meta-analyses of impact evaluations. Evaluation Review, 48(3), 427–460. https://doi.org/10.1177/0193841X241227481

62.

Loeb

Novicoff

Pollard

Robinson

White

(2023). The effects of virtual tutoring on young readers: Results from a randomized controlled trial. National Student Support Accelerator.

63.

Waymack

Kalogrides

Robinson

C. D.

Lee

M. G.

Loeb

(2025). Implementation of the OSSE high impact tutoring initiative—School year 2023—2024 second year report. National Student Support Accelerator. https://nssa.stanford.edu/sites/default/files/Implementation%20of%20the%20OSSE%20High%20Impact%20Tutoring%20Initiative%20-%20School%20Year%202023%20%E2%80%93%202024%20%20Second%20Year%20Report.pdf

64.

Makori

Burch

Loeb

(2024). Scaling high-impact tutoring: School level perspectives on implementation challenges and strategies (EdWorkingPaper Nos. 24–923). Annenberg Institute at Brown University. https://edworkingpapers.com/ai24-923

65.

Mathes

P. G.

Torgesen

J. K.

Clancy-Menchetti

Santi

Nicholas

Robinson

Grek

(2003). A comparison of teacher-directed versus peer-assisted instruction to struggling first-grade readers. The Elementary School Journal, 103(5), 459–479. https://doi.org/10.1086/499735

66.

McShane

B. B.

Böckenholt

Hansen

K. T.

(2016). Adjusting for publication bias in meta-analysis: An evaluation of selection methods and some cautionary notes. Perspectives on Psychological Science, 11(5), 730–749. https://doi.org/10.1177/1745691616662243

67.

Moore

Morton

Schwendel

Welbourne

(2024). National Tutoring Programme year 3: Impact report. Department for Education, Government Social Research. https://www.gov.uk/government/publications/national-tutoring-programme-year-3-impact-evaluation

68.

National Center for Education Statistics. (2024). School Pulse Panel: Responses to the pandemic and efforts toward recovery. U.S. Department of Education, Institute of Education Sciences, National Center for Education Statistics. https://nces.ed.gov/surveys/spp/results.asp

69.

Neitzel

Storey

(2024). Air Reading: A randomized evaluation of a virtual tutoring model. Center for Research and Reform in Education, Johns Hopkins University. https://jscholarship.library.jhu.edu/server/api/core/bitstreams/e5e4e033-2416-436c-a1db-6c7ec02f633a/content

70.

Nickow

Oreopoulos

Quan

(2020). The impressive effects of tutoring on prek-12 learning: A systematic review and meta-analysis of the experimental evidence (Working Paper No. 27476). National Bureau of Economic Research. https://doi.org/10.3386/w27476

71.

Nickow

Oreopoulos

Quan

(2024). The promise of tutoring for pre-k–12 learning: A systematic review and meta-analysis of the experimental evidence. American Educational Research Journal, 61(1), 74–107. https://doi.org/10.3102/00028312231208687

72.

National Student Support Accelerator (NSSA). (2023). A snapshot of state tutoring policies. Author. https://studentsupportaccelerator.org/briefs/snapshot-state-tutoring-policies

73.

Oreopoulos

Gibbs

Jensen

Price

(2024). Teaching teachers to use computer assisted learning effectively: Experimental and quasi-experimental evidence (NBER Working Paper No. 32388).

74.

Pellegrini

Lake

Neitzel

Slavin

R. E.

(2021). Effective programs in elementary mathematics: A meta-analysis. AERA Open, 7, 2332858420986211. https://doi.org/10.1177/2332858420986211

75.

Pigott

T. D.

Polanin

J. R.

(2020). Methodological guidance paper: High-quality meta-analysis in a systematic review. Review of Educational Research, 90(1), 24–46. https://doi.org/10.3102/0034654319877153

76.

Polanin

J. R.

Zhang

Taylor

J. A.

Williams

R. T.

Joshi

Burr

(2023). Evidence gap maps in education research. Journal of Research on Educational Effectiveness, 16(3), 532–552. https://doi.org/10.1080/19345747.2022.2139312

77.

Pollard

Zandieh

Kalogrides

Robinson

C. D.

Loeb

Waymack

(2024). Implementation of the OSSE high impact tutoring initiative: First year report school year 2022–2023. National Student Support Accelerator. https://nssa.stanford.edu/sites/default/files/Implementation%20of%20the%20OSSE%20High%20Impact%20Tutoring%20Initiative.pdf

78.

Pritchett

Sandefur

(2015). Learning from experiments when context matters. American Economic Review, 105(5), 471–475. https://doi.org/10.1257/aer.p20151016

79.

Pustejovsky

J. E.

Tipton

(2022). Meta-analysis with Robust Variance Estimation: Expanding the range of working models. Prevention Science: The Official Journal of the Society for Prevention Research, 23(3), 425–438. https://doi.org/10.1007/s11121-021-01246-3

80.

Ready

D. D.

McCormick

S. G.

Shmoys

R. J.

(2024). The effects of in-school virtual tutoring on student reading development: Evidence from a short-cycle randomized controlled trial (EdWorkingPaper Nos. 24–942). Annenberg Institute at Brown University. https://edworkingpapers.com/ai24-942

81.

Ritter

G. W.

Barnett

J. H.

Denny

G. S.

Albin

G. R.

(2009). The effectiveness of volunteer tutoring programs for elementary and middle school students: A meta-analysis. Review of Educational Research, 79(1), 3–38. https://doi.org/10.3102/0034654308325690

82.

Robinson

C. D.

Kraft

M. A.

Loeb

Schueler

B. E.

(2021). Accelerating student learning with high-dosage tutoring (EdResearch for Recovery Design Principles). EdResearch for Recovery Project. https://eric.ed.gov/?id=ED613847

83.

Robinson

C. D.

Pollard

Novicoff

White

Loeb

(2024). The effects of virtual tutoring on young readers: Results from a randomized controlled trial (EdWorkingPapers.com). Annenberg Institute at Brown University. https://edworkingpapers.com/ai24-955

84.

Roschelle

Cheng

B. H.

Hodkowski

Neisler

Haldar

(2020). Evaluation of an online tutoring program in elementary mathematics (Online Submission). Digital Promise. https://eric.ed.gov/?id=ED604743

85.

Ross

S. M.

Potter

Paek

McKay

Sanders

Ashton

(2008). Implementation and outcomes of Supplemental Educational Services: The Tennessee state-wide evaluation study. Journal of Education for Students Placed at Risk (JESPAR), 13(1), 26–58. https://doi.org/10.1080/10824660701860391

86.

Schueler

Rodriguez-Segura

(2022). A cautionary tale of tutoring hard-to-reach students in Kenya. Journal of Research on Educational Effectiveness, 16(3), 442–472. https://doi.org/10.1080/19345747.2022.2131661

87.

Schwartz

R. M.

Schmitt

M. C.

Lose

M. K.

(2012). Effects of teacher-student ratio in response to intervention approaches. Elementary School Journal, 112(4), 547–567. https://doi.org/10.1086/664490

88.

Slavin

R. E.

Lake

(2008). Effective Programs in Elementary Mathematics: A Best-Evidence Synthesis. Review of Educational Research, 78(3), 427–515. https://doi.org/10.3102/0034654308317473

89.

Slough

Tyson

S. A.

(2023). External validity and meta-analysis. American Journal of Political Science, 67(2), 440–455. https://doi.org/10.1111/ajps.12742

90.

Springer

M. G.

Pepper

M. J.

Ghosh-Dastidar

(2014). Supplemental Educational Services and student test score gains: Evidence from a large, urban school district. Journal of Education Finance, 39(4), 370–403. https://doi.org/10.1353/jef.2014.a546720

91.

Stein

M. L.

Berends

Fuchs

McMaster

Sáenz

Yen

Fuchs

L. S.

Compton

D. L.

(2008). Scaling up an early reading program: Relationships among teacher support, fidelity of implementation, and student performance across different sites and years. Educational Evaluation and Policy Analysis, 30(4), 368–388. https://doi.org/10.3102/0162373708322738

92.

Tanner-Smith

E. E.

Tipton

(2014). Robust variance estimation with dependent effect sizes: Practical considerations including a software tutorial in Stata and SPSS. Research Synthesis Methods, 5(1), 13–30. https://doi.org/10.1002/jrsm.1091

93.

Tipton

Bryan

Murray

McDaniel

M. A.

Schneider

Yeager

D. S.

(2023). Why meta-analyses of growth mindset and other interventions should follow best practices for examining heterogeneity: Commentary on Macnamara and Burgoyne (2023) and Burnette et al. (2023). Psychological Bulletin, 149(3–4), 229–241. https://doi.org/10.1037/bul0000384

94.

Tipton

Olsen

R. B.

(2018). A review of statistical methods for generalizing from evaluations of educational interventions. Educational Researcher, 47(8), 516–524. https://doi.org/10.3102/0013189X18781522

95.

Torgerson

Ainsworth

Buckley

Hampden-Thompson

Hewitt

Humphry

Jefferson

Mitchell

Torgerson

(2016). Affordable online maths tuition: Evaluation report and executive summary. Education Endowment Foundation. https://eric.ed.gov/?id=ED581116

96.

Tyack

D. B.

(1974). The one best system: A history of American urban education. Harvard University Press.

97.

U.S. Department of Education. (2024). Elementary and secondary school emergency relief fund fiscal year 2023 annual performance report. U.S. Department of Education.

98.

Vadasy

P. F.

Sanders

E. A.

(2008). Code-oriented instruction for kindergarten students at risk for reading difficulties: A replication and comparison of instructional groupings. Reading and Writing, 21(9), 929–963. https://doi.org/10.1007/s11145-008-9119-9

99.

Vaughn

Wanzek

Wexler

Barth

Cirino

P. T.

Fletcher

Romain

Denton

C. A.

Roberts

Francis

(2010). The relative effects of group size on reading progress of older students with reading difficulties. Reading and Writing, 23(8), 931–956. https://doi.org/10.1007/s11145-009-9183-9

100.

Victorian Auditor-General’s Office. (2024). Effectiveness of the tutor learning initiative [Independent assurance report to Parliament]. https://www.audit.vic.gov.au/report/effectiveness-tutor-learning-initiative?section=

101.

von Hippel

P. T.

(2024, March 7). Two-sigma tutoring: Separating science fiction from science fact. Education Next. https://www.educationnext.org/two-sigma-tutoring-separating-science-fiction-from-science-fact/

102.

von Hippel

P. T.

(2025). Multiply by 37 (or Divide by 0.027): A surprisingly accurate rule of thumb for converting effect sizes from standard deviations to percentile points. Educational Evaluation and Policy Analysis, 47(3), 960–969. https://doi.org/10.3102/01623737241239677

103.

Wanzek

Vaughn

(2008). Response to varying amounts of time in reading intervention for students with low response to intervention. Journal of Learning Disabilities, 41(2), 126–142. https://doi.org/10.1177/0022219407313426

104.

Zimmer

Gill

Razquin

Booker

Lockwood

J. R.

(2009). State and local implementation of the No Child Left Behind Act, Volume VII–Title I school choice and Supplemental Educational Services: Final report [Evaluative Reports]. U.S. Department of Education Office of Planning, Evaluation and Policy Development Policy and Program Studies Service. https://www2.ed.gov/rschstat/eval/choice/nclb-choice-ses-final/index.html

105.

Zimmer

Hamilton

Christina

(2010). After-school tutoring in the context of No Child Left Behind: Effectiveness of two programs in the Pittsburgh Public Schools. Economics of Education Review, 29(1), 18–28. https://doi.org/10.1016/j.econedurev.2009.02.005

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.25 MB