Sage Journals: Discover world-class research

Abstract

This article reviews research on the achievement outcomes of elementary mathematics programs; 87 rigorous experimental studies evaluated 66 programs in grades K–5. Programs were organized in six categories. Particularly positive outcomes were found for tutoring programs (effect size [ES] = +0.20, k = 22). Positive outcomes were also seen in studies focused on professional development for classroom organization and management (e.g., cooperative learning; ES = +0.19, k = 7). Professional development approaches focused on helping teachers gain in understanding of mathematics content and pedagogy had little impact on student achievement. Professional development intended to help in the adoption of new curricula had a small but significant impact for traditional (nondigital) curricula (ES = +0.12, k = 7), but not for digital curricula. Traditional and digital curricula with limited professional development, as well as benchmark assessment programs, found few positive effects.

Keywords

evidence of effectiveness

In recent years, there has been an increasing emphasis on the identification and dissemination of programs proven in rigorous experiments. This emphasis has been clear in federal funding for education research, especially at the Institute for Educational Sciences (IES), Education Innovation Research (EIR), and the National Science Foundation (NSF). The establishment of the What Works Clearinghouse (WWC) has helped establish standards of evidence and has disseminated information on the evidence base for educational programs. In England, the Education Endowment Foundation has similarly supported rigorous research in education. In 2015, the Every Student Succeeds Act defined, for the first time, criteria for the effectiveness of educational programs. Every Student Succeeds Act (ESSA) places particular emphasis on three top levels of evidence: strong (statistically significant positive effects in at least one randomized experiment), moderate (statistically significant positive effects in at least one quasi-experiment), and promising (statistically significant positive effects in at least one correlational study). ESSA encourages use of programs meeting these criteria, and requires schools seeking school improvement funding to adopt programs meeting one of these criteria.

One of the subjects most affected by the evidence movement in education is mathematics, because there is more rigorous research in mathematics than in any other subject except reading. The rapid expansion in numbers and quality of studies of educational programs has provided a far stronger basis for evidence-informed practice in mathematics than once existed.

The advances in research have been noted in reviews, cited later in this article. However, the great majority of reviews have focused only on particular approaches or subpopulations, using diverse review methods. This makes it difficult to compare alternative approaches on a consistent basis, to understand the relative impacts of different programs. The most recent meta-analyses to systematically review research on all types of approaches to mathematics instruction were a review of elementary mathematics programs by Slavin and Lake (2008) and one by Jacobse and Harskamp (2011). A meta-analysis of all secondary mathematics programs was published by Slavin et al. (2009).

The present article updates the Slavin and Lake (2008) review of elementary mathematics, incorporating all rigorous evaluations of programs intended to improve mathematics achievement in grades K–5. The review uses more rigorous selection criteria than would have been possible in 2008, and uses current methods for meta-analysis and meta-regression, to compare individual programs and categories of programs, as well as key mediators, on a consistent basis.

Need for This Review

Two reviews considering all elementary mathematics programs have been published since 2008. Slavin and Lake (2008) identified 87 qualifying studies of outcomes of elementary mathematics programs and concluded that mathematics programs that incorporate cooperative learning, classroom management, and tutoring had the most positive effects on mathematics achievement. Another review of experimental studies by Jacobse and Harskamp (2011) examined the impact of mathematics interventions in grades K–6 and identified 40 studies. The authors reported that small group or individual interventions had greater effects on mathematics achievement than did whole-class programs.

An important contribution of the present review is its focus on coherent categories of mathematics interventions. Most previous reviews of mathematics interventions have focused on variables rather than programs or categories of similar programs (e.g., Gersten et al., 2014; Lynch et al., 2019). Yet to inform practice in elementary mathematics, it is important to identify specific effective programs and categories of programs, because this is how educators and policymakers interested in evidence-based reform make choices (Morrison et al., 2019). For example, the 2015 ESSA defines program effectiveness, and the WWC (2020) is similarly focused on evaluating evidence for programs, not variables.

The importance of program categories stems from the importance of programs. A daunting problem in evidence-based reform in education is that few programs are supported by large numbers of rigorous studies. The vast majority of practical programs with any rigorous evidence of effectiveness at all have just one or two studies that would meet modern standards. If there are several similar programs that also find positive impacts in rigorous experiments, this may buttress the claims of effectiveness for all of them. On the contrary, if a given program shows positive impacts in a single rigorous experiment, but other equally rigorous studies of similar programs do not, this should cause educators and researchers to place less confidence in the one study’s findings.

In the present meta-analysis, we included all studies that met a stringent set of inclusion criteria, regardless of the type of program used. We then grouped the programs into six mutually exclusive categories. These are described in detail later in this article, but in brief, the categories are as follows:

Tutoring (e.g., one-to-one or one-to-small group instruction in mathematics)

Professional development (PD) focused on mathematics content and pedagogy (at least 2 days or 15 hours)

PD (at least 2 days or 15 hours) focused on classroom organization and management (e.g., cooperative learning in mathematics)

PD focused on implementation of traditional (nondigital) and digital curricula (at least 2 days or 15 hours)

Traditional and digital curricula with limited PD (less than 2 days or 15 hours)

Benchmark assessments

A major feature of the present review is its use of modern approaches to meta-analysis and meta-regression that enable researchers to control effects of programs, categories and variables for substantive and methodological factors, and to obtain meaningful estimates for key moderators (see Borenstein et al., 2009; Borenstein et al., 2017; Lipsey, 2019; Pigott & Polanin, 2020; Valentine et al., 2019).

Another important contribution of the present meta-analysis is its use of stringent inclusion standards, similar to those of the WWC (2020). For example, the review of research on elementary mathematics programs by Slavin and Lake (2008), mentioned earlier, required that studies use random assignment or quasi-experimental designs, excluded measures overaligned with the treatment, and required a minimum duration of 12 weeks and a minimum sample size of 30 students in each treatment group. This review found positive effects for PD approaches, such as cooperative learning, mastery learning, and classroom organization and management, which had a mean effect size (ES) of +0.33 (k = 36). Technology-focused programs had a mean ES of +0.19 (k = 38), and curriculum approaches (mostly textbooks) had a mean ES of +0.10 (k = 13). These ESs are in a range similar to those reported by WWC (2013) studies of K–12 mathematics. The Lynch et al. (2019) review used similar inclusion standards, and reported an overall impact on mathematics learning of +0.27. Yet other reviews of mathematics interventions find much larger overall impacts. This is due to their inclusion of studies with design features known to significantly inflate ESs. For example, the third meta-analysis to include all studies of elementary mathematics, Jacobse and Harskamp (2011), reported an average ES of +0.58, about twice the size of the Slavin and Lake (2008) and Lynch et al. (2019) mean ESs. They noted that the review studies using non-standardized measures obtained significantly larger ESs than those using standardized measures, yet they did not control for this difference, known from other research (e.g., Cheung & Slavin, 2016) to be a powerful methodological factor in achievement ESs.

In recent years, research has established the substantial inflationary bias in ES estimates introduced by certain research design elements. Particularly important sources of bias include small sample size, very brief duration, use of researchers rather than school staff to deliver experimental programs, and use of measures made by developers and researchers (Cheung & Slavin, 2016; de Boer et al., 2014; Wolf et al., 2020).

The problem is that despite convincing demonstrations of the biasing impact of these factors, most reviews of research do not exclude or control for studies that contain factors known to substantially and spuriously inflate ESs. As a result, meta-analyses often report ESs that are implausibly large. As a point of reference, a study by Torgerson et al. (2013) found an ES of +0.33, the highest for one-to-one tutoring in mathematics by certified teachers in the current review. How could studies of far less intensive treatments produce much larger effects than one-to-one tutoring?

As one example, a review of research on intelligent tutoring systems by Kulik and Fletcher (2016), mostly in mathematics, reported an implausible ES of +0.66. The review had a minimum duration requirement of only 30 minutes. The review reported substantial impacts of “local” (presumably researcher-made) vs. standardized measures, with means of +0.73 and +0.13, respectively. It reported ESs of +0.78 for sample sizes less than 80, and +0.30 for sample sizes over 250. Individual included studies with very low sample sizes reported remarkable (and implausible) ESs. A 50-minute study involving 48 students had an ES on local measures of +0.95. Another, with 30 students and a duration of one hour, found an ES of +0.78. A third, with 30 students and a duration of 80 minutes, reported an ES of +1.17. Yet in its overall conclusions, Kulik and Fletcher (2016) did not exclude or control for inclusion of very small or very brief studies or inclusion of “locally developed” measures and did not weight for sample size. In a separate analysis, the review reported on 15 mostly large, long-term studies of a secondary technology program called Cognitive Tutor, showing ESs of +0.86 on “locally developed” measures and +0.16 on standardized measures, but simply averaged these to report an ES of +0.45, an implausibly large impact. As a point of comparison, the WWC, which uses inclusion criteria similar to those used by Slavin and Lake (2008) and Lynch et al. (2019), accepted five studies of Cognitive Tutor Algebra I, which had a median ES of +0.08, and one of Cognitive Tutor Geometry with an ES of −0.19.

As another example, Lein et al. (2020), in a review of research on word problem solving interventions, reported mean ESs of +0.68 for researcher-made measures, compared with +0.09 for norm-referenced measures. They also reported a mean of +0.71 for interventions delivered by researchers, compared with +0.28 for those delivered by school staff. Yet the review did not control for these or other likely biasing factors and reported an implausible mean ES of +0.56.

In the present meta-analysis, we used inclusion criteria more stringent than those used by the WWC or by Slavin and Lake (2008) or Lynch et al. (2019), and substantially more stringent than those of the great majority of reviews of studies of mathematics programs. We excluded all measures made by developers or researchers, post hoc quasi-experiments, very small and very brief studies, and those in which researchers, rather than staff unaffiliated with the research taught the experimental program. We also weighted studies by their sample sizes (using inverse variance) in computing mean ESs. Then we statistically controlled for relevant methodological and substantive moderators. These methods are described later in this article.

The importance of these procedures should be clear. Whatever outcomes are reported for studies included in the present meta-analysis, readers should be able to be confident that these outcomes are due to the actual likely effectiveness of the interventions, not to methodological or substantive factors that are known to bias ES estimates from extensive prior research. Failing to exclude or control for these factors not only spuriously inflates reported ESs but it also confounds comparisons of ESs within reviews, as a program’s large ES could be due to use of study features known to inflate ESs in the studies evaluating it, rather than to any actual greater benefit for students.

The inclusion of studies with certain study features not only risks substantial inflation of mean ESs, but also may undermine the relevance of the study for practice. A study of 30 minutes’ duration, one that has a sample size of 14, one that uses researchers rather than school staff to deliver the intervention, or one that uses outcome measures created by developers or researchers, is of little value to teachers or students, because educators need information on what works over significant time periods, is implemented by school staff, and is evaluated using universally accepted assessments, not ones they themselves made up.

Method

Inclusion Criteria

The review used rigorous inclusion criteria designed to minimize bias and provide educators and researchers with reliable information on programs’ effectiveness. The inclusion criteria are similar to those of the WWC (2020), with a few exceptions noted below. A PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) flow chart (Figure 1) shows the numbers of studies initially found and the numbers winnowed out at each stage of the review. Inclusion criteria were as follows:

Studies had to evaluate student mathematics outcomes of programs intended to improve mathematics achievement in elementary schools, Grades K–5. Sixth graders were also included if they were in elementary schools. Students who qualified for special education services but attended mainstream mathematics classes were included.

Studies had to use experimental methods with random assignment to treatment and control conditions, or quasi-experimental (matched) methods in which treatment assignments were specified in advance. Studies that matched a control group to the treatment group after posttest outcomes were known (post hoc quasi-experiments or ex post facto designs) were not included.

Studies had to compare experimental groups using a given program to control groups using an alternative program already in place, or “business-as-usual.”

Studies of evaluated programs had to be delivered by school staff unaffiliated with the research, not by the program developers, researchers, or their graduate students. This is particularly important for relevance to practice.

Studies had to provide pretest data. If the pretest differences between experimental and control groups were greater than 25% of a standard deviation, the study was excluded. Pretest equivalence had to be acceptable both initially and based on pretests for the final sample, after attrition. Studies with differential attrition between experimental and control groups of more than 15% were excluded.

Studies’ dependent measures had to be quantitative measures of mathematics performance.

Assessments made by program developers or researchers were excluded. The WWC (2020) excludes “overaligned” measures, but not measures made by developers or researchers. The rationale for this exclusion in the current review is that studies have shown that developer/researcher-made measures overstate program outcomes, with about twice the ESs of independent measures on average, even within the same studies (Cheung & Slavin, 2016; de Boer et al., 2014; Gersten et al., 2009; Kulik & Fletcher, 2016; Lein et al. 2020; Lynch et al., 2019; Nelson & McMaster, 2019). Results from developer- or researcher-made measures may be valuable to researchers or theorists, and there are situations in which independent measures do not exist. However, such findings should only be supplemental information, not reported as outcomes of the practical impact of treatments.

Studies had to have a minimum duration of 12 weeks, to establish that effective programs could be replicated over extended periods. Also, very brief studies have been found to inflate ESs (e.g., Gersten et al., 2014; Kulik & Fletcher, 2016; Nelson & McMaster, 2019).

Studies could have taken place in the United States or in similar countries: Europe, Israel, Australia, or New Zealand. However, the report had to be available in English. In practice, all qualifying studies took place in the United States, the United Kingdom, Canada, the Netherlands, and Germany.

Studies had to have been carried out from 1990 through 2020, but for technology a start date of 2000 was used, due to the significant advances in technology since that date.

Figure 1.

PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) flow diagram of study search and review process.

Literature Search and Selection Procedures

A broad literature search was carried out in an attempt to locate every study that might meet the inclusion requirements. Then studies were screened to determine whether they were eligible for review using a multistep process that included (a) an electronic database search, (b) a hand search of key peer-reviewed journals, (c) an ancestral search of recent meta-analyses, (d) a Web-based search of education research sites and educational publishers’ sites, and (e) a final review of citations found in relevant documents retrieved from the first search wave.

First, electronic searches were conducted in educational databases (JSTOR, ERIC, EBSCO, PsycINFO, ProQuest Dissertations & Theses Global) using different combinations of key words (e.g., “elementary students,” “mathematics,” “achievement,” “effectiveness,” “RCT,” “QED”). We also reviewed studies accepted by the WWC, and searched in recent tables of contents of eight key mathematics and general educational journals from 2013 to 2020: American Educational Research Journal, Educational Research Review, Elementary School Journal, Journal of Educational Psychology, Journal of Research on Educational Effectiveness, Journal for Research in Mathematics Education, Learning and Instruction, and Review of Educational Research. We investigated citations from previous reviews of elementary mathematics programs (e.g., Dietrichson et al., 2017; Gersten et al., 2014; Jacobse & Harskamp, 2011; Kulik & Fletcher, 2016; Li & Ma, 2010; Lynch et al., 2019; Nelson & McMaster, 2019; Savelsbergh et al., 2016).

We were particularly careful to be sure we found unpublished as well as published studies, because of the known effects of publication bias in research reviews (Cheung & Slavin, 2016; Chow & Ekholm, 2018; Polanin et al., 2016). Finally, we reviewed citations of documents retrieved from the first wave to search for any other studies of interest.

A first screen of each study was carried out by examining the title and abstract using inclusion criteria. Studies that could not be eliminated in the screening phase were located and the full text was read by one of the authors of the current study. We further examined the studies that were believed to meet the inclusion criteria and those where inclusion was possible but not clear. All of these studies were examined by a second author to determine whether they met the inclusion criteria. When the two authors were in disagreement, the inclusion or exclusion of the study was discussed with a third author until consensus was reached.

Initial searching identified 18,646 potential studies. After removing 4,157 duplicate records, these search strategies yielded 14,489 studies for screening. The screening phase eliminated 13,366 studies, leaving 1,123 full-text articles to be assessed for eligibility. Of these full-text articles that were reviewed, 1,039 studies did not meet the inclusion criteria, leaving 84 contributions included in this review, with two studies including multiple interventions, for a total number of 87 studies (see Figure 1).

Coding

Studies that met the inclusion criteria were coded by one of the authors of the review. Then codes were verified by another author. As for the inclusion of the studies, disagreements were discussed with a third author until consensus was reached.

Data coded included program components, publication status, year of publication, study design, study duration, sample size, grade level, participant characteristics, outcome measures, and ESs.

We also identified variables that could possibly moderate the effects in the review distinguishing between substantive factors and methodological factors. Substantive factors are related to the intervention and the population characteristics. The factors coded were grade level (K–2 vs. 3–6), student achievement levels (low achievers vs. average/high achievers), socioeconomic status (low SES vs. moderate/high SES), and study locations in the United States. versus other countries. Methodological factors included research design (quasi-experiments vs. randomized studies). For tutoring programs we also coded the group size (one-to-one vs. one-to-small group) and the type of provider (teacher, teaching assistant, paid volunteer, or unpaid volunteer). The coded data are available on GitHub (Pellegrini et al., 2021).

Effect Size Calculations and Statistical Procedures

ESs were computed as the mean difference between the posttest scores for individual students in the experimental and control groups after adjustment for pretests and other covariates, divided by the unadjusted standard deviation of the control group’s posttest scores. Procedures described by Lipsey and Wilson (2001) were used to estimate ESs when unadjusted standard deviations were not available.

Statistical significance is reported for each study using procedures from the WWC (2020). If assignment to the treatment and control groups was at the individual student level, statistical significance was determined by using analysis of covariance, controlling for pretests and other factors. If assignment to the treatment and control groups was at the cluster level (e.g., classes or schools), statistical significance was determined by using multilevel modeling such as hierarchical linear modeling (Raudenbush & Bryk, 2002). Studies with cluster assignments that did not use hierarchical linear modeling or other multilevel modeling but used student-level analysis were re-analyzed to estimate significance with a formula provided by the WWC (2020) to account for clusters.

Mean ESs across studies were calculated after assigning each study a weight based on inverse variance (Lipsey & Wilson, 2001), with adjustments for clustered designs suggested by Hedges (2007). In combining across studies and in moderator analysis, we used random-effects models, as recommended by Borenstein et al. (2009).

Meta-Regression

We used a multivariate meta-regression model with robust variance estimation (RVE) to conduct the meta-analysis (Hedges et al., 2010). This approach has several advantages. First, our data included multiple ESs per study, and RVE accounts for this dependence without requiring knowledge of the covariance structure (Hedges et al., 2010). Second, this approach allows for moderators to be added to the meta-regression model and calculates the statistical significance of each moderator in explaining variation in the ESs (Hedges et al., 2010). Tipton (2015) expanded this approach by adding a small-sample correction that prevents inflated Type I errors when the number of studies included in the meta-analysis is small or when the covariates are imbalanced. We estimated three meta-regression models. First, we estimated a null model to produce the average ES without adjusting for any covariates. Second, we estimated a meta-regression model with the identified moderators of interest and covariates. Third, we estimated an exploratory meta-regression model which added tutoring provider as a moderator. Due to the small sample size, this model is considered exploratory and results of statistical tests such as p values are not reported. All moderators and covariates were grand-mean centered to facilitate interpretation of the intercept. All reported mean ESs come from this meta-regression model, which adjusts for potential moderators and covariates. The packages metafor (Viechtbauer, 2010) and clubSandwich (Pustejovsky, 2020) were used to estimate all random-effects models with RVE in the R statistical software (R Core Team, 2020).

Categories of Mathematics Programs

Studies that met the inclusion criteria were divided into categories according to the main and most distinctive components of the programs. Category assignments were based on independent readings of articles and websites by the authors. All authors read all accepted studies, and if there were disagreements about categorizations they were debated and determined by consensus among all authors. The categories and their theoretical rationales were as follows.

Tutoring. Tutoring refers to one-to-one or one-to-small group instruction intended to help students struggling in mathematics. The theoretical base for tutoring draws on research in reading, which has long made extensive use of one-to-one and small group tutoring (see, e.g., Elbaum et al., 2000; Gersten et al., 2020; Slavin et al., 2011; Wanzek et al., 2016) as well as in mathematics (e.g., Fuchs, Schumacher, et al., 2013; Fuchs, Schumacher, et al., 2016; Jacobse & Harskamp, 2011; Nelson & McMaster, 2019). Tutoring may involve one teacher or one teaching assistant (paraprofessional) with one student, or one teacher or teaching assistant with a very small group of students, usually from two to six at a time.

There are several ways in which tutoring is likely to improve student mathematics outcomes. First, tutoring (especially one-to-one) permits tutors to substantially adapt their instruction to the needs of the student(s). Tutoring programs in mathematics generally provide well-structured, sequential materials for students, but tutors are trained to explain and demonstrate concepts for students who are struggling with it. Tutors are trained to start with struggling students where they are and move them forward rapidly. They are able to explain and model mathematical concepts and processes, observe how students are working, and give them personalized feedback and encouragement. Tutors can enable students to work in small steps, experiencing success at each step. Furthermore, tutors are likely to be able to build close personal relationships with tutored student(s), giving them attention and praise that many students crave, and enhancing their motivation as students seek to please a valued adult. Previous reviews of research on elementary mathematics approaches have found that tutoring is among the most effective of all interventions for students struggling in mathematics (e.g., Jacobse & Harskamp, 2011; Slavin & Lake, 2008).

2. PD Focused on Mathematics Content and Pedagogy. Interventions in this category provide intensive content-focused PD intended to advance teachers’ understanding of current standards-based content and effective pedagogy (teaching methods). To be included in this category, PD had to be provided for at least 2 days or 15 hours. This category of strategies emphasizes giving teachers knowledge about mathematics content and about ways of explaining it (Desimone, 2009; Desimone & Garet, 2015; Kennedy, 2014; Penuel et al., 2011). Ideally, such approaches emphasize mathematics content, active learning, coherence, sustained duration, and collective participation to help teachers learn and apply to their teaching new understandings of mathematics content and mathematics-specific pedagogy (Desimone, 2009; Desimone & Garet, 2015; Kennedy, 2014; Penuel et al., 2011). Almost all of these PD programs (as well as those in Categories 3, 4, and 5) provided some degree of on-site coaching to follow up after initial training. Coaching has been found to be an effective component of PD in mathematics (Kraft et al., 2018).

3. PD Focused on Classroom Organization and Management. This mathematics-specific category includes programs that provide teachers with PD and materials to help them implement innovations in classroom organization and management, such as cooperative learning (e.g., Slavin, 2017) and classwide behavior approaches (e.g., Weis et al., 2015). This category had the highest ES (ES = +0.33, k = 36) of any category in the Slavin and Lake (2008) meta-analysis. Previous research on cooperative learning has shown positive effects on mathematics and other subjects (e.g., Rohrbeck et al., 2003; Webb, 2008).

4. PD Focused on Implementation of Traditional and Digital Curricula. Interventions in this category provide teachers with moderate to extensive PD (at least 2 days or 15 hours, combining training and follow-up coaching) to support informed, thoughtful implementation of innovative traditional (i.e., non-digital) or digital curricula for students. There were two subcategories: (a) PD Focused on Implementation of Traditional Curricula, with minimal use of technology and (b) PD Focused on Implementation of Digital Curricula, such as computer-assisted instruction.

5. Traditional and Digital Curricula With Limited PD includes two subcategories: (a) Traditional (i.e., nondigital) curricula (textbooks with associated teaching materials) and (b) Digital curricula for students. Limited PD (less than 2 days or 15 hours) was provided in such strategies.

6. Benchmark Assessments consist of tests given periodically (three to five times a year) to find out how students are proceeding toward success on state standards. The rationale is to give teachers and school leaders early information on student performance so they can make changes well before state testing (e.g., Konstantopoulos et al., 2016).

Results

A total of 87 studies evaluating 66 programs met the inclusion standards of this review. The studies included were of high methodological quality: 74 (85%) of the studies were randomized trials and 13 (15%) were quasi-experimental studies. Also, 75 (86%) of the studies were reported in 2010 or later, indicating the extraordinary pace at which rigorous studies of elementary mathematics are appearing. Only four of the studies included in the current review overlapped those cited by Slavin and Lake (2008). Studies cited in 2008 but not in the current article were released before 1990, or did not meet the much more stringent inclusion requirements of the current synthesis.

Table 1 shows the meta-regression outcomes. The full model controlled for program category and subcategory, research design, grade level, student achievement level, SES, the United States versus other countries, and tutoring group size. Table 2 shows adjusted means for each category and subcategory. Tables 3 to 8 summarize the main characteristics and outcomes of the individual studies, grouping them by category, and Table 9 shows effects of moderators. Across all included studies of programs on elementary mathematics, we found an average weighted ES of +0.09, p < .01 (k = 87), with outcomes that vary substantially among different categories.

Table 1

Meta-Regression Results

Coefficient	Reference group	β	SE	t	df	p
Null model
Intercept		0.11	0.02	6.42	72.92	.000
Meta-regression
Intercept	Tutoring	0.10	0.01	7.93	41.95	.000
PD focused on classroom organization and management		0.04	0.08	0.48	8.89	.641
PD focused on mathematics content and pedagogy		−0.12	0.07	−1.75	23.55	.094
PD focused on implementation of traditional and digital curricula		−0.15	0.07	−2.26	10.26	.047
Traditional and digital curricula with limited professional development		−0.11	0.07	−1.63	17.85	.120
Benchmark assessments		−0.15	0.10	−1.56	7.10	.163
PD focused on implementation of traditional curricula	PD focused on implementation of digital curricula	0.12	0.04	2.78	7.33	.026
Digital curricula	Traditional curricula	0.04	0.04	1.01	24.56	.324
Quasi-experiments	Randomized studies	0.12	0.04	3.30	12.21	.006
K–2	Mixed	−0.04	0.03	−1.15	15.79	.267
3–6	Mixed	0.00	0.02	−0.09	11.51	.930
Low achievers	Mixed achievers	0.05	0.03	1.87	12.07	.086
Moderate/high achievers	Mixed achievers	−0.02	0.02	−0.84	12.10	.419
Low SES	Mixed SES	−0.02	0.02	−0.65	20.74	.524
Moderate/high SES	Mixed SES	0.01	0.02	0.31	22.36	.759
International studies	U.S. Studies	−0.02	0.03	−0.47	30.80	.643
One-to-small group tutoring	One-to-one tutoring	0.12	0.08	1.52	15.19	.149

Note. Meta-regression model also controlled for cross-age and online tutoring. PD = professional development; SES = socioeconomic status.

Table 2

Mean Effect Sizes of Program Categories and Subcategories

Table	Category	k	n	ES	SE	t	df	p
3	Tutoring programs	22	39	+0.20	0.05	4.21	7.86	.003
	One-to-one tutoring	8	13	+0.19	0.06	3.36	7.50	.011
	One-to-small group tutoring	14	26	+0.30	0.05	5.88	13.38	.000
4	Professional development focused on mathematics content and pedagogy	10	23	+0.03	0.03	0.86	9.01	.411
5	Professional development focused on classroom organization and management	7	11	+0.19	0.06	3.30	4.16	.028
6	Professional development focused on implementation of traditional and digital curricula	12	35	+0.01	0.03	0.42	3.13	.705
	Professional development focused on implementation of traditional curricula	7	18	+0.12	0.03	4.88	5.51	.003
	Professional development focused on implementation of digital curricula	5	17	0.00	0.03	−0.03	3.15	.977
7	Traditional and digital curricula with limited professional development	30	67	+0.05	0.03	1.52	12.52	.153
	Traditional curricula	15	34	+0.04	0.04	1.06	12.33	.309
	Digital curricula	15	33	+0.08	0.02	4.02	11.88	.002
8	Benchmark assessments	4	5	0.00	0.08	−0.03	3.12	.975

Note. k = number of studies; n = number of outcomes; ES = effect size.

Table 3

Tutoring Programs

Study	Design	Duration	Sample size	Grade	Sample characteristics	Posttest	Effect size	Study effect size
Category mean: +0.20*
One-to-one tutoring
Subcategory mean: +0.19*
One-to-one tutoring by teachers
Math Recovery
Smith et al. (2013)	QE	1 Year	775 Students (259E, 516C)	1	48% minority, 15% ELL, 65% FRL	WJ-Math Fluency	+0.15*	+0.24*
						WJ-App. Problems	+0.28*
						WJ-Quant Concepts	+0.24*
						WJ-Math Reasoning	+0.30*
Numbers count
Torgerson et al. (2013)	SR	12 Weeks	418 Students (144E, 274C)	Year 2 (Grade 1)	England. 75% FRL	Progress in Math (PIM 6)		+0.33*
One-to-one Tutoring by Teaching Assistants
Catch Up® Numeracy						Program mean: +0.05
Hodgen et al. (2019)	CR	1 Year	142 Schools, 1,481 students (737E, 744C)	Year 4, 5 (Grade 3, 4)	Urban and rural schools in England. 22% FRL	Progress Test in Mathematics		−0.04
Rutt et al. (2014)	SR	30 Weeks	216 Students (108E, 108C)	Year 2–6 (Grade 1–5)	England 35% FRL	Progress Test in Mathematics		+0.21*
Galaxy Math
Fuchs, Geary, et al. (2013)	SR	16 Weeks	591 Students (385E, 206C)	1	Southeast school district. 69% AA, 7% H, 83% FRL	Word Problems		+0.25*
Maths Counts
See et al. (2018)	SR	3 Months	291 Students (147E, 144C)	Year 3–6 (Grade 2–5)	Low-performing students in England. 37% FRL, 54% SEN	Key Stage 2		+0.11
Pirate Math
Fuchs et al. (2010)	SR	16 Weeks	150 Students (100E, 50C)	3	Nashville and Houston; 35% SPED, 19% ELL, 75% FRL, 56% AA, 29% H			+0.37*
One-to-one tutoring by paid volunteers
MathCorps
Parker et al. (2019)	SR	6 Months	284 Students (183E, 101C)	4–6	Minnesota. 35%W, 27%AA, 20% A, 61%FRL	STAR Math		+0.20*
One-to-small group tutoring
Subcategory mean: +0.30*
One-to-small group tutoring by teachers
Number Rockets
Gersten et al. (2015)	CR	6 Months	76 Schools, 994 students (615E, 379C)	1	44% AA, 46% H, 34% FRL	TEMA–3		+0.34*
One-to-small group tutoring by teaching assistants
1stClass@Number
Nunes et al. (2018)	CR	3 Months	122 Schools, 503 students (251E, 252C)	Year 2 (Grade 1)	Schools in England. 40% FRL	Key Stage 1		+0.01
Affordable Primary Tuition
Torgerson et al. (2018)	CR	12 Weeks	102 Schools, 1,201 students (567E, 634C)	Year 6 (Grade 5)	England. 48% FRL, 72%W	Key Stage 2		+0.19
FocusMATH
Styers and Baird-Wilkerson (2011)	SR	1 Year	341 Students (166E, 175C)	3, 5	23% AA, 33% H, 24% ELL, 12% SPED, 71% FRL	KeyMath 3		+0.24*
Fraction Face-Off!					Program mean: +0.57*
Fuchs et al. (2013b)	SR	12 Weeks	259 Students (129E, 130C)	4	82% FRL, 11% ELL, 53% AA, 25% W, 19% H	NAEP items		+0.88*
Fuchs, Schumacher, et al. (2016)	SR	12 Weeks	213 Students (143E, 70C)	4	17% ELL, 88% FRL, 15% SPED, 58% AA, 16% W, 17% H	NAEP Items		+0.39*
Fuchs, Malone, et al. (2016)	SR	12 Weeks	212 Students (142E, 70C)	4	49% AA, 27% H, 18% ELL, 90% FRL	NAEP Items		+0.64*
Malone et al. (2019)	SR	12 Weeks	225 Students (149E, 76C)	4	16% W, 43% AA, 25% H, 20% ELL, 88% FRL	NAEP Items		+0.29*
Fusion Math
Clarke et al. (2014)	SR	19 Weeks	78 Students (38E, 40C)	1	Pacific Northwest. 20% H, 18% ELL, 70% FRL, 12% SPED	SAT–10		+0.11
Onebillion maths apps
Nunes et al. (2019)	CR	12 Weeks	112 Schools, 1,089 students (543E, 546C)	Year 1 (K)	England. 25% FRL	PTM		+0.24*
ROOTS					Program mean: +0.19*
Clarke et al. (2016)	SR	4 Months	290 Students (203E, 87C)	K	Oregon. 5% AA, 58% W, 33% H, 32% LEP, 11% SPED	TEMA–3	+0.32*	+0.16
						NSB	+0.16
						SESAT	+0.001
Doabler et al. (2016)	SR	5 Months	292 Students (208E, 82C)	K	Boston. 7% AA, 89% W, 50% H, 26% ELL.	TEMA–3	+0.31*	+0.32*
						NSB	+0.40*
						SESAT	+0.24
Clarke et al. (2017)	SR	4 Months	689 Students (527E, 162C)	K	Oregon. 55% W, 26% H, 26% ELL, 87% FRL.	TEMA–3	+0.25*	+0.15
						NSB	+0.09
						SESAT	+0.12
Working Memory Intervention
Wright et al. (2019)	CR	5 Months	171 Schools, 1,822 students (882E, 940C)	Year 3 (Grade 2)	England; 37% FRL, 80% W	GL Assessment British Ability		+0.22
Online one-to-one tutoring
Affordable Online Maths Tuition
Torgerson et al. (2016)	CR	27 Weeks	64 Schools; 578 students; (289E, 289C)	Year 6 (Grade 5)	England; 92% FRL, 43% minority	Key Stage 2		−0.03
Cross-age peer tutoring
Shared Maths
Lloyd et al. (2015)	CR	2 Years	79 Schools Year 3 (tutees); 2,786 students; Year 5 (tutors); 2,683 students	Year 3, 5 (Grades 2, 4)	England; 22% FRL, 86% W, 4% AA, 5% A	ICAS-Year 3	+0.01	+0.02
Lloyd et al. (2015)	CR	2 Years		Year 3, 5 (Grades 2, 4)	England; 22% FRL, 86% W, 4% AA, 5% A	ICAS-Year 5	+0.02	+0.02

Note. Design/treatment: SR = student randomized; CR = cluster randomized; QE = quasi-experiment; CQE = cluster quasi-experiment. Measures: BAM = Balanced Assessment in Mathematics; CAT = California Achievement Test; CMT-Math = Connecticut Mastery Test; CST = California Standards Test; CSAP = Colorado Student Assessment Program; ECLS-K = Early Childhood Longitudinal Program; FCAT = Florida Comprehensive Assessment Test; GMADE = Group Mathematics Assessment and Diagnostic Evaluation; HCPS II = Hawaii Content and Performance Standards; ICAS = Interactive Computerised Assessment Systemin; CAS = Interactive Computerized Assessment System; ISAT = Illinois Student Achievement Test; ISTEP+ = Indiana State Test of Educational Proficiency; ITBS = Iowa Test of Basic Skills; MAP = Measure of Academic Progress; MAT = Metropolitan Achievement Test; MEAP = Michigan Educational Assessment Program; NAEP = National Assessment of Educational Progress; NJASK = New Jersey State Test; NSB = Brief Number Sense Screener; Nevada CRT = Nevada Criterion Referenced Test; NWEA = Northwest Evaluation Association; PTM = Progress Test in Maths; SAT 10 = Stanford Achievement Test 10; SESAT = Stanford Early School Achievement Test; SOL = Virginia Standards of Learning; STAR Math = Standardized Testing and Reporting; TAKS = Texas Assessment of Knowledge and Skills; TEMA–3 = Test of Early Mathematics Ability 3; WJ III = Woodcock-Johnson III. Demographics: A = Asian; AA = African American; H = Hispanic; W = White; FRL = free/reduced-price lunch; ELL = English language learner; LD = Learning disabilities; SPED = special education.