Abstract
Computer-based scaffolding provides temporary support that enables students to participate in and become more proficient at complex skills like problem solving, argumentation, and evaluation. While meta-analyses have addressed between-subject differences on cognitive outcomes resulting from scaffolding, none has addressed within-subject gains. This leaves much quantitative scaffolding literature not covered by existing meta-analyses. To address this gap, this study used Bayesian network meta-analysis to synthesize within-subjects (pre–post) differences resulting from scaffolding in 56 studies. We generated the posterior distribution using 20,000 Markov Chain Monte Carlo samples. Scaffolding has a consistently strong effect across student populations, STEM (science, technology, engineering, and mathematics) disciplines, and assessment levels, and a strong effect when used with most problem-centered instructional models (exception: inquiry-based learning and modeling visualization) and educational levels (exception: secondary education). Results also indicate some promising areas for future scaffolding research, including scaffolding among students with learning disabilities, for whom the effect size was particularly large (ḡ = 3.13).
Keywords
Having originated as a naturalistic description of how adults help toddlers learn solve problems (Wood, Bruner, & Ross, 1976), scaffolding has expanded to one that is used among diverse learners and in the context of many problem-centered instructional approaches (Hawkins & Pea, 1987; Hmelo-Silver, Duncan, & Chinn, 2007; Stone, 1998). Along with this expansion, many scaffolding approaches, forms, and empirical studies, have emerged. For example, scaffolding now encompasses one-to-one interactions with classroom teachers (van de Pol, Volman, & Beishuizen, 2010), interaction with similarly abled peers (Pifarre & Cobos, 2010), and computer-based tools (Devolder, Van Braak, & Tondeur, 2012; Reiser, 2004). Scaffolding is used among students of diverse educational levels and demographic backgrounds (Cuevas, Fiore, & Oser, 2002; Hadwin, Wozney, & Pontin, 2005). Furthermore, scaffolding is often designed to affect knowledge and skills beyond problem-solving ability, including argumentation ability (Jeong & Joung, 2007) and deep content knowledge (Davis & Linn, 2000). Synthesizing work on this expanded conceptualization of scaffolding is important to help researchers and designers determine what works best in scaffolding among particular populations and contexts. Scaffolding synthesis work has been done, but all focus on between-subjects differences (Belland, Walker, Kim, & Lefler, 2017; Belland, Walker, Olsen, & Leary, 2015; Ma, Adesope, Nesbit, & Liu, 2014; Steenbergen-Hu & Cooper, 2013, 2014; Swanson & Deshler, 2003; Swanson & Lussier, 2001; VanLehn, 2011), leaving important questions of how much within-subject growth one might expect among average students unaddressed. In this article, we address this gap by using Bayesian network meta-analysis to synthesize pre–post growth among networks of student populations, STEM (science, technology, engineering, and mathematics) disciplines, educational levels, and assessment levels (Berger, 2013; Lumley, 2002; Mills, Thorlund, & Ioannidis, 2013).
Literature Review
Scaffolding Definition
Scaffolding can be defined as contingent support that structures and highlights the complexity inherent in problem solving, thereby supporting current performance and promoting skill gain (Reiser, 2004; Wood et al., 1976). Three key attributes characterize scaffolding: contingency, intersubjectivity, and transfer of responsibility (Wood et al., 1976). First, scaffolding is contingent on dynamic assessment, which indicates students’ current abilities and where they need support. Scaffolding can be provided initially, and as dynamic assessment indicates that students are gaining skill or facing additional challenges, scaffolding can be faded or added, respectively (Collins, Brown, & Newman, 1989; Murray, 1999; Wood et al., 1976). Next, students need to recognize successful performance on the scaffolded task (Wood et al., 1976). Finally, scaffolding needs to engender independent task completion.
The concept of instructional scaffolding originated in describing one-to-one interactions with an ever-present tutor (Wood et al., 1976). Soon, researchers began to think about how the technique could be leveraged in other settings. One such way was one-to-one interactions from a classroom teacher who provided individualized help as students engaged with problems (van de Pol et al., 2010). Scaffolding is now used in the context of many instructional approaches, including project-based learning, problem-based learning, inquiry-based learning, and design-based learning (Belland, 2017). At the center of each is an ill-structured problem, defined as a problem that does not have just one correct solution, and which has multiple solution paths (Jonassen, 2011). To address such a problem, it is necessary to represent the problem qualitatively so as to recognize the critical factors and how they interact (Jonassen, 2003). Still, each problem-centered approach involves a different set of expectations, both in terms of process and product. For example, in design-based learning, students iterate designs that address the central problem (e.g., levee to prevent beach erosion of barrier islands; Kolodner et al., 2003); meanwhile, in inquiry-based learning, students pose and address their own questions (Keys & Bryan, 2001).
As computing power increased, researchers began to think about how computer tools could provide scaffolding (Hawkins & Pea, 1987). Computer-based scaffolding is often designed to (a) help students with what to consider when addressing a problem (conceptual scaffolding), (b) bootstrap a strategy for addressing a problem (strategic scaffolding), (c) invite students to question their own understanding (metacognitive scaffolding), and (d) enhance interest, autonomy, self-efficacy, and other motivational variables (motivation scaffolding; Belland, Kim, & Hannafin, 2013; Hannafin, Land, & Oliver, 1999; Rienties et al., 2012). Specific strategies embedded in scaffolding include cognitive support such as highlighting critical problem features, modeling expert processes and demonstration, and motivational support such as recruitment, direction maintenance, and controlling frustration (van de Pol et al., 2010; Wood et al., 1976).
Existing Scaffolding Meta-Analyses
Some work has been done to synthesize existing empirical work, but most such
synthesis work focuses on between-subjects differences—how students who used
scaffolding performed when compared with the performance of students who did not
use scaffolding. This is undeniably a crucial way to gauge the impact of an
intervention, and it indicates that scaffolding is a highly effective
intervention. Meta-analyses of between-group differences indicated that students
using a variety of scaffolding types performed 0.53 (Belland et al., 2015) and 0.46 (Belland et al., 2017)
Contextual Issues Related to Scaffolding
It makes little sense to try to find a universal design for scaffolding that is most effective because scaffolding (a) employs a wide range of strategies that are grounded in different theories (Koedinger & Corbett, 2006; Puntambekar & Kolodner, 2005; Quintana et al., 2004; van de Pol et al., 2010) and (b) is used in the context of many different problem-centered instructional approaches and subject matters, and by learners diverse in grade level and demographics (Hmelo-Silver et al., 2007; Lin et al., 2012; Puntambekar & Hübscher, 2005; Stone, 1998).
Learners
The age level with which scaffolding is used has expanded from preschool to K–12, college, graduate, and adult. Scaffolding can be seen as potentially a good fit for such a wide range of different age groups in that all need to learn to address ill-structured problems (Jonassen, 2011). The need to be able to address ill-structured problems is reflected in the needs of employers (Carnevale & Desrochers, 2003) and is at the center of the Common Core (McLaughlin & Overturf, 2012; National Governors Association Center for Best Practices & Council of Chief State School Officers, 2010) and Next Generation Science Standards (Achieve, 2013; Krajcik, Codere, Dahsah, Bayer, & Mun, 2014). At the same time, it is likely that different combinations of scaffolding strategies need to be used across these different age groups. A comprehensive traditional meta-analysis indicated that effect sizes for computer-based scaffolding were higher among adult learners than among college, secondary, middle-level, or primary students (Belland et al., 2017). Still, it is natural to question whether the strength of pre–post gains of computer-based scaffolding varies based on education level.
The original education population among which the definition of instructional scaffolding was grounded was middle class and average-achieving (Wood et al., 1976). But with the expansion of the metaphor, scaffolding began to be used among students with a much wider range of demographic characteristics. Early efforts found success using scaffolding among lower achieving students (Dimino, Gersten, Carnine, & Blake, 1990; Palincsar & Brown, 1984) and students with learning disabilities (Englert, Raphael, Anthony, & Stevens, 1991; Stone, 1998). With the more widespread use of computer-based scaffolding, so too did computer-based scaffolding begin to be used among a wide range of learners, including students from traditional, low socioeconomic status (SES), and underrepresented backgrounds, as well as those who are lower achieving and higher achieving (Belland, 2017). Traditional meta-analysis efforts indicated that scaffolding leads to stronger between-subject effects among traditional students than among underperforming students (Belland et al., 2017). But it is also worthwhile to consider whether within-subject (pre–post) differences vary based on education population. This can be done through network meta-analysis.
Context of Use
The context in which scaffolding is used can vary widely, and this variation is associated with real differences in scaffolding strategy (Belland, 2017). Differences in context of use can be considered from two perspectives—the problem-centered instructional model with which scaffolding is used, and the subject matter in which the instruction is situated. Problem-centered instructional models with which scaffolding is used include project-based learning, problem-based learning, inquiry-based learning, design-based learning, case-based learning, and problem solving (Belland, 2017). These models all involve addressing an ill-structured problem, but the nature of the problem and what should be produced, as well as inherent structure for student learning, varies between the models. For example, problem-based learning is the most open-ended in that students are expected to produce and argue for a conceptual solution to the problem (Hmelo-Silver, 2004), while design-based learning and project-based learning constrain the solution type (e.g., video or designed product) students need to produce. The stages through which students need to progress vary according to model as well. With such variation in process and product, it is natural to question if corresponding within-subject effect sizes vary. This can be addressed through network meta-analysis.
Problem-centered models and the nature of central problems tend to cluster differently by subject matter. For example, design-based learning (Chandrasekaran, Stojcevski, Littlefair, & Joordens, 2013; Silk, Schunn, & Cary, 2009) and problem-based learning (Galand, Frenay, & Raucent, 2012; Yadav, Subedi, Lundeberg, & Bunting, 2011) are often used in engineering education. Inquiry-based learning (Edelson, Gordin, & Pea, 1999; Marx et al., 2004) and project-based learning (Barron et al., 1998; Krajcik et al., 1998) tend to cluster in science education.
Assessment Level
Crucial to examining scaffolding outcomes is determining whether the magnitude of pre–post gains of scaffolding depend on assessment level, defined as the nature of learning outcome targeted by assessment. Assessment levels include concept (ability to state definitions of basic knowledge), principles (ability to describe or use relationships between facts), and application (ability to use concept- and principles-level knowledge to address a new problem; Sugrue, 1995). Traditional meta-analysis indicated that scaffolding’s effect was greater when measured at the principles level than at the concept level (Belland et al., 2017).
Bayesian Network Meta-Analysis as a Potential Solution
Two techniques that can help researchers establish equivalence on response variables before the treatment is introduced are random selection and random assignment (Higgins et al., 2011). But little education research incorporates true random selection and assignment of participants, leading to high risk of bias in randomization (Higgins et al., 2011). Another method is to use students as their own controls through the use of a pretest that is equivalent to the posttest. Network meta-analysis allows one to synthesize pre–post differences across studies in order to make indirect comparisons between treatments that may not have been compared directly in any single study (Lumley, 2002; Mills et al., 2013). When taking a frequentist approach to network meta-analysis, all included studies need to contain a treatment and a control condition (Puhan et al., 2014). Thus, studies with multiple versions of a scaffolding treatment but no lecture control treatment cannot be included in a frequentist network meta-analysis. Taking a Bayesian approach to network meta-analysis allows researchers to include multiple treatment studies as long as each study has a treatment in common with another study (Bhatnagar, Lakshmi, & Jeyashree, 2014; Goring et al., 2016). Furthermore, taking a Bayesian approach sets up a decision-making framework that scaffolding researchers and funders can use to indicate which contexts hold the greatest promise for scaffolding (Jansen et al., 2011).
Research Questions
To what extent do learner characteristics moderate cognitive pre–post
gains resulting from scaffolding? To what extent does education level among which
scaffolding was used moderate cognitive pre–post
gains? To what extent does education population among which
scaffolding was used moderate cognitive pre–post
gains?
To what extent does the context in which scaffolding is used moderate
cognitive pre–post gains? To what extent does context of use of scaffolding
moderate cognitive pre–post gains? To what extent does STEM discipline within which
scaffolding was used moderate cognitive pre–post
gains?
To what extent does assessment level moderate cognitive pre–post gains resulting from scaffolding?
Method
Design
For this synthesis effort, we followed a network meta-analysis approach from a Bayesian perspective. Network meta-analyses allow researchers to make direct and indirect comparisons of pre–post gains of different interventions that have a common comparator (Mills et al., 2013). Two principal advantages of network meta-analysis are its capacity to allow researchers to (a) make indirect comparisons among treatments that were never compared in a single study and (b) rank treatments according to effectiveness (Mills et al., 2013). However, the reliability of the indirect comparisons and rankings depends on the number of direct comparisons that are included in the network (Lumley, 2002; Mills et al., 2013). Furthermore, when the number of studies that represent a certain level of a moderator is low, the results for those moderator levels can be overweighted or biased. When (a) the number of direct comparisons among moderator levels is low and (b) there is no common comparator between moderator levels, one may opt to take a Bayesian approach to analysis. At a high level, in Bayesian approaches, rather than simply calculating the distribution of a collected sample without reference to what is already known (as one would do with a frequentist approach), one (a) determines possible prior distributions (considers what is already known about the distribution of the construct in the population of interest), (b) collects data from a sample, and (c) empirically approximates the posterior distribution (through, e.g., Markov Chain Monte Carlo [MCMC] sampling; see Figure 1; Carlin & Chib, 1995; Little, 2006; Lunn, Thomas, Best, & Spiegelhalter, 2000). For a relatively comprehensive and user-friendly introduction to Bayesian data analysis approaches, readers are directed to Gelman et al. (2013).

Basic Bayesian approach.
Following a Bayesian approach requires that one establish a prior distribution, defined as the distribution of the parameters in question according to prior research. All relevant prior meta-analyses about computer-based scaffolding focused on between-subject, rather than within-subject differences. Therefore, existing meta-analysis results are ill-equipped to form an informative prior distribution in this study. Furthermore, we wanted the current coding, rather than a prior distribution informed by between-subjects effects, to primarily drive the approximation of the posterior distribution (Jansen, Crawford, Bergman, & Stam, 2008). Therefore, this article employs a noninformative prior distribution model, which can be used when there is insufficient information about a treatment’s effectiveness or there is no consensus about the effectiveness among scholars. Among several possible prior noninformative distribution models, which have different assumptions about the variance between studies (e.g., maximum and minimum tau values), uniform prior distribution on tau (0, 5) was selected by deviance information criterion statistics (see Supplementary Table S1 in the online version of the journal), which evaluate and compare generated Bayesian models (Spiegelhalter, Best, Carlin, & van der Linde, 2002).
Next, one collects current data—in this study, this is our coding of articles collected through literature search. Then, one runs MCMC simulations informed by the prior distribution and the current data to empirically approximate the posterior distribution, defined as the distribution of true parameters. We did this using WinBUGs (Lunn et al., 2000; see Supplementary Table S1 in the online version of the journal for our WinBUGS code). Readers who are interested in learning more about how to perform the process of running calculations for a Bayesian network meta-analysis with the combination of STATA and WinBUGS are directed to the screencast available in Supplementary Video S2 in the online version of the journal. Readers interested in learning more about the foundations and application of coordinating Bayesian analysis between STATA and WinBUGS are directed to Thompson (2014). Many of the principles behind the commands and processes would be similar if combining WinBUGS with other statistical packages like R or SAS.
Literature Search
We used a three-pronged literature search to identify 7,589 potential studies,
which were published between January 1, 1993, and December 31, 2015 (see Figure 2). The databases
searched were ProQuest, Education Source, psycINFO, CiteSeer, ERIC, Digital
Dissertations, PubMed, Academic Search Premier, IEEE, and Google Scholar, and
search terms used were various combinations of the following terms:

Number of studies added at each stage of literature search and dropped at each stage of the exclusion process.
Application of Inclusion Criteria
Inclusion criteria were that (a) participants addressed an ill-structured problem in one of the STEM fields (science, technology, engineering, and mathematics); (b) participants used a computer-based scaffolding intervention; (c) participants took a similar pretest and posttest covering a cognitive variable; (d) sufficient statistics were reported to calculate effect size; and (e) there were at least two treatments. We defined ill-structured problems as those for which qualitative representation of the problem was necessary, and not all necessary information to do so were presented to students (Jonassen, 2011). All included studies had to have a treatment in common with at least one other study (Mills et al., 2013). Thus, if a study compared two scaffolding types that were not examined in any other study, then it would be excluded. When more than one study reported the same data, the one with the most information (e.g., dissertation) was retained.
Application of inclusion criteria proceeded in a two-stage manner. In Stage 1,
the inclusion criteria were applied in a pre–pass manner to winnow the list of
studies that resulted from the literature search (see Figure 2 for the number of studies
dropped according to element of the exclusion process). Specifically, one
researcher applied the inclusion criteria and only removed a study from
consideration if it clearly did not meet the inclusion criteria. Stage 1
resulted in dropping
In Stage 2, alternating pairs of researchers read each article resulting from
Stage 1 and applied inclusion criteria. Based on our inclusion criteria, 1,062
studies were excluded. The final number of included studies was
Coding Scheme
Articles were coded for the following characteristics—education population, education level, STEM discipline, and assessment level. Our coding process, along with examples from the coded studies, are shared in the following paragraphs.
Effect Size Calculations
All included studies had at least two treatments—usually one scaffolding
treatment and a lecture control condition, but sometimes two different
scaffolding treatments. For each treatment group, the sample size, pretest
mean, pretest standard deviation, posttest mean, and posttest standard
deviation were inputted into a free online tool (http://esfree.usu.edu/) to calculate effect size. All
reported effect sizes used the Hedges’s
Education Level (Primary/K–5, Middle Level/6–8, Secondary/9–12, College/Vocational/Technical, Graduate/Professional, Adult)
Education level was coded as (a) primary when the majority of participants were enrolled in Grades K–5, (b) middle level when the majority of participants were enrolled in Grades 6 to 8, (c) secondary when the majority of participants were enrolled in Grades 9 to 12, (d) college/vocational/technical when the majority of participants were enrolled in a 4-year bachelor’s program or 2-year associate’s program, (e) graduate when the majority of participants were enrolled in a graduate degree program (e.g., master’s or doctorate), or (f) adult when the majority of participants were over the age of 18 years but not enrolled in a college or graduate-level program.
Education Population (Traditional, High-Performing, Underperforming, English Language Learners (ELL), Underrepresented, and Persons With Learning Disabilities)
Education population refers to participant characteristics that may be associated with differences in educational outcomes in STEM (Heinrich, Knight, Collins, & Spriggs, 2016; Hernandez, Schultz, Estrada, Woodcock, & Chance, 2013; Molina, Borror, & Desir, 2016; Williams, Thomas, Ernst, & Kaui, 2015). Participants were coded as traditional when no argument was made that the majority of participants had a demographic characteristic or preexisting performance levels that makes them substantially different from students representing majority characteristics and typical performance for the country of study. For example, Chen, Kao, and Sheu (2003) noted having chosen their three participating schools because they were “located near 3 of the 10 best bird-watching sites in Taiwan” (p. 355). This was important because the scaffolding was aimed at helping students solve problems related to bird identification in the field. However, it does not have any bearing on education population characteristics, and thus participants were labeled as traditional. Sometimes, authors labeled participants as high-performing or low-performing based on preexisting measures of performance. For example, Liu (2004) reported pretest and posttest means separately for students who were identified as talented and gifted, those in the regular track, and those with learning disabilities or who were ELL. The author made the case that such groups represented high performers, traditional students, and underperformers, respectively. Sometimes, an argument was made that the entire school was high-performing or low-performing, and the education population was coded accordingly. For example, students in one study were coded as high-achieving because the study authors identified the participating school as having consistently ranked in the top 10 in its country according to an academic measure (Tan, Loong, & So, 2005). Education population was coded as ELL when the majority of participants spoke English as a second language but were instructed in English. For example, test scores were broken down according to participating school in Songer, Lee, and McDonald (2003). At one such school, only 38% of students spoke English as a primary language, and thus the corresponding scores were coded as ELL. Education population was coded as underrepresented when most participants are not typically represented in the target discipline. For example, participants in Bulu and Pedersen (2010) were 50% Hispanic and 35% African American, and the domain was science, where individuals from these groups are underrepresented. Education population was coded as persons with learning disabilities when the majority of participants had a documented disability for which an individualized education program would be prepared and which would interfere with learning the target content. For example, of the nine elementary school students who used scaffolding in the context of mathematics instruction in Xin et al. (2017), three had learning disabilities, one had attention deficit hyperactivity disorder, and one had a mild intellectual disability.
Instructional Approach (Problem-Based Learning With Scaffolding, Project-Based Learning With Scaffolding, Inquiry-Based Learning With Scaffolding, Case-Based Learning With Scaffolding, Design-Based Learning With Scaffolding, Modeling/Visualization With Scaffolding, and Problem Solving With Scaffolding)
Problem-based learning with scaffolding was identified when (a) the problem was presented first, and was the driver of all subsequent learning; (b) teachers served as facilitators rather than information providers; and (c) computer-based scaffolding was provided (Barrows & Tamblyn, 1980; Hmelo-Silver, 2004). For example, in Liu (2004), middle school students were presented with an ill-structured problem in which aliens are stranded, and needed to find a new home within our solar system. Student learning about characteristics of planets was driven by this problem, and teachers served as facilitators, rather than information providers. Project-based learning with scaffolding needed to involve learning focused toward the production of a real-world project/deliverable related to the central problem, and computer-based scaffolding needed to be provided (Helle, Tynjälä, & Olkinuora, 2006; Krajcik et al., 1998). For example, in Barak and Dori (2005), students addressed a sequence of chemistry problems and needed to construct a chemical model of the chemical that would address the problem. In inquiry-based learning with scaffolding, students needed to pose one or more question(s) related to the problem, devise and carry out a method to address the question(s), and be provided scaffolding (Crippen & Archambault, 2012; Edelson et al., 1999). For example, in Ardac and Sezen (2002), students used simulation software in which they could ask questions that they could then address by manipulating different variables related to a chemical reaction. In case-based learning with scaffolding, all necessary information is given to students often via lecture, then a case is provided, and students need to solve the case using the provided information and with the aid of scaffolding (Srinivasan, Wilkes, Stevenson, Nguyen, & Slavin, 2007; Thistlethwaite et al., 2012). For example, in Feyzi-Behnagh et al. (2014), participants needed to solve unique cases related to dermatology. Design-based learning with scaffolding was coded when students were invited to design and/or produce a product that would address an ill-structured problem (Kolodner et al., 2003; Silk et al., 2009). For example, Puntambekar, Stylianou, and Hübscher (2003) invited students to address authentic problems related to force by designing artifacts like roller coasters. Problem solving with scaffolding was identified when students needed to address an ill-structured problem, but the problem centered instructional model could not be classified as problem-based learning, project-based learning, inquiry-based learning, case-based learning, design-based learning, or modeling/visualization.
STEM Discipline (Science, Technology, Engineering, Mathematics)
We coded this category according to the problem students were addressing, rather than the discipline of the class in which participants were enrolled. We always coded according to a broad category (e.g., engineering), and a narrower category (e.g., electrical engineering). This decision was made for two reasons: (a) the subject matter of the class did not always align with the nature of the problem being addressed, and the nature of the problem being addressed was deemed to be more important to an examination of scaffolding; (b) participants were not always drawn from a formal class. As an example of the first point, participants in Magana (2014) were in an introductory educational computing course, but were addressing a problem related to scale (nanoscale, microscale, and macroscale). Because the goal was that students be able to order, classify, and sort shapes according to scale, the STEM discipline was coded as mathematics. As an example of the second point, participants in Chen, Kao, and Sheu (2005) engaged in a mobile butterfly watching activity. Within the study, participants needed to compare photographs they took with database photos of butterflies; for this reason, it was coded as science–ecology. There was a focus on engineering implications of electrical current in another study (de Jong, Härtel, Swaak, & van Joolingen, 1996). So while the participants were high school students enrolled in physics and engineering courses, the study was coded as electrical engineering.
Assessment Level (Concept, Principles, and Application)
Assessments were labeled on the basis of what students were asked to know and do with the target knowledge (Sugrue, 1995). Concept-level assessments measured whether participants knew basic knowledge. For example, a pretest and a posttest in one study asked declarative knowledge questions about scientific instruments, the solar system, and planet characteristics (Bulu & Pedersen, 2010). Principles-level assessment was coded when participants were asked to identify relationships/connections between facts, either in terms of directionality or scale. For example, an assessment invited students to read a scenario in which scientists were investigating a phenomenon, and students needed to indicate the hypotheses that was being tested (Tan et al., 2005). Application-level assessment was coded when participants needed to apply concept-level knowledge and principles-level knowledge to a new holistic/authentic problem. For example, high school students needed to use physics knowledge and principles to describe how a shuffle stone moves across a shuffleboard (Gijlers, 2005).
Coding Process
Alternating pairs of coders from a pool of four researchers with expertise in
scaffolding, meta-analysis, or both, coded the studies. Two researchers
independently coded each study, and then met to discuss coding discrepancies and
come to consensus. We used Krippendorff’s alpha to assess interrater reliability
on initial coding because it can handle the range of scales (nominal, ordinal,
and ratio) present in our coding data, and it adjusts for chance agreement
(Krippendorff,
2004). Because Krippendorff’s alpha adjusts for chance agreement, is
appropriate to use with multiple scales, and can account for unused scale
points, its values are typically lower than other popular indices of agreement
such as percentage agreement and Cohen’s kappa, and thus should not be
interpreted in light of such statistics. Two coders were drawn from a pool of 4,
and 218 data points were used for the interrater reliability analysis. All
alphas were greater than .667 (see Supplementary Table S1 in the online version of the journal),
which represents the minimum standard for acceptable reliability (Krippendorff, 2004).
The lowest Krippendorff’s alpha values: .731 for assessment level, and .761 for
context of use, were further analyzed using the
Consensus codes were used in all analyses. An earlier version of the coding scheme was developed in two ways—through synthesis of the scaffolding literature and development of in vivo codes; this was then used for a pilot scaffolding meta-analysis project (Belland et al., 2015). We presented the coding scheme and our suggested additions to encompass a broader swath of literature to our advisory board. They then either confirmed that the coding categories and their associated levels were reasonable or suggested revisions. The revised coding scheme was then used in a comprehensive, traditional meta-analysis (Belland et al., 2017), and, with the exception of the calculation of ESs, the coding categories used in this article were the same.
Meta-Analytic Procedures/Statistical Analyses
The wide range of participants, context of use, study measures, and educational
levels makes it unlikely that each outcome represents an approximation of a
single true ES. This led us to use a random effects model (Borenstein, Hedges, Higgins, & Rothstein,
2009). Analyses were conducted using the metan package of STATA 14
and WinBUGS 1.4.3. Specifically, WinBUGS 1.4.3 was used to run MCMC simulations
using Gibbs sampling. We used 20,000 MCMC samples for each analysis. This study
used the 2-level model
Using a Bayesian approach helps address small study effects (Kay, Nelson, & Hekler,
2016; Mengersen,
Drovandi, Robert, Pyne, & Gore, 2016). But another potential
problem of publication bias is the file drawer problem, according to which
studies with negative or no effects are often not published. To guard against
this threat, we examined the underlying coding. We only found two positive
outliers (
The presence of similar pretests and posttests within the same study can present a risk of testing bias. Within the overall Bayesian network meta-analysis of scaffolding in STEM education project, we also wrote an article covering scaffolding characteristics and risk of bias—a lens with which to code research quality that does not make assumptions when data are not present (Higgins et al., 2011). Results showed that there was no substantial risk of bias due to testing effect (Walker, Belland, Kim, & Piland, 2017).
MCMC simulations generate the posterior distribution, which represents the range of true ESs for each moderator. Using Bayesian probability, one can calculate the probability that each moderator level is the best (Jansen et al., 2011). We report this as “probability of the best.” One can also calculate the probability that each moderator level is second best, third best, and so on. Averaging all such probability levels together for each moderator level allows one to arrive at a rank order for the levels of the moderator. We report this as “ranking.”
The goal of Bayesian network meta-analysis is to model a network of evidence pertaining to scaffolding treatments and common treatments—sometimes lecture-based controls and sometimes other scaffolding treatments. Because not all scaffolding treatments will have been compared directly with control, it does not make sense to calculate a two-node network computing one effect size estimate for all scaffolding treatments versus control (Lumley, 2002).
Results
Research Question 1: To What Extent Do Learner Characteristics Moderate Cognitive Pre–Post Gains Resulting From Scaffolding?
Education Level
When interpreting the network plot (see Supplementary Figure S3 in the online version of the
journal), one can see the number of unique outcomes for each level (e.g.,
middle level) of the target characteristic (e.g., education level). Each
solid line between two circles represents the number of direct comparisons
between the two levels of the target characteristic. For example, the solid
line between middle level and control shows that there were eight direct
comparisons of middle-level students using scaffolding with students in a
control condition. Of note, for education population, there are no studies
that compared students at different educational levels, which is to be
expected. Dotted lines indicate indirect comparison information that can be
ascertained among treatment characteristics that were never directly
compared in a single study. The number of outcomes were (a) greatest at the
college/vocational/technical level (
Pre–post effect size estimates are highest among college- and graduate-level
learners, at

Effect size (ES) estimates and 95% credible intervals (Crl) of scaffolding according to education level.
Using a Bayesian network meta-analysis approach allows estimation of the true effect size and enables rank ordering treatments and calculating the probability that each treatment is the best. Scaffolding led to the highest pre–post gains at the college and graduate levels (see Table 1), ranked first and second with a 35% and a 47% chance of being the best, respectively.
Ranking and probability of the best of scaffolding used at different education levels
Education Population
The evidence is strongest for the comparison of traditional students using scaffolding versus control (25 outcomes; see Supplementary Figure S3 in the online version of the journal). There are some studies that contained multiple educational populations. For example, traditional students using scaffolding co-occurred with underrepresented students, high-performing students, underperforming students, and ELL in at least one study for each combination.
The pre–post gains are consistently positive and substantial across
educational populations (see Figure 4). The number of outcomes for
scaffolding used by traditional students was the greatest, leading the group
to have the tightest credible interval. Note that

Effect size (ES) estimates and 95% credible intervals (Crl) of scaffolding according to education population.
When examining ranking and probability of the best, one finds scaffolding to have a high probability of having the best ranking when used among students with learning disabilities (see Table 2). Indeed, the probability of the best is virtually nil for all other education populations.
Ranking and probability of the best of scaffolding used among members of different education populations
Research Question 2: To What Extent Does the Context in Which Scaffolding Is Used Moderate Cognitive Pre–Post Gains?
Context of Use
With the exception of problem solving, the number of coded outcomes for each problem-centered instructional model was very small (see Supplementary Figure S3 in the online version of the journal). This resulted in a very large range of true effects as calculated through Bayesian simulations.
The highest pre–post effect size was for project-based learning
(

Effect size (ES) estimates and 95% credible intervals (Crl) of scaffolding according to problem-centered instructional model with which scaffolding was used.
Proejct-based learning has the highest probability of the best (see Table 3). The ranking of problem solving is close behind that of project-based learning, but problem solving has a much lower likelihood of being the best.
Ranking and probability of the best of scaffolding used in the context of different problem-centered instructional models
STEM Discipline
Science and mathematics had the most coded outcomes, resulting in tighter credible intervals than in engineering and technology (see Supplementary Figure S3 in the online version of the journal).
Mathematics and technology had the highest pre–post effect sizes:

Effect size (ES) estimates and 95% credible intervals (Crl) of scaffolding according to STEM discipline.
Ranking and probability of the best of scaffolding used in the context of different STEM disciplines
Research Question 3: To What Extent Does Assessment Level Moderate Cognitive Pre–Post Gains Resulting From Scaffolding?
The network of evidence included a substantial number of direct comparisons among the assessment levels and between each assessment level and control, with the exception of between application and control (see Supplementary Figure S3 in the online version of the journal).
Scaffolding led to strong pre–post gains across assessment levels, with the
lowest effect size estimate at the application level (

Effect size (ES) estimates and 95% credible intervals (Crl) of scaffolding according to assessment level.
The magnitude of difference among the assessment levels is minor. Comparing assessment levels through ranking similarly shows that there is little evidence to say that scaffolding is more effective at a particular assessment level than another (see Table 5).
Ranking and probability of the best of scaffolding when measured at different assessment levels
Discussion
Implications for Instruction
It is often thought that when selecting an instructional approach, one needs to
determine which level of educational outcome (e.g., concept level, problem
solving) is most important, and select the approach that best aligns with the
outcome (Kuhn, 2007;
Wiggins & McTighe,
2005). For example, many posit that using direct instruction is best
at promoting strong conceptual knowledge (Kirschner, Sweller, & Clark, 2006).
Still others argue that it is best to use problem-based learning to enhance
problem-solving skill (Hmelo-Silver et al., 2007; Kuhn, 2007). In this way, teachers are
often left in a quandary. Specifically, they often hear through professional
learning and standards (e.g., Common Core and Next Generation Science Standards)
that it is important to engage students in authentic problem solving (Drew, 2012; McLaughlin & Overturf,
2012). But they also know that if their students do not perform well
on state standardized tests that emphasize declarative knowledge, their schools
may be labeled low-performing, and other undesirable outcomes may ensue (Harman, Boden, Karpenski,
& Muchowicz, 2016; Price, 2016). Thus, it is often
difficult to convince K-12 teachers to integrate problem-centered learning
(Keys & Bryan,
2001; Kim,
Hannafin, & Bryan, 2007; Nariman & Chrispeels, 2015).
Previous meta-analysis work implied that computer-based scaffolding leads to
between-subjects differences that were statistically greater than zero and above
Turning to effect size estimates, it is important to remember that Bayesian
network meta-analyses deal with a fundamentally different effect
(within-subjects) than traditional meta-analyses (between-subjects). Thus, there
is a need for extreme caution when comparing such. But there are metrics against
which one can compare within-subjects effect sizes. One example is the average
annual gain on standardized math scores, which ranges from
Variation in Scaffolding’s Effect Among Learner Populations and Education Levels
Of note is that the within-subjects effect size was highest (
Turning to why scaffolding was so effective among students with special needs,
one-to-one scaffolding has a long history in teaching students with learning
disabilities (Palincsar,
1998; Stone,
1998). One way is through one-to-one support provided to mainstreamed
students with special needs by teaching assistants (Radford, Bosanquet, Webster, &
Blatchford, 2015). For example, teaching assistants may model and
prompt the use of effective strategies (Radford et al., 2015). A key reason it
has been advocated is its incorporation of dynamic assessment, which is
considered of utmost importance in special education given the wide range of
challenges and abilities that one can find among students with special needs
(Tiekstra, Minnaert,
& Hessels, 2016). While much computer-based scaffolding does not
incorporate dynamic assessment, scaffolding embedded in intelligent tutoring
systems does. The single coded study on scaffolding among special education
students (Xin et al.,
2017) was of an intelligent tutoring system in mathematics, which was
used by elementary learners with a range of special needs, including learning
disabilities, attention deficit hyperactivity disorder, and mild intellectual
disabilities. In this way, vis-a-vis students with learning disabilities, the
posterior distribution empirically approximated through MCMC sampling represents
scaffolding embedded in intelligent tutoring systems. A meta-analysis showed
that the effect size for dynamic assessment among students with special needs
was highest (
Problem-centered instruction has been implemented among students with varying cognitive and learning disabilities with success (Belland, Ertmer, & Simons, 2006; Belland, Glazewski, & Ertmer, 2009; Bottge, 2001; Bottge, Heinrichs, & Mehta, 2002). However, such efforts are not widespread, in part because direct instruction has long been considered to be highly efficacious among special education students (Datchuk, 2016; Gersten, 1985; White, 1988). Still, what is meant by direct instruction within special education differs from the model of an hour-long lecture. Rather, it refers to short, bite-sized instruction delivered in a rapid manner, with a goal of achieving mastery for all (Gersten, 1985; White, 1988). In this way, it is grounded in an idea of needing to maintain high expectations for students with special needs, which is also the rationale for using a scaffolding approach (Lutz, Guthrie, & Davis, 2006). A fundamental assumption in direct instruction is that it is best to minimize struggle/unsuccessful practice such that students learn as rapidly as possible. This is an assumption shared by developers of intelligent tutoring systems, many of which are based in the Adaptive Control of Thought–Rational (ACT-R) learning theory (Anderson, Matessa, & Lebiere, 1997). Thus, it is understandable that intelligent tutoring systems would be highly effective among students with special needs. At the same time, intelligent tutoring systems and direct instruction are not one and the same. Still, even for the staunchest of direct instruction advocates, a pre–post effect size of over 3 is hard to ignore. The magnitude of the effect can be determined with more clarity with the coding of more primary research, which would allow for a more accurate approximation of the population parameter through MCMC sampling and more robust indirect comparisons with scaffolding used among other education populations (Salanti, Higgins, Ades, & Ioannidis, 2008).
Interplay Between Highest Rankings Among College/Graduate-Level Learners and Elementary Learners With Special Needs
That the highest effect size estimates were among college- and graduate-level
learners is similar to the finding of our traditional meta-analysis of
computer-based scaffolding (Belland et al., 2017). It is no surprise that scaffolding is used in
college and graduate-level populations, in that the promotion of skills like
problem solving is critical at those levels (Jonassen, 2011). It is intriguing that
an even greater effect can be found among third- and fourth-grade learners with
special needs (
Scaffolding and STEM Discipline
Effect size estimates were highest in mathematics and technology. This contrasts with our traditional meta-analysis of scaffolding, which found no significant difference in effect size estimate based on STEM discipline (Belland et al., 2017). That the effect size is highest in mathematics is not surprising, since much work on intelligent tutoring systems is done in mathematics (Steenbergen-Hu & Cooper, 2013; VanLehn, 2011) and it has long benefitted from more synthesis of research results and systematic refinement (Murray, 1999; Steenbergen-Hu & Cooper, 2013, 2014; VanLehn, 2011) than other scaffolding types. Of note, many intelligent tutoring systems are grounded in ACT-R, according to which the goal of instruction is to present knowledge to students and give them practice applying such knowledge to problems, such that production rules for applying the knowledge are generated (Anderson et al., 1997). Such an approach fits well with the traditional approach to mathematics curricula in the United States, where even textbooks supposedly aligned with the Common Core focus largely on procedures and declarative knowledge (Polikoff, 2015). For example, in traditional algebra curricula, the focus is on helping students solve for variables, rather than framing variables as tools to characterize relationships (Nie, Cai, & Moyer, 2009). While traditional approaches to mathematics instruction are not the same as those of ACT-R-informed intelligent tutoring systems, production rules are similar to procedures, and so the foundations of the two approaches are in alignment.
Most studies in the technology category were from computer science. Scaffolding’s strength in producing within-subjects gains is likely to be of great interest to those involved in the computer science for all initiative (K-12 Computer Science Framework Steering Committee, 2016; Obama, 2016). At the same time, it is not clear why pre–post effect sizes of scaffolding would be higher in computer science than in engineering and science. Further research is needed.
Implications for Meta-Analysis
Traditional meta-analysis has a long history in education research (Glass, 1976), where it has allowed researchers to take a step back from the body of research studies on a topic and read a (relatively) unbiased account of what the literature says. But there is bias in any literature review, meta-analysis or otherwise, which stems from factors like publication bias, how researchers frame the literature, inclusion criteria, choice of moderators, and unequal sample sizes. Following a Bayesian network meta-analysis approach does not mean that one avoids bias. Rather, it mitigates some biases but also introduces new biases. For example, its inclusion criteria allows one to synthesize results of early stage research for which the samples are too small to warrant a control group or the constructs are not narrowed down enough to allow for fine-tuned control of variables (Courgeau, 2012; Sutton & Abrams, 2001). Much scaffolding research is done in real-world settings, and does not benefit from a finely controlled study design. Such research is useful, but would be missed in a traditional meta-analysis. Most studies (70%) included in this Bayesian network meta-analysis were not covered in our traditional meta-analysis of computer-based scaffolding in STEM education (Belland et al., 2017).
Changing the prior distribution can lead to big changes in the posterior distribution, which is the source of much contention between frequentists and Bayesians (Efron, 2013; Little, 2006). Fit statistics (e.g., deviance information criterion) can provide evidence that the most suitable prior distribution was selected. But this does not sweep away the contention that arises from prior distributions. Still, using a noninformative prior distribution for which fit statistics are best may reduce bias in that the coding sample drives the approximation of the posterior distribution more than does the prior distribution (Jansen et al., 2008).
Bayesian network meta-analysis gained traction in pharmaceutical research in
large part because it enhances decision-making by ranking all available
treatments and determining the probability that each is the best (Jansen et al., 2011;
Salanti, 2012).
In this way, one could see with relative confidence which medication to treat
condition
Having more symmetrical treatment networks filled with more multiple treatment studies may also help address the issue of nesting of participants within classrooms, within schools, and within school districts, which often arises in education research. If the needed data are in the included research reports, one can use a hierarchical approach to meta-analysis, which accounts for nesting. This was done in a between-subjects meta-analysis of problem-based learning in medical education (Kalaian, Mullan, & Kasim, 1999), allowing the authors to find that students in medical schools with more experience with problem-based learning had higher medical content knowledge than students in medical schools that have less experience with problem-based learning. Hierarchical approaches to Bayesian network meta-analysis have been used in medical research (Stettler et al., 2007). While using a hierarchical Bayesian network meta-analysis could be advantageous in education research, few studies that we coded contained the needed classroom-level, school-level, and district-level data. Furthermore, one needs to have sufficient degrees of freedom across analyses to meaningfully detect intraclass correlation. To use hierarchical Bayesian network meta-analysis in education research, educational researchers need to include classroom-, school-, and district-level data on variables such as teacher experience, SES, and state standardized test scores. In this way, data would be available for future researchers to conduct hierarchical Bayesian network meta-analysis, which in turn may lead to new and more valid conclusions.
That more studies adopt within-subject designs is important not only for Bayesian network meta-analysis but also for social justice: compared with between-subject designs, within-subject designs may indicate better the extent to which members of marginalized populations (e.g., students from minority and low-SES backgrounds) benefit from scaffolding (McNeish & Dumas, 2017). Such students often score low on a single time point assessment, but this does not illustrate their full capacity for learning (McNeish & Dumas, 2017). Scores taken at multiple time points can highlight areas of strength and growth of marginalized students, especially when one compares the trajectory and magnitude of growth among different student populations (McNeish & Dumas, 2017). Systematic synthesis of within-subject effects allows one to see which interventions hold promise for which populations, which in turn has the potential to enhance social justice. For example, despite the inclusion of 144 studies in our traditional meta-analysis (Belland et al., 2017), effect size estimates among education populations (high-performing, low-income, traditional, underperforming, and underrepresented) were indistinguishable statistically, except that the effect of scaffolding was greater among traditional students than among underperforming students. In contrast, the current study indicated that scaffolding has the greatest promise among special education and ELL students. Scaffolding also produced strong pre–post gains among underperforming and underrepresented students. Knowing that scaffolding is helpful across a wide range of educational populations is important, but it is equally important to understand the within-subjects growth that one might expect among members of different populations. Thus, we urge scaffolding researchers to adopt within-subject designs, especially when studying marginalized populations.
This article introduces an approach (Bayesian network meta-analysis) with which educational researchers can synthesize within-subject effects. When the goal is to model growth due to an intervention, it will accomplish synthesis goals more effectively than traditional meta-analysis. When the goal is to model a comparison between an intervention and control, traditional meta-analysis or Bayesian traditional meta-analysis would fit best.
Limitations and Suggestions for Future Research
A Bayesian approach to network meta-analysis was adopted because in traditional network meta-analysis, each included study needs to incorporate a control condition, which would have required us to exclude many multiple treatment studies. Through MCMC simulations informed by the prior distribution and current coding, we could strengthen predictions for the effect size for each context of scaffolding use. However, by taking this approach, the effect size estimates are empirical approximations of population parameters, which depend on use of the best possible prior distribution. Any change in prior distribution could produce different results. We verified the appropriateness of our prior distribution through deviance information criterion statistics. Furthermore, the strictness of the inclusion criteria and the fact that the majority of included studies were not included in our traditional meta-analysis (Belland et al., 2017) could mean that the nature of included scaffolding interventions was strikingly different. This may not be the case since the same operational definition of scaffolding was applied in both meta-analyses. In a follow-up to this study, we plan to (a) use the results of this article as an informative prior distribution and (b) code new studies not included in this meta-analysis. This may result in tighter credible intervals and more accurate effect size estimates.
No meta-analysis covers qualitative results, and all meta-analyses exclude some quantitative research. Conclusions of any meta-analysis are limited in these ways. For example, based on this study and our previous meta-analysis (Belland et al., 2017), college- and graduate-level education appear to be the most promising contexts for scaffolding. It is possible that when synthesizing all empirical research (including quantitative research that did not meet the inclusion criteria and qualitative research) on the topic, one would reach a different conclusion.
Authors of the included studies chose the content covered in and when to administer the pre- and posttest. Choosing alternative test content or test administration time points could have led to different pre–post effect sizes. As such, the pre–post effect sizes reported in this article are an imperfect measure of the cognitive growth resulting from computer-based scaffolding.
The number of outcomes at some levels of coding categories were small. This could have led to large fluctuation of simulated effects, which in turn could have led to wide credible intervals. But when we checked the trace plot, there was no large fluctuation. Another possible reason is real inconsistency in computer-based scaffolding findings. Further research is needed.
Using the Sugrue (1995) framework for coding of assessment level may have led us to not fully capture the range of outcomes that are targeted by scaffolding, including conceptual change, particularly when helping students overcome misconceptions. Modifying the Sugrue (1995) framework may help more fully reflect outcome types targeted with scaffolding.
Finally, it is possible that our search terms did not uncover all relevant studies because some interventions may share essential characteristics with scaffolding but their name does not contain any of our search terms. We asked advisory board members (representing biology, chemistry, physics, engineering, technology, mathematics, cognitive science, learning sciences, and meta-analysis) for input on search terms. Authors of future Bayesian network meta-analyses of scaffolding research would be wise to carefully consider search terms.
Conclusion
Computer-based scaffolding is highly effective at improving cognitive learning
from pre to posttest; this strength is largely consistent across measurement
levels, education populations, and STEM disciplines. Scaffolding led to a
pre–post gain of at least 1 SD among university-level students, graduate-level
students and students with learning disabilities, and when used in the context
of (a) project-based learning, (b) technology, and (c) mathematics. These are
quite large effect sizes, which indicates that scaffolding’s effect is strong in
the contexts and warrants further exploration. Furthermore, effect size
estimates were at least 0.74 across concept-, principles-, and application-level
assessment. Scaffolding’s consistent effect informs teachers that using
problem-centered approaches does not preclude strong concept learning, which is
often the focus of state standardized tests and, by consequence, teacher
evaluation (Harman et al.,
2016; Price,
2016). The within-subjects effect size at the concept level was
The notably large effect size (
Also intriguing was that scaffolding’s effect was strongest among college, graduate, and early elementary special education learners—the farthest and closest to the population for which scaffolding was originally proposed (Wood et al., 1976). Possible explanations include that dynamic assessment is a known strong intervention for special education students (Swanson & Deshler, 2003; Swanson & Lussier, 2001) and college and graduate students potentially exhibit greater intersubjectivity when they address problems related to their major.
Scaffolding showed its largest within-subject effects in contexts (i.e., college and graduate) far removed from its origins in early childhood education (Wood et al., 1976), which is consistent with our earlier traditional meta-analysis (Belland et al., 2017). Also, scaffolding has strong effects among special education students, ELLs, and students who are otherwise underrepresented, and when used with diverse problem-centered instructional models. This implies that scaffolding is a robust and versatile model.
This article also introduces Bayesian network meta-analysis to education research (Bhatnagar et al., 2014; Jansen et al., 2011; Salanti, 2012). But for Bayesian network meta-analysis to be of maximum utility in education research, there is a need for more multiple treatment studies to enhance researchers’ ability to (a) strengthen comparisons (Salanti et al., 2014) and (b) use a hierarchical approach to Bayesian network meta-analysis so as to address nesting (Stettler et al., 2007). This may also help researchers get a better sense of the extent to which a treatment helped members of marginalized populations (McNeish & Dumas, 2017).
Footnotes
Notes
Authors
BRIAN R. BELLAND is an associate professor in the Department of Instructional
Technology and Learning Sciences, Utah State University, 2830 Old Main Hill, Logan,
UT 84322, USA; email:
ANDREW E. WALKER is an associate professor in the Department of Instructional
Technology and Learning Sciences, Utah State University, 2830 Old Main Hill, Logan,
UT 84322, USA; email:
NAM JU KIM earned his PhD in the Department of Instructional Technology and Learning
Sciences, Utah State University, 2830 Old Main Hill, Logan, UT 84322, USA and is
beginning as an assistant professor of applied learning sciences at the University
of Miami in Fall 2017; email:
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
