Sage Journals: Discover world-class research

Abstract

Computational thinking (CT), an essential 21st century skill, incorporates key computer science concepts such as abstraction, algorithms, and debugging. Debugging is particularly underrepresented in the CT training literature. This multi-level meta-analysis focused on debugging as a core CT skill, and investigated the effects of various debugging interventions. Moderator analyses revealed which interventions were effective, in which situations, and for what kind of learner. A significant overall mean effect of debugging interventions ( $\bar{g}$ = 0.64, CI = (0.32, 0.96), p < .001), was found based on 62 effect sizes from 18 source articles. Significant between-studies variation indicated that true effects could range from −0.54 to 1.82. In addition, sensitivity analyses and checks on confounding provided further understandings of intervention features and their impacts. Interventions using enhanced debuggers and systematic instruction were particularly effective in fostering debugging skills. Debugging intervention effects varied by participant population and potentially by publication type. Moreover, debugging interventions had impact regardless of how debugging skills were measured, programming medium used, control-group type, and whether the study was randomized. Future studies should investigate the best practices for improving debugging abilities for whom and under what circumstances.

Keywords

programming computational thinking debugging meta analysis intervention effect debugging intervention computer science education

Computational thinking (CT) is an analytical approach that incorporates the key concepts of computer science; it is considered by some to be an essential 21^st century skill (Wing, 2006). Shute et al. (2017) defined CT as the collective knowledge and skills needed to solve problems effectively and efficiently by designing algorithms that are applicable to different contexts. Computational thinking consists of various skills, such as the ability to construct algorithms, abstract, and debug (Shute et al., 2017). Among these, debugging is an often overlooked and underrepresented skill in CT research and training (Gao & Hew, 2023; Liu et al., 2017; Wong & Jiang, 2018).

Debugging - the process of locating and fixing errors in code - is a core competency of computational thinking and an essential skill for computer programmers to master (e.g., Brennan & Resnick, 2012; Kong & Wang, 2023; Shute et al., 2017). Researchers have emphasized training and improving debugging skills as a key CT practice (Kong & Wang, 2023; Kutay & Oner, 2022; Zeng et al., 2023). Programs rarely work the first time they are run, and even experts claim to spend more than half of their time debugging code (Contreras-Rojas et al., 2019; Zeller, 2009). Moreover, debugging is a multifaceted skill, requiring sophisticated domain, systems, procedural, strategic, and experiential knowledge (Jonassen & Hung, 2006). To be successful, a programmer must balance the dual demands of maintaining a high-level view of code structure and abstractions while also flexibly focusing to locate and fix any bugs. Given its complexity, debugging is especially challenging for novices (Lee & Wu, 1999; McCauley et al., 2008). Although findings suggest that beginners do employ rudimentary debugging strategies, such as trial and error, these often are limited, and prove to be ineffective and inefficient (Lee & Wu, 1999; Li et al., 2019). For example, rather than systematically exploring the problem space as experts do, novices jump to fixing bugs without taking time to understand the overall program first (Fitzgerald et al., 2008). Moreover, beginners find it challenging to execute the procedural aspects of debugging, such as conducting thorough code tests, using debuggers, and interpreting error messages. The centrality of these skills in bug localization makes it challenging for novices to identify the causes of errors, resulting in a process marked by frustration and stress (Whalley et al., 2021).

The complexity of debugging makes it challenging not only to learn, but also to teach. We still lack guidelines for teaching debugging skills (Whalley et al., 2021). Because debugging involves a diverse set of nuanced subskills, such as program comprehension and hypothesis testing, significant instructional time is required to cultivate mastery in each of its component skills. Bugs also vary in context and form, so instructors must personalize support for the specific bugs used in instruction while also teaching general debugging principles. In practice, computer-science courses mostly focus on programming content knowledge and syntax, with little time allocated to debugging practice (Lee & Wu, 1999).

A first step towards providing guidelines for instructors and helping researchers to better facilitate debugging is to synthesize the existing literature on debugging interventions. In this meta-analysis we analyze studies and report on their findings regarding the efficacy of such interventions. We also conduct moderator analysis on eight factors—intervention type, programming medium, control-group activities, type of outcome, participant characteristics, study design, and features of the source documents (publication year and type)—in order to distinguish factors that influence the effect size of the intervention. The goal of this work is to synthesize existing research on debugging interventions to feature effective approaches and highlight opportunities for future research.

Our research questions are:

(1) Are interventions for increasing debugging skills effective on average at raising scores on measures of debugging?

(2) How variable are the effect sizes for interventions for increasing debugging skills, and what portion of effects, if any, shows detrimental effects?

(3) Do characteristics of the interventions and learners, or features of the studies themselves, relate to the effectiveness of debugging interventions?

Literature Review

Interventions for Fostering Computational Thinking

Due to the interconnectedness between CT and computer science, much of the literature has focused on improving CT via teaching programming or coding knowledge (e.g., Angeli & Valanides, 2020; Berland & Wilensky, 2015; Papert, 1980). Papert (1980) proposed to use constructionism as a framework for learning CT. He built LOGO, which provided a microworld in which students construct programs and improve their own understandings of numerical and computational concepts. This effort inspired the creation of other CT learning environments, such as Scratch (MIT, https://scratch.mit.edu). Scratch is a visual programming tool where students can create their own artifacts (e.g., interactive stories, animations, and games) by dragging and dropping blocks of code to write programs. Such visual programming tools are frequently used to improve CT (Hsu et al., 2018). Studies have shown consistent evidence that designing artifacts using Scratch can foster CT skills among preschoolers (e.g., Bers, 2018; Falloon. 2016; Sung et al., 2017), elementary students (e.g., Jun et al., 2017; Sáez-López et al., 2016; Zhang & Nouri, 2019), middle-school students (e.g., Grover et al., 2015; Kutay & Oner, 2022), and even college students (e.g., Cetin, 2016). Grover et al. (2015) provided evidence that students can transfer their programming skills from Scratch to text-based programming environments. Moreover, educators can use Scratch to teach subject-matter content by creating simulations and games. For example, Garneli and Chorianopoulos (2018) taught students science knowledge and CT skills via Scratch.

Other approaches to teaching CT include incorporating robotics and digital games. Robotics are generally used to help young students learn CT. Studies have investigated how pre-elementary students interact with educational robotics, and provide evidence that preschoolers can acquire essential CT skills such as debugging and sequencing (Angeli & Valanides, 2020; Bers et al., 2014). Robotics can also support middle-school and high-school students in learning CT skills including abstraction, generalization, and algorithm development (Atmatzidou & Demetriadis, 2016; Berland & Wilensky, 2015). Fun digital games can also maintain student attention and interest in CT development, and provide support when needed. Various games have been developed to foster student CT skills, including RoboBuilder (Weintrop & Wilensky, 2012), Penguin Go (Zhao & Shute, 2019), and aMazeD (Tikva & Tambouris, 2023). Research has found that using digital games can help students develop targeted CT skills, such as problem decomposition, algorithmic thinking, abstraction, and debugging.

Assessment of Computational-Thinking Skills

A key issue in the CT literature relates to the assessment of CT across studies. The assessments used have largely been based on researcher-developed rubrics or tests of programming knowledge and skills. Few of these scales have been validated as measures of CT. Often their reliabilities are not reported.

Home-grown tests generally target the domain knowledge associated with the intervention used to encourage CT (e.g., Chen et al., 2017; Grover et al., 2017). However, few CT tests have undergone psychometric evaluation. One exception is the scale of Korkmaz et al. (2017). The authors characterized CT as using five associated skills: creativity, algorithmic thinking, cooperation, critical thinking, and problem solving. They reviewed the literature on scales assessing these five skills and selected 29 items to be included in their CT scale. They initially administered 29 five-point Likert-scale items (with responses from “never” to “always”) to a sample of undergraduates. The scale was further tested among secondary students (Korucu et al., 2017).

Another exception is the CT assessment developed by Román-González et al. (2017). The scale includes 28 multiple-choice items targeting knowledge of sequences, loops-repeat times, loops-repeat until, if-simple conditional, if-else-complex conditional, while conditional, and simple functions. These researchers have conducted various empirical studies to validate their test, especially among middle-school and high-school students. The downsides of their test pertain to its emphasis on programming concepts and the exclusion of an important CT skill – debugging. Debugging is an integral part of CT skills when a solution plan does not work as expected; however, both CT interventions and assessments have often failed to include debugging as a component of CT.

Interventions for Fostering Debugging Skills

Debugging interventions can generally be categorized into two types, those that teach debugging and those that support debugging. Ardimento et al. (2019) have argued that these are “inherently different tasks.” Debugging instruction explicitly introduces students to the skills and strategies of debugging. This commonly takes place in the form of online or offline curriculum-based approaches, using a series of scaffolded steps or by examining pre-defined bugs. On the other hand, some tools support students as they encounter bugs in their own programs. These tools are typically embedded within an integrated developmental environment (IDE) for coding. Traditional debuggers, visualization tools, hint-generation systems, enhanced error messages, and intelligent tutoring systems most often fall into this category. These two approaches are not always mutually exclusive, as some tools that support debugging have been shown to improve general debugging skills as well (Ardimento et al., 2019; Price et al., 2020).

Based on the observation that debugging skills are rarely explicitly taught in computer science (CS) classes, some researchers have explored the effectiveness of improving students’ skills through direct instruction. Such interventions focus on systematically teaching the debugging process (e.g., Chou, 2020; Wong & Jiang, 2018). For example, Michaeli and Romeike (2019) developed and taught a procedure for debugging that emphasized repeated hypothesis formation and experimentation for different error types: compile-time, runtime, and logical. They presented this procedure on a poster to tenth-grade students and discussed each step using an example. Learning this procedure improved both self-efficacy expectations and debugging performance in the experimental group as compared to the control group.

While the previously mentioned intervention was conducted during in-person training sessions with paper materials (a poster), online learning tools support other promising approaches to systematically teaching debugging. Ladebug (https://github.com/EmandM/ladebug), an online tool that allows students to step through scaffolded pre-defined bugs, garnered initial positive feedback from introductory programming students (Simon et al., 2019). Similarly, online games designed to foster debugging skills have also shown preliminary positive effects (Deitz & Buy, 2016; Liu et al., 2017), though these authors did not conduct comparative experiments as part of their investigations.

A separate yet related category of interventions aims to support students when they encounter bugs while writing their own programs. A major obstacle for beginning programmers is deciphering messages about errors that prevent code compilation. Denny et al. (2014) note that compiler messages are notoriously difficult for novices to understand and can hamper their progress while increasing frustration. Based on this observation, an active area of research is the enhancement of error messages to make them easier to read, more encouraging, and more helpful. Along a similar vein, researchers have also explored the integration of hints that guide students in locating bugs and implementing solutions in their code. For example, Price et al. (2020) explored the efficacy of embedding data-driven hints in student code. These hints annotated which lines should be deleted, replaced, or inserted and provided further explanations and examples. Thus far, enhanced error messages and hint-generation approaches have shown mixed effects on improving student debugging ability (e.g., Becker et al., 2019; Price et al., 2020), suggesting that the design of these messages heavily influences their utility.

Although syntax bugs are a major obstacle for students, they are typically easier to fix compared to semantic bugs which compile but produce incorrect output (Fitzgerald et al., 2008). Traditional debuggers, which allow students to set breakpoints and systematically check variable values at each step, assist in locating semantic bugs. However, these tools have a steep learning curve (Becker, 2019) and often don’t address high-level struggles novices encounter while debugging, such as during hypothesis formation and decomposition. Attempts at enhancing traditional debuggers to cater to novices assume a variety of forms. Ko and Myers (2004) developed the Whyline (https://www.cs.cmu.edu/∼NatProg/whyline.html), an interrogative debugging tool that visualizes users’ “why” questions about their code. Similarly, many intelligent-tutoring approaches to debugging gradually scaffold the debugging process by encouraging students to ask questions and develop hypotheses (Lane & VanLehn, 2005). General features observed in these enhanced debuggers are the use of visualizations to reduce the cognitive load of code processing and provisions for user input to guide the debugging process.

Assessment of Debugging Skills

In addition to the diverse interventions designed to address debugging, assessment methods that measure the impact of these interventions vary greatly. One common evaluation approach is based on the correctness of student code. This includes measures such as qualitative or quantitative ratings of code quality and number of errors generated (e.g., Chung & Hsiao, 2020; Denny et al., 2014). Additionally, researchers have used various measures of debugging skills, such as number of attempts to create correct code, numbers of lines of code edited (e.g., Liu et al., 2017; Price et al., 2020) and time taken to solve the bug (e.g., Ko & Myers, 2004; Lin et al., 2017; White, 2009). These evaluation metrics are most commonly assessed on pre-defined debugging tasks within controlled environments (e.g., Ko & Myers, 2004; Lane & VanLehn, 2005), or on student-generated code in an ecologically valid environment, such as an introductory CS course (Becker et al., 2016; Price et al., 2020). There is little consensus, though, on validated measures of debugging ability, making generalizations across the field potentially difficult. The wide variety of assessment approaches further highlights the need for a meta-analysis of findings in the literature.

Previous Meta-Analyses on Computational Thinking and Debugging

Meta-analysis facilitates synthesis of findings from previous studies and assessment of their generalizability. CT researchers have examined a variety of interventions using diverse assessment methods, often with smaller samples. Meta-analysis allows for increased power via the accumulation of such smaller diverse studies (Cohn & Becker, 2003), and provides a means for assessing whether differences in effectiveness appear across the varied approaches and contexts that have been studied. Recent meta-analyses on CT have focused mainly on the impact of specific intervention types (e.g., collaboration, educational games, etc.) on general CT skills. For example, collaboration was associated with an overall positive effect on CT programming knowledge ( $\bar{g}$ = 0.56, p < .001; Lai & Wong, 2022). Use of Scratch and Arduino led to a very large average effect size of 1.03 standard-deviation units on student CT skills (Fidai et al., 2020). Sun et al. (2023) reported an overall effect of educational games on CT skills in general ( $\bar{g}$ = 0.77). Similarly, Sun et al. (2021) reported an overall effect of $\bar{g}$ = 0.60 for programming activities on general CT skills in K-12 CT education.

Meta-analyses have also examined component CT skills as outcomes. Sun et al. (2021) examined the impact of programming activities on K-12 students' CT skills and found an effect of 0.60 standard-deviation units (SE = 0.05). Li et al. (2022) also examined general CT skills following one of two types of intervention. An overall effect size of $\bar{g}$ = 0.39 (SE = 0.04) was seen for “unplugged activities” and a mean of $\bar{g}$ = 0.58 (SE = 0.08) was reported for programming exercises. Merino-Armero et al. (2022) analyzed various tools to improve CT skills. They reported large changes (not group differences) for the computational dimension worked, in terms of CT concepts ( ${\bar{g}}_{C}$ = 1.13, SE = 0.12), CT practices ( ${\bar{g}}_{C}$ = 0.93, SE = 0.17), and perspectives ( ${\bar{g}}_{C}$ = 0.76, SE = 0.28) in K-12 education settings. However, they did not study debugging. In conclusion, no meta-analysis has explicitly and specifically focused on debugging as a CT skill and examined effects of treatments designed to improve it.

While a handful of in-depth reviews have examined specific interventions related to debugging such as enhanced error messages (Becker et al., 2019) or code visualizations (Egyed et al., 2003), a comparative overview of the many diverse approaches is lacking. McCauley et al. (2008) briefly addressed interventions in their review of the literature on debugging, but their analysis was limited to explicit debugging instruction. Additionally, Li et al. (2019) and Luxton-Reilly et al. (2018) presented cursory reviews of debugging tools, but their reviews do not include quantitative analyses of study results.

Our synthesis aims to investigate debugging as an essential but often neglected CT skill. Our quantitative meta-analysis is designed to identify which types of debugging interventions are effective, and for whom. We investigate our three research questions via overall analyses of effectiveness and variation in effects, and by examining eight moderators of intervention effects. Our results have implications for guidelines for debugging instruction, and for future research.

Methods

Study Selection

The databases used for our data collection included ProQuest (including its dissertation compendium), EBSCOhost, Web of Science, and PsycInfo. In each database, we used the same keywords and criteria for searching for sources. We first identified empirical studies appearing between January 1, 2002 to December 31, 2021 (the most recent two decades) with abstracts including the keywords: “computational thinking”, “computer science”, debug*, intervention, effect*, and experiment*. We sought sources in English, not other languages. Our search included journal articles, dissertations, theses, conference papers, and proceedings. The initial search generated 234 papers to screen. We removed duplicate papers. In addition, we conducted snowball searches for additional papers by examining the references of previous meta-analyses on CT and the references of screened papers for other relevant studies. In response to reviewer comments we updated our searches through by additional snowballing and thus added three more studies to our original collection of fifteen.

We screened papers using the inclusion and exclusion criteria listed below.

1. The research outcome should be debugging skills.

2. The source should specify the instruments/evaluation tools used to test students’ debugging skills.

3. The source should report on a comparative study to test the effects of an intervention.

4. Research that described the development of an intervention without providing quantitative data on its efficacy was excluded.

5. The research report specifically described the participants and methods.

6. The report provided sufficient data for computing the effect size for the treatment versus control comparison.

After careful review, 18 papers (see Figure 1) were included for the final meta-analysis, including two dissertations, eleven conference papers/proceedings, and five journal articles.

Figure 1.

PRISMA chart of the study-selection process.

Coding Procedures

After the set of study features (i.e., sample size, test used, use of randomization, etc.) was identified, two coders independently coded the study features and effect sizes for each sample. Two coders went through three rounds of coding to establish inter-rater reliability. Two randomly-selected studies were coded for the first round, and one study was randomly selected for coding training for each of the next two rounds. After each round of coding, the coders met to compare extracted data and resolve any differences until reaching 100% agreement on all the 20 variables. In the first round, the two coders had complete agreement for one paper, and some disagreement for the other paper on three variables related to statistics, such as the values of debugging measures, and what other types of statistics should be included. Thus across all variables the percentage agreement was 92% (37 out of 40 variables) for the first round. The inter-rater reliability, in the form of percentage agreement for the two studies examined in the next two rounds was 95% (19 out of 20 variables) and 100%, respectively. Therefore, the inter-rater reliability was about 96% on average for the three rounds of training, which is high enough for separate coding. Afterwards, the two coders coded the remaining fifteen studies and obtained the data set for analysis. The final inter-rater agreement was 100% across all the 18 papers included in the analysis.

Effect Sizes

For each source, we calculated standardized-mean-difference effect sizes for all possible group comparisons. None of the studies presented results for multiple sub-samples. When debugging was measured by several tests or tasks, an effect size was calculated for each test or task. The effect sizes were corrected for bias and within-study variances were computed, as described below.

Hedges (1981) derived the unbiased standardized mean difference, denoted as g, which is adjusted using the correction factor J(m) – a function of sample sizes, with m = n_T + n_C - 2. In larger samples, the factor J is close to 1, and for smaller sample sizes it is lower, reflecting and correcting for small-sample bias (Hedges & Olkin, 1985). The formula to calculate Hedges’ g is

g = J * \frac{{\bar{Y}}^{T} - {\bar{Y}}^{C}}{S_{y p o o l e d}},

where

{\bar{Y}}^{T}

and

{\bar{Y}}^{C}

are the treatment and control group means, and J = J(m) = 1 - 3/(4m - 1) is the unbiasing factor that is a function of the degrees of freedom m = n_T + n_C - 2. Also

S_{y p o o l e d} = \sqrt{\frac{(n_{T} - 1) S_{T}^{2} + (n_{C} - 1) S_{C}^{2}}{n_{T} + n_{C} - 2}},

which is computed using

S_{T}^{2}

and

S_{C}^{2}

, the treatment and control-group sample variances, respectively. Last, the variance of each effect size is computed as

V a r (g) = J^{2} * (\frac{n_{t} + n_{c}}{n_{t} * n_{c}} + \frac{g^{2}}{2 (n_{t} + n_{c})}) .

Hedges’ g is referred to as the corrected or unbiased effect size compared to the value of Cohen’s d. The difference between Hedges’ g and Cohen’s d is very small for studies with samples of 19 or more, as J(m) ≥ .96 in that range (Hedges, 1981).

Features Coded for Moderator Analysis

In addition to examining the overall effect of debugging interventions, we investigate how eight factors relate to the effects of interventions on student debugging skills. Four of the factors concern the nature of the intervention (i.e., intervention types, programming medium, control-group activities, and type of outcome), and four focus on participants, study design, and features of the source documents. We examine all eight study features for confounding or collinearity with other factors.

Intervention Types

As the literature review shows, debugging interventions have assumed different forms, as no consensus exists on how to best enhance debugging skill. Studies have explored a wide range of interventions, such as instructional/curriculum designs (e.g., Chou, 2020; Michaeli & Romeike, 2019), intelligent tutoring systems (e.g., Ko & Myers, 2004; Simon et al., 2019), and digital games (e.g., Deitz & Buy, 2016; Liu et al., 2017). Providing hints and enhanced error messages has shown inconsistent effects on debugging learning (e.g., Becker et al., 2019; Price et al., 2020). Given the large amount of variance in the design of interventions, we compared effect sizes from different types of interventions. By identifying effective interventions, we hope to provide implications for future intervention designs.

We categorized the debugging interventions into five types. Enhanced debuggers are interactive systems developed to help programmers to visualize their code input as part of the debugging process, especially when searching for semantic bugs. Enhanced error messages and hints are messages carefully designed to help programmers comprehend error messages and locate and resolve bugs; these are not interactive as are enhanced debuggers. Systematic-instruction interventions included curricula developed to teach debugging skills in courses. These may or may not have digital components. Digital games and virtual-reality systems are the fourth type – these can be leveraged to improve students’ debugging skills, help them visualize code, and increase their motivation and interest. Lastly, collaborative programming may play a role in enhancing debugging abilities collectively through team interactions. However, data based on collaborative activities have inherent dependencies and do not cleanly parse out individual students’ competence levels.

Intervention duration was also coded, but only seven studies (38.9%) reported it in a clear and usable form.

Programming Medium

The programming medium is the channel through which students interact with coding activities. We examined three types of medium. Text-based programming focuses on programming languages, such as Java and Python, where students need to write syntax to create programs. A typical example of block-based programming is Scratch where students drag and drop code blocks to build programs such as games and simulations. The last medium uses physical objects to program, such as in studies of robotics (Bers et al., 2014; García-Valcárcel-Muñoz-Repiso & Caballero-Gonzales, 2019). The programming medium also influences the types of bugs students encounter. Notably, syntax errors are minimized and even absent by design in block-based languages. If different programming media are associated with more or less effective interventions, we may gain insight about the most useful type of medium for training debugging skills.

Control-Group Activities

In some studies, the control group consisted of simple and regular programming instruction without any intervention or support for debugging. On the other hand, some control groups received a modified/lesser version of the treatment-group activities (e.g., paper version of computerized feedback), or a different but weaker approach aimed at training debugging. If one type of control-group activity leads to more learning of debugging than the other, effect sizes will be reduced for studies with that sort of control.

Type of Outcome

The literature contains a great diversity of assessments of debugging skill, and of specific dimensions that constitute the construct of debugging skill. Based on the literature we coded four types of outcome: time spent to debug or complete tasks, correctness of program code, number of bugs solved, and efficiency (i.e., number of compiles or edits taken to arrive at the correct solution). Two common measures of debugging skill include accuracy of code (e.g., Chung & Hsiao, 2020; Denny et al., 2014), and time spent on problem solving (e.g., Ko & Myers, 2004; Lin et al., 2017). Students are expected to spend less time to debug and complete tasks after training, and to make fewer edits or attempts at corrections compared with students who did not receive any training or received less training; effects from studies using these types of measure were reverse coded so that positive effects represented better performance (e.g., less time spent) by treated samples. In the same vein, debugging interventions should help students to construct programs with higher levels of correctness, and to resolve more bugs. Thus, investigating effect-size differences between assessment methods can reveal the efficacy of interventions across debugging success measures.

School Level

Students who participated in studies of debugging interventions ranged from kindergarteners to graduate students. Debugging interventions are tailored to students based on their age or grade level, often by modifying the type of programming language students encounter. K-8 students are commonly introduced to programming through block-based languages as opposed to text-based languages, while the same is not necessarily true for college-aged students. This suggests potential confounding of programming medium and student age/school level. Additionally, Rich et al. (2019) suggest that younger learners rely more on basic debugging strategies and may have difficulty employing cognitively complex strategies such as strategically choosing between debugging techniques.

Prior knowledge and experience with programming may also matter, and this often comes with age. That said, some older students may still be programming novices whereas some younger students may be able to create advanced programs. Given the anticipated large range of grades and ages, we differentiate effect sizes among different grade or school levels.

Unfortunately no studies explicitly measured level of coding experience of participants, aside from a few that characterized the overall experience level of the sample (e.g., students in an introductory CS class). Thus level of coding experience was not investigated.

Use of Randomization

In true experiments, participants are randomly assigned to control and experimental groups, for the purpose of minimizing biases and enhancing causal inferences (Creswell, 2009). However, as Lipsey (2003) noted, this design feature may be confounded with other study features, such as whether the intervention is a research or demonstration program versus one that has been put “into the field”; the latter may be more difficult to study using randomization. We include both true experiments and studies using nonrandom assignment. Thus, it is worth checking the effect-size difference between the two different assignment methods.

Publication Year

Year of publication is often examined as a moderator in meta-analyses. It can be used to detect trends over time, which reflect advances and improvements in the interventions used as the literature builds upon prior research, and as those developing interventions enhance their approaches.

Publication Type

Much of the literature on debugging has been shared via conference proceedings, such as those of the Association for Computing Machinery (ACM). Thus many of the papers in our review were from conference proceedings. Our meta-analysis also includes studies published in peer-reviewed journals and dissertations. The quality of research can vary greatly across peer-reviewed journals, conference proceedings, and dissertations, though we note these outlets are a poor proxy for true study quality. Also, prior meta-analyses have found differences between published and unpublished sources consistent with publication bias, or the censoring of weaker or nonsignificant findings (e.g., Mathur & VanderWeele, 2021). Thus, we will compare effect sizes of debugging interventions across these publication types.

Analyses

In total, 62 effect sizes from eighteen studies of interventions aiming to improve students’ debugging skills were included in the final meta-analysis. We used the R package “metafor” (Viechtbauer, 2010) to conduct the meta-analysis, and adopted random-effects models with restricted maximum likelihood (REML) estimation. All but four studies provided more than one effect size. The number of effect sizes ranged from 1 to 16 across the 18 studies, with a mean of 3.6. As such, many subsets of effect sizes are dependent on each other. Thus we employed multi-level modeling techniques to handle the nested data structure resulting from the use of multiple outcome measures, using the study’s first author as the nesting variable. We examined the overall mean and variance of the effects and meta-regressions for each study feature. We also examined the data for the presence of publication bias using Egger’s test and the funnel plot with the trim-and-fill technique.

Results

Study Description

Table 1 describes the general characteristics of the studies included in the meta-analysis. Numbers in parentheses indicate the counts of studies related to each category. We have 18 studies from 16 author groups. Most participants were undergraduate students. The number of participants across studies varied greatly. The most frequently used debugging measures were correctness of the code and efficiency of debugging processes, and the most common intervention type was the use of enhanced error messages.

Table 1.

Characteristics of Studies in the Meta-analysis.

Study Characteristics	Description
Number of participants	Mean = 116, SD = 168, Range = 9 - 695
Intervention type	Enhanced debuggers (4), Systematic instruction (5), Enhanced error messages (6), Games (2), Collaboration (1)
Programming medium	Text-based (13), Block-based (4), Physical (1)
Control-group activities	Regular instruction (9), Weak intervention (9)
Type of outcome^a	Time spent (12), Correctness (24), Number of bugs solved (7), Efficiency (20)
School level	Kindergarten (1), Elementary (2), High school (2), Undergraduate (11), Graduate (2)
Randomization	Non-randomized (11), Randomized (7)
Publication year	Mean = 2016, SD = 5.09, Range = 2004 – 2021
Publication type	Journal article (5), Conference paper (11), Dissertation (2)

^aCounts of types of outcome are based on the total of 62 effect sizes.

Overall Analyses: Research Questions 1 and 2

Our first goal was to examine the overall effectiveness of the collected interventions, regardless of their features or the characteristics of participants. First, we examined the effect sizes to determine whether a random-effects model was appropriate, and if so to estimate the variance of the effects. Random-effects models assume a population of effect sizes, wherein each sample effect size may represent a unique participant population. Thus the collected effect sizes are expected to show a degree of true variability around the mean effect size. The I² value, an index of between-studies heterogeneity as a portion of total variability is 81.54%, indicating that a high percent of the variance comes from between-studies differences. The test of homogeneity is highly significant with Q(61) = 203.93 (p < .0001). The two indices show that a random-effects model is appropriate for the analysis.

The forest plot of confidence intervals (Figure 2) for all effects confirms that the effect sizes are heterogeneous. Intervals are ordered by study author(s) and plotted using within-study standard errors. Some studies (e.g., Lin et al., 2017; Ko & Myers, 2004; White, 2009, outcome 3) reported very large effect sizes with large standard errors resulting from their small samples (ns of 16, 9, and 7 respectively). The plot also reveals that most studies provided more than one effect size. These represent effects for measures tapping different skill outcomes (e.g., Zhong & Li, 2020, with effects for time, correctness, and number of debugging tasks completed) or multiple measures of the same skill dimension, such as Denny et al.’s (2014) counts of different kinds of errors. Effect sizes appear very similar across the measures collected within most studies.

Figure 2.

Forest plot of the effects of interventions on debugging skills.

Overall Mean Effect: Research Question 1

The overall mean effect size across all interventions from the multilevel analysis is shown as the center of the diamond at the bottom of the forest plot in Figure 2. Across all debugging-skills studies the mean is $\bar{g}$ = .64 (with 95% CI = (.32, .96), p < .001, shown as the width of the diamond), based on the multilevel random-effects analysis. This is a large effect.

The forest plot reveals that most large effects are from smaller studies, as reflected in their wide confidence intervals. Thus any one large effect will not have an overly large influence on the mean, but the sheer number of large effects has contributed to the large mean effect.

Overall Variation: Research Question 2

The forest plot also shows that effect sizes vary widely around the mean effect, with eleven g values larger than one standard deviation in size, and three above two standard deviations. The between-studies variance for the full set of effects has two components. The between-studies (i.e., between-authors) variance is 0.361, and the variance among effects within studies is 0.002, which is negligible. Summing these, the variance among all effects is 0.363, leading to a plausible values interval for 95% of the population of effects ranging from −0.54 to 1.82. This interval is shown as a wider arrow just below the diamond for the CI in Figure 2.

As the lower end of the plausible values interval is negative, it is apparent that some interventions are from populations that suffered deleterious effects from the interventions. If the population effects arose from a normal distribution, roughly 86% of effect sizes would be expected to be positive, and the remaining 14% negative. However, in our data 17 effects or 27% of the 62 effect sizes fell below zero, and a histogram showed the distribution effects to be positively skewed (skewness coefficient = 1.47) with a median of 0.17. The mean effect appears to have been strongly influenced by the upper tail of the distribution.

Publication Bias

Among the 18 included papers, 11 (61.1%) were peer-reviewed high quality conference papers (such as ACM), 5 (27.8%) were from academic journals, and 2 (11.1%) were dissertations. All the papers were accessible. However, it is still important to check for potential publication bias in the studies of debugging interventions. Egger’s test for the funnel plot (Figure 3) shows notable asymmetry in the effects from debugging interventions (z = 6.20, p < .0001). The trim-and-fill technique using the L0 estimator with the REML estimated variances added 14 potentially missing studies to the left side. The addition of these potentially missing studies leads to a greatly reduced estimate of the mean effect, of 0.08 (SE = 0.08) which does not differ from zero. However, this analysis is not hierarchical thus does not account for dependence.

Figure 3.

Funnel plot for the 62 effect sizes from debugging interventions with 14 added studies as suggested by trim and fill.

Moderator Analyses: Research Question 3

CT researchers have designed different types of interventions and employed various measures to assess student debugging skills. Given the heterogeneous nature of the studies, with many intervention types, assessment tools, and research designs, we conducted moderator analyses to examine possible sources of the variation in the effectiveness of debugging training. Before proceeding we checked for evidence of confounding by examining cross-tabulations among the eight moderators. Several interesting patterns emerged.

Potential Confounding

Meta-analyses are especially subject to confounding of study features because in most cases the set of studies is not planned (Lipsey, 2003). Also, unlike participants in primary-study experiments, studies are not “assigned” to have specific features. Study features may be confounded by chance, or due to decisions common to a field of study. If several predictors that explain significant amounts of effect-size variation are confounded, we cannot assign a unique interpretation to any one predictor.

To examine the study features for confounding/collinearity we computed correlations among quantitative variables and dichotomies, otherwise we crosstabulated pairs of features. The eight features have 28 pairwise relationships, thus to reduce the chance of detecting spurious associations we used a reduced alpha level of .0018. None of the six correlations among the two dichotomies (randomization and control type) plus year and school level reached significance with alpha = .0018. The largest was r = .39 (p = .114) between school level and control group.¹ Also the two dichotomies, control-group type and randomization, correlated with r = 0.11 (p = .653).

The other 22 relationships were examined using crosstabulations and chi-square tests. However, all tables had many empty cells because of the relatively small number of studies, and because most variables had the same value for all effects within study. The exception to this was the type of outcome, which varied within some studies. Only four of the 22 relationships were significant; we discuss these below when relevant to a specific moderator analysis.²

Intervention Types

The most common type of intervention in our studies provided enhanced error messages/hints, followed by enhanced debuggers, systematic instructions, and games. Table 2 and Figure 4 show the mean effect sizes of each type of intervention on students’ debugging skills. The test of between groups differences is significant, with Q = 11.46 (df = 4, p = .022). Differences in intervention type do explain much of the variation in the effects for students’ debugging skills. Despite the large number of effects for studies of enhanced error messages/hints, the mean of this set of effects was the smallest, with effects of collaboration being only slightly stronger. Neither of these effects differed from zero. The mean game-based debugging effect was significant with an effect size of 0.94 SDs.

Table 2.

Moderator Analysis of Different Debugging Interventions.

	k	$\bar{g}$	SE	z	95% CI	p
Enhanced debuggers	8	0.97	0.283	3.43	(0.41, 1.53)	<.001
Systematic instruction	7	0.93	0.228	4.09	(0.49, 1.38)	<.001
Enhanced error messages/hints	36	0.04	0.225	0.19	(−0.40, 0.48)	.853
Games	7	0.94	0.268	3.49	(0.41, 1.46)	<.001
Collaboration	4	0.15	0.482	0.31	(−0.80, 1.09)	.757

Figure 4.

Forest plot of group means by type of intervention.

In contrast, enhanced debuggers and systematic instruction had significant and large effects on students’ debugging skills. The mean of enhanced debuggers was close to one standard deviation in size.

Programming Medium

The Q statistic for differences due to the programming medium used is 2.09 (p = .351, df = 2), indicating that program medium did not explain the differences between the effect sizes. Table 3 and Figure 5 show that most effects were obtained from text-based programming interventions; the average of these 53 effects (

\bar{g}

= 0.51) on students’ debugging skills was significant. Block-based interventions were even more effective in training debugging skills (

\bar{g}

= 0.83). The very large effect of programming using physical objects should be interpreted with caution because only one study used physical objects for intervention and showed a very large effect of nearly three standard deviation units. The graph shows that the other two group means are comparable to each other, as their confidence intervals completely overlap.

Table 3.

Moderator Analysis of Programming Medium.

	k	$\bar{g}$	SE	z	95% CI	p
Text-based	53	0.51	0.191	2.68	(0.14, 0.89)	.007
Block-based	8	0.83	0.346	2.40	(0.15, 1.51)	.016
Physical	1	1.36	0.623	2.19	(0.14, 2.58)	.029

This one study of physical based programming examined the only kindergarten sample, and our analyses of confounding confirm a significant association between program medium and school level (χ²(8) = 26.37, p = .0009). This p value reaches significance by our stringent rules. Essentially, younger students are very unlikely to be able to use text-based programming. Some kindergartners cannot yet read, so researchers have quite limited options for the programming medium that can be applied for the youngest coders. None of our studies of kindergarten or elementary samples used text-based programming. Because of this we cannot fully disentangle the impact of coding with physical objects from the role of age without further studies that use physical-object-coding experiences with older participants.

Figure 5.

Forest plot of group means by type of programming medium.

Control-Group Type

We investigated the possible difference in intervention effects due to different operationalizations of the control group; results appear in Table 4 and Figure 6. The control-group type did not relate to the strength of intervention effects, with Q = 0.65 (df = 1, p = 0.421). Almost two thirds of the effects came from comparisons between regular instructional activities and debugging interventions, with a significant average effect

\bar{g}

of 0.67. The average effect when comparing the debugging intervention with a relatively weaker version of the intervention was also significant (

\bar{g}

= 0.67). Because of the degree of between-studies variation these two means did not differ.

Table 4.

Moderator Analysis of Control-Group Type.

Type of Control	k	$\bar{g}$	SE	z	95% CI	p
Regular instruction	41	0.56	0.191	2.92	(0.18, 0.93)	.003
Weak intervention	21	0.76	0.219	3.45	(0.33, 1.19)	<.001

Figure 6.

Forest plot of group means by type of control group.

Outcome Type

Table 5 and Figure 7 show the average effect sizes for comparisons on each dimension of student debugging skills. Outcome type did not account for between-effects variance (Q = 2.42, df = 3, p = .490). All of the outcome types demonstrated significant mean effects of the interventions. That is, on average training enabled students to write more accurate code, solve more bugs, and be more efficient in programming. Students also appear on average to debug more quickly.

Table 5.

Moderator Analysis of Outcome Types.

Outcome Type	k	$\bar{g}$	SE	z	95% CI	p
Time spent to debug/complete tasks	12	0.62	0.187	3.30	(0.25, 0.98)	<.001
Correctness of programs	23	0.59	0.176	3.34	(0.24, 0.93)	<.001
Number of bugs solved	7	0.82	0.206	3.99	(0.41, 1.22)	<.001
Efficiency	20	0.61	0.181	3.36	(0.25, 0.96)	<.001

Figure 7.

Forest plot of group means by type of outcome.

Our analysis of confounding also revealed that type of outcome used was significantly associated with randomization (χ²(3) = 17.9, p = .0005), year (χ²(27) = 82.08, p < .0001), and school level (χ²(12) = 34.93, p = .00048). Though the reason is unclear, efficiency of coding was measured only in randomized studies, and nonrandomized studies were more likely than randomized ones to track the number of bugs solved. The pattern of association for outcome type with year is not interpretable. Correctness was measured at all school levels except graduate study, but number of bugs solved was only tapped at the high-school and college levels; perhaps these choices relate to the cognitive levels possible for young children, or the constraints of what can be measured given the programming mediums used for younger children. Last, certain outcomes are measured only or primarily for specific interventions. Efficiency was measured in the main for studies of enhanced error messages. In contrast, time spent was measured largely in studies of enhanced debuggers and systematic instruction, and correctness and number of bugs solved were used in at least 4 of the 5 types of interventions.

School Level

We investigated whether the effects of debugging interventions differed across school populations. Most studies focused on introductory undergraduate CS courses. The Q value between groups is 75.99 (df = 4, p < .001). Thus notable differences existed between different school levels in terms of the effectiveness of teaching students debugging skills. Table 6 and Figure 8 show that interventions were especially beneficial to students from all school levels, except for high school students. Such interventions were particularly helpful for kindergarteners and graduate students. Debugging training was also effective for elementary and undergraduate students. However, the effect of debugging intervention on kindergarten students was based on only one effect size, and as noted above school level and programming medium were nearly confounded. While the effect sizes were small-sized for high school samples, they did not reach significance. These results require interpretation with caution given the small numbers of effects at all but the college-undergraduate level.

Table 6.

Moderator Analysis of School Levels.

Level	k	$\bar{g}$	SE	z	95% CI	p
Kindergarten	1	1.36	0.228	5.97	(0.92, 1.81)	<.001
Elementary	4	0.68	0.183	3.73	(0.32, 1.04)	<.001
High school	5	0.28	0.187	1.51	(−0.08, 0.65)	.131
Undergraduate	47	0.12	0.056	2.19	(0.01, 0.23)	.028
Graduate	5	1.95	0.261	7.46	(1.44, 2.46)	<.001

Figure 8.

Forest plot of group means by school level.

Use of Randomization

Seven studies included in the analysis randomly assigned students into treatment and control groups, and nine others did not. However, more effect sizes were reported in the randomized experiments than in studies with other designs. Thus our moderator analysis compared 38 effects from studies where students were randomly assigned versus 24 other effects. The between-groups test Q = 1.88 (df = 1, p = .170) shows that effects from studies where participants were randomly assigned were comparable to effects from other studies. The non-randomization studies (

\bar{g}

= 0.85) differed significantly from zero. But the randomization studies did not differ from zero. The effect size of randomization studies is about a half of the effect sizes from non-randomization studies. The results showed that more modest effects were seen for the studies using the more stringent experimental design. Table 7 and Figure 9 show these results.

Table 7.

Moderator Analysis of Randomization.

Group	k	$\bar{g}$	SE	z	95% CI	p
Non-randomized	24	0.85	0.22	3.82	(0.41, 1.28)	<.001
Randomized	38	0.41	0.23	1.77	(−0.04, 0.86)	.077

Figure 9.

Forest plot of group means by randomization.

Year of Publication

Studies in our sample were published between 2004 and 2021. Because only a few years saw the publication of more than one study, publication year is highly confounded with study authorship. The scatterplot of year versus effect size in Figure 10 shows no discernible pattern of effects. A linear model with the predictor year did not reach significance (b = −0.02, SE = 0.021, p = .321).

Figure 10.

Scatterplot of year versus effect size.

Publication Type

To further analyze whether publication bias existed across different dissemination channels, we investigated the differences of debugging-skill effect sizes reported in academic journals, at conferences, and in dissertations. Table 8 and Figure 11 show these results.

Table 8.

Moderator Analysis of Publication Types.

Group	k	$\bar{g}$	SE	z	95% CI	p
Journal article	10	0.73	0.196	3.72	(0.35, 1.12)	<.001
Conference paper	46	0.64	0.186	3.44	(0.28, 1.01)	<.001
Dissertation	6	0.48	0.477	1.00	(−0.46, 1.41)	.317

Figure 11.

Forest plot of group means by type of publication.

Effect sizes did not differ on average between the three different publication types (Q = 0.86, df = 2, p = 0.652), with dissertation studies showing a small, nonsignificant mean. This may reflect some degree of publication bias, wherein significant results and high effect sizes are more likely than weaker effects to be published in peer-reviewed journals (and here, conference proceedings).

Sensitivity Analysis

Three effect sizes in the dataset are larger than 2, which is unusual in most studies of educational interventions. Large effect sizes influence meta-analysis results. As a result, we removed the three largest effect sizes and performed a sensitivity analysis of the overall results. With the remaining 59 effect sizes, the overall mean effect reduced to $\bar{g}$ = 0.55 (95% CI = (0.28, 0.83), p < .001). The data still contained considerable heterogeneity (Q(58) = 167.45, p < .001). Therefore, the overall effects of using interventions to foster debugging skills are still positive and significant.

Discussion

This meta-analysis investigated intervention effects on students’ debugging-skill development. We found a nascent field, but one where nearly all studies provided information on multiple measures of debugging skill. Our results provide compelling evidence that interventions can have a positive impact on students’ debugging skills, with a significant overall mean effect of $\bar{g}$ = 0.64. This mean was achieved in spite of just over one quarter of the effect sizes falling below zero (17 out of 62). However, only one negative effect was significantly different from zero. Debugging, as an integral part of computational thinking skills for students to master, benefits in the main from effective and systematic training. The meta-analysis results reveal the potential of various interventions to foster students’ ability to debug programs.

Our analysis adopted multilevel-modeling techniques to account for the fact that multiple effects came from the same research groups, and included several moderator analyses to explore the impacts of intervention and participant characteristics on the effectiveness of interventions for debugging. First, we investigated which types of interventions were most effective in training debugging skills. Enhanced debuggers ( $\bar{g}$ = 0.97) and systematic instruction ( $\bar{g}$ = 0.93) produced the highest effect sizes among all the interventions. Games also had a large-sized effect. However, the other two types of interventions - enhanced error messages/hints and collaboration - did not show significant effects on skill development. We found that enhanced error messages/hints, with the most effect sizes (k = 36), produced very small effect sizes on average. We surmise this occurred mainly because some studies showed mixed effectiveness of hints/enhanced error messages (e.g., Denny et al., 2014; Price et al., 2020). For example, effect sizes may have differed within outcome type, according to specific debugging problems or error types. The results suggest the importance of future research on developing various types of hints and understanding their corresponding effects on debugging (e.g., Greifenstein et al., 2023). As a result, re-designing and re-testing of enhanced error messages/hints is required before investing in future research/or applications of the approaches studied here. For example, future studies can investigate the timing and the format of presenting the messages/hints to maximize the utility. In terms of game-based interventions and collaborative debugging, only a handful of effect sizes were accumulated, having more similar comparison studies will help to determine the effects of collaboration and game-based learning on debugging education.

Second, we analyzed the intervention effects for various measures of debugging skills. We found significant effects of interventions for all four debugging measures. Thus, it is safe to conclude that on average, training for debugging improves students’ debugging skills in terms of all four types of debugging measures.

Third, debugging skills were significantly improved regardless of programming medium. A majority of studies implemented interventions in text-based languages. The effect size of text-based programming ( $\bar{g}$ = 0.51) was lower than that for block-based programming ( $\bar{g}$ = 0.83), but not significantly so. The result aligns with the CT literature where block-based coding is dominant in training students’ CT skills (e.g., Zhang & Nouri, 2019). Block-based coding can reduce the burden on students that comes from making syntax errors. Physical-based programming showed a large effect, but was based only on García-Valcárcel-Muñoz-Repiso and Caballero-Gonzalez’s (2019) single kindergarten sample. It is impossible to say how this programming medium functions with older students.

Fourth, we categorized the control groups used based on their strength of intervention. That is, some control groups applied a weaker form of the focal intervention but others adopted regular learning or instruction. The results confirmed the advantage of debugging interventions over both control groups; and on average weak interventions and regular instruction were equally inferior. This again showcases the promising of debugging interventions to facilitate students’ ability to resolve bugs better than those who either received a weaker version of an intervention or did not receive any at all.

We further investigated whether intervention effectiveness varied by participant population. Debugging interventions were particularly effective for kindergarten and graduate students, but not for high-school populations. However, the single effect size for kindergarten students suggests an area for more empirical investigation. Our findings suggest that interventions may be most effective for older students. However, we also noted that different sorts of outcomes and different programming media were used across the school levels in our collection. This makes sense, as younger children are less likely to be able to code using text, or even process text as part of the interventions and outcomes that were assessed.

Next, we checked whether randomizing participants influenced the effects reported. The results revealed that studies with randomization and without randomization both had significant mean effects. However, the mean effect size from randomized studies was more moderate. The difference due to this aspect of method was not significant, thus our findings only hint at the potential bias existing in non-experimental designs which can in turn result in differences in effectiveness. One further finding was that when we examined the between studies variation separately for randomized and nonrandomized studies, we found almost 30% more variation among the nonrandomized studies ( ${\hat{τ}}^{2}$ = 0.44) as among the randomized studies ( ${\hat{τ}}^{2}$ = 0.34). This is consistent with findings by Lipsey and Wilson (1993), who also reported increased variation for non-experimental studies, but no significant difference in mean effects for the two designs across a variety of treatment/control studies.

We also checked the possible influence of publication bias in effect sizes. Our overall publication-bias analyses – the funnel plot, Egger’s test, and trim and fill – suggested asymmetry that was manifest in underreporting of small and/or negative effects. Also, while both journal articles and conference papers reported significant and large mean effect sizes of (0.73 and 0.64 respectively), dissertations generally reported smaller and more dispersed results. However, caution is needed in interpreting the results because the bulk of our sources were conference papers appearing in proceedings outlets. Our analysis also indicates a potential publication bias, in that researchers have predominantly published their debugging research in conference proceedings related to computer-science fields, rather than in traditional journals. This is likely due to the rapidly changing environment of computing and computer-science research.

Conclusions

This meta-analysis focused on debugging interventions in the context of fostering students’ computational thinking skills. The review included 18 sources, many with multiple effect sizes. Most of the various interventions supported debugging-skill development, but to different degrees. Some studies reported remarkably large effects, of over two standard-deviation units, whereas in contrast a notable portion of the effects were also negative. We aim to direct more researchers’ attention to debugging skill, an often-neglected but integral area of computational-thinking skills, particularly for researchers in non-computer science fields. The increased interest in computing-intensive fields such as artificial intelligence and data science prompt a dire need to assess and facilitate students’ debugging skills.

Future studies need to more fully investigate the best practices for improving debugging abilities for whom and under what circumstances. Future research calls for investigations of how to design more effective enhanced error messages/hints, and more empirical evidence is required to test the effects of enhanced debuggers and digital games. Future researchers should investigate the effects of block-based programming in cultivating debugging skills, with the caution that text-based programming usually exerts high requirements on syntactic accuracy which may be problematic for non-CS students.

Footnotes

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

ORCID iD

Chen Sun

Notes

Author Biographies

Chen Sun is a postdoctoral research associate in Manchester Institute of Education at the University of Manchester. Her research focuses on leveraging technology to create personalized learning experiences. She is particularly interested in fostering computational thinking skills and collaborative problem solving skills. She earned her PhD at Florida State University.

Stephanie Yang is a 5th year PhD student at Harvard’s Graduate School of Education (HGSE). Her research aims to help introductory CS students overcome challenges in debugging by developing systematic strategies for locating and fixing errors. Moreover, her research seeks to equip students with non-cognitive strategies to address the emotional aspects of debugging. She is specifically interested in the role generative AI and data visualizations can play in pursuit of this goal. Prior to HGSE, Stephanie completed her undergraduate and masters at Stanford University, where she first developed her passion for CS education.

Betsy Jane Becker is Mode L. Stone Distinguished Professor of Educational Statistics Emeritus and FSU Distinguished Research Professor Emeritus of the Department of Educational Psychology and Learning Systems in the College of Education at Florida State University (FSU). She served on the FSU faculty from Fall 2004 to Spring 2022. Previously for 21 years she served on the faculty of Michigan State University. Becker earned both B.A. and M.A. degrees in Psychology from The Johns Hopkins University in 1978 and earned her Ph.D. in Education from The University of Chicago in 1985. Her dissertation on combined probability methods for meta-analysis won the American Educational Research Association’s Outstanding Dissertation Award in 1985. Becker’s research involves methods for synthesizing correlation matrices and regression results, and she currently consults with the USDA on the use of meta-analysis in development of dietary guidelines. Becker is a founding member, past president, and past secretary of the Society for Research Synthesis Methodology, and in 2023 she delivered the Society’s prestigious Olkin Lecture. She co-founded the Methods Training Group for the Campbell Collaboration, an international organization whose goals include the promotion of evidence based analysis for policy making in the social sciences. In 2011 she received the Mosteller Award for Distinctive Contributions to Systematic Reviewing from the Campbell Collaboration. Becker is a Fellow of the American Statistical Association, the American Educational Research Association, and the Society for Research Synthesis Methodology. She has served as associate editor of Psychological Methods, and on the editorial boards of Journal of Educational and Behavioral Statistics and Research Synthesis Methods, among others.

References

* Ahmed

U. Z.

Srivastava

Sindhgatta

Karkare

(2020). Characterizing the pedagogical benefits of adaptive feedback for compilation errors by novice programmers. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering: Software Engineering Education and Training (pp. 139–150). https://doi.org/10.1145/3377814.3381703

* Alrashidi

Gardner

Callaghan

(2017). Evaluating the use of pedagogical virtual machine with augmented reality to support learning embedded computing activity. In Proceedings of the 9th International Conference on Computer and Automation Engineering (pp. 44–50). https://doi.org/10.1145/3057039.3057088

Angeli

Valanides

(2020). Developing young children’s computational thinking with educational robotics: An interaction effect between gender and scaffolding strategy. Computers in Human Behavior, 105, 105954. https://doi.org/10.1016/j.chb.2019.03.018

* Ardimento

Bernardi

M. L.

Cimitile

Ruvo

G. D.

(2019). Reusing bugged source code to support novice programmers in debugging tasks. ACM Transactions on Computing Education (TOCE), 20(1), 1–24. https://doi.org/10.1145/3355616

Atmatzidou

Demetriadis

(2016). Advancing students’ computational thinking skills through educational robotics: A study on age and gender relevant differences. Robotics and Autonomous Systems, 75, 661–670. https://doi.org/10.1016/j.robot.2015.10.008

Becker

(2019). How novice programmers interact with programming environments. https://www.brettbecker.com/wp-content/uploads/2019/05/BeckerIWCSE_keynote_presentation-2.pdf. Slides presented at 2019 International Workshop on Computer Science Education (WCSE’19).

* Becker

B. A.

(2016). An effective approach to enhancing compiler error messages. In Proceedings of the 47th ACM Technical Symposium on Computing Science Education (pp. 126–131). https://doi.org/10.1145/2839509.284458

Becker

B. A.

Denny

Pettit

Bouchard

Bouvier

D. J.

Harrington

Prather

(2019). Compiler error messages considered unhelpful: The landscape of text-based programming error message research. In Proceedings of the Working Group Reports on Innovation and Technology in Computer Science Education (pp. 177–210). https://doi.org/10.1145/3344429.3372508

* Becker

B. A.

Glanville

Iwashima

McDonnell

Goslin

Mooney

(2016). Effective compiler error message enhancement for novice programming students. Computer Science Education, 26(2–3), 148–175. https://doi.org/10.1080/08993408.2016.1225464

10.

* Becker

B. A.

Goslin

Glanville

(2018). The effects of enhanced compiler error messages on a syntax error debugging test. In Proceedings of the 49th ACM Technical Symposium on Computer Science Education (pp. 640–645). https://doi.org/10.1145/3159450.3159461

11.

Berland

Wilensky

(2015). Comparing virtual and physical robotics environments for supporting complex systems and computational thinking. Journal of Science Education and Technology, 24(5), 628–647. https://doi.org/10.1007/s10956-015-9552-x

12.

Bers

M. U.

(2018). Coding, playgrounds and literacy in early childhood education: The development of KIBO robotics and ScratchJr. In Proceedings of 2018 IEEE Global Engineering Education Conference (EDUCON) (pp. 2094–2102). https://doi.org/10.1109/EDUCON.2018.8363498

13.

Bers

M. U.

Flannery

Kazakoff

E. R.

Sullivan

(2014). Computational thinking and tinkering: Exploration of an early childhood robotics curriculum. Computers & Education, 72, 145–157. https://doi.org/10.1016/j.compedu.2013.10.020

14.

Brennan

Resnick

(2012, April 13-17). New frameworks for studying and assessing the development of computational thinking. [Paper presentation]. American Educational Research Association 2012 Convention, Vancouver, Canada.

15.

Cetin

(2016). Preservice teachers’ introduction to computing: Exploring utilization of scratch. Journal of Educational Computing Research, 54(7), 997–1021. https://doi.org/10.1177/0735633116642774

16.

* Chang

(2017). Transforming video gameplay experiences into a roadmap to facilitate children's learning of computational thinking concepts. (Publication No. 10287095) [Doctoral dissertation, Columbia University]. Proquest Dissertations.

17.

Chen

Shen

Barth-Cohen

Jiang

Huang

Eltoukhy

(2017). Assessing elementary students’ computational thinking in everyday reasoning and robotics programming. Computers & Education, 109, 162–175. https://doi.org/10.1016/j.compedu.2017.03.001

18.

Chou

P. N.

(2020). Using ScratchJr to foster young children’s computational thinking competence: A case study in a third-grade computer class. Journal of Educational Computing Research, 58(3), 570–595. https://doi.org/10.1177/0735633119872908

19.

* Chung

C. Y.

Hsiao

I. H.

(2020). Computational thinking in augmented reality: An investigation of collaborative debugging practices. In 2020 6th International Conference of the Immersive Learning Research Network (iLRN) (pp. 54–61). https://doi.org/10.23919/iLRN47897.2020.9155152

20.

Cohn

L. D.

Becker

B. J.

(2003). How meta-analysis increases statistical power. Psychological Methods, 8(3), 243–253. https://doi.org/10.1037/1082-989X.8.3.243

21.

Contreras-Rojas

Quiané-Ruiz

J. A.

Kaoudi

Thirumuruganathan

(2019). Tagsniff: Simplified big data debugging for dataflow jobs. In Proceedings of the ACM Symposium on Cloud Computing, (pp. 453–464). https://doi.org/10.1145/3357223.3362738

22.

Creswell

J. W.

(2009). Research design: Qualitative, quantitative, and mixed methods approaches (3rd ed.). Sage Publications, Inc.

23.

Deitz

Buy

(2016). From video games to debugging code. In Proceedings of the 5th International Workshop on Games and Software Engineering, (pp. 37–41). https://doi.org/10.1145/2896958.2896964

24.

* Denny

Luxton-Reilly

Carpenter

(2014). Enhancing syntax error messages appears ineffectual. In Proceedings of the 2014 Conference on Innovation & Technology in Computer Science Education, (pp. 273–278). https://doi.org/10.1145/2591708.2591748

25.

Egyed

Horling

Becker

Balzer

(2003). Visualization and debugging tools. In Lesser

Ortiz

C. L.

Tambe

(Eds.), Distributed sensor networks: A multiagent perspective (pp. 33–42). Springer. https://doi.org/10.1007/978-1-4615-0363-7_4

26.

Falloon

(2016). An analysis of young students’ thinking when completing basic coding tasks using Scratch Jnr. on the iPad. Journal of Computer Assisted Learning, 32(6), 576–593. https://doi.org/10.1111/jcal.12155

27.

Fidai

Capraro

M. M.

Capraro

R. M.

(2020). “Scratch”-ing computational thinking with Arduino: A meta-analysis. Thinking Skills and Creativity, 38, 100726. https://doi.org/10.1016/j.tsc.2020.100726

28.

Fitzgerald

Lewandowski

McCauley

Murphy

Simon

Thomas

Zander

(2008). Debugging: Finding, fixing and flailing, a multi-institutional study of novice debuggers. Computer Science Education, 18(2), 93–116. https://doi.org/10.1080/08993400802114508

29.

Gao

Hew

K. F.

(2023). A flipped systematic debugging approach to enhance elementary students’ program debugging performance and optimize cognitive load. Journal of Educational Computing Research, 61(5), 1064–1095. https://doi.org/10.1177/07356331221133560

30.

* García-Valcárcel-Muñoz-Repiso

Caballero-González

Y. A.

(2019). Comunicar. Media Education Research Journal, 27(59). 63–72. https://doi.org/10.3916/C59-2019-06

31.

Garneli

Chorianopoulos

(2018). Programming video games and simulations in science education: Exploring computational thinking through code analysis. Interactive Learning Environments, 26(3), 386–401. https://doi.org/10.1080/10494820.2017.1337036

32.

Greifenstein

Brune

Fuchs

Heuer

Fraser

(2023). Impact of hint content on performance and learning: A study with primary school children in a scratch course. In Proceedings of the 18th WiPSCE Conference on Primary and Secondary Computing Education Research (pp. 1–10). https://doi.org/10.1145/3605468.3605498

33.

Grover

Basu

Bienkowski

Eagle

Diana

Stamper

(2017). A framework for using hypothesis-driven approaches to support data-driven learning analytics in measuring computational thinking in block-based programming environments. ACM Transactions on Computing Education (TOCE), 17(3), 1–25. https://doi.org/10.1145/3105910

34.

Grover

Pea

Cooper

(2015, April 15-20). “Systems of assessments” for deeper learning of computational thinking in K-12 [Paper presentation]. American Educational Research Association 2015 Convention, Chicago, IL, United States.

35.

Hedges

L. V.

(1981). Distribution theory for Glass’s effect size and related estimators. Journal of Educational Statistics, 6(2), 107–128. https://doi.org/10.3102/10769986006002107

36.

Hedges

L. V.

Olkin

(1985). Statistical methods for meta-analysis. Academic Press.

37.

Hsu

T. C.

Chang

S. C.

Hung

Y. T.

(2018). How to learn and how to teach computational thinking: Suggestions based on a review of the literature. Computers & Education, 126, 296–310. https://doi.org/10.1016/j.compedu.2018.07.004

38.

Jonassen

D. H.

Hung

(2006). Learning to troubleshoot: A new theory-based design architecture. Educational Psychology Review, 18(1), 77–114. https://doi.org/10.1007/s10648-006-9001-8

39.

Jun

Han

Kim

(2017). Effect of design-based learning on improving computational thinking. Behaviour & Information Technology, 36(1), 43–53. https://doi.org/10.1080/0144929X.2016.1188415

40.

* Ko

A. J.

Myers

B. A.

(2004). Designing the Whyline: A debugging interface for asking questions about program behavior. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (pp. 151–158). https://doi.org/10.1145/985692.985712

41.

Korkmaz

Ö.

Çakir

Özden

M. Y.

(2017). A validity and reliability study of the computational thinking scales (CTS). Computers in Human Behavior, 72, 558–569. https://doi.org/10.1016/j.chb.2017.01.005

42.

Kong

S. C.

Wang

Y. Q.

(2023). Monitoring cognitive development through the assessment of computational thinking practices: A longitudinal intervention on primary school students. Computers in Human Behavior, 145, 107749. https://doi.org/10.1016/j.chb.2023.107749

43.

Korucu

A. T.

Gencturk

A. T.

Gundogdu

M. M.

(2017). Examination of the computational thinking skills of students. Journal of Learning and Teaching in Digital Age, 2(1), 11–19.

44.

Kutay

Oner

(2022). Coding with Minecraft: The development of middle school students’ computational thinking. ACM Transactions on Computing Education (TOCE), 22(2), 1–19. https://doi.org/10.1145/3471573

45.

Lai

Wong

G. K. W.

(2022). Collaborative versus individual problem solving in computational thinking through programming: A meta-analysis. British Journal of Educational Technology, 53(1), 150–170. https://doi.org/10.1111/bjet.13157

46.

* Lane

H. C.

VanLehn

(2005). Intention-based scoring: An approach to measuring success at solving the composition problem. ACM SIGCSE Bulletin, 37(1), 373–377. https://doi.org/10.1145/1047124.1047471

47.

Lee

G. C.

J. C.

(1999). Debug it: A debugging practicing system. Computers & Education, 32(2), 165–179. https://doi.org/10.1016/s0360-1315(98)00063-3

48.

Chan

Denny

Luxton-Reilly

Tempero

(2019). Towards a framework for teaching debugging. In Proceedings of the Twenty-First Australasian Computing Education Conference (pp. 79–86). https://doi.org/10.1145/3286960.3286970

49.

Wang

Cheng

Wang

(2022). The effectiveness of unplugged activities and programming exercises in computational thinking education: A meta-analysis. Education and Information Technologies, 27(6), 7993–8013. https://doi.org/10.1007/s10639-022-10915-x

50.

* Lin

Sun

Xue

Liu

Dong

(2017). Feedback-based debugging. In Proceedings of the 39th International Conference on Software Engineering (ICSE) (pp. 393–403). https://doi.org/10.1109/ICSE.2017.43

51.

Lipsey

M. W.

(2003). Those confounded moderators in meta-analysis: Good, bad, and ugly. Annals of the American Academy of Political and Social Science, 587(1), 69–81. https://doi.org/10.1177/0002716202250791

52.

Lipsey

M. W.

Wilson

D. B.

(1993). The efficacy of psychological, educational, and behavioral treatment: Confirmation from meta-analysis. American Psychologist, 48(12), 1181–1209. https://doi.org/10.1037/0003-066X.48.12.1181

53.

Liu

Zhi

Hicks

Barnes

(2017). Understanding problem solving behavior of 6–8 graders in a debugging game. Computer Science Education, 27(1), 1–29. https://doi.org/10.1080/08993408.2017.1308651

54.

Luxton-Reilly

Simon Albluwi

Becker

B. A.

Giannakos

Kumar

A. N.

Ott

Paterson

Scott

M. J.

Sheard

Szabo

(2018). Introductory programming: A systematic literature review. In Proceedings Companion of the 23rd Annual ACM Conference on Innovation and Technology in Computer Science Education (pp. 55–106). https://doi.org/10.1145/3293881.3295779

55.

Mathur

M. B.

VanderWeele

T. J.

(2021). Estimating publication bias in meta-analyses of peer-reviewed studies: A meta-meta-analysis across disciplines and journal tiers. Research Synthesis Methods, 12(2), 176–191. https://doi.org/10.1002/jrsm.1464

56.

McCauley

Fitzgerald

Lewandowski

Murphy

Simon

Thomas

Zander

(2008). Debugging: A review of the literature from an educational perspective. Computer Science Education, 18(2), 67–92. https://doi.org/10.1080/08993400802114581

57.

Merino-Armero

J. M.

González-Calero

J. A.

Cozar-Gutierrez

(2022). Computational thinking in K-12 education. An insight through meta-analysis. Journal of Research on Technology in Education, 54(3), 410–437. https://doi.org/10.1080/15391523.2020.1870250

58.

* Michaeli

Romeike

(2019). Improving debugging skills in the classroom: The effects of teaching a systematic debugging process. In Proceedings of the 14th Workshop in Primary and Secondary Computing Education (pp. 1–7). https://doi.org/10.1145/3361721.3361724

59.

Papert

(1980). Mindstorms: Children, computers and powerful ideas. Basic Books.

60.

* Price

T. W.

Marwan

Winters

Williams

J. J.

(2020). An evaluation of data-driven programming hints in a classroom setting. In Bittencourt

Cukurova

Muldner

Luckin

Millan

(Eds.), Lecture notes in computer science: Vol. 12164. Artificial Intelligence in Education (pp. 246-251). Springer. https://doi.org/10.1007/978-3-030-52240-7_45

61.

Rich

K. M.

Strickland

Binkowski

T. A.

Franklin

(2019). A K-8 debugging learning trajectory derived from research literature. In Proceedings of the 50th ACM Technical Symposium on Computer Science Education, Minneapolis, MA, USA (pp. 745–751), ACM. https://doi.org/10.1145/3287324.3287396

62.

Román-González

Pérez-González

J. C.

Jiménez-Fernández

(2017). Which cognitive abilities underlie computational thinking? Criterion validity of the computational thinking test. Computers in Human Behavior, 72, 678–691. https://doi.org/10.1016/j.chb.2016.08.047

63.

Sáez-López

J. M.

Román-González

Vázquez-Cano

(2016). Visual programming languages integrated across the curriculum in elementary school: A two year case study using “scratch” in five schools. Computers & Education, 97, 129–141. https://doi.org/10.1016/j.compedu.2016.03.003

64.

Shute

V. J.

Sun

Asbell-Clarke

(2017). Demystifying computational thinking. Educational Research Review, 22, 142–158. https://doi.org/10.1016/j.edurev.2017.09.003

65.

Simon Luxton-Reilly

Ajanovski

V. V.

Fouh

Gonsalvez

Leinonen

Parkinson

Poole

Thota

(2019). Pass rates in introductory programming and in other stem disciplines. In Proceedings of the Working Group Reports on Innovation and Technology in Computer Science Education (pp. 53–71). https://doi.org/10.1145/3344429.3372502

66.

* Socratous

Ioannou

(2021). Structured or unstructured educational robotics curriculum? A study of debugging in block-based programming. Educational Technology Research and Development, 69(6), 3081–3100. https://doi.org/10.1007/s11423-021-10056-x

67.

Sun

Guo

(2023). Educational games promote the development of students’ computational thinking: A meta-analytic review. Interactive Learning Environments, 31(6), 3476–3490. https://doi.org/10.1080/10494820.2021.1931891

68.

Sun

Zhou

(2021). Which way of design programming activities is more effective to promote K‐12 students' computational thinking skills? A meta-analysis. Journal of Computer Assisted Learning, 37(4), 1048–1062. https://doi.org/10.1111/jcal.12545

69.

Sung

Ahn

Black

J. B.

(2017). Introducing computational thinking to young learners: Practicing computational perspectives through embodiment in mathematics education. Technology, Knowledge and Learning, 22(3), 443–463. https://doi.org/10.1007/s10758-017-9328-x

70.

Tikva

Tambouris

(2023). The effect of scaffolding programming games and attitudes towards programming on the development of computational thinking. Education and Information Technologies, 28(6), 6845–6867. https://doi.org/10.1007/s10639-022-11465-y

71.

Viechtbauer

(2010). Conducting meta-analyses in R with the metafor package. Journal of Statistical Software, 36(3), 1–48. https://doi.org/10.18637/jss.v036.i03

72.

Weintrop

Wilensky

(2012). RoboBuilder: A program-to-play constructionist video game. In Proceedings of the Constructionism 2012 Conference, Athens, Greece.

73.

Whalley

Settle

Luxton-Reilly

(2021). Novice reflections on debugging. In Proceedings of the 52nd ACM Technical Symposium on Computer Science Education (pp. 73–79). https://doi.org/10.1145/3408877.3432374

74.

* White

(2009). Teaching novices to debug [Doctoral dissertation, University of New South Wales]. UNWS Library. https://doi.org/10.26190/unsworks/20398

75.

Wing

J. M.

(2006). Computational thinking. Communications of the ACM, 49(3), 33–35. https://doi.org/10.1145/1118178.1118215

76.

Wong

G. K.

Jiang

(2018). Computational thinking education for children: Algorithmic thinking and debugging. In Proceedings of 2018 IEEE International Conference on Teaching, Assessment, and Learning for Engineering (TALE) (pp. 328–334). https://doi.org/10.1109/TALE.2018.8615232

77.

Zeller

(2009). Why programs fail: A guide to systematic debugging. Elsevier. https://doi.org/10.1016/B978-0-12-374515-6.X0000-7

78.

Zeng

Yang

Bautista

(2023). Computational thinking in early childhood education: Reviewing the literature and redeveloping the three-dimensional framework. Educational Research Review, 39, 100520. https://doi.org/10.1016/j.edurev.2023.100520

79.

Zhang

Nouri

(2019). A systematic review of learning computational thinking through Scratch in K-9. Computers & Education, 141, 103607. https://doi.org/10.1016/j.compedu.2019.103607

80.

Zhao

Shute

V. J.

(2019). Can playing a video game foster computational thinking skills? Computers & Education, 141, 103633. https://doi.org/10.1016/j.compedu.2019.103633

81.

* Zhong

(2020). Can pair learning improve students’ troubleshooting performance in robotics education? Journal of Educational Computing Research, 58(1), 220–248. https://doi.org/10.1177/0735633119829191

Debugging in Computational Thinking: A Meta-analysis on the Effects of Interventions on Debugging Skills

Abstract

Keywords

Literature Review

Interventions for Fostering Computational Thinking

Assessment of Computational-Thinking Skills

Interventions for Fostering Debugging Skills

Assessment of Debugging Skills

Previous Meta-Analyses on Computational Thinking and Debugging

Methods

Study Selection

Coding Procedures

Effect Sizes

Features Coded for Moderator Analysis

Intervention Types

Programming Medium

Control-Group Activities

Type of Outcome

School Level

Use of Randomization

Publication Year

Publication Type

Analyses

Results

Study Description

Overall Analyses: Research Questions 1 and 2

Overall Mean Effect: Research Question 1

Overall Variation: Research Question 2

Publication Bias

Moderator Analyses: Research Question 3

Potential Confounding

Intervention Types

Programming Medium

Control-Group Type

Outcome Type

School Level

Use of Randomization

Year of Publication

Publication Type

Sensitivity Analysis

Discussion

Conclusions

Footnotes

Declaration of Conflicting Interests

Funding

ORCID iD

Notes

Author Biographies

References