Abstract
In this meta-analysis, we systematically reviewed research on digital games and learning for K–16 students. We synthesized comparisons of game versus nongame conditions (i.e., media comparisons) and comparisons of augmented games versus standard game designs (i.e., value-added comparisons). We used random-effects meta-regression models with robust variance estimates to summarize overall effects and explore potential moderator effects. Results from media comparisons indicated that digital games significantly enhanced student learning relative to nongame conditions (
In 2006, the Federation of American Scientists (FAS) issued a widely publicized report stating that games as a medium offer powerful affordances for education. The report encouraged private and governmental support for expanded research into complex gaming environments for learning. A special issue of
In the current meta-analysis, we systematically reviewed research on digital games and learning for K–16 students in light of the recent NRC report on education for life and work in the 21st century (Pellegrino & Hilton, 2012). We synthesized comparisons of game conditions versus nongame conditions (i.e., media comparisons) as well as comparisons of augmented game designs versus equivalent standard game designs (i.e., value-added comparisons). Meta-regression models were used to assess the possible moderating effects of participant characteristics, game condition characteristics, and research quality characteristics.
Alignment With Recent Related Meta-Analyses
The current meta-analysis extends and refines the findings of three recent meta-analyses relevant to the impact of games on learning.
1
We first provide an overview of these three relevant meta-analyses to frame the relationships, contributions, and research questions of the current meta-analysis. The first meta-analysis, by Vogel et al. (2006), synthesized results from 32 studies from 1986 to 2003, focusing on pretest–posttest comparisons of cognitive and attitudinal outcomes in games and simulations for age-groups spanning preschool through adult. Vogel et al. described computer games and simulations as follows: A computer game is defined as such by the author, or inferred by the reader because the activity has goals, is interactive, and is rewarding (gives feedback). Interactive simulation activities must interact with the user by offering the options to choose or define parameters of the simulation then observe the newly created sequence rather than simply selecting a prerecorded simulation. (p. 231)
The synthesized studies compared games and simulations to traditional classroom teaching. Moderator variables included gender, learner control, type of activity (game vs. simulation), age, visual realism, and player grouping (individual vs. group).
Overall, Vogel et al. (2006) found that games and simulations led to higher cognitive outcomes (
The second meta-analysis, by Sitzmann (2011), synthesized results from 65 studies from 1976 to 2009, focusing on pretest–posttest comparisons of self-efficacy, declarative knowledge, procedural knowledge, and retention in simulation games for adult workforce trainees. Sitzmann defined simulation games as “instruction delivered via personal computer that immerses trainees in a decision-making exercise in an artificial environment in order to learn the consequences of their decisions” (p. 492). Comparison conditions in the synthesized studies ranged from no-training control conditions to alternative instructional method conditions. Theoretical moderator variables included entertainment value, whether the simulation game instruction was active or passive, whether or not trainees had unlimited access to the simulation game, whether the simulation game was the sole instructional method, and whether the instructional methods in the comparison group were active or passive. Methodological moderator variables included random assignment to experimental condition, rigor of the study design, publication status, and year of the publication/presentation.
Sitzmann (2011) found that self-efficacy was significantly higher (
The third meta-analysis, by Wouters, van Nimwegen, van Oostendorp, and van der Spek (2013), analyzed 39 studies from 1990 to 2012 focusing on pretest–posttest and posttest-only comparisons of knowledge, skills, retention, and motivation outcomes in serious games for a wide range of age-groups. Wouters et al. defined serious games as follows: We describe computer games in terms of being interactive (Prensky, 2001; Vogel et al., 2006), based on a set of agreed rules and constraints (Garris et al., 2002), and directed toward a clear goal that is often set by a challenge (Malone, 1981). In addition, games constantly provide feedback, either as a score or as changes in the game world, to enable players to monitor their progress toward the goal (Prensky, 2001). . . . In speaking of a serious (computer) game, we mean that the objective of the computer game is not to entertain the player, which would be an added value, but to use the entertaining quality for training, education, health, public policy, and strategic communication objectives (Zyda, 2005). (p. 250)
Comparison conditions in the synthesized studies included conventional instruction methods such as lectures, reading, drill and practice, or hypertext learning environments. Theoretical moderator variables included active versus passive instruction in comparison groups, presence of additional nongame instruction in game conditions, level of visual realism in game conditions, level of narrative in game conditions, number of training sessions, group size, instructional domain, and age. Methodological moderator variables included publication source, random assignment, and pretest–posttest versus posttest-only assessment.
Wouters et al. (2013) found that serious games were more effective than conventional instruction in terms of learning (
Core Hypotheses
Drawing on results from these prior meta-analyses, the present meta-analysis sought to extend and refine our understanding of the effects of digital games on learning outcomes for K–16 students. Methodologically, the current meta-analysis expanded on prior work by broadening the scope of the literature surveyed. Research on games for learning spans many fields. We thus selected databases spanning Engineering, Computer Science, Medicine, Natural Sciences, and Social Sciences in an effort to capture this breadth while focusing on research published between 2000 and 2012 in light of the dramatic evolution of digital games for learning over the past decade. Furthermore, the current meta-analysis provides a specific and distinct focus on (a) digital games, (b) K–16 students, and (c) cognitive, intrapersonal, and interpersonal learning outcomes. The current study therefore builds on the prior meta-analyses by expanding the scope of constituent studies while focusing on an overlapping but distinct cross section of the research literature with a tighter focus on games and learning by K–16 students (Table 1). Overall, based on the prior meta-analyses, we predicted that game conditions would be associated with better learning outcomes than nongame conditions in media comparisons (Core Hypothesis 1).
Characteristics of recent meta-analyses on games for learning: Overlapping but distinct lenses
Whereas prior meta-analyses have focused exclusively on comparisons of game conditions versus nongame control conditions—which Mayer (2011), calls media comparisons—the present study focused on value-added comparisons also. Value-added comparisons measure the efficacy of a standard version of a game relative to an enhanced version augmented to test a theoretical design proposition (Mayer, 2011). Wouters et al. (2013) expressed the need for analyses of value-added studies in their discussion. The present study thus moved beyond a sole focus on media comparisons to also assess the contribution of design to learning. Although it might appear common sense that versions of a game that have been augmented to support learning should outperform standard versions of those games, the role of design has often been de-emphasized in debates over whether digital games are
Beyond these two core hypotheses, the present study analyzed the potential moderating effects of (a) general study characteristics, (b) game mechanics characteristics, (c) visual and narrative characteristics, and (d) research quality characteristics. These moderator analyses explored the relationships between design features and learning outcomes. The number of media comparisons that met the eligibility criteria (outlined in the Method section) was sufficient to support moderator analyses of general study characteristics, game mechanics characteristics, and visual and narrative characteristics. The number of value-added and media comparisons that met eligibility criteria was sufficient to support moderator analyses in terms of research quality characteristics. We elaborate on the moderator analyses and hypotheses in the following sections.
Moderator Analyses of General Study Characteristics
The present meta-analysis examined three general study characteristics as potential moderators of the effects of digital games on learning. Specifically, we examined game duration, presence of nongame instruction in game conditions, and player grouping. All these moderators were identified from prior meta-analyses on this topic.
With regard to duration of game play, Sitzmann (2011) found that media comparisons in which trainees had unlimited access to the game demonstrated significantly better learning outcomes than media comparisons in which the trainee had limited access to the game. Similarly, Wouters et al. (2013) found that (a) game conditions where participants interacted with the game for more than one session demonstrated significantly better outcomes relative to the nongame control conditions, but (b) game conditions where participants engaged with the game for only one session did not demonstrate significantly better outcomes relative to the nongame control conditions.
Whereas the Sitzmann (2011) comparisons emphasized additional time on task and increased learner control relative to the comparison groups, Wouters et al. (2013) focused on a combination of spaced versus massed learning (cf. McDaniel, Fadler, & Pashler, 2013) and the potential for greater incremental value of additional time in games compared to the incremental value of additional time in associated control conditions. As Wouters et al. (2013) explained, “It is plausible that, in comparison to that of conventional instruction methods, the effectiveness of serious games in terms of learning pays off only after multiple training sessions in which the players get used to the game” (p. 251). The studies synthesized in the current analysis involve primarily equivalent amounts of total time in experimental and control conditions, and thus our analyses align more closely with the relationship between experimental and control conditions in the Wouters et al. (2013) analyses. Based on these findings, we predicted that game conditions involving increased duration and number of game play sessions would be associated with better learning outcomes in media comparisons (Moderator Hypothesis 1a). 2
In terms of supplemental nongame instruction, two prior meta-analyses (Sitzmann, 2011; Wouters et al., 2013) found that comparisons where game conditions included supplemental nongame instruction demonstrated better learning outcomes (relative to nongame conditions) than comparisons where the game conditions did not include nongame instruction. Given the importance of verbalization for learning (Wouters, Paas, & van Merriënboer, 2008), and the effects of supplemental instruction on learning observed in prior meta-analyses, we predicted that game conditions that include nongame instruction would be associated with better learning outcomes than game conditions that do not include nongame instruction in media comparisons (Moderator Hypothesis 1b).
In terms of player group structures in game conditions, Vogel et al. (2006) found significant learning outcomes for single-player as well as for collaborative conditions relative to nongame conditions and reported a trend toward larger effect sizes with solitary players but did not report analyses comparing effect size magnitudes between the two player grouping structures. Based on this trend, and given the ambiguity in prior research on the benefits of collaborative play (e.g., Schwartz, 1995; van der Meij, Albers, & Leemkuil, 2011), Wouters et al. (2013) hypothesized that single-user play would outperform group play but found that learners who played serious games in a group learned more than learners who played alone. In the current meta-analysis, we therefore predicted that collaborative game conditions would be associated with better learning outcomes than single-player game conditions in media comparisons (Moderator Hypothesis 1c).
Moderator Analyses of Game Mechanics Characteristics
In addition to exploring general study characteristics, we explored game design mechanics as potential moderators of game effects on learning outcomes. Specifically, we explored broad sophistication of game mechanics (simple gamification of academic tasks vs. more elaborate game mechanics), variety of player actions (focused games like
Moderator Analyses of Visual and Narrative Game Characteristics
Results from prior meta-analyses examining the effects of digital games on learning have yielded inconsistent and conflicting findings regarding the moderating effect of visual realism. We coded three unique visual characteristics: visual realism, camera perspective, and anthropomorphism. The relevance of camera viewpoint for learning was included because of the numerous reports that have shown that individuals who play first-person perspective “shooter” games, but not other games, demonstrate improvement on certain visual cognitive tasks (e.g., Feng, Spence, & Pratt, 2007; Green & Bavelier, 2006, 2007). Anthropomorphism was included because of numerous findings suggesting that anthropomorphic attributes affect a range of perceptual, cognitive, and social tasks (e.g., Heider & Simmel, 1944; Killingsworth, Levin, & Saylor, 2011; Mahajan & Woodward, 2009).
In addition to including these visual characteristics, we examined the narrative characteristics of each game condition. Overarching research on learning has supported the inclusion of narrative context in the sense of situating and anchoring learning in context (e.g., Bransford, Brown, & Cocking, 2000; Bransford, Sherwood, Hasselbring, Kinzer, & Williams, 1990; Brown, Collins, & Duguid, 1989). Furthermore, the role of narrative in games for learning remains a central focus of the field (e.g., Dickey, 2006; Echeverria, Barrios, Nussbaum, Amestica, & Leclerc, 2012; Lim, 2008; Malone & Lepper, 1987).
Based on the findings of Wouters et al. (2013), however, we predicted that game conditions involving increased visual realism, anthropomorphism, camera perspective, story relevance, and story depth would be associated with smaller learning outcomes in media comparisons (Moderator Hypotheses 3a–3e). In addition, we predicted that game conditions involving increased overall contextualization would be associated with smaller learning outcomes in media comparisons (Moderator Hypothesis 3f).
Research Characteristics in Value-Added and Media Comparisons
Beyond study and game characteristics, we also explored whether research quality was associated with better or smaller effects in the media comparisons and value-added comparisons. Prior meta-analyses have noted issues with the methodological quality of the primary studies in the games literature (Vogel et al., 2006) and have noted that the beneficial effects of serious games may be attenuated in studies with random assignment versus quasi-experimental designs (Wouters et al., 2013). Based on prior findings with research characteristics, we predicted that comparison condition quality, sufficient condition reporting, sufficient reporting of methods and analyses, overalignment of assessment with game, assessment type, and study design will be associated with learning outcomes in value-added and media comparisons (Moderator Hypotheses 4a–4f).
In-Depth Exploration of Variability in the Effects of Games on Learning
To test the hypotheses outlined above, we explored variability in the effects of digital games on learning outcomes by employing a recently developed statistical technique for robust variance estimation (RVE) in meta-regression (Hedges, Tipton, & Johnson, 2010; Tipton, 2013). This technique permits the inclusion of multiple effect sizes from the same study sample within any given meta-analysis—a common occurrence in meta-analyses in the educational and social sciences (e.g., Tanner-Smith & Tipton, 2014; Tanner-Smith, Wilson, & Lipsey, 2013; Wilson, Tanner-Smith, Lipsey, Steinka-Fry, & Morrison, 2011). This approach avoids loss of information associated with dropping effect sizes (to ensure their statistical independence) and does not require information about the covariance structure of effect size estimates that would be necessary for the use of multivariate meta-analysis techniques (see Tanner-Smith & Tipton, 2014, for a discussion).
Method
Inclusion and Exclusion Criteria
Digital Game
Eligible studies were required to include at least one comparison of a digital game versus nongame condition or at least one comparison of an augmented game design versus equivalent standard game design (but these two types of comparisons were always analyzed separately). Studies were required to designate explicitly the environment as a
Hybrid augmented reality games that used digital platforms to create games in physical space were eligible, but physical games with no digital platform were excluded (e.g., board games). Interventions that focused primarily on teaching youth to create or program games were not included for the present analyses because these approaches were considered distinct (and potentially more powerful) in light of their closer alignment with design-based learning (e.g., Kafai, 2006). In terms of recreational value, we do not imply joviality—games, like books and movies, can be serious or sad, thus communicating a powerful experience and message while doing so in a way that draws in people willingly to play for the sake of play (cf. Young et al., 2012). In terms of simulations, while most games have some form of simulation within them, the current meta-analysis includes only studies where (a) the digital environment in the study meets the eligibility criteria definition of a game outlined above and (b) the digital environment is explicitly referred to by the authors of that study as a game in the title or abstract. Thus, simulations that do not meet the game eligibility criteria outlined above are not included in the current meta-analysis.
Participants
Eligible participant samples included students in K–16, ages 6 to 25. Participants had to be students in a K–12 institution or enrolled in postsecondary school. Studies of participants beyond the K–16 grade range were not eligible. Studies focusing on samples from specific clinical populations of students (e.g., autism spectrum) were also excluded.
Research Designs
Because the focus of the meta-analysis was on making causal inferences regarding the effects of digital games on learning, only those studies using randomized controlled trial and controlled quasi-experimental research designs were eligible for inclusion.
Learning Outcomes
Eligible studies were required to measure information on at least one outcome related to learning aligned with the recent NRC report on Education for Life and Work (Pellegrino & Hilton, 2012). This report categorized learning into three broad domains: cognitive, intrapersonal, and interpersonal. The cognitive domain includes cognitive processes and strategies, knowledge, and creativity. The intrapersonal domain includes intellectual openness, work ethic and conscientiousness, and positive core self-evaluation. The interpersonal domain includes teamwork, collaboration, and leadership.
Publication Type
To reflect the current state of digital game design, eligible studies were required to have been published between January 2000 and September 2012 in a peer-reviewed journal article. Restricting eligibility to publications in peer-reviewed journals was selected to provide consistent sampling across the diverse fields and databases covered in the literature search as outlined in the Search Strategies section below. Nonetheless, to be sensitive to any biases this may have created in our study set, we conducted extensive sensitivity analyses to assess for the possibility of publication bias, as outlined below in the Data Analysis section.
Study Site and Language
Eligible studies were those published in English (but not necessarily conducted in English or in an English-speaking country).
Effect Sizes
Eligible studies were required to report sufficient information needed to calculate both pretest and posttest effect sizes on at least one measure of learning and the variables involved in the effect sizes had to have a known direction of scoring. We use the term
Search Strategies
We wanted to maximize sensitivity in our search, that is, to locate all studies that might potentially meet the eligibility criteria. Our search criteria therefore simply specified that the term
Coding Procedures
Eligibility coding first occurred at the title level, where two research assistants independently screened all titles identified in the literature search to eliminate clearly ineligible reports (e.g., reports in non-English languages) or publications that reported on games that were clearly irrelevant for the current study (e.g., discussion of the Olympic Games or sports injuries). Eligibility coding next occurred at the abstract level. All research assistants were first trained on a randomly selected subset of 100 abstracts, which were discussed until 100% consensus was reached with the entire group. The remaining abstracts were screened independently by two research assistants, and any disagreements were resolved by one of the authors. If there was any ambiguity about potential eligibility based on the abstract, we erred on the side of inclusivity at this stage. The final stage of eligibility coding occurred at the full-text level, in which all reports previously identified as potentially eligible at the abstract level were screened for final eligibility. At least two research assistants conducted independent full-text screening of each article, and any questions about eligibility were resolved by consensus with one of the study authors. The reason for ineligibility was recorded for each study, using the criteria outlined above.
Studies that were deemed ineligible at the full-text level were not coded further. Studies identified as eligible at the full-text level progressed to full-study coding, in which two of the study authors coded all game and nongame condition characteristics while two research assistants independently extracted information about the studies, participants, research conditions, and effects sizes. Any discrepancies in the coding were discussed in person and resolved via consensus between coders and at least one of the study authors.
Variables and Effect Size Moderators
Data were extracted for the following study characteristics and used for descriptive purposes and/or examined as potential effect size moderators.
Study Characteristics
We coded publication year, attrition between pretest and posttest measurement points, whether the study used an experimental or controlled quasi-experimental research design, location of study, whether the study had poor reporting of statistical or game-related information, and the timing at which the posttest measurement occurred.
Participant Characteristics
We coded percentage of White/non-White participants, percentage of male participants, and average age of sample.
Condition Characteristics
We measured several general characteristics related to the focal game condition in each study: duration of game, total number of game sessions, number of days elapsed between first and last game session, number of URLs provided for the game, number of screenshots provided for the game, word count of the game description, and whether the game included additional nongame instruction. In terms of game design characteristics, we measured presence of additional nongame instruction to supplement the game, sophistication of game mechanics, variety of actions in which the player engaged, social structuring of players within the game, intrinsic/extrinsic nature of the integration of learning and game mechanics, nature of scaffolding, primary learning mechanic, visual realism, anthropomorphism, camera perspective, story relevance, and story depth. Nongame conditions were coded for comparison condition quality. Value-added comparisons were coded for the focal compared feature.
Outcome Characteristics
We coded whether the outcome was measured using an existing normed instrument, a modification of an existing instrument, or an author-developed instrument. We also coded assessments in terms of broad learning outcome domain, learning outcome discipline, and possible overalignment with the game condition.
Statistical Methods
Effect Size Metric
The outcomes of interest in the meta-analysis were measured with pretest–adjusted posttest standardized mean difference effect sizes. They were coded so that positive effect sizes represent better learning outcomes for the focal game condition of interest at the posttest follow-up time point. Pretest-adjusted posttest standardized mean difference effect sizes (
where the first term is the posttest standardized mean difference effect size and the second term is the pretest standardized mean difference effect size. For each term, the numerator is the difference in means for the focal game and comparison group (using posttest means in the first term and pretest means in the second term), and the denominator is the pooled standard deviation for the scores in those groups (using the pooled posttest standard deviation in the first term and the pooled pretest standard deviation in the second term). We used this effect size metric in an attempt to provide conservative estimates of digital game effects on learning, net of pretest differences between groups on learning measures. Using a simple unadjusted posttest effect size metric would have been inappropriate given the inclusion of studies using quasi-experimental research designs where participants were not randomized to conditions.
All effect sizes were then adjusted with the small-sample correction factor to provide unbiased estimates of effect size (
where
Some studies in the meta-analysis required cluster adjustments (Hedges, 2007) because assignment to game/comparison conditions was performed at the school or classroom level, but results were reported at the individual level and this clustering was not accounted for in the statistical analysis. Because none of the studies provided the intraclass correlations (ICCs) needed to make cluster adjustments, we made the following cluster adjustments to the standard errors of the effect sizes using a conservative estimate (based on Hedges & Hedberg, 2007) of the ICC at .20, such that
where
Several studies provided multiple effect sizes on the same learning outcome construct of interest (e.g., two different measures of mathematics learning) for the same game and comparison group combination, or, in some cases, the same study included several variants of a game condition that were all compared to a single comparison condition. This meant that effect sizes were not statistically independent. Until recently, the most statistically defensible way to handle dependent effect sizes has been to model the dependencies among effect size estimates drawn from the same study using multivariate meta-analysis techniques (Gleser & Olkin, 2009), but these methods are often difficult to implement in practice because they require information about the intercorrelations between the effect sizes, which are seldom reported in primary studies.
Therefore, we used a technique to synthesize results that allows inclusion of statistically dependent effect size estimates in a single meta-analysis and does not require information about the intercorrelation between effect sizes within studies (Hedges et al., 2010; Tanner-Smith & Tipton, 2014). With this technique, robust standard errors are used to handle the lack of statistical independence in a set of correlated effect size estimates, and no information from the source studies about outcomes need be lost to the analysis. This technique therefore permits in-depth examination of variability in the effects of digital games on learning, specifically as that variability relates to study quality, game variants, and other study characteristics.
Missing Data
To be eligible for inclusion, studies were required to provide enough information needed to estimate a pretest and posttest effect size on at least one learning outcome. Therefore, there were no missing data in the effect size outcomes of interest. Data were missing, however, on some of the coded study characteristics. Attrition data were missing for 7% of studies, race/ethnicity for 67% of studies, gender composition for 28% of studies, game duration information for 8% of studies, and outcome measurement characteristics for 3% of studies. Because we did not have a large enough sample size to conduct any defensible imputation of missing data, we used listwise deletion for any moderator analyses that included these variables.
Analytic Strategies
Given the presumed heterogeneity in game conditions and participant samples, random effects statistical models were used for all analyses (Raudenbush, 2009). Mean effect sizes and meta-regression models using RVE were estimated using a weighted least squares approach (see Hedges et al., 2010; Tanner-Smith & Tipton, 2014, for more information). In the RVE framework, a simple model that relates the effect sizes
where
where the effect size
where
where for study
where
Recent simulation studies suggest that the statistical test originally proposed by Hedges et al. (2010) has low statistical power rates unless there are large numbers of studies included in the meta-analysis (López-López, Van den Noortgate, Tanner-Smith, Wilson, & Lipsey, 2015; López-López, Viechtbauer, Sánchez-Meca, & Marín-Martínez, 2010). Therefore, we used the
Results
All literature searches were conducted in September 2012. Figure 1 outlines the eligibility coding for the 61,887 reports identified in the literature search. A majority of reports were initially screened out at the title level (

Study identification flow diagram.
Demographic and Publication Characteristics
Table 2 shows descriptive statistics for the study, participant, and outcome characteristics. As shown in Table 2, the average publication years were 2009 and 2010, with publication dates ranging from 2000 to 2012. Attrition was relatively low in most studies, with an average of only .05, which was due in large part to the immediate posttest measurement employed by many studies. Few of the studies reported the race/ethnicity of participants sufficiently to code such characteristics. For those reporting information on the gender of participant samples, roughly half of the participants were male. The average ages of participants were 12 and 13, with most participants in the seventh grade. Learning outcomes focused primarily on cognitive competencies.
Descriptive statistics for study, participant, and learning outcome characteristics
Variables measured at study level.
Core Media Comparison and Value-Added Findings
All meta-analyses were estimated using robust variance estimates and could include multiple effect sizes from each study sample. Because the effect sizes were standardized mean difference effect sizes, confidence intervals (CIs) around mean effect sizes that include zero provide no evidence of significant differences between groups (regardless of associated effect size).
Media Comparisons (Core Hypothesis 1)
Nongame conditions generally involved either additional “classroom as normal” time or time spent working on materials intended as representative of traditional instruction instead of playing the game. When restricting analyses to those studies comparing learning outcomes in digital game versus nongame conditions, there were 209 pairwise comparisons (i.e., effect sizes) from 57 studies across all learning domains. Among these, there were 173 effect sizes from 55 studies measuring effects on cognitive competencies, 35 effect sizes from 14 studies for intrapersonal competencies, and 1 effect size from 1 study for an interpersonal competency outcome. Table 3 shows that students in digital game conditions demonstrated significantly better outcomes overall relative to students in the nongame comparison conditions (
Results from moderator analyses examining differences in posttest mean effect sizes for digital game versus nongame conditions.
The mean effect size for cognitive competencies was 0.35, indicating a significant beneficial effect of digital games on cognitive learning outcomes, relative to comparison conditions (95% CI [0.20, 0.51], τ2 = 0.29). The mean effect size for intrapersonal competencies was also 0.35, indicating a significant beneficial effect of digital games on intrapersonal learning outcomes, relative to comparison conditions (
Value-Added Comparisons (Core Hypothesis 2)
Value-added comparisons measure the efficacy of a standard version of a game relative to an enhanced version of that game augmented to test a theoretical design proposition (Mayer, 2011). For the purposes of comparison, conditions were identified as including the standard version of a game or an enhanced version of that game. Table 4 shows the results from this analysis. Overall, the comparison of all enhanced versions versus standard versions showed a significant positive effect size for the enhanced designs (
Posttest mean effect sizes for enhanced design variants of digital games versus equivalent standard versions of those digital games
As part of this hypothesis, we had also planned to explore specific categories of value-added comparisons in terms of the focal compared feature represented in the enhancement. The only category with substantial representation turned out to be enhanced scaffolding, which included 20 effect sizes drawn from 9 studies. Enhanced scaffolding was defined broadly to include personalized scaffolding, intelligent agents, adapting game experiences to student needs or interests, and revised game structuring targeted at emphasizing the learning mechanic. Specific comparisons of enhanced scaffolding demonstrated a significant overall effect size of similar magnitude to the overall value-added findings (
Moderator Analyses of General Study Characteristics
Play Duration (Moderator Hypothesis 1a)
As shown in Table 3, game conditions involving multiple game play sessions demonstrated significantly better learning outcomes than nongame control conditions, but there was no evidence that game conditions involving single game play sessions were different from nongame control conditions. Furthermore, effects were significantly smaller when games were played in one game session versus more than one session (
Because the effect of the absolute duration of an intervention might differ widely depending on game characteristics, we reestimated the meta-regression models after controlling for visual realism, anthropomorphism, variety of game actions, viewpoint, story relevance, and story depth. The purpose was to examine whether the differences in observed effects across categories would remain after controlling for those other game characteristics. The relationship between single-session versus multiple-session comparisons remained statistically significant, and the relationship between total duration and effect size magnitude remained nonsignificant.
Additional Instruction (Moderator Hypothesis 1b)
Many studies included game conditions with additional nongame instruction (e.g., students participating in relevant classroom work in addition to game play). As shown in Table 3, there was no evidence that effects were different depending on whether or not the game conditions included additional nongame instruction (
Player Configuration (Moderator Hypothesis 1c)
In terms of player grouping structure, effects of digital games on learning outcomes were largest for those game conditions using single noncollaborative/noncompetitive play. In fact, this was the only group that demonstrated significant learning gains, although this could be due to the small number of conditions and effect sizes in each of the other categories (see Table 3). Moreover, average effects were significantly larger for games with single players (with no formal collaboration or competition) relative to those using single/competitive play (
Because player grouping structure might be correlated with other game characteristics, we reestimated the meta-regression models while controlling for visual realism, anthropomorphism, variety of game actions, viewpoint, story relevance, and story depth. Results from that model indicated that after controlling for those other game characteristics, games with single noncollaborative/noncompetitive players still exhibited significantly larger mean effect sizes than those with single competitive players (
Moderator Analyses of Game Mechanics Characteristics
Sophistication of Mechanics (Moderator Hypothesis 2a)
The first category of broad sophistication of game design focuses on relatively rudimentary games involving the mere addition of points and/or badges to schoollike tasks. As shown in Table 3, these games were associated with a 0.53 standard deviation improvement in learning outcomes. Games in the second category could include those rudimentary aspects, but they also included mechanics, scaffolding, and/or situating context beyond those rudimentary aspects. This second category of games was associated with a 0.25 standard deviation improvement in learning. Although these results suggest that the average effect was largest for rudimentary games, results from a meta-regression model including a dummy indicator for the game type indicated no significant differences in the mean effect size across the two categories (
Variety of Player Actions (Moderator Hypothesis 2b)
The next section of Table 3 presents results in terms of the variety of game actions in which the player engaged during the game (i.e., small, medium, or large). Small variety includes simple games, such as
Overall, effects on learning were strongest for game conditions with medium or large varieties of game actions. Effects were somewhat smaller for those games with a small variety of actions, but there was no evidence that the mean effect sizes across these three categories were different from each other. Results were similar after controlling for visual realism, anthropomorphism, viewpoint, story relevance, and story depth.
Intrinsic Integration (Moderator Hypothesis 2c)
We also classified game conditions based on the integration of the primary learning mechanic and the primary game play mechanic (cf. Habgood & Ainsworth, 2011; Kafai, 1996). The learning mechanics can be defined as the mechanics and interactions intended to support players in learning the target learning outcomes. The game mechanics can be defined as the mechanics and interactions ostensibly designed for engagement and progress in the game.
Interestingly, there was only one game with a fully extrinsic relationship between core learning and game mechanics. Instead, many of the games were
Although the mean effect size was slightly larger for games using simplistically intrinsic designs relative to those using intrinsic or not fully intrinsic designs, there was no evidence that the mean effect sizes were significantly different across these three categories (Table 3). Results were similar even after controlling for the visual realism, anthropomorphism, variety of game actions, viewpoint, story relevance, and story depth game characteristic variables.
Scaffolding (Moderator Hypothesis 2d)
We compared four categories of scaffolding. Game conditions in the lowest category provide scaffolding only in terms of indicating success/failure or number of points earned by the player. The next category includes scaffolding that additionally displays the answer/solution in some manner after an error. The next category provides enhanced scaffolding beyond simply indicating success/failure and displaying the correct answer (e.g., intelligent agents or adapting scaffolding to past performance). The highest category (in terms of adaptiveness of the scaffolding) involves scaffolding provided by the teacher. As shown in Table 3, although results were relatively similar across the scaffolding categories, the effect on learning outcomes was significantly larger for games where the teacher provided scaffolding relative to those games using simple success/failure/points (
Moderator Analyses of Visual and Narrative Game Characteristics
Visual Realism (Moderator Hypothesis 3a)
Visual realism focuses on the graphical realism of the game environment. The schematic category includes schematic, symbolic, and text-based games with overall simplistic graphical elements. The cartoon category includes games with nonrealistic shading or forms (e.g., nonrealistic forms of characters or objects), often in a two-dimensional format. The realistic category includes games with realistic shading/forms or real pictures, often in a three-dimensional format. Effects were significantly larger for schematic than realistic games (
Anthropomorphism (Moderator Hypothesis 3b)
We coded anthropomorphism as the degree to which the player, nonplayable characters, and environmental entities in the game have human features or perform humanlike movements. For an entity to be considered relevant for the purposes of coding, attention to the entity must be important for successful gameplay. The low/none category includes either few or no anthropomorphic entities or features. The medium category includes approximately equal numbers of anthropomorphic and nonanthropomorphic entities. The high category includes a majority of anthropomorphic entities and features and anthropomorphic qualities closer to human. As shown in Table 3, effects were significantly larger for games using low/no anthropomorphizing compared to those using medium levels of anthropomorphizing (
Perspective (Moderator Hypothesis 3c)
Camera perspective is the camera viewpoint through which players interact with the game. If the game included cut-scenes in which the player did not control actions, these cut-scenes were not considered when coding for camera perspective. The third person viewpoint presents noncamera-based views (e.g.,
Story Relevance (Moderator Hypothesis 3d)
Story relevance assesses whether or not the narrative is relevant to the learning mechanic. Story relevance is different from the relationship between the game mechanic and learning mechanic (intrinsic vs. extrinsic) because it deals specifically with the story rather than the game mechanic. A story about analyzing scientific data in a game that requires applying math skills to graphs of experimental data would be relevant (e.g.,
Story Depth (Moderator Hypothesis 3e)
Story depth categorizes the extent of the story. Thin depth involves only setting, scenery, or context. Medium depth involves some evolving story over the course of the game. Thick depth includes a rich evolving story over the course of the game. Results showed that games with no story or thin story depth both had significantly larger effects relative to those with medium story depth (
Contextualization (Moderator Hypothesis 3f)
One issue with individual analyses of visual realism, anthropomorphism, camera viewpoint, story relevance, and story depth is that these characteristics are likely intercorrelated. For example, we found significant correlations at the

Scatter plot of pretest-adjusted posttest effect sizes and overall contextualization aggregate score for digital game versus nongame conditions (media comparisons).
Research Characteristics in Value-Added and Media Comparisons
Comparison Condition Quality (Moderator Hypothesis 4a)
Comparison condition quality tracks the equivalence of the control condition to the game condition in terms of the focal comparison (i.e., the manipulation the authors indicated as the primary focus). We coded comparison conditions as (a) sham or irrelevant activities, (b) weak comparisons, (c) medium comparisons representing rough equivalents of typical classroom approaches but not representing tightly controlled matches, (d) strong comparisons designed and optimized as clearly viable alternatives but still not tightly controlled matches, and (e) excellent comparisons representing direct analogs controlling for all but the focal variables. Results indicated that comparison condition quality had a significant relationship to effect sizes in the media comparison analyses (
Posttest mean effect sizes for digital game versus nongame conditions for all learning outcomes by study quality variables
Condition Reporting (Moderator Hypothesis 4b)
Condition reporting was coded in terms of word count and number of screenshots for game conditions. Many studies provided minimal information about the game conditions. Table 5 shows the results from analyses restricted to studies based on the word count of the game description. Overall, there were minimal differences in effects when we filtered based on word counts of the game descriptions for media comparison or value-added analyses. Number of screenshots, however, was significantly correlated with effect sizes for the game versus nongame conditions in the media comparison analyses (
Methods Reporting (Moderator Hypothesis 4c)
We coded each study subjectively in terms of insufficiency of reporting of methods and analyses. Specifically, we coded whether studies reported clearly inappropriate statistical analyses (e.g., analyzing cluster-randomized trial data at the individual unit of analysis with no adjustment for clustering) and/or demonstrated serious omissions in the reporting of methods or statistical analyses (e.g., omission of standard deviations or sample sizes, confusion between posttest and pretest–posttest change scores). Although sufficiency of reporting methods and analyses was not significantly associated with effect size magnitude in the meta-regression models for the media comparison or value-added analyses, it is noteworthy that the mean effect size was reduced substantially if we restricted the media comparison analyses to only those studies with unflawed reporting of their methods or analyses.
Assessment Overalignment (Moderator Hypothesis 4d)
We coded each study in terms of subjective overalignment of assessment with the game tasks. Specifically, we applied a binary code to indicate whether studies used learning outcome measures that were partially or entirely overaligned with the learning tasks included in the game conditions themselves (e.g., an English proficiency test of vocabulary questions that included the same vocabulary questions appearing in the digital quiz game under investigation). This characteristic was not significantly associated with effect size in the meta-regression models for the media comparison or value-added analyses, and there were minimal differences in effects when we filtered based on overalignment with outcome.
Assessment Type (Moderator Hypothesis 4e)
Assessments were categorized as preexisting normed instruments, modifications of preexisting instruments, or author-developed instruments. Results indicated no significant differences in effect size magnitude across assessment types for media comparison or value-added analyses, although the mean effect sizes varied slightly and nonsignificantly from 0.33 for author-developed instruments to 0.40 for preexisting normed instruments for media comparison analyses.
Study Design (Moderator Hypothesis 4f)
Research design was not associated with effect size magnitude for value-added analyses. Research design was not significantly associated with effect size magnitude for media comparison analyses but was marginal (
Restricting Analyses With Multiple Study Quality Characteristics
We consider comparison condition quality, sufficient condition reporting, sufficient reporting of methods and analyses, and overalignment of assessment to be quality variables to which all studies should be held accountable (i.e., study design–independent). We consider assessment type and research design to be study design–dependent quality variables in the sense that they must be weighed against other research choices (which we clarify in the Discussion section).
For media comparison analyses, only four studies met all study design–independent filters, and only two studies met all study quality filters. Synthesizing results for those comparisons yields nonsignificant mean effect sizes (
Publication Bias
Finally, related to research quality, we also explored the possibility of publication bias within our sample. Figure 3 shows the funnel plots with pseudo 95% confidence limits for the media comparison (top) and value-added analyses (bottom). There were no obvious asymmetries in either funnel plot by outcome type, and results from Egger regression tests provided no evidence of small study effects/bias for media comparison (

Funnel plots with pseudo 95% confidence limits for media comparisons (top) and value-added comparisons (bottom).
Discussion
Overall, results indicated that digital games were associated with a 0.33 standard deviation improvement relative to nongame comparison conditions. Thus, digital games conditions were on average more effective than the nongame instructional conditions included in those comparisons. These results generally confirm the overall findings from prior meta-analyses on the effects of games on learning (Sitzmann, 2011; Vogel et al., 2006; Wouters et al., 2013). Findings from the present meta-analysis do diverge slightly from the finding in Wouters et al. (2013) that game conditions and nongame instructional conditions did not differ in terms of motivation outcomes. In the current study, the intrapersonal learning outcome domain not only included motivation but also included intellectual openness, work ethic and conscientiousness, and positive core self-evaluation. Thus our findings do not necessarily conflict with those of Wouters et al. (2013) but rather suggest that game conditions support overall improvements in intrapersonal learning outcomes relative to nongame instructional conditions.
In terms of value-added comparisons, augmented game designs were associated with a 0.34 standard deviation improvement in learning relative to standard versions. This, along with the largely overlapping confidence intervals around the mean effect sizes in the media comparison and value-added analyses, suggests that the effects for the media comparison and value-added comparisons were similar in magnitude. This result highlights that the design of an intervention is associated with as large an effect as the medium of an intervention. Although this conclusion may appear to be common sense, the role of design is often de-emphasized in debates over whether digital games are better or worse than traditional instruction. The value-added findings empirically demonstrate the importance of the role of design beyond medium. Although too few value-added comparisons met eligibility requirements to support moderator analyses of design features, the findings underscore the need to carefully consider moderator analyses of differences in design across game conditions in the media comparisons, in terms of better understanding the role of design as well as in terms of interpreting the nature and import of what is being compared.
Moderator Analyses of General Study Characteristics
Play Duration (Moderator Hypothesis 1a)
Similar to results from prior meta-analyses, we found that (a) game conditions involving multiple game play sessions demonstrated significantly better learning outcomes than nongame control conditions and (b) game conditions involving single game play sessions did not demonstrate different learning outcomes than nongame control conditions. In our analysis that focused on total game play duration (i.e., as a continuous moderator variable), however, we found no evidence of a relationship between total duration and effects on learning outcomes.
It is worth noting that the constituent studies involved largely equivalent amounts of treatment time between experimental and comparison conditions (rather than simply comparing treatment time increases in experimental conditions while holding control conditions constant). The findings may therefore reflect a benefit of spaced learning as compared to massed learning in game contexts (cf. Cepeda, Pashler, Vul, Wixted, & Rohrer, 2006; McDaniel et al., 2013). Longer play durations may enhance learning but only when sessions are adequately spaced. Alternatively, it is possible that the impact of total duration was masked by game conditions that were longer than needed to achieve the observed improvement on the assessments. Games were played for an average of 347 minutes (or almost 6 hours). Deeper assessments of student learning should thus be investigated in future research.
Additional Instruction (Moderator Hypothesis 1b)
Additional nongame instruction was not associated with larger or smaller effects for game conditions in media comparisons. These findings diverge from Sitzmann (2011) and Wouters et al. (2013), who found that supplemental nongame instruction supported learning. Both Sitzmann (2011) and Wouters et al. (2013) may have used a more stringent definition for “additional instruction.” In the present meta-analysis, game conditions were coded as including additional nongame instruction (whether integrated or not) if players were exposed to a learning context that was likely to provide them with additional topic-relevant information (e.g., spending days in typical classroom instruction). Based on Wouters et al.’s (2013) examples, it is possible that only studies that explicitly stated that players received additional domain-relevant instruction were coded as such. This might suggest that additional teaching or activities specifically designed to supplement game content as part of an integrated experience can increase learning but unintegrated supplemental instruction is unlikely to contribute to larger gains. Another important clarification is that Sitzmann (2011), Wouters et al. (2013), and the current meta-analysis do not include interaction of players with informal sites or communities on the Internet as “nongame instruction” (e.g., World of Warcraft community forums or Wiki Game support sites). Research has demonstrated the importance and power of the argumentation and learning that occur on these sites (e.g., Steinkuehler & Duncan, 2008). Thus, findings from the three meta-analyses do not analyze that important context of learning and participation around games.
Player Configuration (Moderator Hypothesis 1c)
When controlling for game characteristics, single-player games without competition and collaborative team competition games outperformed those from single-player games with competition. These findings partly parallel those of Wouters et al. (2013), who found that collaborative play was generally more effective than individual play. Our findings, however, suggest that collaborative games may not be generally more effective for learning than single-player games but that competitive single-player structures may be least effective. This explanation would align with research on motivation and learning (e.g., Bandura, 1997; Pintrich, 2003; Schunk, 1991). The motivational support of self-efficacy for certain students in a single-player competitive structure is necessarily a failure to support other students (because one student’s gain necessitates another student’s loss).
This comparison highlights, however, the challenges of aggregating interventions across studies and games in terms of the potential to mask important instructional variables. Although meta-analysis as a method assumes commensurability across studies, confounding variables are inevitably present when synthesizing aggregate findings from multiple studies. In the current analysis, it is possible that the goals for individual versus group game conditions differed in a manner that contributed to the observed overarching pattern. Indeed, single-player games might have induced players to pursue a goal of attaining the highest possible score, for example, whereas collaborative games might have induced players to adopt or test maximizing strategies for team member roles. Interpretation thus requires careful consideration of possible underlying variables and mechanisms of change. We will return to this challenge in the Caveats and Limitations section.
Moderator Analyses of Game Mechanics Characteristics
The comparison of broad design sophistication in media comparisons (Moderator Hypothesis 2a) demonstrated that simple gamification as well as more sophisticated game mechanics can prove effective. To clarify this finding, future research and analyses should explore whether or not the simple gamification studies (e.g., games that simply add contingent points and badges to learning activities) more frequently focus on lower order learning outcomes as compared to studies with more sophisticated game mechanics. Regardless, these results support the proposal that simple gamification can prove effective for improving certain types of learning outcomes (cf. Lee & Hammer, 2011; Sheldon, 2011). These findings parallel those observed for the variety of game actions (Moderator Hypothesis 2b), showing equivalent learning outcomes across all levels of action variety in media comparison studies.
The present meta-analysis is largely silent with regard to intrinsic versus extrinsic design (Moderator Hypothesis 2c) because only one study involved a fully extrinsic condition. Regarding the nature of scaffolding (Moderator Hypothesis 2d), each category of scaffolding demonstrated significant effects on learning relative to nongame control conditions, but higher levels of scaffolding were associated with higher relative learning outcomes than lower levels of scaffolding. Enhanced scaffolding also showed significant effects on learning outcomes in the value-added analyses. These findings provide a productive foundation for ongoing work on enhancing scaffolding in games (e.g., Barzilai & Blau, 2014).
Moderator Analyses of Visual and Narrative Game Characteristics
Several visual and narrative game characteristics (Moderator Hypotheses 3a–3e) were intercorrelated. An aggregate contextualization variable created from these game characteristics (Moderator Hypothesis 3f) demonstrated a small but significant negative relationship with learning gains overall in media comparisons. This result parallels the findings of Wouters et al. (2013), which showed that schematic games were more effective than cartoon-like or realistic serious games and supports the trend those authors observed that games with no narrative might be more effective than games with narratives.
On the surface, these findings contradict research and theory highlighting the value of situating learning in context (e.g., Bransford et al., 2000). One possible interpretation is that rich narratives and visual complexity distract students from the intended learning content or provide alternative goals within the game that do not support improvement on the assessed outcome measures. This interpretation would speak to the need for game designers and education researchers to collaborate on designs to keep game graphics, environments, and narratives optimally aligned with assessed learning objectives.
A second possible interpretation focuses on the nature of the assessments in the constituent studies. Almost all the studies analyzed in this report involved immediate posttests focusing on lower order learning outcomes. The arguments for situating learning in context focus on developing a deep, durable, integrated understanding that students can apply across contexts (essentially the opposite of an immediate focused posttest). This interpretation highlights the importance of including assessments designed to measure deeper understanding in future research. Such a shift in assessment would align with theoretical proposals indicating that the greatest strengths of digital games as a medium involve their affordances for supporting higher order cognitive, intrapersonal, and interpersonal learning objectives (e.g., Gee, 2007; Squire, 2011).
The visual and narrative features of games are also envisioned as potentially creating a “time for telling” about lower level concepts in a meaningful and compelling context. A third possible interpretation of our findings from this perspective is that our own coding rules may not have captured the critical relationships between narratives and learning in terms of time for telling about lower order learning objectives. Specifically, we coded the relevance of narratives in terms of relevance to the learning mechanic rather than assessment content. Thus, relevant narratives may have helped contextualize the learning mechanic in the game play but failed to create a time for telling about lower level concepts in a meaningful manner in terms of the assessed learning objectives.
A fourth possible interpretation also focuses on our coding system. We coded narrative in terms of relevance and thickness, but perhaps the critical features of narratives are whether they are engaging, high-quality, or accessible, regardless of thickness or relevance. Some thin narratives are incredibly engaging, whereas some thick narratives may be dull. Additionally, poorly designed thick narratives might be difficult for students to understand. Similar questions could be framed in terms of the value of visual sophistication versus visual clarity or visual engagement. The amount of information reported about the game contexts was minimal in many of the constituent studies, restricting the ways in which we were able to code visual and narrative characteristics, but clearly much room remains for exploring the relationships between contextualization and learning.
Research Characteristics in Value-Added and Media Comparisons
Few studies met all four study design–independent quality variables for the research quality moderator analyses (Moderator Hypotheses 4a–4d) in value-added or media comparisons, supporting claims that methodological rigor needs to be improved in research on games for learning. That said, results from moderator analyses indicated that few study quality variables (design-independent or design-dependent) influenced the effects of digital games on learning outcomes in the media comparison or value-added analyses (Moderator Hypotheses 4a–4f). This provides additional confidence in our overall effect estimates and suggests that findings were not unduly biased by individual study quality variables. Further discussion (provided below), is merited, however, for one design-independent variable (control condition quality) and both design-dependent variables (assessment type and research design).
Control Condition Quality
Restricting the meta-analysis to only those studies with medium or better comparison condition quality (thus weeding out “straw man” comparisons) reduced the effect size from 0.33 to 0.28 (but remained significant). These findings further underscore the importance of design (and careful reporting of that design) for both game and nongame conditions (cf. Young et al., 2012). Media comparison research often highlights medium while placing less emphasis on the design of the game and control conditions. Many of the media comparison studies in the present report, for example, provided only sparse descriptions of game or control interventions. As research on games begins to focus more on design, researchers will need to provide thicker descriptions of conditions to support informed comparisons across studies.
Assessment Type
There are trade-offs between research questions of interest and the availability of preexisting normed instruments. Although preexisting assessments can clearly enhance confidence in research quality, these instruments exist only for certain outcomes. Furthermore, the present meta-analysis found no evidence of a relationship between assessment type (i.e., preexisting normed instrument, modification of a preexisting instrument, or author-developed instrument) and effect size magnitude. The present meta-analysis also found no evidence of a relationship between effect sizes and potential overalignment of assessments. Given the aforementioned trade-offs and our null result concerning the impact of normed instruments on effect sizes, we propose that requiring research to rely exclusively on preexisting normed instruments would unnecessarily limit digital games research. This issue is particularly relevant for the outcome types most desirable from the perspective of 21st-century skills and preparedness (for which normed assessments are scarce). Researchers should thus be encouraged to choose appropriate assessments based on learning goals but should report reliability and validity information for author-created or -modified instruments.
Research Design
Although there were no significant differences in average effects across randomized and controlled quasi-experimental designs in the present meta-analysis, the observed effects were notably smaller in the studies using randomized designs. Post hoc correlational analyses showed, however, that differences in game characteristics between games in studies using randomized versus quasi-experimental designs might partially account for effect size differences across study designs. Furthermore, randomized designs preclude many research questions and populations. We therefore argue that researchers should carefully weigh the benefits of experimental designs in light of fundamental issues of ecological validity, authenticity, and specific requirements of the research questions under exploration. In studies where quasi-experimental designs are implemented, researchers must provide more substantial information about the group attributes and account for those attributes in analyses.
Caveats and Limitations
This section raises three issues for consideration. The first involves commensurability, which should be considered when interpreting this (or any) meta-analysis. Meta-analyses assume that the included pairwise comparisons represent relatively standardized or homogenous conditions. In practice, this is not the case even in settings that might appear highly homogeneous, such as medical research. Jüni, Witshci, Bloch, and Egger (1999), for example, described these hazards in great detail in their article in the
Thus, although meta-analyses aggregate findings within categories that sound highly generalizable, the included research conditions do not fill or equally represent the entire domain suggested by the categories. Neither this nor any meta-analysis accounts for all possible design approaches or qualities of implementation. Future research should not be limited, therefore, to the highest performing game characteristics identified in the current meta-analysis. Alternative designs for low-performing game characteristics should be investigated if those characteristics are considered critical to learning goals. We argue that this implication is particularly salient regarding our findings for visual and narrative contextualization, where overarching research highlights the importance of situating learning in context to support deeper understanding, but the findings of this meta-analysis underscore potential design and alignment challenges.
In addition to commensurability of game conditions, there are commensurability issues for the nongame comparison conditions, which generally represented typical instructional approaches rather than optimized learning activities in the constituent studies. The findings of the media comparison analyses should thus not be interpreted as suggesting that game-based instruction is superior to all learning experiences that could be designed within traditional media; rather, the findings suggest that the game-based experiences analyzed in these studies were superior to the traditional nongame approaches implemented in the constituent studies. We therefore urge against simplistic quotations of findings suggesting that games universally outperform nongame learning approaches. The results and comparisons are more complex and must be acknowledged as such.
The second issue concerns inclusion, which is related to commensurability. Meta-analyses include distinct cross sections of studies (as is true for any type of review; cf. Young et al., 2012). As shown in Table 1, Vogel et al.’s (2006) and Sitzmann’s (2011) meta-analyses included simulations, for example, and less than 50% of the studies from Wouters et al. (2013) were eligible in the present meta-analysis (with publication date and research designs accounting for most differences). Furthermore, although many important studies focusing on design have been conducted in the learning sciences, games research, and other fields, not all these studies met the eligibility criteria for inclusion in this particular meta-analysis (often based on the requirement of experimental or quasi-experimental designs involving pretest–posttest measurements, sufficient reporting for calculation of effect sizes, or eligible comparison conditions). This is important to note because research conducted from some epistemological paradigms, particularly sociocultural paradigms, can be relatively incompatible with current assessment practices and experimental designs. The current meta-analysis therefore includes only a cross section of research on games, and eligibility should not be conflated with contribution or value. We need to leverage the findings across studies, regardless of their eligibility for inclusion in the current analyses, as we move forward in exploring the role of design to leverage the affordances of games for learning.
The third issue concerns assessments. Higher order cognitive, intrapersonal, and interpersonal processes and skills prove more challenging to measure accurately and reliably than do lower order cognitive skills and rote knowledge. As a result, research on games has generally focused on lower order cognitive skills, rote knowledge, and immediate posttests. The NRC report on education for life and work in the 21st century, however, emphasizes a more distributed focus across outcomes, if not a complete reversal in emphasis. Furthermore, proponents of digital games for learning (e.g., Gee, 2007; Squire, 2011) propose that the greatest strengths of digital games as a medium involve their affordances for supporting higher order cognitive, intrapersonal, and interpersonal learning objectives. Assessments that yield reliable and valid scores of higher order processes and skills would also facilitate further research at sociocultural and situated grain sizes of the overarching activity structure and community, as well as over much longer longitudinal time frames of months or years rather than hours or days, which are the grain sizes and time frames underlying the greatest strengths of games for learning (cf. Young et al., 2012). For all these reasons, ongoing development and research should focus more heavily on accurate and reliable assessment of higher order learning outcomes.
Role of Design and Final Thoughts
To date, much experimental and quasi-experimental research on games and learning has focused on media comparisons. The present meta-analysis suggests that games as a medium can indeed support productive learning. Furthermore, the results of the present meta-analysis parallel the conclusions of the NRC report on laboratory and inquiry activities (Singer, Hilton, & Schweingruber, 2005) in highlighting the key role of design beyond medium. Thus, harkening back to the Clark/Kozma debates of the 1980s and 1990s about the relative importance of studying medium versus design (e.g., Clark, 1994; Kozma, 1994), games as a medium definitely provide new and powerful affordances, but it is the design within the medium to leverage those affordances that will determine the efficacy of a learning environment. We now need to leverage findings on games from across methodological paradigms, regardless of their eligibility for inclusion in the current analyses, to conduct situated empirical analyses that consider design in terms of interactions among player goals, game affordances, pedagogy, teaching objectives, and curricular content. Our findings expand on and reinforce Young et al.’s (2012) findings that we should “stop seeking simple answers to the wrong question” (p. 84). We should thus shift emphasis from proof-of-concept studies (“Can games support learning?”) and media comparison analyses (“Are games better or worse than other media for learning?”) to cognitive-consequences and value-added studies exploring how theoretically driven design decisions influence situated learning outcomes for the broad diversity of learners within and beyond our classrooms.
Footnotes
Notes
Authors
DOUGLAS B. CLARK is a professor of the learning sciences and science education at Vanderbilt University, Box 230, 230 Appleton Place, Nashville, TN 37203-5721; e-mail:
EMILY E. TANNER-SMITH is a research assistant professor at the Peabody Research Institute and Department of Human and Organizational Development at Vanderbilt University, Box 0181, 230 Appleton Place, Nashville, TN 37203-5721; e-mail:
STEPHEN S. KILLINGSWORTH is a postdoctoral scholar in the Department of Teaching and Learning at Vanderbilt University, Box 230, 230 Appleton Place, Nashville, TN 37203-5721; e-mail:
