Sage Journals: Discover world-class research

Abstract

The testing effect is a robust empirical finding in the research on learning and instruction, demonstrating that taking tests during the learning phase facilitates later retrieval from long-term memory. Early evidence came mainly from laboratory studies, though in recent years applied educational researchers have become increasingly interested in the effects of retrieval practice. We investigated the extent that the testing effect can also be observed and effectively used in psychology classes. Inspection of the research literature yielded 19 publications that tested the effect in the context of learning and teaching psychology. A total of 72 effect sizes were extracted from these publications and subjected to a meta-analysis. A significant overall effect size of d = 0.56 demonstrated that testing was beneficial to the learning outcomes. Further analyses focussed on the role of potential moderator variables, a possible publication bias, and the dependency between effect sizes. The results are discussed in the context of applications in learning and teaching psychology.

Keywords

Meta-analysis testing effect retrieval practice teaching psychology

Introduction

“The term evidence-based teaching refers to […] tools and techniques that have shown through rigorous experimentation to promote learning” (Dunn, Saville, Baker, & Marek, 2013, p. 5). Recently, several authors have published collections of such tools, techniques, and strategies to raise attention to the concept of evidence-based teaching in various contexts (e.g., Cranney, 2013; Dunlosky, Rawson, Marsh, Nathan, & Willingham, 2013; Dunn et al., 2013; Graesser, 2009; Graesser, Halpern, & Hakel, 2008; Pashler et al., 2007; Roediger & Pyc, 2012; Schwartz & Gurung, 2012). Testing or retrieval practice is one of the most frequently studied techniques, the research of which has produced numerous publications recommending evidence-based teaching strategies (e.g., Dunlosky et al., 2013; Dunn et al., 2013; Graesser et al., 2008; Pashler et al., 2007; Roediger & Pyc, 2012).

In the classroom, tests are commonly applied to assess students’ learning performance and to evaluate academic achievement. Many experimental studies, however, indicate that taking tests also facilitates students’ learning. Often, the accessibility of previously retrieved learning material is enhanced in a later, final test – a phenomenon called the testing effect (cf. Roediger & Karpicke, 2006a). In this context, the focus of testing is shifted from the assessment of learning outcomes to supporting the learning process – assessment for learning rather than assessment of learning.

The effect was first empirically investigated in the early 1900s (e.g., Abbott, 1909; Gates, 1917). Despite a century of interest in the topic, research on the testing effect has increased enormously during the past 15 years producing a considerable number of studies investigating various conditions under which the effect occurs (for recent reviews and meta-analyses, see, for example, Dunlosky et al., 2013; Roediger & Butler, 2011; Rowland, 2014).

Theoretical accounts explaining the beneficial effect of testing differentiate direct and indirect effects. Direct effects refer to the impact of retrieving information from memory. Retrieval practice is assumed to strengthen the memory trace by elaborating the encoded information and by creating different retrieval routes to the information in long-term memory (cf. Dunlosky et al., 2013). Several empirical findings further indicate that the amount of retrieval effort directly influences the extent of the effect. Testing effects are larger following more difficult tests (e.g., Glover, 1989; Kang, McDermott, & Roediger, 2007) and following longer intervals between initial learning and testing (e.g., Karpicke & Roediger, 2007; Pyc & Rawson, 2009). Indirect effects refer to the modulation of learning behaviours after retrieval practice. From a metacognitive perspective, for example, tests can serve as monitoring tools that provide information about the current state of learning (Winne & Hadwin, 1998). Learners can use this diagnostic information for adapting subsequent learning activities. For example, learners can intensify encoding of learning content that they had failed to retrieve or were unsure of during retrieval. Providing feedback about the test result can further support this process by disclosing cognitive and metacognitive errors during retrieval (e.g., Butler, Karpicke, & Roediger, 2008).

Research on the Testing Effect: From the Laboratory to the Classroom

Early research focussing on the underlying memory processes of direct testing effects was more conducive to laboratory studies. Numerous experimental studies have demonstrated the testing effect as a robust phenomenon across a wide variety of samples, learning materials, test formats, criterion tasks, and retention intervals. For example, testing effects were shown across a wide range of age, in pre-school (e.g., Fritz, Morris, Nolan, & Singleton, 2007) and school children (e.g., Bouwmeester & Verkoeijen, 2011), in university students (e.g., Karpicke & Roediger, 2007), and in older adults (e.g., Balota, Duchek, Sergent Marshall, & Roediger, 2006). Studies on the testing effect also encompassed different types of learning materials such as simple word lists (e.g., Karpicke & Roediger, 2007), definitions (e.g., Metcalfe, Kornell, & Son, 2007) and factual knowledge (e.g., Butler et al., 2008), prose text materials (e.g., Roediger & Karpicke, 2006b), and videos and animations (e.g., Butler & Roediger, 2007). Moreover, beneficial effects of testing were found using various criterion tasks in the final tests, including recognition, cued and free recall, and tasks involving learning transfer (e.g., Chan, 2010) or inferences (e.g., Agarwal & Roediger, 2011).

As the experimental studies indicated growing generalizability of the testing effect, applying tests as retrieval practice to educational contexts became a new focus of research in this field. The practical relevance of the testing effect was highlighted by the two core findings of beneficial effects of retrieval practice with classroom-relevant learning materials (e.g., Butler & Roediger, 2007; Kang et al., 2007; McDaniel, Anderson, Derbish, & Morissette, 2007) and the long-term beneficial effects even after intervals of several weeks (e.g., Butler & Roediger, 2007), months (e.g., McDaniel, Agarwal, Huelser, McDermott, & Roediger, 2011), and years (e.g., Bahrick, Bahrick, Bahrick, & Bahrick, 1993). These findings inspired a growing number of field studies investigating the testing effect under real classroom conditions (e.g., Bjork, Little, & Storm, 2014; Carpenter, Pashler, & Cepeda, 2009; Roediger, Agarwal, McDaniel, & McDermott, 2011). As a result, the implementation of retrieval practice in everyday educational settings has been a focus of discussion, and it was recently highly recommended for classroom use (e.g., Dunlosky et al., 2013; Dunlosky & Rawson, 2015). Roediger and Pyc (2012) further promoted retrieval practice as one of the “inexpensive techniques to improve education” (p. 242).

In conclusion, researchers of cognitive and educational psychology strongly recommend to apply evidence-based teaching methods, but do they follow their own instructions and can they demonstrate in empirical studies how to adapt successfully evidence-based teaching methods such as retrieval practice to their own teaching of psychology? This is a relevant question for two reasons, at least. First, such empirical demonstrations would underpin the credibility of recommendations given by psychologists to teachers and instructors in the field. Second, such empirical demonstrations could encourage other researchers of psychology to innovate and implement techniques in the teaching of psychology based on their own empirical findings. With regard to the testing effect literature, this question is still open. Studies investigating the testing effect with psychological learning materials are seldom and not easily detectable. Many studies were conducted with students of psychology but did not involve psychological learning material (e.g., Pyc & Rawson, 2012). Furthermore, some of the studies examining the testing effect in the context of teaching psychology only comprised small samples and yielded inconclusive results (e.g., Bell, Simone, & Whitfield, 2015). Finally, the only meta-analysis investigating the testing effect in classroom studies (Bangert-Drowns, Kulik, & Kulik, 1991) is over thirty years old, included studies using psychological learning materials but also studies from other domains, and investigated a specific research question, namely the role of multiple testing. Therefore, the current study investigated the extent that the testing effect can be observed and effectively used in learning and teaching psychology using meta-analytic methods to summarize the current state of evidence on this question. Based on evidence demonstrating the generalizability of the testing effect and a growing number of field studies demonstrating beneficial effects of testing in the classroom, we expected a positive overall effect of retrieval practice on the measures of learning performance in psychology classrooms.

Scope of the Meta-Analysis

We searched the literature for studies investigating testing as retrieval practice in the context of learning and teaching psychology and applied two inclusion criteria for identifying relevant studies. First, studies were classified as relevant when retrieval practice was applied in teaching psychology students or in the context of teaching psychological learning content to non-psychology students. Second, studies were only included when effects of testing could be compared with adequate control conditions such as restudying or no testing. Although the implementation of retrieval practice in these studies varied substantially, the study design typically included three phases: An initial learning phase in which the learning materials are presented to the participants for the first time; an intervening phase in which the content is tested, re-studied, or not presented again; and a final test.

With regard to the heterogeneity of the included studies, three study characteristics were systematically coded and controlled because of their influence on the occurrence and the extent of a testing effect. (a) Studies were classified according to whether the study method included a between- or within-subjects design. In between-subjects’ designs, one group of participants takes an intermediate test and another group does not. A testing effect is indicated by higher performance in the intermediately tested group compared to the participants who took only the final test. In within-subjects’ designs, all participants are intermediately tested but not on the complete learning material. The testing effect is indicated by higher accessibility of the previously tested items compared to the non-tested items. Within-subjects’ designs seem to be the more conservative option, because some studies demonstrated that not only the performance on tested material was enhanced but also on non-tested, semantically related material (e.g., Carpenter, 2011; Pyc & Rawson, 2010). Moreover, within-subjects’ designs are more suitable for controlling effects of individual differences. (b) Studies were classified according to whether they applied a restudy or a no-test control condition. Both control conditions cannot completely isolate the testing effect from other potential influences (cf. Rowland, 2014). The no-test condition, for example, lacks additional presentation of the learning material in the intervening phase and thus implies unequal learning times in the test condition and the no-test condition. This method might result in an overestimation of the testing effect which can be addressed by a restudy control condition. However, the availability of learning content in the intervening phase is still different in the restudy condition and the test condition. In the restudy condition, the complete learning content might be presented, but the learning content presented in the test condition is restricted to the correctly recalled information. This contrast might result in an underestimation of the testing effect which can be addressed by providing feedback on the test results so that non-recalled information can be recalled after the test. (c) Given the importance of feedback, studies were classified according to whether or not feedback was given in the test condition. Feedback can address the problem of learning content availability in the intervening phase. Moreover, it can also support the process of adapting learning activities subsequent to retrieval practice as stated earlier. Thus, additional moderator analyses were used to determine the potential influences of the study design, the type of control condition, and the implementation of feedback.

Methods

Search Strategy and Selection Process

All studies contributing data to the current meta-analysis were identified by screening reference lists of current review articles and book chapters on the testing effect, especially those considering research in applied settings and by searching the PsycINFO database using the following search algorithm: (1) testing effect AND learn*; (2) publication range: 1941–August 2015; (3) age group range: 13–64 years; and (4) method: empirical study, and field study. The database search initially provided 485 results from which only 13 publications met the inclusion criteria. The small number of hits resulted from the fact that the search algorithm also included studies that did not address the testing effect, but still contained the search terms testing, effect and learn* (e.g., Schmeck, Mayer, Opfermann, Pfeiffer, & Leutner, 2014). A large number of other studies examined the testing effect, however, confronting learners with material like word pairs (Kang & Pashler, 2014; Tse & Pu, 2012) or prose texts (Einstein, Mullet, & Harrison, 2012; Jonge, Tabbers, & Rikers, 2015) unrelated to the field of psychology.

Another six studies were identified by using the backward snowballing technique and by directly contacting authors with a corresponding research focus. Finally, 19 publications dating from July 1984 to February 2016 fitted the inclusion criteria and provided 72 individual effect sizes ranging from a minimum of one effect size to a maximum of 20 effect sizes for a single publication. All included publications are listed in Appendix A.

Coding of Study Characteristics and Moderator Variables

A coding scheme was installed a priori, taking into account basic information about individual studies (e.g., author/s, year and type of publication), data relevant to calculate individual effect sizes (e.g., sample size, means, and standard deviations) and information about the respective level of moderator variables. All three potential moderators were coded as dichotomous categorical variables: study design (between vs. within); type of control condition (restudy vs. no-test); and implementation of feedback (yes vs. no).

Individual Effect Size Calculation

In the first step, individual effect sizes were calculated as the standardized mean difference (Cohen’s d) of final test performance between the testing condition and the control condition. The standard formula based on the difference between means divided by the pooled standard deviation of test and control condition was used for data coming from between-subjects’ designs (Borenstein, Hedges, Higgings, & Rothstein, 2009, pp. 26–28). Normally, the standard deviation of difference scores is recommended to replace the pooled standard deviation component for data coming from within-subjects’ designs (Borenstein et al., 2009, pp. 28–30). However, this approach was not feasible, because often the necessary data could not be reconstructed from the original studies. Therefore, the formula for independent data was also used for matched data. Given that this procedure generally results in an underestimation of effect sizes for studies using within-subjects’ designs (cf. Hays, 1994, p. 339) we deemed the procedure to be a conservative and thus appropriate way of assessing effect sizes under these conditions.

Depending on the type of data reported in the original studies, various modifications of the basic formula for calculating Cohen’s d were used to determine the individual effect sizes, all implemented in a web-based effect size calculator (Lipsey & Wilson, 2001). A correction factor introduced by Hedges (1981) was used for five effect sizes from small sample studies, because Cohen’s d is known to result in a slight overestimation of the effect size in small samples (< 20 subjects).

Method of Analysis

To calculate a combined overall mean effect size (Cohen’s d) from the individual effect sizes, a random-effects model was run using the Metafor package (Viechtbauer, 2010) in R (R Core Team, 2014). An assumption of a random-effects model is that several true effect sizes underlie the observed effect sizes, and they usually follow a normal distribution. Thus, differences in observed effect sizes are regarded not to result from sampling error alone but also from a true variation between studies. Accordingly, the overall effect calculated from a random-effects model represents the mean of the normal distribution of all true effect sizes. This statistical model seemed appropriate for the current study, because it included studies with heterogeneous sample characteristics and implementations across different field conditions. Therefore, the assumption was justified that differences in the individual effect sizes probably did not only originate from sampling error. Furthermore, assuming that we considered all possible moderators causing differences in the observed effect sizes beyond the sampling error would be unrealistic.

Between-study variance was calculated using the restricted-maximum-likelihood estimator implemented in Metafor. In a comparative review, this method was shown to provide a good balance between being unbiased and efficient, both important criteria for optimality of effect size estimators (Viechtbauer, 2005). Each individual effect size was weighted using inverse-variance weights, taking into account the sampling error and the estimated between-study variance.

A mixed-effects model was used for categorical moderator analysis. This approach combines the two steps of assigning studies to levels of the respective moderator and calculating a mean effect size for studies within these levels by using a random-effects model. The present model included an estimate of the overall mean effect size as the intercept and the three categorical moderators as additional factors.

Two additional analyses were conducted to explore potential biases. First, we checked the data for a possible publication bias. Egger's regression test (Egger, Davey Smith, Schneider, & Minder, 1997) was used to check whether a relationship exists between the observed outcome values from the random-effects model and their respective standard errors. Significant results indicate the existence of a publication bias that originates from a lower chance of publication for small sample size studies that yield only small or moderate effects. Second, a robust variance estimation (RVE) in the form of a meta-regression implemented in the Robumeta package (Fisher & Tipton, 2015) in R was computed to inspect and correct for the influence of dependent effect sizes in the random-effects and mixed-effects models. This analysis was important, because several effect sizes were based on different dependent variables (in the same sample) or based on the same dependent variable compared in overlapping subsamples (same control group or same experimental group). These dependencies could result in correlated estimation errors. The RVE procedure used a correlated effects model as a weighting method to correct for this error. To optimize the RVE procedure for the current data set (comprising only 19 studies), an adjustment for small samples with less than 40 studies was used (Tipton, 2015).

Results

To give a first impression of the current state of research in this field, the distribution of the individual effects sizes is visualized in a forest plot (see Figure 1). The majority of the effect sizes (n = 57) were positive, suggesting a higher learning outcome in the testing condition compared to the control condition. Considering the 95% confidence intervals (Cis) of the effect sizes, however, indicates that only 33 of these positive effects significantly differed from zero. One of the remaining 15 effect sizes was exactly d = 0, and 14 effect sizes were negative suggesting a lower learning outcome in the testing condition compared to the control condition. Inspecting the 95% CIs of these 14 effect sizes, however, revealed that only one CI did not include zero, indicating that only one out of 72 effect sizes designated a significant negative effect of testing.

Figure 1.

Forest plot of the individual effect sizes with 95% confidence intervals and overall effect size (total) for the full data set, based on data uncorrected for dependencies.

In the uncorrected random-effects model of the full data set, the analysis revealed a mean weighted effect size significantly different from zero, d = 0.56, 95% CI [0.40, 0.71], indicating a beneficial effect of testing. Final test performance in the experimental condition (test condition) was significantly higher than in the control condition (no test or restudy).

Categorical Moderator Analyses

In the uncorrected mixed-effects model, the analysis revealed only one significant effect pertaining to the moderator variable feedback (z = 2.42, p = 0.02). The implementation of feedback enhanced the testing effect if all other moderators were held constant. The other two moderators, study design (z = 0.76, p = 0.45) and type of control condition (z = 0.79, p = 0.43), did not reach significance. Table 1 shows the mean weighted effect sizes when considering the respective moderator levels, given that all other moderators were held constant.

Table 1

Moderator Analyses

		95%-CI
Moderator	d	LB	UB	k
Design
Between	0.34	0.01	0.67	27
Within	0.73	0.40	1.06	45
Control condition
Restudy	0.73	0.40	1.05	24
No test	0.34	0.02	0.67	48
Feedback
Yes	0.60	0.29	0.91	39
No	0.47	0.16	0.78	33

Notes. d = mean weighted effect size Cohen’s d; CI = confidence interval; LB = lower bound; UB = upper bound; k = number of effect sizes; based on data uncorrected for dependencies.

Exploratory Analyses

We first tested for a potential publication bias. Egger's regression test did not reach the conventional level of significance (z = -0.33, p = 0.74). Thus, there was no indication of a publication bias in the current data set that originates from a systematic correlation between sample size and effect size.

We then explored the potential effect of dependency among effect sizes. Applying the RVE procedure to the random-effects model slightly increased the overall effect size, d = 0.62, 95% CI [0.32, 0.93]. A forest plot depicting the distribution of the individual effect sizes clustered by publication can be found in Appendix B. Applying the RVE procedure to the mixed-effects model produced no significant effects of the three moderator variables study design (t = 0.23, p = 0.82), control condition (t = -0.70, p = 0.50), and feedback (t = 1.84, p = 0.10). This result indicates that the dependency of several effect sizes led to a slight underestimation of the overall effect and a slight overestimation of the influence of feedback.

Discussion

Researchers of cognitive and educational psychology argue for the concept of evidence-based teaching which refers to the idea of using theoretically grounded and empirically supported phenomena from the research on learning and memory to improve teaching and student learning. The testing effect is one of the most often cited phenomena in the context of evidence-based teaching and repeatedly recommended to be adopted in the classroom. Whether researchers can support this recommendation by proving effective applications of this idea in their own teaching of psychology, is still an open question. Therefore, we investigated the extent that testing effects can also be demonstrated in teaching and learning psychology. To this end, we identified 19 studies on the testing effect applying psychological learning materials, and we meta-analysed the effects of intermediate testing on learning outcome.

The central result of our analysis is that testing between the acquisition phase and a final test enhanced performance in the final test. Feedback on the result of the intermediate test increased this effect, although the moderator effect of feedback was not significant after controlling for dependencies among the individual effect sizes. A publication bias was not detected, but this result should be carefully interpreted because of the comparably small number of studies. Furthermore, non-statistical causes for a publication bias like individual study quality cannot be detected by Egger’s test.

These results are comparable to the results of other meta-analyses in testing effect research. For example, an early meta-analysis by Bangert-Drowns et al. (1991) summarized 35 field studies from diverse learning contexts (some of them also included psychological learning content). Frequent testing in the classroom was demonstrated to have a positive overall mean effect size on learning outcome variables (0.23 standard deviations). This result indicates a smaller effect of testing compared to the effect found in the current meta-analysis. However, Bangert-Drowns et al. (1991) included studies with different types of control conditions, some of which also included testing but to a smaller extent than in the experimental conditions. The overall mean effect size (0.54) of their 11 studies that compared testing with control conditions without any testing components resembled the mean effect size in the current meta-analysis (0.56). More recently, Rowland (2014) meta-analysed the data of 61 experimental studies examining the effects of testing compared to restudying. The analysis yielded a positive overall mean effect size of 0.50 standard deviations. Although this result was based on experimental studies, the overall mean effect size was comparable to the mean effect size found in the current analysis. Moreover, Rowland (2014) demonstrated that intermediate testing that included feedback yielded larger effects (0.73 standard deviations) than intermediate testing without providing feedback (0.39), which is also in line with the result of our moderator analysis (see Table 1). In sum, even though the current meta-analysis was limited to field studies of learning and teaching psychology, the overall mean effect size matched the results of both earlier meta-analyses. We interpret this result in terms of the robustness of the testing effect, and we conclude that the effect can successfully be adopted to teaching psychology.

Consequently, the implementation of retrieval practice in psychology classrooms is recommendable. First, the current meta-analysis indicates that retrieval practice can be beneficially applied in the psychology classroom to foster students’ learning outcomes. These results, combined with earlier findings, suggest that tests should be implemented not only as assessment and evaluation tools but also as tools to enhance learning. However, teachers are recommended to make this distinction transparent to their students. Learners should know beforehand whether a test is used to evaluate performance or to practice retrieval. Second, practice tests can be easily implemented in the context of psychology learning and teaching. Practice tests are not particularly time-consuming, and in contrast to other learning strategies (e.g., concept mapping or note-taking), they do not need particular instruction and training effort. Third, feedback can further enhance the beneficial effects of the practice tests. Beyond the direct effects of retrieval practice on the memory representation, feedback can serve as an instrument to stimulate indirect beneficial effects by stating explicitly the learning contents retrieved correctly from memory or not. This information can guide the learners’ following learning activities and strengthen the learning progress. Finally, the risk to deteriorate the learning process is extremely low. Only one study indicated a significant, negative effect of intermediate testing on final test performance. Generally, however, it should be recognized that harms and benefits of an intervention as identified in a meta-analysis are only harmful or beneficial in relation to the specific learning procedures and control conditions implemented in the original studies. Thus, implementing practice tests in a specific context does not necessarily lead to the optimal learning design as the learning and teaching goals might diverge from the goals addressed in the meta-analysed studies.

Drawing conclusions from the current results should be made with the following study limitations in mind. First, the meta-analysis comprised only 19 studies, a relatively low number for summarizing empirical results. Thus, further research on implementing retrieval practice in the psychology classroom is needed to confirm the effect when using learning materials on psychology. Moreover, there was a high number of dependent effect sizes, which is a potential source of bias. Although inspection and control of the correlation of estimation errors revealed no indication of a bias corresponding to dependency, a higher number of independent effect sizes is desirable. In the current data, we were unable to control variables potentially influencing the learning outcome in addition to study design, type of control condition, and feedback. For example, future studies should investigate how practice tests are implemented in the psychology classroom. Although the current results add to earlier findings indicating the testing effect to be a robust phenomenon that is also effective in real classroom situations, the high number of effect sizes not significantly different from zero is remarkable. Significant effect sizes from small samples (e.g., Howard, 2011; Rawson, Dunlosky, & Sciartelli, 2013) and non-significant effect sizes from larger samples (e.g., Khanna, 2015; Shapiro & Gordon, 2012) indicate that the lack of significance was not solely due to low statistical power. Thus, future research should focus on the circumstances under which the implementation of retrieval practice in the psychology classroom fosters learning. In addition to our results, such a more advanced analysis could encourage teachers and instructors of psychology to adopt the findings of their own research discipline and encourage teachers and instructors of other fields of academic learning to rely on psychologists’ recommendations.

Footnotes

Acknowledgements

We thank Jörg-Tobias Kuhn and Paul-Christian Bürkner for valuable discussions.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

Author biographies

Juliane Schwieren is research assistant at the University of Münster, Germany (Institute for Psychology in Education). Her research interests include learning and teaching psychology. She is teaching psychology in teacher education programs.

Jonathan Barenberg is postdoctoral research fellow at the University of Münster, Germany (Institute for Psychology in Education). His research interests include metacognition and learning, physical activity and cognition, working memory and executive functions. He is teaching psychology in teacher education programs and continued professional education for teachers.

Stephan Dutke is professor of general and educational psychology at the University of Münster, Germany (Institute for Psychology in Education). His research interests include learning, memory, metacognition, and text comprehension. Currently, he is the convenor of the Board of Educational Affairs at the European Federation of Psychologists’ Associations (EFPA) and the speaker of the German National Awarding Committee for the European Certificate in Psychology (EuroPsy).

Appendix A Studies included in the meta-analysis with additional information concerning respective effect sizes

References

Abbott

E. E.

(1909) On the analysis of the factors of recall in the learning process. Psychological Monographs 11: 159–177. doi:10.1037/h0093018.

Agarwal

P. K.

Roediger

H. L.

III (2011) Expectancy of an open-book test decreases performance on a delayed closed-book test. Memory 19: 836–852. doi:10.1080/09658211.2011.613840.

Bahrick

H. P.

Bahrick

L. E.

Bahrick

A. S.

Bahrick

P. E.

(1993) Maintenance of foreign language vocabulary and the spacing effect. Psychological Science 4: 316–321. doi:10.1111/j.1467-9280.1993.tb00571.x.

Balota

D. A.

Duchek

J. M.

Sergent Marshall

S. D.

Roediger

H. L.

III (2006) Does expanded retrieval produce benefits over equal-interval spacing? Explorations of spacing effects in healthy aging and early stage Alzheimer's disease. Psychology and Aging 21: 19–31. doi:10.1037/0882-7974.21.1.19.

Bangert-Drowns

R. L.

Kulik

J. A.

Kulik

C.-L. C.

(1991) Effects of frequent classroom testing. Journal of Educational Research 85: 89–99. doi:10.1080/00220671.1991.10702818.

*Barenberg, J., & Dutke, S. (2012). ‘Testing Effect’ und metakognitives Monitoring in Psychologie-Lehrveranstaltungen [‘Testing effect’ and metacognitive monitoring in psychology courses]. In M. Krämer, S. Dutke, & J. Barenberg (Eds.), Psychologiedidaktik und Evaluation IX [Didactics of psychology and evaluation IX] (pp. 301–308). Aachen, Germany: Shaker.

*Bell

M. C.

Simone

P. M.

Whitfield

L. C.

(2015) Failure of online quizzing to improve performance in introductory psychology courses. Scholarship of Teaching and Learning in Psychology 1: 163–171. doi:10.1037/stl0000020.

*Bing

S. B.

(1984) Effects of testing versus review on rote and conceptual learning from prose. Instructional Science 13: 193–198. doi:10.1007/BF00052385.

*Bjork

E. L.

Little

J. L.

Storm

B. C.

(2014) Multiple-choice testing as a desirable difficulty in the classroom. Journal of Applied Research in Memory and Cognition 3: 165–170. doi:10.1016/j.jarmac.2014.03.002.

10.

Borenstein

Hedges

L. V.

Higgins

J. P. T.

Rothstein

H. R.

(2009) Introduction to meta-analysis, Chichester, UK: Wiley.

11.

Bouwmeester

Verkoeijen

P. P.

(2011) Why do some children benefit more from testing than others? Gist trace processing to explain the testing effect. Journal of Memory and Language 65: 32–41. doi:10.1016/j.jml.2011.02.005.

12.

Butler

A. C.

Karpicke

J. D.

Roediger

H. L.

III (2008) Correcting a metacognitive error: Feedback increases retention of low-confidence correct responses. Journal of Experimental Psychology: Learning, Memory, and Cognition 34: 918–928. doi:10.1037/0278-7393.34.4.918.

13.

Butler

A. C.

Roediger

H. L.

III (2007) Testing improves long-term retention in a simulated classroom setting. European Journal of Cognitive Psychology 19: 514–527. doi:10.1080/09541440701326097.

14.

Carpenter

S. K.

(2011) Semantic information activated during retrieval contributes to later retention: Support for the mediator effectiveness hypothesis of the testing effect. Journal of Experimental Psychology: Learning, Memory, and Cognition 37: 1547–1552. doi:10.1037/a0024140.

15.

Carpenter, S. K., Pashler, H., & Cepeda, N. J. (2009). Using tests to enhance 8th grade students' retention of U.S. history facts. Applied Cognitive Psychology, 23, 760–771. doi:10.1002/acp.1507.

16.

Chan

J. C. K.

(2010) Long-term effects of testing on the recall of nontested materials. Memory 18: 49–57. doi:10.1080/09658210903405737.

17.

Cranney

(2013) Toward psychological literacy: A snapshot of evidence-based learning and teaching. Australian Journal of Psychology 65: 1–4. doi:10.1111/ajpy.12013.

18.

*Cranney

Ahn

McKinnon

Morris

Watts

(2009) The testing effect, collaborative learning, and retrieval-induced facilitation in a classroom setting. European Journal of Cognitive Psychology 21: 919–940. doi:10.1080/09541440802413505.

19.

*Daniel

D. B.

Broida

(2004) Using web-based quizzing to improve exam performance: Lessons learned. Teaching of Psychology 31: 207–208. doi:10.1207/s15328023top3103_6.

20.

*Downs

S. D.

(2015) Testing in the college classroom: Do testing and feedback influence grades throughout an entire semester? Scholarship of Teaching and Learning in Psychology 1: 172–181. doi:10.1037/stl0000025.

21.

Dunlosky

Rawson

K. A.

(2015) Practice tests, spaced practice, and successive relearning: Tips for classroom use and for guiding students' learning. Scholarship of Teaching and Learning in Psychology 1: 72–78. doi:10.1037/stl0000024.

22.

Dunlosky

Rawson

K. A.

Marsh

E. J.

Nathan

M. J.

Willingham

D. T.

(2013) Improving students' learning with effective learning techniques: Promising directions from cognitive and educational psychology. Psychological Science in the Public Interest 14: 4–58. doi:10.1177/1529100612453266.

23.

Dunn

D. S.

Saville

B. K.

Baker

S. C.

Marek

(2013) Evidence-based teaching: Tools and techniques that promote learning in the psychology classroom. Australian Journal of Psychology 65: 5–13. doi:10.1111/ajpy.12004.

24.

Egger

Davey Smith

Schneider

Minder

(1997) Bias in meta-analysis detected by a simple, graphical test. British Medical Journal 315: 629–634. doi:10.1136/bmj.315.7109.629.

25.

Einstein

G. O.

Mullet

H. G.

Harrison

T. L.

(2012) The testing effect: Illustrating a fundamental concept and changing study strategies. Teaching of Psychology 39: 190–193. doi:10.1177/0098628312450432.

26.

Fisher, Z., & Tipton, E. (2015). robumeta: Robust Variance Meta-Regression. R package version 1.6. Retrieved from http://CRAN.R-project.org/package=robumeta.

27.

Fritz

C. O.

Morris

P. E.

Nolan

Singleton

(2007) Expanding retrieval practice: An effective aid to preschool children's learning. Quarterly Journal of Experimental Psychology 60: 991–1004. doi:10.1080/17470210600823595.

28.

Gates

A. I.

(1917) Recitation as a factor in memorizing, New York, NY: Printing Office of Columbia University.

29.

Glover

J. A.

(1989) The ‘‘testing’’ phenomenon: Not gone but nearly forgotten. Journal of Educational Psychology 81: 392–399. doi:10.1037/0022-0663.81.3.392.

30.

Graesser

A. C.

(2009) Inaugural Editorial for the Journal of Educational Psychology. Journal of Educational Psychology 101: 259–261. doi:10.1037/a0014883.

31.

Graesser

A. C.

Halpern

D. F.

Hakel

(2008) 25 principles of learning, Washington, DC: Task Force on Lifelong Learning at Work and at Home.

32.

Hays

W. L.

(1994) Statistics, Belmont, CA: Wadsworth.

33.

Hedges

L. V.

(1981) Distribution theory for Glass’s estimator of effect size and related estimators. Journal of Educational Statistics 6: 107–128. doi:10.2307/1164588.

34.

*Howard, C. R. (2011). Examining the testing effect in an introductory psychology course (PhD Dissertation). Auburn University, Auburn, AL. Dissertation Abstracts International: Section B. Sciences and Engineering, 71, 7702. Retrieved from https://etd.auburn.edu/handle/10415/2289.

35.

*Johnson

B. C.

Kiviniemi

M. T.

(2009) The effect of online chapter quizzes on exam performance in an undergraduate social psychology course. Teaching of Psychology 36: 33–37. doi:10.1080/00986280802528972.

36.

Jonge

Tabbers

H. K.

Rikers

R. P.

(2015) The effect of testing on the retention of coherent and incoherent text material. Educational Psychology Review 27: 305–315. doi:10.1007/s10648-015-9300-z.

37.

Kang

S. H. K.

McDermott

K. B.

Roediger

H. L.

III (2007) Test format and corrective feedback modify the effect of testing on long-term retention. European Journal of Cognitive Psychology 19: 528–558. doi:10.1080/09541440601056620.

38.

Kang

S. K.

Pashler

(2014) Is the benefit of retrieval practice modulated by motivation? Journal of Applied Research in Memory and Cognition 3: 183–188. doi:10.1016/j.jarmac.2014.05.006.

39.

Karpicke

J. D.

Roediger

H. L.

III (2007) Repeated retrieval during learning is the key to long-term retention. Journal of Memory and Language 57: 151–162. doi:10.1016/j.jml.2006.09.004.

40.

*Khanna

M. M.

(2015) Ungraded pop quizzes: Test-enhanced learning without all the anxiety. Teaching of Psychology 42: 174–178. doi:10.1177/0098628315573144.

41.

Lipsey

M. W.

Wilson

D. B.

(2001) Practical meta-analysis, Thousand Oaks, CA: SAGE Publications.

42.

*Lyle

K. B.

Crawford

N. A.

(2011) Retrieving essential material at the end of lectures improves performance on statistics exams. Teaching of Psychology 38: 94–97. doi:10.1177/0098628311401587.

43.

McDaniel

M. A.

Agarwal

P. K.

Huelser

B. J.

McDermott

K. B.

Roediger

H. L.

III (2011) Test-enhanced learning in a middle school science classroom: The effects of quiz frequency and placement. Journal of Educational Psychology 103: 399–414. doi:10.1037/a0021782.

44.

*McDaniel

M. A.

Anderson

J. L.

Derbish

M. H.

Morrisette

(2007) Testing the testing effect in the classroom. European Journal of Cognitive Psychology 19: 494–513. doi:10.1080/09541440701326154.

45.

Metcalfe

Kornell

Son

L. K.

(2007) A cognitive-science based programme to enhance study efficacy in a high and low risk setting. European Journal of Cognitive Psychology 19: 743–768. doi:10.1080/09541440701326063.

46.

Pashler, H., Bain, P., Bottge, B., Graesser, A., Koedinger, K., McDaniel, M., & Metcalfe, J. (2007). Organizing Instruction and Study to Improve Student Learning (NCER 2007–2004). Washington, DC: National Center for Education Research, Institute of Education Sciences, U.S. Department of Education. Retrieved from http://ncer.ed.gov.

47.

Practice Guide. NCER 2007-2004. Retrieved from http://eric.ed.gov/?id=ED498555.

48.

Pyc

M. A.

Rawson

K. A.

(2009) Testing the retrieval effort hypothesis: Does greater difficulty correctly recalling information lead to higher levels of memory? Journal of Memory and Language 60: 437–447. doi:10.1016/j.jml.2009.01.004.

49.

Pyc

M. A.

Rawson

K. A.

(2010) Why testing improves memory: Mediator effectiveness hypothesis. Science 330: 335–336. doi:10.1126/science.1191465.

50.

Pyc

M. A.

Rawson

K. A.

(2012) Why is test–restudy practice beneficial for memory? An evaluation of the mediator shift hypothesis. Journal of Experimental Psychology: Learning, Memory, and Cognition 38: 737–746. doi:10.1037/a0026166.

51.

*Rawson

K. A.

Dunlosky

Sciartelli

S. M.

(2013) The power of successive relearning: Improving performance on course exams and long-term retention. Educational Psychology Review 25: 523–548. doi:10.1007/s10648-013-9240-4.

52.

R Core Team (2014). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. Retrieved from http://www.R-project.org/.

53.

Roediger

H. L.

III Agarwal

P. K.

McDaniel

M. A.

McDermott

K. B.

(2011) Test-enhanced learning in the classroom: Long-term improvements from quizzing. Journal of Experimental Psychology: Applied 17: 382–395. doi:10.1037/a0026252.

54.

Roediger

H. L.

III Butler

A. C.

(2011) The critical role of retrieval practice in long-term retention. Trends in Cognitive Sciences 15: 20–27. doi:10.1016/j.tics.2010.09.003.

55.

Roediger

H. L.

III. Karpicke

J. D.

(2006a) The power of testing memory: Basic research and implications for educational practice. Perspectives on Psychological Science 1: 181–210. doi:10.1111/j.1745-6916.2006.00012.x.

56.

Roediger

H. L.

III Karpicke

J. D.

(2006b) Test-enhanced learning: Taking memory tests improves long-term retention. Psychological Science 17: 249–255. doi:10.1111/j.1467-9280.2006.01693.x.

57.

Roediger

H. L.

III Pyc

M. A.

(2012) Inexpensive techniques to improve education: Applying cognitive psychology to enhance educational practice. Journal of Applied Research in Memory and Cognition 1: 242–248. doi:10.1016/j.jarmac.2012.09.002.

58.

Rowland

C. A.

(2014) The effect of testing versus restudy on retention: A meta-analytic review of the testing effect. Psychological Bulletin 140: 1432–1463. doi:10.1037/a0037559.

59.

*Saville

B. K.

Pope

Lovaas

Williams

(2012) Interteaching and the testing effect: A systematic replication. Teaching of Psychology 39: 280–283. doi:10.1177/0098628312456628.

60.

Schmeck

Mayer

R. E.

Opfermann

Pfeiffer

Leutner

(2014) Drawing pictures during learning from scientific text: Testing the generative drawing effect and the prognostic drawing effect. Contemporary Educational Psychology 39: 275–286. doi:10.1016/j.cedpsych.2014.07.003.

61.

Schwartz

B. M.

Gurung

R. A. R.

(2012) Evidenced-based teaching for higher education, Washington, DC: American Psychological Association.

62.

*Schwede

Schweppe

Rummer

(2015) Der Testeffekt in der Hochschullehre: Open-book- versus closed-book-Tests. [The testing effect in higher education: Open-book- versus closed-book-tests]. Paper presented at the 15th Conference of Fachgruppe Pädagogische Psychologie (PAEPS), September 14-16 in Kassel, Germany.

63.

*Shapiro

A. M.

Gordon

L. T.

(2012) A controlled study of clicker-assisted memory enhancement in college classrooms. Applied Cognitive Psychology 26: 635–643. doi:10.1002/acp.2843.

64.

*Stenlund, T., Jönsson, F. U., & Jonsson, B. (2016). Group discussions and test-enhanced learning: Individual learning outcomes and personality characteristics. Educational Psychology. Advance online publication. doi:10.1080/01443410.2016.1143087.

65.

Tipton

(2015) Small sample adjustments for robust variance estimation with meta-regression. Psychological Methods 20: 375–393. doi:10.1037/met0000011.

66.

Tse

(2012) The effectiveness of test-enhanced learning depends on trait test anxiety and working-memory capacity. Journal of Experimental Psychology: Applied 18: 253–264. doi:10.1037/a0029190.

67.

Viechtbauer

(2005) Bias and efficiency of meta-analytic variance estimators in the random-effects model. Journal of Educational and Behavioral Statistics 30: 261–293. doi:10.3102/10769986030003261.

68.

Viechtbauer

(2010) Conducting meta-analyses in R with the metafor package. Journal of Statistical Software 36: 1–48. doi:10.18637/jss.v036.i03.

69.

*Vojdanoska

Cranney

Newell

B. R.

(2010) The testing effect: The role of feedback and collaboration in a tertiary classroom setting. Applied Cognitive Psychology 24: 1183–1195. doi:10.1002/acp.1630.

70.

*Wiklund-Hörnqvist

Jonsson

Nyberg

(2014) Strengthening concept learning by repeated testing. Scandinavian Journal of Psychology 55: 10–16. doi:10.1111/sjop.12093.

71.

Winne

P. H.

Hadwin

A. F.

(1998) Studying as self-regulated learning. In: Hacker

D. J.

Dunlosky

Graesser

A. C.

(eds) Metacognition in educational theory and practice, Hillsdale, NJ: Lawrence Erlbaum, pp. 277–304.

The Testing Effect in the Psychology Classroom: A Meta-Analytic Perspective

Abstract

Keywords

Introduction

Research on the Testing Effect: From the Laboratory to the Classroom

Scope of the Meta-Analysis

Methods

Search Strategy and Selection Process

Coding of Study Characteristics and Moderator Variables

Individual Effect Size Calculation

Method of Analysis

Results

Categorical Moderator Analyses

Exploratory Analyses

Discussion

Footnotes

Acknowledgements

Declaration of Conflicting Interests

Funding

Author biographies

Appendix A Studies included in the meta-analysis with additional information concerning respective effect sizes

References