Abstract
We introduce a specific type of item for knowledge tests, confidence-weighted true–false (CTF) items, and review experiences of its application in psychology courses. A CTF item is a statement about the learning content to which students respond whether the statement is true or false, and they rate their confidence level. Previous studies using CTF items were reviewed and data from 21 additional psychology courses (N = 1879 teacher candidates) revealed that CTF items were successfully applied to measure learning performance at the cognitive and metacognitive level. Furthermore, examples are presented demonstrating that the indices of metacognitive monitoring that can be derived from these items provide useful additional information about the effects of teaching and testing methods.
In Europe, the Bologna process has caused fundamental changes in the structure of study programmes—not only in psychology but also in programmes of other academic disciplines involving psychology as a minor subject and in teacher education programmes. One significant change is the increased frequency of examinations and tests requiring additional resources from teaching staff. This situation necessitates the search for easy and informative tests to assess academic performance. In the present paper, we introduce confidence-weighted true–false (CTF) items that can be used to create knowledge tests according to the following criteria: (1) applicability of different teaching content and different types of courses; (2) economic test constructing and test analysing procedures; (3) valid assessment of learning progress and knowledge; and (4) assessment of not only recall performance but also transfer performance.
Confidence-Weighted True–False Items
A true–false item is a one-sentence statement (e.g., “Learned helplessness involves causal attribution processes”) to which students respond whether this statement is true or false (e.g., Frisbie & Becker, 1991). The statements are chosen so that the answer requires activation of knowledge acquired in the course. However, successful learning is not only expected to increase the availability of knowledge but also the student’s confidence in correctly recalling and applying this knowledge. Therefore, we combined evaluation of truth and confidence on one response scale presenting four response options: (1) I am sure the statement is true; (2) I think the statement is true, but I am unsure; (3) I think the statement is false, but I am unsure; or (4) I am sure the statement is false. The scoring scheme rewards confidence in correct answers, but punishes confidence in incorrect answers. A correct answer is scored +2 points when the student is confident, but only +1 point when the student is unconfident. An incorrect answer is scored −2 points when the student indicates confidence, but only −1 point when the student is unconfident. Learning performance is assessed by the sum of points across all items (performance score). In a 30-item test, for example, the maximum of +60 points can be reached if all answers are correct and all signed with confidence and a minimum of −60 points if all answers are incorrect but nevertheless all are signed with confidence. If preferred, the performance score can be mapped on a conventional percentage scale with the minimum number of points designating 0% and the maximum number of points designating 100%. A performance score of 0 points equals 50% on the percentage scale.
This type of scale is assumed to be advantageous for three reasons. Firstly, deciding whether or not a statement is true might be facilitated when students are allowed to express their possible doubts. Secondly, knowing that accurate confidence judgements are rewarded may direct learners’ attention to monitor their knowledge and to learn at the metacognitive level, which supports self-regulated learning (e.g., Thiede, Anderson, & Therriault, 2003; Winne & Hadwin, 1998). Thirdly, data collected with this scale can be easily transformed to indices of metacognitive monitoring accuracy, which provides information about the learning outcome in addition to the number of correct answers.
Flexible Construction of Knowledge Tests
The quality of a true–false statement primarily depends on its concise formulation and its clear relation to the learning goals. The extent that the answer requires simple reproduction or complex inferences can be chosen when the content of the course and the prior knowledge of the participants are known. If, for example, learned helplessness has been discussed in the context of dysfunctional patterns of causal attribution, the statement “Learned helplessness involves causal attribution processes” (true) probably requires little more than recognition. In contrast, answering the item “Experiencing high achievement alleviates learned helplessness” (false) requires the inference that high achievement can also be attributed to causes that are not suitable to cure perceived loss of control (e.g., chance). Thus, item difficulty can be chosen by the test author, the demands of which are not restricted to recognition and may include inferences and complex retrieval processes.
Compared to multiple-choice items, constructing true–false statements is less laborious than authoring multiple-choice items. The quality of multiple-choice items depends on the plausibility of the distractors. The diagnostic value of an item is high when all distractors represent equally plausible alternatives to the true response option. Otherwise, the item can be solved by excluding the implausible response options, not by recognising the true option. Therefore, one multiple-choice item requires the careful construction of several plausible distractors, whereas a (false) CTF item requires only one statement.
Furthermore, true and false statements on the same topic can easily be developed: “Learned helplessness involves the actors’ cognition of being unable to control the consequence of their actions” (true) versus “Learned helplessness involves the actors’ cognition of being able to control the consequence of their actions” (false). Likewise, creating positively and negatively formulated versions of the same statement is often possible: “In a state of learned helplessness, the actor attributes failure to stable causes” (true) versus “In a state of learned helplessness, the actor does not attribute failure to variable causes” (true). Thus, to avoid biases, tests can easily be balanced with regard to true versus false and positively versus negatively formulated statements.
We organise our items (currently N = 798) in a database that allows storing additional information for each item (e.g., correct solution, author, thematic domain, difficulty). After specifying the thematic domain, items can be randomly or manually chosen. If the database contains information about item difficulty, parallel tests with comparable mean difficulty can be created. The database creates a printable text file and a version showing the correct answers as a basis for programming the data analysis. With this system, the item pool can be developed and any new configuration of items can be economically produced and analysed.
Evaluation of Knowledge Tests using Confidence-Weighted True–False Items
Knowledge tests using this type of item have already been applied successfully in a large number of psychology courses on different topics in educational psychology (e.g., motivation, cognition, diagnostics and evaluation, etc.) and across different types of courses (e.g., lectures and seminars). We present some results already published, but we also report additional analyses based on recently collected data from an additional 21 psychology courses with N = 1879 teacher candidates. The courses were instructed by the Institute for Psychology in Education staff members at the University of Münster, Germany. Students in these courses (Bachelor and Master’s programmes) were preparing for teaching in primary and secondary schools.
Acceptance of the Response Scale and Item Difficulty
A central concern of CTF items is the extent to which the response scale is accepted by examinees. Do students “admit” that they are unconfident about their answers in a test? Dutke and Barenberg (2009) found that only 6.7% of the participants did not select the “unconfident” options but indicated confidence in all test items. On average, 31.1% of the items were answered without confidence. The analyses of the additional 21 courses tested with CTF items yielded similar results. Only 4.4% of the participants never selected the “unconfident” options. On average, 26.5% of the items were answered with the “unconfident” categories (ranging from 0% to 95.0%).
Another concern was related to the plausibility of the false statements. If they are obviously implausible, their difficulty might be unacceptably low. Consequently, these items can be rejected on the basis of knowledge irrespective of the learning goals. Dutke and Barenberg (2009) found that, on average, each item received 1.0 points, which designates medium difficulty (on a scale from −2 to +2, a mean of 0 points would designate the guessing probability). False items were more difficult (M = .68 points) than true items (M = 1.31 points). Thus, the apprehension that false items could be rejected without relevant knowledge was unjustified. The 21-course analysis corroborated this finding. False items were more difficult (M = .89 points) than true items (M = 1.20).
Sensitivity to the Learning Progress
Knowledge tests should be sensitive to learning progress. Dutke and Barenberg (2009) tested participants of a psychology course in the middle of the semester with 20 CTF items. Half of the items were related to content that had already been discussed in the course, whereas the other half referred to topics scheduled for the second half of the semester. As expected, the performance score was significantly higher for items related to content already discussed than for items related to future content. At the end of the semester, the same set of items was administered again. The performance score increased significantly in both types of items but the increase was more pronounced in the items targeting the contents of the second half of the semester, such that the difference between both types of items vanished completely. Thus, CTF items differentially indicated learning progress across the semester. Moreover, the number of correct and confident answers significantly increased from the mid-term to the end-term test (from .48 to .75). Thus, CTF items were also sensitive to learning progress at the level of confidence in the correctness of these answers.
Predictive Validity
Dutke and Barenberg (2009) explored the predictive validity of CTF items by comparing students’ performance in CTF items with their performance in answering essay questions on the same contents. The quality of the essays was scored according to correctness and completeness. The performance score based on the CTF items significantly correlated with the essay answer score between r = .33 (p < . 05) and r = .51 (p < .001). The 21-course analysis confirmed this finding. The performance scores of CTF items and essay answer quality significantly correlated (average r = .55). However, the knowledge tests in these courses varied on the number of CTF items used (20–68 items) and the range of points that could be obtained in the open answers (between 0 and 4 points and 0 and 40 points). We predicted the correlations between the CTF score and the open answer score in a regression analysis and included two predictors: the number of CTF items and the range of the open answer score. Only the latter predictor had a significant (positive) beta weight (t = 3.29, p = .004). This result suggested that the correlation between the CTF performance score and an external criterion might be limited by the range of values on the criterion variable rather than by the number of CTF items. The more the essay answer score was allowed to vary, the higher it correlated with the CTF performance score (up to r = .78). Thus, performance in answering CTF items was reasonably correlated with performance in tasks requiring active generation of answers rather than the evaluation of pre-fabricated statements.
In summary, these results demonstrate that (1) the possibility of differentiating the confidence in CTF items was widely used and accepted by the majority of students; (2) CTF items were sensitive to learning progress across the semester; and (3) the predictive validity of CTF items was high for test performance in more complex tasks. The CTF items also disclosed performance changes on the level of confidence in recalling and applying knowledge, which would not have been detected using the typical scoring of correct answering. Thus, CTF items are also suitable for assessing metacognitive monitoring.
Assessing Metacognitive Monitoring
The relevance of metacognitive processes is frequently discussed in the context of self-regulated learning, and metacognitive monitoring is seen as a key competence and prerequisite for successful regulation processes (e.g., Boekaerts, 1999; Winne & Hadwin, 1998). Substantial under- or overconfidence can lead to unnecessary or missing regulation processes, which in turn affect learning progress and outcomes (e.g., Thiede et al., 2003). Thus, metacognitive monitoring and its accuracy are crucial for successful and effective learning.
In a conceptual paper, Schraw (2009) described different indices of accuracy of metacognitive monitoring commonly reported in metacognition literature. Barenberg and Dutke (2013) described how several of these indices can also be derived from CTF items 1 (see the Appendix for formulas). The absolute accuracy (AC) of the confidence judgements is assessed by computing the proportion of correct and confident answers plus the proportion of incorrect and unconfident answers. This score reflects the overall match between confidence and correctness. The bias of the confidence judgements (BS) is assessed by computing the relative number of confident answers minus the relative number of correct answers. Positive values indicate general overconfidence, and negative values indicate general underconfidence. However, the BS score does not indicate whether correct and incorrect answers are biased in the same way. This is reflected by two conditional probabilities: the probability of indicating confidence given the answer is correct (confident-correct probability, CCP) and the probability of indicating confidence given the answer is incorrect (confident-incorrect probability, CIP). The difference between both probabilities (DIS = CCP − CIP) indicates how reliably a participant discriminates (at the level of confidence) between correct and incorrect answers. A high CCP score (indicating confidence when the answer is correct) and a low CIP score (indicating no confidence when the answer is incorrect) would reflect successful metacognitive monitoring, because it would indicate that the learner can adequately discriminate between learning content already mastered and learning content requiring further learning.
Beyond Academic Examinations: Using Confidence-Weighted True–False Items to Evaluate Metacognitive Effects of Teaching
Reflective teaching requires monitoring the effects of our teaching activities. Most of them are directed at improving teaching content acquisition, which is commonly evaluated by academic tests. However, the effects of our teaching activities may not be restricted to the cognitive performance level. These activities can also influence metacognitive performance. In the following sections, we provide some examples demonstrating how CTF items can help to illuminate effects of teaching and testing methods on the metacognitive level.
Intended and Unintended Effects of Grading Tests
The expected evaluation of test performance affects students’ test performance, depending on whether the evaluation is pass–fail or graded (e.g., Harackiewicz, Manderlink, & Sansone, 1992). Barenberg and Dutke (2011, 2013) investigated whether grading also affects metacognitive monitoring of the learning process. They tested participants of psychology courses in the middle and at the end of the semester with the same set of CTF items. The authors assumed that students expecting the final test to be graded would prepare more intensively than students expecting a pass–fail test. Accordingly, from the mid-term to the end-term test, students expecting a graded test increased the number of correct answers and the absolute accuracy of confidence judgements more than students expecting a pass–fail test. Moreover, the graded group reduced their initial underconfidence, whereas the underconfidence of the pass–fail group remained the same. Thus, expecting a graded test prompted greater improvements of test performance and accuracy of metacognitive monitoring compared to expecting a pass–fail test. However, Barenberg and Dutke (2013) also found that expecting a graded test not only increased the confidence in correct answers (CCP) but also increased the (unjustified) confidence in incorrect answers (CIP). In contrast, only the confidence in correct answers increased during the semester in the pass–fail group, whereas the confidence in incorrect answers remained the same. This example demonstrates that considering the metacognitive indices derived from CTF items can provide information about intended and unintended effects of the way educators evaluate learning outcomes. See Barenberg and Dutke (2013) for a discussion of the potential reasons for the unintended effect.
Knowing the Test Format Affects Metacognitive Monitoring
Dutke, Barenberg, and Leopold (2010) compared test performance between participants informed about the test format before they started learning and participants informed about the test format immediately before they were tested. All participants were tested with the same set of 15 CTF items. No difference was found in the number of correct answers, but participants knowing the test format before learning discriminated better between correctly and incorrectly answered items on the basis of their confidence ratings (i.e., DIS). Thus, knowing the test format seemed to influence test preparation. Practical consequences for teaching, such as explaining the test format at the beginning of a course or giving opportunity to train test taking, are discussed by Dutke et al. (2010).
A Testing Effect in Psychology Courses
Testing learning content often improves later recall of the content (see e.g., Roediger & Karpicke, 2006). Barenberg and Dutke (2012) explored whether the testing effect could be demonstrated in psychology courses. They tested participants unexpectedly in the middle of the semester with 20 CTF items. Instructor feedback included only the number of correct answers, but no information was provided about which items were answered (in)correctly. At the end of the semester, pre-tested content and non-pre-tested content were tested. As expected, the percentage of correct answers was higher when the content was pre-tested than when it was not pre-tested. Unexpectedly, testing had no effect on metacognitive monitoring. This result raised further questions. A central explanation of the testing effect posits that previously encoded material is more readily available the more often it has been retrieved from long-term memory (e.g., Roediger & Karpicke, 2006; Rowland, 2014). Accordingly, repeated testing should increase not only the probability of correct retrieval but also the confidence in having retrieved the correct content. The reasons for the incongruence between this theoretical expectation and the data remain to be investigated, the findings of which might have consequences for explanations of the testing effect.
Consolidation of Knowledge by Application
In the Master’s programme for teacher candidates at the University of Münster, students attend an introductory lecture on psychology (including methodological topics) before attending two additional psychology courses. Berse and Dutke (2012) designed particular psychology courses to prepare teacher candidates for conducting experiments in the context of their Master’s theses (preparatory courses). These courses focused on applying previously acquired methodological knowledge (e.g., deducing hypotheses, designing experiments, analysing data), whereas other psychology courses in the same programme (control courses) focused on theoretical issues relevant to teachers (e.g., learning strategies, motivation, social processes). Therefore, the authors expected students in the preparatory courses to consolidate methodological knowledge more than students in the control courses. Methodological knowledge was measured with 28 CTF items at the beginning and at the end of the semester. No differences were found in the number of correct answers between the preparatory and control courses, neither at the beginning nor at the end of the semester. This result was initially puzzling, but after considering the metacognitive measures, a different picture emerged. The CCP had increased in the preparatory courses more than in the control courses. Thus, although the students in the preparatory courses were unable to give more correct answers than students in the control courses, the former students were more confident when they answered correctly. This consolidation effect would not have been detected with item formats that had not included confidence measures. In essence, this study exemplified that the effect of teaching interventions might be underestimated when testing aspects that affect metacognition are not included.
Conclusion and Recommendations
The aim of the present paper was to introduce CTF items and to illuminate its application in psychology courses. CTF items provide a means to economically assess learning performance at the cognitive and metacognitive level. Firstly, we have provided evidence that a performance score considering correctness and confidence is a valid measure of academic performance. Secondly, we showed that deriving measures of metacognitive monitoring from CTF items helped to reveal intended and unintended effects of teaching and testing methods beyond the level of cognitive performance.
When applying CTF items, however, some limitations should be kept in mind. The most severe limitation is the quality of items. They need to be formulated concisely and with a clear relation to the learning content. Teachers should attend to the contextualisation of each item. In reference to the sample item discussed earlier, the statement “Learned helplessness involves causal attribution processes” is only true or false in relation to a specific theoretical context and a specific way of confronting students with this topic. This context, which is also related to testing validity, needs to be considered. CTF items, as well as multiple-choice items, require the evaluation of statements pre-fabricated by the test author in comparison to formats requiring the active generation of answers. Thus, CTF und multiple-choice items share the disadvantage of relying on input from the test author more than open answer formats. We nevertheless have provided evidence that properly constructed CTF items requiring inference processes on the basis of the newly acquired knowledge are sensitive to indicate learning progress across the semester. The CTF results also correlated with subsequent performance in essay tasks, which can be considered an indicator of predictive validity.
Another concern is that conventional true–false items (requiring only a yes–no response) have often been criticised for their high guessing probability (Ebel, 1970). For example, pure guessing may lead to a 50% correct result, which is twice as high as in a test with multiple-choice items, each with one true option and three distractors. In either case, it is advisable to consider the guessing probability in determining the criterion for passing a test. In the multiple-choice example, the pass–fail criterion should be higher than 25%, whereas in the true–false example, it should be higher than 50%. Adjusting the pass–fail criterion to the guessing probability is essential for all tests in which guessing might be a feasible strategy. Guessing, however, is less of an issue with CTF items (compared to conventional true–false items), because being unconfident (which is probably the case when an examinee is tempted to guess) becomes obvious in the answer. Given that each answer is weighted with the examinee’s (in)confidence in the correctness of this particular answer, the performance score not only informs about the number of correct answers but also reflects the guessing tendency, that is, low confidence ratings may indicate a stronger guessing tendency. Thus, although the theoretical guessing probability in multiple-choice items with several distractors might be lower than in CTF items, the latter provide a better chance to detect guessing.
Finally, one should be aware of a statistical limitation when using the different indices of metacognitive monitoring. All indices are derived from the same data points with the consequence that they are not independent of each other. For example, the number of correct answers limits the significance of the CCP. Thus, considering the configuration of these indices is advisable rather than interpreting a single index. Nevertheless, the different indices allow the investigator to take different instructive perspectives on the examinee’s performance.
Footnotes
Funding
This research received no specific grant from any funding agency in the public, commercial or not-for-profit sectors.
