Sage Journals: Discover world-class research

Abstract

The article compares the trajectories of students’ self-reported test-taking effort during a 120 minutes low-stakes large-scale assessment of English comprehension between a paper-and-pencil (PPA) and a computer-based assessment (CBA). Test-taking effort was measured four times during the test. Using a within-subject design, each of the N = 2,676 German ninth-grade students completed half of the test in PPA and half in CBA mode, where the sequence of modes was balanced between students. Overall, students’ test-taking effort decreased considerably during the course of the test. On average, effort was lower in CBA than in PPA. While on average, effort was lower in CBA than in PPA, the decline did not vary between both modes during the test. That is, students’ self-reported effort was higher if the items were easier (compared to students’ abilities). The consequences of these results concerning the further development of CBA tests and large-scale assessments in general are discussed.

Keywords

test-taking effort computer-based assessment large-scale assessment

Introduction

During the last decade, various periodic national and international large-scale assessments (LSAs) in educational research have changed their operational mode from paper-and-pencil assessments (PPA) to computer-based assessments (CBA). For example, the Programme for International Student Assessment (PISA) has administered computer-based tests since 2015 (OECD, 2017), the Trends in International Mathematics and Science Study (TIMSS) since 2019 (Robitzsch et al., 2020), and the Progress in International Reading Literacy Study (PIRLS) since the 2021 cycle (Mullis & Martin, 2019).

In general, LSAs measure students’ performance in selected subjects (e.g., Science or Mathematics) along with various socio-demographic, motivational, and other variables. Students’ scores are compared between participating countries to provide policymakers with valuable information about the capability of the educational system in their country (see, e.g., Rutkowski et al., 2014 for a comprehensive overview). Most LSAs are low-stakes for students, meaning they are neither rewarded for doing their best nor penalized for working carelessly on the test, which might limit students’ motivation. Test-taking motivation or test-taking effort, however, has repeatedly been shown to be associated with test performance (Baumert & Demmrich, 2001; Penk & Richter, 2016; Wise & DeMars, 2005; Wise & Kuhfeld, 2021). For policymakers, results from LSAs are high-stakes in that the test scores represent what students know and can do if they do their best. If students’ efforts vary during the test (taking a within-person-perspective, see, e.g., Weirich et al., 2017; Wolgast et al., 2020) or differ between participating countries (taking a between-person perspective, see, e.g., Debeer et al., 2014), the validity of test scores may be threatened. Therefore, test administrators are interested in ensuring a constant and preferably high level of effort both within students (i.e., effort should not decrease during the test) and between students. Results of previous studies have partly questioned whether these conditions are fulfilled in PPA tests. For example, Asseburg and Frey (2013) showed that the level of effort in a test situation is related to the difference between the test-takers ability and the difficulty of the items the test-taker worked on, that is, the “fit” between item difficulty and person ability. They found test-taking effort to depend linearly on the extent a students’ ability exceeds the difficulty of the items seen, except for the very upper and lower end. This is in line with related research in classroom contexts, where research shows that the difference between task difficulty and student ability is related to motivational variables. Krannich et al. (2019), for instance, found that overchallenge negatively affects academic self-concept, whereas underchallenge enhances academic self-concept. However, severe over- or underchallenge also increased students’ boredom. Thus, motivational variables like test-taking effort vary both between students (see also Baumert & Demmrich, 2001; Eklöf, 2010) as well as within students during the course of the test (Sachse et al., 2023) and depending on item characteristics (e.g., Asseburg & Frey, 2013; Goldhammer et al., 2017) as well as on the sequence of (sub-)tasks presented in the test battery (Wolgast et al., 2020).

Low, fading, or varying test-taking motivation may have detrimental consequences to the validity of test scores. For example, test-taking motivation can cause rapid guessing (Wise et al., 2009), increased omission rates (Köhler et al., 2015; Pohl et al., 2014), or quitting (Ulitzsch et al., 2020), among other factors, potentially shrinking the test scores of high-performing test takers due to motivational issues. Hence, establishing a consistently high level of student motivation has become an important aspect of validly interpreting low-stakes test scores (Baumert & Demmrich, 2001). In view of the gradual change from paper-and-pencil to computer-based assessments, the current study compares test-taking motivation between paper-based and computer-based (but otherwise equivalent) tests.

Test-Taking Motivation in Computer-Based Assessments

From the early beginning in the sixties of the 20^th century, large-scale assessments were administered as paper-and-pencil tests. The most prevalent reason for changing the testing mode to a computer-based format is the possibility of adaptive testing, where tasks can be tailored to the exact level of students’ skills. This can increase measurement precision in the subsequent Item Response Theory (IRT) scaling models (Frey et al., 2016; Luecht & Nungester, 1998). Moreover, as students’ motivation also depends on item difficulty (Goldhammer et al., 2017), adaptive testing may enhance and sustain student effort during the test. For example, providing less capable students with easier tasks may prevent students’ motivation from dropping to feelings of resignation (Lane & Leventhal, 2015). At the same time, assigning more difficult tasks to high-performing students can prevent them from being underchallenged, completing the tasks superficially, and not performing up to their competence level.

Theoretically, computer-based tests offer possibilities to increase students’ motivation, for example, by providing a more engaging testing environment that promotes students’ intrinsic values, such as enjoyment more than (linear) PPA tests. In addition to external incentives or appeals (Baumert & Demmrich, 2001), Perkins et al. (2021) found in line with Pekrun’s (2006) theory of achievement emotions that test-taking effort is positively correlated with enjoyment and negatively correlated with emotions like boredom, anger, or anxiety. Similarly, Wigfield and Eccles (2000) Expectancy-value theory (EVT) claims four value components of achievement tests (importance, usefulness, enjoyment, and costs). At first glance, one would expect importance to be low due to low stakes for students, and usefulness to be low as well due to the lack of immediate feedback in paper-based, large-scale testing. Hence, enjoyment is the one factor that should be maximized to foster students’ engagement. However, in an experimental study, Liu et al. (2015) showed in computer-based settings and Baumert and Demmrich (2001) for PPA that appeals highlighting the importance of the study to educational administrators have positive effects on students’ motivation, even if the results have no direct consequences for them (for an overview, see Rios, 2021). As previous research has shown that low or declining motivation is likely to be the rule rather than the exception (Debeer et al., 2014; Eklöf, 2010; Perkins et al., 2021; Weirich et al., 2017), it is important to investigate which of these alternatives can be made most useful through the possibilities of computer-based tests.

Mode Effects Between Paper-Based and Computer-Based Tests

Empirical studies repeatedly showed that computer-based tests on cognitive traits (e.g., reading comprehension) are more difficult for students than paper-and-pencil tests (Fishbein et al., 2018; Robitzsch et al., 2017). This phenomenon is also known as mode effect. Mode effects can result from item properties (e.g., response formats, Bennett et al., 2008; Bodmann & Robinson, 2004; Parshall et al., 2002) and test features (e.g., navigation, layout, Lee et al., 1986) as well as individual characteristics (e.g., ICT-familiarity or -literacy, Zandvliet & Farragher, 1997). Computer use necessitates the deployment of further cognitive resources of the students—solving a reading literacy item not only involves reading competence (as usually presumed by the underlying model) but also technical skills, such as navigating through the text. Mode effects result in more difficult items and represent another source of construct-irrelevant variance that can affect the validity of test scores (Bürger et al., 2016; Fishbein et al., 2018; Kröhne & Martens, 2011; Robitzsch et al., 2020). Moreover, as item difficulty differs between PPA and CBA due to mode effects, also motivational effects (which has been shown to depend on task difficulty, see, e.g., Goldhammer et al., 2017; Wigfield & Eccles, 2000; Wise & DeMars, 2005) might differ between both modes. However, adaptive tests (if applied) or an interesting computer setting might affect motivation in favor of CBA test scores, mode effects might cause opposite effects.

It is reasonable to assume that mode effects may also occur in non-cognitive (e.g., attitudinal) measures, raising similar issues as for cognitive assessments. Fortunately, Thelk et al. (2009) have shown that self-reported measures of examinee motivation are invariant across assessment modes (paper-based vs. computer-based assessment) in terms of configural, metric, and scalar invariance.

Both mode effects and a drop in motivation over the course of the test can introduce construct-irrelevant variance into test scores (Debeer et al., 2014; Haladyna & Downing, 2004; Messick, 1984; Wise & DeMars, 2010) and thus be considered potential threats to the validity of test results (Borsboom, 2004; Messick, 1984, 1989, 1998; Zumbo, 2007). Changing the test mode from paper-based to computer-based tests could lead to mutual reactions between mode and students’ motivation. The estimation of valid test scores then becomes a complex challenge—especially if changes in the mode go along with changes in motivational constructs, which correlate with test performance. Whether test-taking effort depends on the test mode, it is still not clear. It seems possible that computer-based tests provide a more attractive testing environment for students and may stimulate stronger interest than paper-based tests. On the other hand, technical issues may distract students and cause annoyance or loss of interest in test-taking, which may hinder them from performing up to their competence level. Comparing effects of test-taking effort on test scores in PPA and CBA tests can shed light on whether and to what extent test developers shall make reasonable efforts to design tests and administration procedures that keep students’ motivation at a constantly high level.

Research Scope

The present study compares the intra-individual progression of test-taking effort during a large-scale test in PPA versus CBA test mode. Moreover, we investigate whether the progression of effort differs between low and high achieving students. As students’ motivation during test-taking has been shown to depend on whether item difficulty matches student ability (Asseburg & Frey, 2013; Krannich et al., 2019), we control for the difference between the two. Specifically, we examine the following research questions:

1. In a preliminary analysis, we first investigate whether mode effects occur in our English reading and listening comprehension tests. This analysis serves two functions: first, to replicate empirical findings from previous studies and second, to facilitate interpreting the results of research questions 2 and 3.

2. Does self-reported effort change differentially during the test as a function of PPA and CBA administration mode?

3. Does self-reported effort depend on the fit between item difficulty and person ability?

Method

Sample and Procedure

The data stem from a study conducted in spring 2019 with N = 2,676 German ninth-grade students (mean age = 15.1 years; 48% female; 35% academic track) from 123 classes. All students performed a 120 minutes test on English reading and listening comprehension. The tested students were learning English as a foreign language at school. The test consisted of two parts (PPA and CBA), with a scheduled processing time of 60 minutes each and an intermission of 15 minutes in between. We used a within-subject design, that is, each participating class was randomly divided into two parts; the first part labeled with the condition “CBA first” started the test in CBA mode for the first half of the test (blocks one to three) and switched to PPA mode for the second half of the test (blocks four to six). The second part of the class was labeled “PPA first” and started in reverse with PPA mode and then switched to CBA mode. Trained research assistants administered the test under standardized conditions. For the CBA test, students worked on school-owned desktop computers with conventional keyboard and mouse. In addition to the test, students answered a background questionnaire.

Measures

English Reading and Listening Comprehension

The test comprised of 513 dichotomously scored items (267 on reading comprehension and 246 on listening comprehension) administered in a multiple matrix sampling design (Shoemaker, 1973). That is, students only completed a subset of all items. The items were grouped into 23 disjoint blocks (i.e., items were nested within blocks; 12 blocks of reading comprehension, 11 blocks of listening comprehension). The time allocated for each block was 20 min. We constructed 53 booklets, each consisting of six blocks, according to a balanced incomplete block design (Frey et al., 2009; Gonzalez & Rutkowski, 2010; van der Linden et al., 2004), which was balanced regarding block position—that is, each block occurred at each of the six positions with approximately the same frequency. The test items were drawn from an item pool of the IQB Trends in Student Achievement study (Stanat et al., 2016). Thus, items assessed the extent to which students’ reading and listening comprehension in English at the end of secondary school meet the proficiency expectations of the educational standards (KMK, 2004).

Test-Taking Effort

We measured students’ test-taking effort (TTE) as a key element of the multidimensional test-taking motivation using a validated scale (Eklöf, 2010; Eklöf and Nyroos, 2013; Penk et al., 2014). The items referred to students’ effort in the current test situation (e.g., “I gave my best effort on this test.”) and ranged from 1 (strongly disagree) to 4 (strongly agree). All items were adopted (positive formulation only) and translated into German (Penk et al., 2014).

We measured students’ self-reported test-taking effort four times during the test, twice in each test mode (PPA and CBA): (1) at the beginning of the first part of the test (i.e., before students started the test in the first test mode; this condition corresponds to the “pre-test measurement of the first test half”), (2) at the end of the first part of the test (i.e., after 60 min, when students had completed the test in the first test mode; this condition corresponds to the “post-test measurement of the first test half”); (3) at the beginning of the second part of the test (i.e., after the intermission and before students started the second part of the test in the second test mode; this condition corresponds to the “pre-test measurement of the second test half”), and (4) at the end of the second part of the test (i.e., after 120 min when students had completed the second part of the test in the second test mode; this condition corresponds to the “post-test measurement of the second test half”). The wording of the TTE items varied slightly depending on the measurement condition. In each pre-test measurement condition, that is, when motivation is surveyed in relation to tasks that have yet to be completed, we used future tense (“I am motivated to give my best in this test”). In each post-test measurement condition, that is, when motivation is surveyed in relation to the tasks that had just been completed, we used past tense (“I gave my best effort on this test”).

Test-taking effort was measured in the same mode as the test students currently worked on—that is, in the PPA mode, test-taking effort was also measured with paper and pencil, and vice versa. The scale reliabilities of the four effort items yielded satisfactory results (PPA mode pre-/posttest: α = .82/.81; CBA mode pre-/posttest: α = .83/82). We computed the arithmetic mean for each measurement and mode and used them in the subsequent models.

Control Variables

We included self-reported gender, academic track (0 = non-academic track, 1 = academic track), students’ achievement score (θ), and the fit between item difficulty and person ability (i.e., θ–β) as control variables. Additionally, all two-way interactions between the variables were parametrized in the model in order to control for and to be able to detect differential effects with respect to school track and gender.

For construction of the ability-difficulty fit measure, we used the weighted likelihood estimate (WLE; Warm, 1989) as students’ achievement score θ. The WLE resulted from a unidimensional one-parameter Rasch scaling model of the dichotomous item responses (Adams & Wu, 2007; Hambleton et al., 1991). Reading and listening items as well as PPA and CBA items were modeled together in a unidimensional model, yielding a composite English language comprehension test score. As the latent correlation between reading and listening was r = .88 when we scaled the dimensions separately, modeling a composite measure seemed reasonable. The fit between item difficulty and person ability was set to zero for the pre-test measurement of the first test half (at this point, students had not yet performed any test items). For the post-test measurement of the first test half and the pre-test measurement of the second test half, the fit between item difficulty and person ability equals the difference θ–β_th1pt. β_th1pt is the average difficulty of the items the student has worked on up to this point. For the post-test measurement of the second test half, the fit between item difficulty and person ability equals the difference θ–β_th2pt. β_th2pt is the average difficulty of the items that the student worked on in the second half, that is, the second 60 minutes of the test.

Statistical Analysis

Model

We used linear mixed modeling (Bates et al., 2015; Dobson & Barnett, 2008; Dunn & Smyth, 2018) to model the change in test-taking effort during the test. Taking the clustered data structure into account (students nested in classes), we applied a multi-level extension of a linear mixed model (LMM) where person and class effects were assumed to be random. Considering the fixed effects, we modeled three within-subject factors, each with two levels (mode: PPA vs. CBA, test half: first vs. second; measurement occasion: pre-test vs. post-test). Additionally, the continuous variable person-test-fit (operationalized with the difference of person score and average item difficulty of the corresponding test part) also varies within students. The between-subject fixed effects were gender (male, female), overall test score (e.g., students’ θ), and the school track (academic track vs. non-academic track). Students’ self-reported effort was the dependent variable in all analyses. The lmer function of the R (R Core Team, 2020) package lme4 (Bates et al., 2015) was used for estimating the LMMs. For the computation of effect sizes in linear mixed models (Hedges, 2007; Westfall et al., 2014), we used the R package emmeans (Lenth, 2022).

Missing Data

The percentages of missing values of the four test-taking effort items varied between 1.2 and 1.9% in the PPA mode and between .7 and 3.6% in the CBA mode. In general, the missing percentage in the post-test exceeded the missing percentage in the pre-test. For the gender variable, only .1% missing values occurred. Assuming missing at random (MAR; Schafer & Graham, 2002), we applied multiple imputation (Rubin, 1987; van Buuren, 2007, 2018) with the R (R Core Team, 2020) package MICE (Multivariate imputation by chained equations; van Buuren & Groothuis-Oudshoorn, 2011) to generate 15 complete data sets. All available background information of the study (including variables not used in the current analysis, e.g., age, spoken language at home) were additionally used as auxiliary variables in the imputation model. The analyses were conducted separately for each imputed data set, and the results were pooled according to Rubin (1987).

Results

Regarding the preliminary analysis on mode effects, we found significant mode effects in both reading and listening comprehension (see Online Appendix Table A1 and A2), indicating that the items are more difficult in CBA than in PPA. In addition, mode effects were heterogeneous across domains and more pronounced for reading compared to listening comprehension.

Table 1 shows the empirical means and standard deviations of test-taking effort separately for the two starting conditions (PPA first and CBA first) and the four measurement occasions in each mode (positions). With items ranging from 1 to 4, the center of the effort scale is 2.50. All empirical means exceeded the center of the effort scale. In both test modes (PPA first, CBA first), test-taking effort decreases during the course of the test, but the decrease is somewhat lesser in the CBA first condition. However, the average effort is lower in the CBA first condition (2.93 vs. 3.06) when effort is averaged across all four positions. The between-student heterogeneity of effort slightly increases during the course of the test, indicating between-student variability in effort trajectories.

Table 1.

Test-Taking Effort for the Four Measurement Occasions Within the PPA and CBA Test Mode.

Condition	Mode	Position	Test Half	Measurement Occasion	Mean Effort	SD Effort
PPA first	PPA	1	First	Pre	3.199	.578
PPA first	PPA	2	First	Post	3.133	.643
PPA first	CBA	3	Second	Pre	3.063	.688
PPA first	CBA	4	Second	Post	2.862	.754
CBA first	CBA	1	First	Pre	3.031	.661
CBA first	CBA	2	First	Post	2.929	.718
CBA first	PPA	3	Second	Pre	2.932	.687
CBA first	PPA	4	Second	Post	2.836	.743

Note. PPA = paper and pencil assessment; CBA = computer-based assessment.

Table 2 shows the results of two longitudinal LMMs predicting test-taking effort. The first LMM includes only main effects and effect sizes for dichotomous predictors; the second LMM additionally includes two-way interactions. Note that the main effects of the second LMM should not be interpreted as “overall” effects because they refer to the respective reference group of dichotomous variables for which interactions were estimated.

Table 2.

Estimates of Fixed and Random Effects for LMM 1 and LMM 2.

		LMM 1				LMM 2
Fixed effects (main effects)		Est	SE	p	es	Est	SE	p
Intercept		3.225	.028	<.001		3.270	.033	<.001
Test half
	Second (vs. First)	−.197	.009	<.001	.302	−.263	.030	<.001
Measurement Occasion
	Post (vs. Pre)	−.158	.009	<.001	.242	−.306	.025	<.001
Mode
	CBA (vs. PPA)	−.057	.008	<.001	.087	−.157	.026	<.001
Students' θ		.081	.010	<.001		.080	.019	<.001
School type
	Academic track (vs. Non-academic)	−.064	.045	.153	.098	−.044	.060	.469
Gender
	Male (vs. Female)	−.230	.021	<.001	.352	−.204	.030	<.001
Fit (θ − β)		.063	.004	<.001		.055	.014	<.001
Fixed effects (interaction effects)						Est	SE	p
	post × second					.002	.024	.942
	post × CBA					.037	.019	.052
	post × fit					.096	.012	<.001
	post × academic track					.074	.023	<.01
	post × students' θ					−.084	.012	<.001
	post × male					.019	.017	.274
	second × CBA					.162	.042	<.001
	second × fit					.018	.011	.123
	second × academic track					.035	.023	.122
	second × students' θ					−.031	.012	<.01
	second × male					−.023	.018	.198
	CBA × fit					−.009	.009	.311
	CBA × academic track					.009	.020	.643
	CBA × students' θ					−.001	.010	.903
	CBA × male					.023	.016	.137
	Fit × academic track					−.010	.011	.332
	Fit × students' θ					−.010	.003	<.001
	Fit × male					−.017	.009	<.05
	Students' θ × male					.050	.020	<.05
	Students' θ × academic track					−.004	.023	.874
	Male × academic track					−.094	.052	.072
Random effects		Var	SD			Var	SD
	Random intercept (persons)	.235	.485			.233	.483
	Random intercept (classes)	.035	.186			.034	.184
	Residual	.156	.395			.153	.391
Model Fit
	R² (marginal)	11.43%				12.84%
	R² (conditional)	67.64%				68.36%

Notes. R² according to Nakagawa and Schielzeth (2013). Test half: the test was splitted into two parts of 60 min each. Measurement occasion: indicates whether motivation was surveyed at the beginning or at the end of the corresponding test half. es = effect size measure for linear mixed models.

The first LMM shows that, overall, test-taking effort is significantly lower in the second half of the test (−.197) than in the first. Furthermore, overall, effort decreases significantly within each test half between pre-test and post-test (−.158). That is, the retrospectively self-reported (i.e., realized) effort after students had finished one test half in a given mode is overall lower than the prospectively (intended) test-taking effort students were willing to invest before starting the test in the respective test mode. Except for this, overall effort in CBA is lower (−.057) than in PPA mode, although the corresponding effect size is very small. Considering the marginal effects between individuals, higher achieving students report higher effort, and males invest less effort than females. Further, there was overall a tendency that the more a student’s θ exceeds the average item difficulty of the corresponding block, the higher a student’s effort results (.063).

The two-way interactions of the second LMM show that the difference between pre- and post-test is less pronounced if students’ θ exceed the average item difficulty of the corresponding block. However, the difference between pre- and post-test is more pronounced for high achieving students within both school tracks. In contrast, the difference between pre- and post-test is less pronounced for academic track students. The gender difference is less pronounced for high achieving students.

Concerning the first research question, the second LMM shows that the decrease of test-taking effort in PPA does not differ from the decrease of test-taking effort in CBA mode (non-significant interaction term “post × CBA”). However, there is almost no decrease in effort between the two halves of the test, when the mode changes from CBA to PPA, whereas effort decreases visibly between the two test halves when the mode changes from PPA to CBA (significant interaction term “second × CBA”).

Concerning the second research question, our results show that self-reported effort is higher when the items are easier relative to students’ abilities. This is in line with findings by Asseburg and Frey (2013).

Discussion

The current study compared self-reported effort in a large-scale assessment between paper-and-pencil and computer-based administration modes. Comparability of both administration modes is crucial if trends in students’ competencies are to be assessed over longer periods and if the administration mode changes during that time (Robitzsch et al., 2017). In addition, test scores correlate significantly with students’ test-taking effort, which generally decreases as the test progresses (Penk & Richter, 2016). Both changes in the test administration mode (PPA vs. CBA) and a decline in students’ test-taking motivation over the course of the test can introduce construct-irrelevant variance and, ultimately, compromise the validity of the test results. Against this background, we investigated whether a change in the test mode interacts with differential trajectories of test-taking motivation. Identifying factors influencing students’ test scores beyond their competencies is necessary for valid interpretation of changes in test scores.

Our results show that decreasing effort, which has repeatedly been shown in PPA tests (Penk & Richter, 2016), is also an issue in CBA tests. On average and somewhat unexpectedly, students’ self-reported effort during the CBA test is even lower than in the PPA test.

The CBA und PPA implementations of the tests were very similar in the present study: the same items with a similar presentation were used, and no additional features, which are only possible in a CBA environment (adaptive test design, multimedia features, such as videos) were implemented. In other words, there was no compelling need for a CBA implementation of the items administered in the present study.

Overall, our results do not support the assumption that computerized test settings are per se more appealing to students, which could lead them to invest a constantly higher level of effort compared to PPA tests. On the contrary, students’ effort was even lower in CBA than in PPA. While we cannot give a conclusive explanation for this result, students might have higher expectations in advance of the computer-based test that the current implementation could not fulfill. For example, typing short answers to open-ended items using the keyboard might be more exhausting and unfamiliar than writing with paper and pencil, which students are used to. Considering the wide range of possibilities for designing computer-based tests, further investigating which device (e.g., desktop, laptop, and tablet with on-screen keypad) and presentation of stimuli and related items (e.g., split screen, scrolling between, or turning pages) is most user-friendly for which age group of students while ensuring consistently high levels of effort during the test. In addition, computer-based settings offer further promising opportunities to support students’ motivation through flexibly presented design elements in the course of testing. According to the Expectancy-value theory (EVT; Eccles & Wigfield, 2002; Wigfield & Eccles, 2000), immediate feedback (after each half of the test, for example) might enhance the perceived usefulness of the test. If students’ test-taking behavior shows signs of careless responses—for example, if they skip several items in a row or answer within short response times (Lee & Jia, 2014; Pohl et al., 2019; Van der Linden, 2006), indicating rapid guessing behavior (Wise & Kuhfeld, 2021), the test system could encourage students through short appeals. Since Liu et al. (2012) showed that students tend to perform best when they see personal relevance of their test scores, further research should investigate whether emphasizing personal relevance through adaptive encouragement or rewards when students complete a part of the test could enhance their effort and engagement. In addition, analyzing non-linear item position effects can help to decide how much testing time is reasonable for students to receive reliable results and avoid overstraining. For primary school students working on an 80 minutes test which is comprised of four blocks of 20 minutes each, Weirich et al. (2014; Online Appendix) showed that students performance declined especially during the last block, indicating that a test of 60 minutes (i.e., three blocks) would be more reasonable for primary school students. With 120 minutes of testing time, the current study placed high demands on students’ persistence.

In line with Asseburg and Frey (2013), the current study found that students’ effort is higher if the items are rather easier relative to students’ competencies. Moreover, the shrinkage in effort between pre- and post-test is lower when the test is less challenging. Adaptive tests intend to optimize the statistical efficiency of the test (i.e., maximizing the test information function) by choosing the next item (or next stage with multiple items) according to the provisional score of the examinee (Luecht & Nungester, 1998). However, this strategy engendered that θ–β≈0 and therefore probably will not evoke optimal effort in examinees. Hence, designing (adaptive) educational tests should consider that an optimal solution according to efficiency may contradict an optimal solution according to motivation. Thus, a balanced tradeoff between both seems beneficial. Asseburg and Frey (2013) recommend that the expected mean probability of a correct answer across all items for one student should be placed around 70%. This is in line with our results. In order to achieve a high level of effort, students should be presented with items whose mean difficulty is clearly exceeded by their ability resulting in correct answer probabilities of 70% and above.

Limitations

When interpreting the results, the following limitations should be taken into account. First, we used self-report measures of students’ test-taking effort. While several studies support the validity of self-reported measures (Eklöf & Nyroos, 2013; Wise et al., 2006), they have been criticized for ambiguity. Wise and DeMars (2006) state that “examinees who believe they did not do well on the achievement test might underreport their effort” (p. 20) to provide an alternative justification for anticipated poor results. Additionally, LaFave et al. (2022) discussed ambiguity problems arising from the fact that students who feel competent in a specific subject area may have inherently higher levels of motivation. Their higher performance, in turn, is due to their higher abilities, where “motivation [is just] a natural byproduct of having higher abilities” (p. 19). Instead, Ulitzsch et al., (2021) suggest a model-based approach to simultaneously measure test speed, test accuracy, and engagement which uses response times. Such an approach was not feasible in the current study, as response times are not available on item level in PPA tests. Additionally, even the model-based approach suffers from ambiguity problems which may result in lower than expected agreement between different indicators of non-effortful test-taking behavior.

For the same reason, we could not use alternative (approximative) measures of effort, such as response time effort (RTE; Wise & Kong, 2005), a non-intrusive method to evaluate students’ test-taking effort. Reaction times measured at the response level can be used to identify rapid guessing behavior, which can also be used as an indicator of low motivation. RTE or (in a broader sense) response time behavior has been used, among other things, for motivation filtering (Swerdzewski et al., 2011), latent modeling of disengagement (Goldhammer et al., 2017), and separating two consequences of low motivation: guessing and omissions (Ulitzsch et al., 2019).

Second, we compared test-taking effort between an otherwise identical PPA and CBA implementation of an English language test. It is, however, plausible that students’ effort varies between devices (Rutkowski et al., 2022) or various implementations of a CBA test. Moreover, further development of devices (tablets, laptops, etc.) may improve their usability which should affect students’ motivation. Considering the high workload large-scale assessments demand from students without rewarding them, assessments should be designed to engage students’ interest and ensure a continuous high level of motivation. In order to achieve this, the targeted probability of success could be increased.

Supplemental Material

Supplemental Material - Comparing Test-Taking Effort Between Paper-Based and Computer-Based Tests

Supplemental Material for Comparing Test-Taking Effort Between Paper-Based and Computer-Based Tests by Sebastian Weirich, Karoline Sachse, Sofie Henschel, and Carola Schnitzler in Applied Psychological Measurement.

Footnotes

Acknowledgments

The authors would like to thank two anonymous reviewers and the editor for their thorough and constructive comments in improving the manuscript.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

ORCID iDs

Sebastian Weirich

Karoline A. Sachse

Sofie Henschel

Supplemental Material

Supplemental material for this article is available online.

References

Adams

R. J.

M. L.

(2007). The mixed-coefficients multinomial logit model: A generalized form of the Rasch model. In von Davier

Carstensen

C. H.

(Eds.), Multivariate and mixture distribution Rasch models (pp. 57–75), Springer.

Asseburg

Frey

(2013). Too hard, too easy, or just right? The relationship between effort or boredom and ability-difficulty fit. Psychological Test and Assessment Modeling, 55(1), 92–104.

Bates

Maechler

Bolker

Walker

(2015). Fitting linear mixed-effects models using lme4. Journal of Statistical Software, 67(1), 1–48. https://doi.org/10.18637/jss.v067.i01

Baumert

Demmrich

(2001). Test motivation in the assessment of student skills: The effects of incentives on motivation and performance. European Journal of Psychology of Education, 16(3), 441–462. https://doi.org/10.1007/bf03173192

Bennett

R. E.

Braswell

Oranje

Sandene

Kaplan

Yan

(2008). Does it matter if I take my Mathematics test on computer? A second empirical study of mode effects in NAEP. The Journal of Technology, Learning, and Assessment, 6(9), 1–39.

Bodmann

S. M.

Robinson

D. H.

(2004). Speed and performance differences among computer-based and paper-pencil tests. Journal of Educational Computing Research, 31(1), 51–60. https://doi.org/10.2190/grqq-yt0f-7lkb-f033

Borsboom

(2004). The concept of validity. Psychological Review, 111(4), 1061–1071. https://doi.org/10.1037/0033-295X.111.4.1061

Bürger

Kröhne

Goldhammer

(2016). The transition to computer-based testing in large-scale assessments: Investigating (partial) measurement invariance between modes. Psychological Test and Assessment Modeling, 58(4), 597–616.

Debeer

Buchholz

Hartig

Janssen

(2014). Student, school, and country differences in sustained test-taking effort in the 2009 PISA reading assessment. Journal of Educational and Behavioral Statistics, 39(6), 502–523. https://doi.org/10.3102/1076998614558485

10.

Dobson

A. J.

Barnett

A. G.

(2008). An introduction to generalized linear models, CRC Press.

11.

Dunn

Smyth

(2018). Generalized linear models with examples in R, Springer.

12.

Eccles

Wigfield

(2002). Motivational beliefs, values, and goals. Annual Review of Psychology, 53(1), 109–132. https://doi.org/10.1146/annurev.psych.53.100901.135153

13.

Eklöf

(2010). Student motivation and effort in the Swedish TIMSS advanced field study. IEA International Research Conference.

14.

Eklöf

Nyroos

(2013). Pupil perceptions of national tests in science: Perceived importance, invested effort, and test anxiety. European Journal of Psychology of Education, 28(2), 497–510. https://doi.org/10.1007/s10212-012-0125-6

15.

Fishbein

Martin

M. O.

Mullis

I. V. S.

Foy

(2018). The TIMSS 2019 item equivalence study: Examining mode effects for computer-based assessment and implications for measuring trends. Large-scale Assessments in Education, 6(1), 1–23. https://doi.org/10.1186/s40536-018-0064-z

16.

Frey

Hartig

Rupp

A. A.

(2009). An NCME instructional module on booklet designs in large-scale assessments of student achievement: Theory and practice. Educational Measurement: Issues and Practice, 28(3), 39–53. https://doi.org/10.1111/j.1745-3992.2009.00154.x

17.

Frey

Seitz

N.-N.

Brandt

(2016). Testlet-based multidimensional adaptive Testing. Frontiers in Psychology, 7, 1–14. https://doi.org/10.3389/fpsyg.2016.01758

18.

Goldhammer

Martens

Lüdtke

(2017). Conditioning factors of test-taking engagement in PIAAC: An exploratory IRT modelling approach considering person and item characteristics. Large-scale Assessments in Education, 5(18), 1–25. https://doi.org/10.1186/s40536-017-0051-9

19.

Gonzalez

Rutkowski

(2010). Principles of multiple matrix booklet designs and parameter recovery in large-scale assessments. In von Davier

Hasted

(Eds.), Issues and methodologies in large-scale assessments (3, pp. 125–156). EA-ETS Research.

20.

Haladyna

T. M.

Downing

S. M.

(2004). Construct-irrelevant variance in high-stakes testing. Educational Measurement: Issues and Practice, 23(1), 17–27. https://doi.org/10.1111/j.1745-3992.2004.tb00149.x

21.

Hambleton

R. K.

Swaminathan

Rogers

H. J.

(1991). Fundamentals of item response theory, Sage.

22.

Hedges

L. V.

(2007). Effect sizes in cluster-randomized designs. Journal of Educational and Behavioral Statistics, 32(4), 341–370. https://doi.org/10.3102/1076998606298043

23.

KMK . (2004). Bildungsstandards für die erste Fremdsprache (Englisch/Französisch) für den Mittleren Schulabschluss. Beschluss vom 04.12.2003, Luchterhand.

24.

Köhler

Pohl

Carstensen

C. H.

(2015). Investigating mechanisms for missing responses in competence tests. Psychological Test and Assessment Modeling, 57(4), 499–522.

25.

Krannich

Goetz

Lipnevich

A. A.

Bieg

Roos

A.-L.

Becker

E. S.

Morger

(2019). Being over- or underchallenged in class: Effects on students’ career aspirations via academic self-concept and boredom. Learning and Individual Differences, 69, 206–218. https://doi.org/10.1016/j.lindif.2018.10.004

26.

Kröhne

Martens

(2011). Computer-based competence tests in the national educational panel study: The challenge of mode effects. Zeitschrift für Erziehungswissenschaft, 14(S2), 169–186. https://doi.org/10.1007/s11618-011-0185-4

27.

LaFave

A. J.

Taylor

J. A.

Barter

A. B.

Jacobs

A. S.

(2022). Student engagement on the national assessment of educational progress (NAEP): A systematic review and meta-analysis of extant research. Educational Assesment. 27(3). 1–24. https://doi.org/10.1080/10627197.2022.2043151

28.

Lane

Leventhal

(2015). Psychometric challenges in assessing English language learners and students with disabilities. Review of Research in Education, 39(1), 165–214. https://doi.org/10.3102/0091732x14556073

29.

Lee

Moreno

K. E.

Sympson

J. B.

(1986). The effects of mode of test administration on test performance. Educational and Psychological Measurement, 46(2), 467–473. https://doi.org/10.1177/001316448604600224

30.

Lee

Y.-H.

Jia

(2014). Using response time to investigate students’ test-taking behaviors in a NAEP computer-based study. Large-scale Assessments in Education, 2(8), 1–24. https://doi.org/10.1186/s40536-014-0008-1

31.

Lenth

R. V.

(2022). Emmeans: Estimated marginal means, aka least-squares means. Version R package version 1.7.2). https://CRAN.R-project.org/package=emmeans

32.

Liu

O. L.

Bridgeman

Adler

(2012). Measuring learning outcomes in higher education: Motivation matters. Educational Researcher, 41(9), 352–362. https://doi.org/10.3102/0013189x12459679

33.

Liu

O. L.

Rios

J. A.

Borden

(2015). The effects of motivational instruction on college students' performance on low-stakes assessment. Educational Assessment, 20(2), 79–94. https://doi.org/10.1080/10627197.2015.1028618

34.

Luecht

R. M.

Nungester

R. J.

(1998). Some practical examples of computer-adaptive sequential testing. Journal of Educational Measurement, 35(3), 229–249. https://doi.org/10.1111/j.1745-3984.1998.tb00537.x

35.

Messick

(1984). The psychology of educational measurement. Journal of Educational Measurement, 21(3), 215–237. https://doi.org/10.1002/j.2330-8516.1984.tb00046.x

36.

Messick

(1989). Meaning and values in test validation: The science and ethics of assessment. Educational Researcher, 18(2), 5–11. https://doi.org/10.3102/0013189x018002005

37.

Messick

(1998) Test validity: A matter of consequence. In Zumbo

B. D.

(Ed.), Validity theory and the methods used in validation: Perspectives from the social and behavioral sciences (pp. 35–44), Kluwer Academic Press. https://doi.org/10.1023/A:1006964925094

38.

Mullis

I. V. S.

Martin

M. O.

(2019). Pirls 2021 assessment frameworks, TIMSS and PIRLS International Study Center.

39.

OECD . (2017). PISA 2015 technical report, OECD Publishing.

40.

Parshall

C. G.

Spray

J. A.

Kalohn

J. C.

Davey

(2002). Practical considerations in computer-based testing, Springer.

41.

Pekrun

(2006). The control-value theory of achievement emotions: Assumptions, corollaries, and implications for educational research and practice. Educational Psychology Review, 18(4), 315–341. https://doi.org/10.1007/s10648-006-9029-9

42.

Penk

Pöhlmann

Roppelt

(2014). The role of test-taking motivation for students’ performance in low-stakes assessments: An investigation of school-track-specific differences. Large-scale Assessments in Education, 2(5), 2–17. https://doi.org/10.1186/s40536-014-0005-4

43.

Penk

Richter

(2016). Change in test-taking motivation and its relationship to test performance in low-stakes assessments [journal article]. Educational Assessment, Evaluation and Accountability, 29(1), 1–25. https://doi.org/10.1007/s11092-016-9248-7

44.

Perkins

Pastor

D. A.

Finney

S. J.

(2021). Between- versus within-examinee variability in test-taking effort and test emotions during a low-stakes test. Applied Measurement in Education, 34(4), 285–300. https://doi.org/10.1080/08957347.2021.1987905

45.

Pohl

Gräfe

Rose

(2014). Dealing with omitted and not-reached items in competence tests: Evaluating approaches accounting for missing responses in item response theory models. Educational and Psychological Measurement, 74(3), 423–452. https://doi.org/10.1177/0013164413504926

46.

Pohl

Ulitzsch

von Davier

(2019). Using response times to model not-reached items due to time limits. Psychometrika, 84(3), 892–920. https://doi.org/10.1007/s11336-019-09669-2

47.

R Core Team . (2020). R: A language and environment for statistical computing (Version 4.0.2). Vienna, Austria: R Foundation for Statistical Computing.

48.

Rios

J. A.

(2021). Improving test-taking effort in low-stakes group-based educational testing: A meta-analysis of interventions. Applied Measurement in Education, 34(2), 85–106. https://doi.org/10.1080/08957347.2021.1890741

49.

Robitzsch

Lüdtke

Köller

Kröhne

Goldhammer

Heine

J.-H.

(2017). Herausforderungen bei der Schätzung von Trends in Schulleistungsstudien. Diagnostica, 63(2), 148–165. https://doi.org/10.1026/0012-1924/a000177

50.

Robitzsch

Lüdtke

Schwippert

Goldhammer

Kroehne

Köller

(2020). Leistungsveränderungen in TIMSS zwischen 2015 und 2019: Die Rolle des Testmediums und des methodischen Vorgehens bei der Trendschätzung. In Schwippert

Kasper

Köller

McElvany

Selter

Steffensky

Wendt

(Eds.), Timss 2019: Mathematische und naturwissenschaftliche Kompetenzen von Grundschulkindern in Deutschland im internationalen Vergleich, Waxmann.

51.

Rubin

D. B.

(1987). Multiple imputation for nonresponse in surveys, Wiley.

52.

Rutkowski

Flores

(2022). The effect of device type on achievement: evidence from a quasi-experimental design. Educational Assesment, 27(3), 1–18. https://doi.org/10.1080/10627197.2022.2043742

53.

Rutkowski

von Davier

(2014) A Brief introduction to modern international large-scale assessment. In Rutkowski

von Davier

Rutkowski

(Eds.), Handbook of international large-scale assessment, CRC Press.

54.

Sachse

Weirich

Mahler

Rjosk

(2023). Explaining performance decline over the course of taking comprehensive proficiency tests: The roles of effort and omission propensity. International Journal of Testing, 1–22. https://doi.org/10.1080/15305058.2023.2250889

55.

Schafer

J. L.

Graham

J. W.

(2002). Missing data: Our view of the state of the art. Psychological Methods, 7(2), 147–177. https://doi.org/10.1037/1082-989x.7.2.147

56.

Shoemaker

D. M.

(1973). Principles and procedures of multiple matrix sampling. Ballinger.

57.

Stanat

Böhme

Schipolowski

Haag

(2016). IQB-Bildungstrend 2015. Sprachliche Kompetenzen am Ende der 9: Jahrgangsstufe im zweiten Ländervergleich. Waxmann.

58.

Swerdzewski

P. J.

Harmes

C. J.

Finney

S. J.

(2011). Two approaches for identifying low-motivated students in a low-stakes assessment context. Applied Measurement in Education, 24(2), 162–188. https://doi.org/10.1080/08957347.2011.555217

59.

Thelk

A. D.

Sundre

D. L.

Horst

S. J.

Finney

S. J.

(2009). Motivation matters: Using the student opinion scale to make valid inferences about student performance. The Journal of General Education, 58(3), 129–151. https://doi.org/10.1353/jge.0.0047

60.

Ulitzsch

Penk

von Davier

Pohl

(2021). Model meets reality: Validating a new behavioral measure for test-taking effort. Educational Assesment, 26(2), 104–124. https://doi.org/10.1080/10627197.2020.1858786

61.

Ulitzsch

von Davier

Pohl

(2019). A hierarchical latent response model for inferences about examinee engagement in terms of guessing and item-level non-response. British Journal of Mathematical and Statistical Psychology, 73(S1), 83–112. https://doi.org/10.1111/bmsp.12188

62.

Ulitzsch

von Davier

Pohl

(2020). A multiprocess item response model for not-reached items due to time limits and quitting. Educational and Psychological Measurement, 80(3), 522–547. https://doi.org/10.1177/0013164419878241

63.

van Buuren

(2007). Multiple imputation of discrete and continuous data by fully conditional specification. Statistical Methods in Medical Research, 16(3), 219–242. https://doi.org/10.1177/0962280206074463

64.

van Buuren

(2018). Flexible imputation of missing data (2nd ed), Chapman and Hall/CRC.

65.

van Buuren

Groothuis-Oudshoorn

(2011). Mice: Multivariate imputation by chained equations in R. Journal of Statistical Software, 45(3), 1–67. https://doi.org/10.18637/jss.v045.i03

66.

van der Linden

W. J.

(2006). A lognormal model for response times on test items. Journal of Educational and Behavioral Statistics, 31(2), 181–204. https://doi.org/10.3102/10769986031002181

67.

van der Linden

W. J.

Veldkamp

B. P.

Carlson

J. E.

(2004). Optimizing balanced incomplete block designs for educational assessments. Applied Psychological Measurement, 28(5), 317–331. https://doi.org/10.1177/0146621604264870

68.

Warm

T. A.

(1989). Weighted likelihood estimation of ability in item response theory. Psychometrika, 54(3), 427–450. https://doi.org/10.1007/bf02294627

69.

Weirich

Hecht

Böhme

(2014). Modeling item position effects using generalized linear mixed models. Applied Psychological Measurement, 38(7), 535–548. https://doi.org/10.1177/0146621614534955

70.

Weirich

Hecht

Penk

Roppelt

Böhme

(2017). Item position effects are moderated by changes in test-taking effort. Applied Psychological Measurement, 41(2), 115–129. https://doi.org/10.1177/0146621616676791

71.

Westfall

Kenny

D. R.

Judd

(2014). Statistical power and optimal design in experiments in which samples of participants respond to samples of stimuli. Journal of Experimental Psychology: General, 143(5), 1–26. https://doi.org/10.1037/xge0000014

72.

Wigfield

Eccles

(2000). Expectancy-value theory of achievement motivation. Contemporary Educational Psychology, 25(1), 68–81. https://doi.org/10.1006/ceps.1999.1015

73.

Wise

S. L.

DeMars

C. E.

(2005). Low examinee effort in low-stakes assessment: Problems and potential solutions. Educational Assessment, 10(1), 1–17. https://doi.org/10.1207/s15326977ea1001_1

74.

Wise

S. L.

DeMars

C. E.

(2006). An application of item response time: The effort-moderated IRT model. Journal of Educational Measurement, 43(1), 19–38. https://doi.org/10.1111/j.1745-3984.2006.00002.x

75.

Wise

S. L.

DeMars

C. E.

(2010). Examinee noneffort and the validity of program assessment results. Educational Assesment, 15(1), 27–41. https://doi.org/10.1080/10627191003673216

76.

Wise

S. L.

Kong

(2005). Response time effort: A new measure of examinee motivation in computer-based tests. Applied Measurement in Education, 18(2), 163–183. https://doi.org/10.1207/s15324818ame1802_2

77.

Wise

S. L.

Kuhfeld

M. R.

(2021). Using retest data to evaluate and improve effort-moderated scoring. Journal of Educational Measurement, 58(1), 130–149. https://doi.org/10.1111/jedm.12275

78.

Wise

S. L.

Pastor

D. A.

Kong

(2009). Correlates of rapid-guessing behavior in low-stakes testing: Implications for test development and measurement practice. Applied Measurement in Education, 22(2), 185–205. https://doi.org/10.1080/08957340902754650

79.

Wise

V. L.

Wise

S. L.

Bhola

D. S.

(2006). The generalizability of motivation filtering in improving test score validity. Educational Assesment, 11(1), 65–83. https://doi.org/10.1207/s15326977ea1101_3

80.

Wolgast

Schmidt

Ranger

(2020). Test-taking motivation in education students: Task battery order affected within-test-taker effort and importance. Frontiers in Psychology, 11, 1–16. https://doi.org/10.3389/fpsyg.2020.559683

81.

Zandvliet

Farragher

(1997). A comparison of computer administered and written tests. Journal of Research on Computers in Education, 29(4), 423–438. https://doi.org/10.1080/08886504.1997.10782209

82.

Zumbo

B. D.

(2007). Validity: Foundational issues and statistical methodology. In Rao

C. R.

Sinharay

(Eds.), Handbook of Statistics (26, pp. 45–80). Elsevier.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.85 MB