Sage Journals: Discover world-class research

Abstract

Rater variation has been a persistent concern for rater-mediated writing assessments. Instead of treating rater variation as an undesired source of measurement error, the method of comparative judgment (CJ) uses pairwise comparisons to elicit relative judgments from raters and statistical estimation to construct a measurement scale to rank object items, offering a viable approach to accommodate rater-associated heterogeneity of judgment making on the one hand and obtain reliable and valid outcomes on the other hand. The current study systematically examined the utility and quality of CJ as an assessment tool in the context of second language writing. A group of 16 raters (8 experienced and 8 novice) performed the CJ assessment on 94 pieces of English writing texts in the absence of rubric criteria. Despite raters’ varying expertise and rating experiences, raters were able to deliver judgments consistent with the shared consensus, yielding a CJ rank order of the writing texts with a moderate reliability. The analyses of raters’ justifications for judgment making showed that raters varied substantially in terms of evaluation criteria, but the collective expertise derived from the iterative CJ process presented a close alignment with the established scoring rubric. Additionally, inconsistencies were explored when raters and texts significantly deviated from the consensus of judgments, and practical implications were discussed. The results provide empirical evidence for the construct validity of CJ and add a novel perspective to the discussion of rater variation in second language writing assessment.

Keywords

comparative judgment rater variation L2 writing assessment pairwise comparisons reliability construct validity

Introduction

Human judgment plays a vital role in rater-mediated performance assessments, such as assessments of second language (L2) writing. Just as L2 writing by itself is a highly complex language skill, assessing L2 writing is further complicated by raters’ idiosyncratic criteria and subjective judgments (Weigle, 2002) that introduce undesired variation and consequently pose a threat to the reliability and validity of assessment outcomes (Messick, 1995). Research on rater behaviors in writing assessment contexts has demonstrated that raters differ to a large extent in how they interpret rubric criteria to translate their evaluation of text quality into scores, regardless of the analytical or holistic nature of rubrics (Eckes, 2012; Elliott, 2017; Wolfe, 1997). Although optimizing rubrics and rater training are widely implemented to reduce rater variability, the effectiveness of those practices has been shown to be less than expected (Bloxham, 2009; Eckes, 2008; Knoch, 2011; Lumley & McNamara, 1995). Raw scores are often adjusted using statistical modeling techniques like many-facet Rasch measurement (MFRM) to account for rater biases (Myford & Wolfe, 2003, 2004). Rater variation, as undesired sources of error in the measurement, remains a persistent concern for language writing assessment.

However, given the multidimensional nature of the language writing construct and the variability of human judgment, any single, uniform way of conceptualizing and evaluating writing skills seems insufficient to fully capture this inherent complexity (Cumming, 1990; Elliott, 2017; Lumley, 2002). Instead of limiting raters to predefined rubric criteria, raters’ divergent points of view may seem more agreeable to the diversity embedded in writing performance. The question that follows, then, is how to apply an assessment method that can incorporate this rater-associated heterogeneity in a systematic way so that concrete samples of L2 writing performance can be assessed with validity, reliability, and efficiency.

Innovation in educational assessment has drawn attention to the method of comparative judgment (CJ; Thurstone, 1927). The method simply requires raters to select the better out of two works through repeated pairwise comparisons and applies statistical estimation to construct a measurement scale to rank the works. Previous research has shown raters obtain highly reliable evaluations using CJ, advocating for its wider implementation in assessment domains (Bartholomew & Yoshikawa-Ruesch, 2018; Hartell & Buckley, 2021). However, prior studies have primarily focused on the implementation of CJ and its reliability in assessment, leaving construct validity, a crucial aspect of CJ less securitized. It has not been well examined what criteria raters use to compare works to establish CJ’s validity claim with divergent rater judgments; whether and how rater expertise and experiences affect judging behaviors during CJ remain to be addressed. Moreover, compared with other research fields, the investigation of CJ in the context of L2 writing assessment has been rather limited; particularly, how CJ can be used to incorporate rater variation has yet to be explored.

The purpose of this paper is to illustrate the utility of CJ as a promising approach to rater variation embedded in the conventional rubric-based L2 writing assessment, through examining consistency and judgment making of experienced and novice raters during a CJ assessment. The following sections briefly review common approaches used to address rater variation in the writing assessment research, and then introduce the rationales of the CJ method, followed by an empirical application in an English as a foreign language (EFL) writing assessment context. With the application, the reliability and validity of the CJ assessment are examined, and how CJ presents a viable approach to challenges such as rater variation, as well as some implications, is discussed.

Approaches to Rater Variation in Language Writing Assessment

Rater variation refers to the systematic variation in ratings or scores that is not associated with examinees’ performance per se, but with rater characteristics, such as reading styles (Wolfe, 1997), rater types (Eckes, 2008), raters’ cognitive, and meta-cognitive processing (Cumming et al., 2002; Vaughan, 1991; Zhang, 2016), rating experience (Crisp, 2013), fatigue and across-session drift (Palermo et al., 2019). These inconsistencies between and within raters present a challenge to measurement precision and a potential threat to the validity of decisions based on those judgments (Lumley & McNamara, 1995; Myford & Wolfe, 2004). Different approaches have been undertaken in extensive research attempting to mitigate the impacts of rater variation in language writing assessment, but yielded mixed results.

The development of a clear rubric lays out an essential first step to reliable scorings, which clearly specifies the attributes to be assessed and provides detailed descriptions or exemplar responses to differentiate levels of performance (Eckes, 2008; Wilson, 2006). Although a well-designed rubric can serve as a regulatory means to guide the way in which raters evaluate each writing performance, raters were found to use rubrics with varying foci on and broadness of scoring criteria (Eckes, 2008; Wolfe, 2004), and score texts on the given criteria more severely or more leniently based on their own internal standards (Elliott, 2017; Lumley, 2002; Rhead et al., 2016). Moreover, individual raters tend to perceive differential relevance of criteria depending on the quality levels of writing performance (Humphry & Heldsinger, 2019), or shift their focus over a long rating session and across multiple sessions (Elliott, 2017; Lumley, 2002; Palermo, 2022; Wang et al., 2017). These idiosyncratic interpretations and applications of rubric criteria observed in scoring sessions highlighted the importance and necessity of rater training.

Implementing rater training offers another mode to minimize rater variation. Along with a specific rubric, rater training is assumed to facilitate a shared understanding of the standards among raters, leading to more agreement and reliable scoring decisions (Elliott, 2017; Lumley, 2002). Improved between-rater agreement and within-rater scoring consistency after training have been reported (e.g., Davis, 2016; Weigle, 1998), and the effects were more prominent for novice or excessively severe/lenient raters (Yan & Chuang, 2022). For established raters, on the other hand, (additional) training or feedback showed little or limited further effect on their rating consistency (Davis, 2016; Knoch, 2011). Researchers also questioned the validity and effectiveness of rater training, as training may result in raters agreeing on only superficial aspects of writing rather than more substantive features (Charney, 1984).

Alternatively, researchers turned to statistical models aiming to detect, measure, and correct for potential rater effects on scoring performance, among which two most frequently used frameworks are generalizability theory (G-theory; Brennan, 2001) and the many-facet Rasch measurement approach (MFRM; Linacre, 1994). The G-theory helps improve the assessment reliability by disentangling multiple sources of error variance, including rater variation, from the total error of the observed scores, and by estimating the optimal numbers of tasks and/or raters per task that minimize the impact from error components. The MFRM approach incorporates influences of additional characteristics (i.e., facets) into the measurement model and estimates the extent to which these facets affect ratings. More recently, innovative statistical approaches from other disciplines have been explored, such as the hierarchical rater model and its extensions (DeCarlo et al., 2011; Patz et al., 2002) and social network analysis (Lamprianou, 2018). Each of those statistical approaches has its benefits and drawbacks, which is beyond the scope of this paper.

Despite divergent results regarding the effectiveness of reducing rater variation, all those practices treat rater variation in decision-making process as undesired error components, impairing the precision of performance evaluation. However, as L2 writing entails a multidimensional latent construct and a creative product, any single set of rubric criteria may fail to fully capture the underlying complexity. The emphasis on high reliability by training raters to conform to a “consistent” decision-making pattern outlined by a specific rubric also denies the possibility of “more than one ‘correct’ reading of a text” (Weigle, 1994, p. 199). As human judgment is an indispensable element of L2 writing assessment, the expertise and experiences raters bring into the assessment should constitute an integral part of their scoring decisions (Huot, 1990). Therefore, instead of treating rater variation as a nuisance to reliable scores, we argue that an alternative approach to rater variation, can be adopted, allowing raters to employ their expertise and experiences on the one hand and systematically modeling the resulting heterogeneity to produce reliable and valid outcomes on the other hand.

Comparative Judgment: A Relative Judgment Approach

When awarding scores, raters are essentially required to judge the absolute quality of individual writing performances in isolation; however, humans are rarely good at making absolute judgment (Laming, 2004; Thurstone, 1927). In practice, raters either decide on a score by comparing each text against the predefined standard (Elliott, 2017), or gauge (sometimes adjust) scores after comparing the current text with ones that have been previously evaluated (e.g., Elliott, 2017; Lumley, 2002), rendering their judgments relative. In light of these decision-making behaviors, researchers explored a relative judgment approach—comparative judgment—to enable direct comparisons of performances in relation to each other.

The method of comparative judgment (CJ; Thurstone, 1927), instead of evaluating and assigning a score to individual performance one at a time, requires raters to directly compare pairs of performances and judge holistically which one in each pair is of better quality under assessment. The binary judgments from repeated pairwise comparisons are used to construct a ranking scale that orders students’ performances from lowest to highest quality. Advances in computer technology have greatly promoted applications of the CJ procedure (S. Bartholomew & Yoshikawa-Ruesch, 2018; Hartell & Buckley, 2021), showing its potential for producing valid and highly reliable results in open-ended and performance-based assessments, such as essay questions (Walland, 2022; Whitehouse, 2013; Whitehouse & Pollitt, 2012), academic writing (Bouwer et al., 2018; van Daal et al., 2019), engineering design (S. R. Bartholomew et al., 2018; Strimel et al., 2021), mathematical problem-solving (Jones et al., 2015), and within an interdisciplinary university program (Baniya et al., 2022).

The Rationales for Using Comparative Judgment

Previous studies using CJ for student assessment often reported superior reliability ranging from 0.67 to 0.99 with the majority greater than 0.80 (see e.g., Steedle & Ferrara, 2016; Verhavert et al., 2018 for an overview). Though some very high values (0.95 or above) were challenged by the potential inflation with an adaptive algorithm (Bramley, 2015; Bramley & Vitello, 2019), the average CJ reliability reported is often higher than can be obtained by conventional rubric scoring. The high reliability and validity of CJ are claimed to be underpinned by the key features of relative judgments and a shared consensus based on collective expertise.

First, relative comparisons help reduce inter-rater variation. Raters may differ significantly in determining the absolute quality of performances, but they are more likely to agree on which one out of a pair is better. The relative quality of two performances is then irrespective of the absolute standards of individual raters involved, canceling out differences in rater severity and leniency (Kimbell, 2022; Pollitt, 2004). Second, relative judgments help reduce within-rater variation. While raters’ evaluation standard and focus tend to drift over time (Lumley, 2002; Pollitt, 2012a; Wang et al., 2017), their judgments of “the better” in a pair would be more likely to be consistent. By bringing more consistency in raters’ judgments, the reliability is assumed to improve. Next to that, unlike conventional rubric scoring where each writing performance is usually assessed by only one or two raters before receiving a final score, CJ demands that each writing be compared in several different pairings and evaluated by multiple raters independently. Pooling together all the comparative judgments creates a shared consensus among raters of the perceived quality, resulting a rank order of performances that is more reliable (Pollitt, 2012a). Researchers have shown that the reliability of CJ is largely determined by the number of comparisons per text received (Bramley, 2015; Kimbell, 2022). On average 10 to 14 comparisons per text are needed to obtain a reliability of 0.70, and 26 to 37 comparisons per text for a reliability of 0.90 (Verhavert et al., 2019).

Apart from excellent reliability, the result of CJ benefits from the collective expertise from all the raters. While the use of rubrics may compel raters to adopt a specific conceptualization of good writing (Weigle, 1994), CJ does not impose predefined criteria on raters and allows raters to fully tap into their expertise to base their judgments on whichever aspect(s) of the writing construct they value. This “freedom to exercise discretion” (Humphry & Heldsinger, 2019, p. 11), however, invites questions regarding the validity and objectiveness of raters’ comparative judgments. While the validity of CJ has not been adequately investigated, Whitehouse (2013) and van Daal et al. (2019) explored the criteria raters used to make CJ decisions, and found that those criteria were highly construct-relevant, not only relating to but extending beyond the criteria specified by the assessment objectives or attainment targets. Thus, by allowing raters to employ their collective expertise, CJ is assumed to produce a more comprehensive conceptualization of the writing competence that embraces various dimensions and elements than a single rubric, and is therefore said to improve not only reliability but validity of the assessment (Chambers & Cunningham, 2022; Lesterhuis et al., 2018).

Comparative Judgment in L2 Writing Assessment

Compared with burgeoning research into utilizing CJ in other subject domains, studies on CJ in the context of L2 writing assessment have been relatively limited. Sims et al. (2020) examined the performance of novice and experienced raters with two rating methods, rubric rating with MFRM and CJ, and found that two groups of raters produced reliable ratings in both settings, and that rating disparity between novice and experienced raters was less prominent with CJ than with the commonly used rubric rating with MFRM approach. One intriguing line of research extends this idea and explores the use of CJ in combination with crowdsourcing (i.e., soliciting contribution from a large group of dispersed participants) to generate reliable and valid evaluations for writing texts in L2 learner corpora. These studies included either expert linguists from the applied linguistic community (Paquot et al., 2022; Thwaites, Kollias, & Paquot, 2024) or judges of more diverse backgrounds recruited from an online crowdsourcing platform (Thwaites, Vandeweerd, & Paquot, 2024), and demonstrated evidence for high levels of reliability and concurrent validity of the crowdsourced CJ evaluations with rubric-based assessment approaches.

One issue that has not been addressed in those studies is the construct validity of CJ used for L2 writing assessment (Thwaites, Vandeweerd, & Paquot, 2024). While the overall CJ reliability was satisfactory, previous studies nevertheless showed indications that systematic differences may exist between groups of judges in their judging performance, possibly associated with judge characteristics, such as professional expertise, and rating experiences. The sources and the extent of rating variation are yet to be investigated. More fundamentally, what criteria judges relied on when comparing texts and to what extent these criteria delivered a valid construct representation of L2 writing should be carefully inspected before CJ results can be used with confidence (Kelly et al., 2022). A small number of studies examining the validity of CJ provided support for construct relevance and full construct representation in judges’ comparative decisions with L1 writing assessment (Chambers & Cunningham, 2022; Lesterhuis et al., 2022; Walland, 2022), but these findings cannot, and should not, be assumed to apply to the L2 context without empirical underpinning.

The Present Study

With high reliability of CJ generally found in large-scale assessments and in other various subject domains, the present study evaluates the utility and quality of the CJ method with a relatively small group of raters in the context of L2 writing, with the aim to incorporate rater variation on the one hand and obtain reliable and valid outcomes on the other hand.

More specifically, it is of interest whether the claimed advantage of high reliability in large-scale CJ assessments can also be obtained in L2 writing assessment with relatively a small group of teachers as raters, and whether any raters (or texts) deviate from the shared consensus in the comparative judgment making. Second, as CJ employs a holistic approach without predefined rubrics, which text features raters attend to when delivering a judgment of “the better” are crucial for CJ to claim validity, and what are similarities and differences in raters’ judgment making processes. Furthermore, when raters (texts) deviate significantly from the shared consensus, referred to as misfitting, where the inconsistency lies conveys valuable diagnostic information for future learning and instruction. The investigation of these issues leads to the following research questions (RQ).

RQ1a. How reliable are the CJ results in the L2 writing assessment context?

RQ1b. To what extent is each rater’s judgment making consistent with the shared consensus and is each text assessed consistently by all the raters?

RQ2. How do raters make judgments when comparing pairs of L2 writing?

RQ3. Where do inconsistencies lie in case of misfitting raters or texts?

Method

Text Materials

A total of 94 writing texts in English were collected using a writing task that was part of a practice test for the Test for English Major-Band 8 (TEM8) at a university in the eastern part of China. The TEM8 is a standardized English proficiency test targeted at the fourth-year undergraduate students majoring in English in China, which examines whether students have attained the proficiency level specified in the national curriculum (NACFLT, 2000). The writing task used in this study required students to construct an argumentative writing of a minimum of 300 words in response to a prompt (see Appendix 1) within 45 minutes, with a focus on content relevance, content sufficiency, organization, and language quality of the writing.

Raters

Sixteen raters with English as a foreign language were invited to participate in the CJ assessment. Eight of them were lecturers (two males and six females, aged between 36 and 57) at the university where the study was conducted, giving undergraduate English courses including English language and literature, academic writing, and second language acquisition. All the lecturers were proficient in English and English writing, with teaching experiences ranging from 2 to 20 years. The other eight raters were postgraduate students (one male and seven females, aged between 25 and 30) enrolled in the Master’s program of teaching English as a foreign language at the university. All the raters volunteered their time.

Except one lecturer who had attended TEM8 writing scoring sessions at the national test administration center, none of the lecturers had received any explicit training on assessing the TEM8 writing. Nevertheless, they needed to evaluate students’ assignment essays and/or exam papers during their years of teaching and thus were reasonably experienced in assessing L2 writing. They were referred to as experienced raters hereafter. None of the Master’s students had any prior experiences in assessing L2 writing, and thus were novice raters in the study, but they all had taken, and passed the TEM8 and were familiar with the writing task requirements.

Data Collection and Procedure

Each writing text was anonymized, scanned, and uploaded to the online platform No More Marking (www.nomoremarking.com), which is a web-based digital system that pairs and presents student works side by side for raters to compare. A webpage link to the online CJ assessment was sent to each rater via email, so that they could access and complete the CJ assessment on their computers within a time window of one week.

Prior to the CJ assessment, the raters were gathered for an information session about 30 minutes, which included brief introductions of CJ, the writing task, and performing CJ with the online system. In particular, neither a rubric nor competence descriptions of L2 writing were provided; instead, the raters were explicitly instructed to evaluate and compare the L2 writing based on their professional expertise and prior rating/writing experiences. With the online system, for each pair the raters needed to select the text judged with better writing quality, and were encouraged to input brief comments explaining why they considered one text better (or worse) than the other before submitting their decision. After that, a new pair would be generated. The procedure continued until all the raters completed their comparison quota.

Based on the findings of Verhavert et al. (2019), the number of comparisons per text was set to 14 to target a reliability of at least 0.70 that was deemed sufficient for low-stakes tests. This led to a CJ design of 658 paired comparisons in total ( $94 * 14 / 2$ , as one pair involves two texts) and on average 41 comparisons for each rater. However, two experienced raters only completed 2 and 13 comparisons due to conflicting schedules. The final data then consisted of 601 completed comparisons, and each text was compared with 12 to 17 (median = 13) other texts and judged by 6 to 13 (median = 10) raters. As some raters forgot submitting reasons for their judgments occasionally, 542 of the 601 decisions (90.2%) were recorded with comments containing justifications of raters’ judgment making.

Data Analysis

The binary judgment data of 601 pairwise comparisons were exported from the online system and modeled using the Bradley-Terry-Luce model (BTL model; Bradley & Terry, 1952; Luce, 1959) to determine the locations of the writing texts on a quality scale. More specifically, for each pair, the probability of text $i$ being considered better than text j is specified (in log odds) as a function of the difference between their qualities as:

\begin{matrix} \log odds (i better than j | β_{i}, β_{j}) = β_{i} - β_{j}, \end{matrix}

(1)

where $β_{i}$ and $β_{j}$ denote the quality scores of texts $i$ and $j$ , respectively. With repeated pairwise comparisons, judgments gathered from different text pairs across all the raters enable the estimation of a relative quality score for each text, ranking the texts from lowest to highest quality.

To examine reliability of the CJ scale (RQ1a), a scale separation reliability (SSR) statistic was calculated in an analogy to the separation index in the Rasch model (Rasch, 1960) literature as:

\begin{matrix} SSR = \frac{G^{2}}{1 + G^{2}}, \end{matrix}

(2)

where $G = σ_{β} / RMSE$ is the separation ratio of the standard deviation of text quality scores ( $β$ ) to the root mean square of the estimation errors. The SSR statistic is mostly considered as a measure of the internal consistency of the resulting rank orders (Jones & Alcock, 2014; Pollitt, 2012a). Verhavert et al. (2018) also demonstrated strong evidence for interpreting the SSR as an inter-rater reliability in the CJ context.

To investigate whether any raters or texts significantly deviated from the BTL model prediction (RQ1b), misfit statistics were calculated. The infit statistic, that is, information weighted mean square (Wright & Masters, 1982) was calculated for each text (rater) as the weighted average of the squared Pearson residuals $r_{ij}$ over all the comparisons involving text $i$ (rater), using the information contained in each corresponding comparison as weights:

\begin{matrix} infi t_{i} = \frac{\sum_{i} {\hat{p}}_{ij} (1 - {\hat{p}}_{ij}) * r_{ij}^{2}}{\sum_{i} {\hat{p}}_{ij} (1 - {\hat{p}}_{ij})}, \end{matrix}

(3)

in which ${\hat{p}}_{ij}$ denotes the predicted probability of text $i$ winning the judgment over text $j$ . The rater infit statistics indicate the extent to which each rater made unexpected judgments compared to all the other raters, and the text infit statistics indicate the extent to which each text was assessed differently by raters. Following the convention in the CJ literature, raters or texts were considered misfit if their infit statistic exceeded the mean infit value by more than two standard deviations (Pollitt, 2012a). In addition, the analysis of variance (ANOVA) was performed to investigate potential effects of the raters’ expertise on their infit statistics.

Raters’ decision making (RQ2) was explored by analyzing their comments following the thematic analysis procedure (Brooks et al., 2006). Comments were first segmented into single arguments regarding a specific feature of writing (e.g., “richer in content,”“better structure,” or “providing specific supports”), leading to a total of 763 arguments. The initial coding scheme based on the framework by Cumming et al. (2001) was adjusted by merging some categories (e.g., grammar and spelling) or refining some into a generic text feature and a task specific one (e.g., paragraph development and argument development). The author and another EFL lecturer independently coded a proportion of the arguments, and yielded an inter-rater agreement of 0.91, suggesting satisfactory reliability of the coding. The author then completed coding of the remaining arguments, and grouped them into seven broader writing aspects. An overview of the final coding scheme and grouping can be found in Table 1. These arguments were compared with the established rubric used for TEM8 scoring (see Appendix 2). The relative uses of each writing aspect in judgment making were calculated in percentages for the whole group of 16 raters as well as for the experienced and novice rater groups separately. Significance of differences between the two rater groups in their judgment making was tested using the $χ^{2}$ test and Cramer’s $V$ as effect size. Raters and texts that had been identified as misfitting in RQ1b were further scrutinized regarding their judgment making patterns and text feature components, respectively, to explore where the inconsistencies lay (RQ3).

Table 1.

The Coding Scheme, the Grouping, and Frequencies of Text Features, and the Percentages of Writing Aspects Raters Used for Comparative Judgment Making.

Aspect	Text feature	Frequency
		Experienced group	Novice group	Total
Layout		4 (1.2%)	76 (17.6%)	80 (10.5%)
	Handwriting	4	71	75
	Spacing	0	1	1
	Legibility	0	1	1
	Editing/self-corrections	0	3	3

Length production		6 (1.8%)	17 (3.9%)	23 (3.0%)
English language use		82 (24.7%)	60 (13.9%)	142 (18.6%)
	Grammar/spelling	9	9	18
	Vocabulary/expression	47	30	77
	Syntax	13	21	34
	Overall	13	0	13

Rhetorical organization		100 (30.1%)	148 (34.3%)	248 (32.5%)
	Intro-body-conclusion structure	41	116	157
	Paragraphing (development)	40	32	72
	Cohesion and coherence	19	0	19

Style (conciseness/specificity/creativity/etc.)		32 (9.6%)	13 (3.0%)	45 (5.9%)
Idea contents (novelty/richness)		32 (9.6%)	65 (15.1%)	97 (12.7%)
Task development		76 (22.9%)	52 (12.1%)	128 (16.8%)
	Argumentation
	Logic	6	8	14
	Reasoning	25	8	33
	Development	5	4	9
	Supporting evidence	29	2	31
	Task completion	6	27	33
	Task relevance	2	0	2
	Persuasiveness	3	3	6
Total		332	431	763

Note. The percentages, representing the relative uses of each broad aspect of writing in raters’ judgment making, were calculated for two rater groups separately and all the raters as a whole in NVivo. The numbers in bold denote the three aspects most frequently attended to by the corresponding rater groups.

The BTL modeling of the binary judgments were performed using the BradleyTerry2 package (Turner & Firth, 2012) in R (R Core Team, 2023), the coding of judgment making comments using NVivo version 14 (2023), and all the other analyses in R. An overview of the study design and analysis framework is presented in the diagram in Appendix 3.

Results

RQ1a: Rank Order and Reliability

Figure 1 presents the scale of all the writing texts ordered by their quality scores derived from the BTL model. Each point denotes a text and the error bar the corresponding 95% confidence interval. The quality scores $β_{i}$ , ranging from –4.10 to 3.81 in logits with the mean constrained to zero (SD = 1.33), indicate the relative performance of the texts and can be converted into probabilities of a text winning a judgment over its paired opponent. For instance, when Texts 82 and 51 ( $β_{82} = . 88$ , $β_{51} = - . 28$ ) are paired for comparison, the former is more likely to be judged as the better one with a probability of 0.76 ( $\frac{\exp (0.88 - (- 0.28))}{1 + \exp (0.88 - (- 0.28))}$ ). However, the small spread of the quality scores suggests that the texts were quite similar in quality, and the relatively large error bars also imply quite some amount of uncertainty in the quality estimates and the resulting rank order. This was echoed by a scale separation reliability (SSR) of 0.64, which is moderate and lower than the targeted level of 0.70. As SSR can also be interpreted as inter-rater agreement, this result suggests that raters’ judgments contributing to the rank order of text quality were not always consistent. Some raters may consider one text of higher quality, whereas others found it the opposite. The inconsistency in raters’ judgments will be explored in detail in the section Text Features and Judgment Making.

Figure 1.

Rank order of the texts’ writing quality (ascending order) using comparative judgment.

RQ1b: Rater and Text Misfit

The infit statistics were calculated for each rater and plotted in Figure 2A to identify any raters whose judgments were relatively different from the other raters. The mean infit across all the raters was 0.91 (SD = 0.21); Rater 11 was identified as an “outlier” with an infit value of 1.37 that was marginally greater than the threshold of 1.32 for misfit. The deviation signals that this rater might compare the texts using different criteria from other raters’, leading to judgments not conforming to the shared consensus. In addition, the ANOVA test showed no significant difference in the misfit tendency between the experienced and novice raters at the significant level of 0.05 (F(1,14) = 3.03, p = 0.10).

Figure 2.

Infit mean square statistics for raters (A) and texts (B).

The infit statistics were also calculated for each text, and the mean text infit was 0.84 (SD = 0.30). Figure 2B, which plots the distribution of text infit values, shows that Texts 77 and 82, with infit values of 1.63 and 1.53, respectively, exceeded the misfit threshold value of 1.43. The large infit values indicate that these two texts were “problematic” for the raters to reach mutual consistency regarding their quality. Given both the moderate reliability and misfitting rater and texts, it is therefore important to inspect how the raters made their judgments when comparing texts and the source(s) of the inconsistency.

RQ2: Text Features and Judgment Making

Raters’ judgment making was explored by analyzing their comments that justified their decisions when comparing text pairs. Table 1 gives the frequencies of text features provided by the raters and the percentages of use by grouping aspect. Among the total 763 arguments of all 16 raters, three writing aspects received the most attention (with bolded numbers in Table 1), namely, rhetorical organization (32.5%), English language use (18.6%), and task development (16.8%), accounting for two-thirds of all the justification arguments. Together with the aspects of idea contents (12.7%) and length production (3.0%), 83.6% of raters’ arguments can be directly related to all the assessment criteria components specified in the standard TEM8 scoring rubric (see Appendix 2). However, different from the TEM8 scoring rubric putting more weights on the content (ideas and argumentation), the raters in the study focused more on rhetorical organization.

Despite no group difference found in their misfit diagnostics, the experienced and novice raters demonstrated divergent patterns in how they arrived at their judgments, which is reflected in their relative uses of writing aspects, and text features in Table 1. Firstly, both groups considered rhetorical organization the primary criterion for comparisons (30.1% for the experienced group and 34.3% for the novice group), but the novice raters predominantly focused on the representation of an introduction-body-conclusion structure, whereas the experienced raters also addressed the development within and between paragraphs as well as cohesion-coherence. Secondly, next to rhetorical organization, the experienced raters informed their decisions by comparing students’ writing in terms of English language use (24.7%) and task development (22.9%), while the novice raters more frequently attended to layout (17.6%) and idea contents (15.1%). Thirdly, the style of writing was also one important aspect for the experienced raters (9.6%), but it was the least frequently considered aspect by the novice raters (3.0%). In addition, the experienced raters were barely affected by handwriting and the length of a text during comparisons. The $χ^{2}$ test suggested that these differences were statistically significant ( $χ^{2} (6) = 95.27, p < . 001$ ) with a medium effect size (Cramer’s $V$ = 0.34).

RQ3: Inconsistency of Misfitting Rater and Texts

Rater 11 was identified as misfitting, making judgments relatively more inconsistent from the other raters. An inspection into Rater 11′s profile and arguments for judgment making showed that Rater 11 was a lecturer, completed 35 comparisons, and provided 51 arguments for the decisions made. Compared with the non-misfitting raters, close to half of Rater 11′s arguments for a favorable judgment were based on English language use (44.2%, more specifically, vocabulary/expression and syntax), followed by rhetorical organization focusing exclusively on the introduction-body-conclusion structure (32.7%). Idea contents (13.5%) and argumentation (13.5%) were the other two frequently considered aspects. These patterns suggest that Rater 11 laid a heavy focus on basic writing aspects like linguistic elements and text structure, which made his/her judgments different from the other raters. One possible explanation could be that Rater 11 was teaching the language to non-English major students for functional purposes, and the other raters were all teaching English major students at more advanced and academic levels or they were English major students themselves.

Texts 77 and 82 were identified as misfit, meaning a lack of agreement among the raters’ comparative judgments regarding their qualities. A close examination of the CJ trails showed that both texts were compared with 13 different texts and received five unexpected decisions, that is, being judged better against a stronger opponent or inferior to a poorer one. Table 2 lists arguments from raters’ justifications for the comparative decisions involving the two misfitting texts. Text 77 seemed to have comparable features of organizational structure, handwriting, and idea contents, but its effective use of vocabulary stood out for some raters. On the other hand, Text 82 seemed to have an advantage in clear and logic reasoning, but might be mediocre at idea contents and linguistic performance. Given raters’ varying foci as reported above, it is possible that when the text presented an imbalanced performance with one outstanding feature, this unique characteristic was highly appreciated by some raters and attracted them to draw a favorable judgment on the overall quality of the text, delivering a judgment that deviated from the shared consensus.

Table 2.

Raters’ Justifications for Comparative Judgments Involving Two Misfitting Texts.

Text	Being better	Being inferior^a
77	Good structure and handwritingGood handwritingComplete structureSolid contentGood use of vocabularyRicher in vocabularyMore academic vocabulary	Clear structureNice handwritingGood contentBetter structuring
82	Smooth in writing with better reasoningClear points of viewGood structureA well-developed logical organizational structureClearly stated main ideas and sufficient supporting detailsThe logic of reasoning is clearer	CreativeBetter at structuring and wordingMore logical in contents

When the text was considered inferior, the justifications were made regarding the opponent text.

Discussion

Underpinned by the psychological law of comparative judgment (Thurstone, 1927), the method of CJ uses pairwise comparisons to elicit relative judgments without confining raters to a predefined rubric, and has been emerging as a viable alternative to rubric-based assessment. Given the persistent challenge of rater variation in L2 writing assessment, the present study evaluated CJ as an alternative assessment method for L2 writing with a group of raters with varying professional expertise and scoring experiences. The reliability, raters’ judgment making (validity), and misfit feedback of the CJ procedure were examined to shed light on the quality and applicability of CJ for L2 writing assessment.

How Reliable Are the CJ Results and to What Extent Were Raters (Texts) Consistent with the Shared Consensus?

In the study, no predefined rubrics nor general competence descriptions of L2 writing were provided to the raters. Yet, the CJ assessment yielded a moderate scale separation reliability (SSR) of 0.64, suggesting a relatively good internal consistency of the resulting rank order of the writing texts. Since the SSR statistic can also be interpreted as inter-rater reliability (Verhavert et al., 2018), the result suggests an adequate level of agreement among raters that is comparable with what might be expected in typical writing assessments with rubrics (Bramley & Vitello, 2019). This demonstrates the advantage of CJ over rubric scoring. Furthermore, the holistic nature of the CJ method also bears a resemblance to the frequently used holistic marking method, but rater variation due to differential interpretation and application of rubric criteria in the latter is replaced by relative judgments on comparing pairs of texts in the CJ procedure, leading to increased consistency of judgments made. The rater misfit diagnostics showed that all the raters (except one) in the study, despite their differences in professional expertise and rating experiences, were able to make comparative judgments conforming to the mutual consistency. In this sense, CJ provides an alternative approach to the challenge of rater variation, not by training raters to adopt a similar decision-making process with rubrics, but by canceling out differences in the absolute standards between and within raters to create a shared consensus.

On the other hand, with 12 to 17 (median = 13) comparisons per text, the SSR value of 0.64 obtained in the study was lower than the targeted level of 0.70 as suggested by Verhavert et al. (2019) as well as those reported in most of previous research using CJ (Steedle & Ferrara, 2016; Verhavert et al., 2018). This could be due to the varying expertise of the raters in the study. The results showed no differences between the experienced and novice raters in their misfit tendency, which is in line with the findings that novices, even students with subject knowledge, can perform CJ as reliably as experts (S. R. Bartholomew et al., 2018; Jones & Alcock, 2014). However, with relatively lower discriminating capability, it takes more rounds of comparisons for novice raters to reach the same reliability level as their experienced counterparties (Kimbell, 2022; Verhavert et al., 2019). Consequently, it is possible that with a similar amount of comparisons performed by the novice and experience raters in this study, the tentative consistency of the former could be lower than that of the latter, affecting the overall reliability achieved. Given that the number of comparisons per text is one key factor that affects the SSR (Bramley, 2015; Kimbell, 2022; Verhavert et al., 2019), a maximal level of reliability could be eventually achieved by increasing rounds of comparisons.

Another possible explanation for the moderate SSR could be the homogeneous characteristic of the writing texts. Based on teachers’ knowledge of the learning and classroom formative assessment activities, the students involved in the study had very similar levels of English writing proficiency. The narrow “gap” between two paired texts made it more difficult for the raters to distinguish the better one consistently (Kimbell, 2022). It could be beneficial to include a set of well-established threshold benchmarks from rubric-based assessment in CJ texts to facilitate mapping the CJ rank order to rubric ratings (See Limitations).

How Did the Raters Arrive at (In)consistent Judgments When Comparing L2 Writing?

The thematic analysis into raters’ judgment making arguments showed that raters, on the individual level, varied substantially in what text features they used as criteria to compare single pairs of texts, but collectively the representation derived from the iterative process of pairwise comparisons presented a close alignment with the established scoring rubric. All the criteria components of the scoring rubric were covered by the arguments justifying raters’ decisions, and more than 83% of the raters’ arguments directly related to the assessed criteria in the rubric, providing strong evidence for the construct validity of CJ. Next to that, raters in the study also attended to aspects that were not addressed by the established rubric but were rather relevant and important in terms of English writing in general, such as style and persuasiveness of the writing. This is consistent with previous studies in that CJ often accommodates an extended range of evaluation criteria over specific rubrics (S. R. Bartholomew et al., 2018; Lesterhuis et al., 2018; van Daal et al., 2019). Without the reference to (or limitation of) predefined criteria, raters can tap into their expertise, apply their own conceptualizations of writing to the reading of texts, and deliver a professional judgment. Thus, our results provide empirical support for the construct validity of CJ relying on raters’ expertise and experiences, which embraced a more comprehensive representation than a single rubric specification of what good L2 writing looks like.

Besides, the results add to the CJ literature by highlighting the differences between the experienced and novice raters regarding how they conformed to the shared consensus. The novice raters tended to compare texts in terms of more rule-applying lower-order aspects of writing (Cumming et al., 2001; text structure and layout) and idea contents, whereas the experienced raters informed their judgments mostly by higher-order aspects (rhetorical development, argumentation, and style), and were less affected by construct-irrelevant factors (handwriting or text length). These results are similar to the findings in the literature of rubric-based L2 writing assessment that raters evaluated texts with various foci (Cumming, 1990; Eckes, 2008) and at different levels (Wolfe, 1997; Zhang, 2016). While variations in raters’ criteria were assumed to produce inconsistent judgments in rubric scoring (Eckes, 2008; Huot, 1990), this study illustrates that CJ offered a feasible means to integrate these rater variations by aggregating all the judgments made on one text to produce a global evaluation of its quality. With varied judgment criteria, the novice raters contributed to the shared consensus in terms of what was contained in the texts, and the experienced raters in terms of how the messages were conveyed, whether in a logic, effective, or appropriate manner. Thus, it can be argued that rater variations are allowed, even welcomed, in CJ (van Daal et al., 2019), as long as the deviation is not due to construct-irrelevant factors or biases (see below).

Where Does the Misfit Lie?

Misfit indicates a pattern of raters making (or texts receiving) unexpected judgments consistently. In the study, one rater and two texts were identified as misfitting, demonstrating relatively more deviation from the shared consensus than the others. It should be noted that CJ accommodates the heterogeneity of judgment making processes by pooling together all the raters’ judgments, and thus the misfit statistics are better interpreted relatively rather than absolutely (Pollitt, 2012b). The misfitting rater was the only non-English major related judge, suggesting that current teaching level could be an important factor in promoting consistent judgments. This finding echoes the results by Whitehouse and Pollitt (2012) showing that raters teaching at a similar level were more likely to share a common set of standards to base their judgments on. A significantly misfitting rater may have some practical implications, as how raters conceptualize L2 writing can influence what and how they teach students to develop L2 writing skills. The study illustrates that raters’ misfit statistics in CJ may serve as a good indication to inform raters about their divergent conceptualization of the construct (or other skills), so that appropriate adjustments in their teaching can be made to ensure students to develop the competence according to criteria that are valued in general instead of a rater’s idiosyncratic standards.

It was also observed in the study that when a salient feature was present in a text, it tended to cause the raters to disagree more, leading to more divergent judgments. Text 77 had a prominent feature of good vocabulary and Text 82 fine logic reasoning; the appreciation of these particular features may lead some raters to draw a favorable but inconsistent generalizing conclusion about the overall quality of the texts. This judgment making behavior is commonly known as the halo effect (Thorndike, 1920). The finding suggests that raters in a CJ procedure could also be subject to the halo effect as often reported in rubric-based assessment (e.g., Engelhard, 1994; Myford & Wolfe, 2004). Since advocators of CJ sometimes claim that in the absence of a rubric, rater training on standardization meetings is unnecessary, the results in this study point out the importance of providing basic guidance to raters on assessing L2 writing prior to the CJ procedure.

Implications

The present study also has some implications for instructors and administrators of English major programs. It could be a great challenge to develop an effective scoring rubric for L2 writing that enhances consistency in evaluation and also promotes learning and instruction. High levels of reliability have primarily been sought with externally developed standardized assessments (Humphry & Heldsinger, 2020). This study has shown that CJ enables raters to utilize their professional expert and perform valid evaluations of texts at a satisfactory level of reliability, even with a small group of raters and without extensive negotiation. This offers instructors and program administrators a time-efficient and economic means to deliver reliable classroom- and program-based evaluations. Furthermore, studies of Humphry and Heldsinger (2020) and Marshall et al. (2020) demonstrated that if threshold benchmark texts from standardized tests or large-scale national assessments are to be anchored on the constructed CJ scale, CJ can inform raters to improve the quality and criterion validity of their program-based assessments.

Additionally, the process of actively comparing texts and reflecting on the reasons puts forward CJ’s potential pedagogical benefits in learning and instruction. For instance, when students engage in CJ as rater themselves, concrete examples of writing can help them better understand the abstract construct of writing competency and what it means by “good” and “better” (Kimbell, 2022). Thus, besides the advantages of improved psychometric properties as an assessment method, the potentials of CJ as a feedback tool and learning experience in L2 writing are also an interesting topic for future research.

Limitations

The first limitation of this study was the sample of L2 writing used for the CJ assessment, which can be considered a convenient sample. The practice test, from which the writing texts were collected, had been administered in an evening session one month before the actual TEM8 took place. It is possible that the students were insufficiently motivated or experiencing fatigue during the practice session, causing the quality of the writing to be relatively low in general. With limited “gaps” between text pairs, it might be consequently more difficult for the raters to discriminate the winner from a pair consistently.

The second limitation related to the investigation of the consistency of judgments between the two rater groups. The whole CJ assessment was carried out jointly by the two rater groups, making calculating SSR for each group less straightforward. Splitting the data would further reduce the number of comparisons per text and overlaps between text pairs, leading to even lower consistency of the results. A design with two rater groups completing the whole procedure independently would enable a direct comparison of their judgment consistency. Additionally, the small sample size of raters makes the results of this study less generalizable.

In the present study, no comparison was made between the CJ results and traditional rubric ratings, due to the unavailability of reliable rating scores and benchmark exemplars. Since the present study primarily aimed to focus on (in)consistency among raters/texts and raters’ comparative judgment making to establish CJ’s construct validity in assessing L2 writing, subsequent studies can include a set of established threshold benchmarks from traditional rubric rating as anchor points on the CJ rank order, comparing CJ results and rubric ratings to examine the criterion validity of CJ.

Lastly, the current study explored CJ for assessing L2 writing by focusing on psychometric properties. Another important aspect of how the raters experienced the novel method was not addressed in this study. As raters’ perceptions of and attitudes toward an assessment method can influence their use of the method and confidence in their judgments, future research should explore this perspective using methods such as survey and interview feedback.

Conclusion

Rater variation has been a persistent concern for language writing assessment. Instead of optimizing scoring rubrics and training practices, the current study evaluated the method of CJ as an alternative approach to rater variations, underpinned by relative judgments and a shared consensus of collective expertise among raters. With an iterative pairwise comparison process and statistical estimation, CJ holds promise for accommodating rater-associated heterogeneity of evaluation on the one hand, and obtaining a reliable and valid rank order of writing texts on the other hand. Despite the relatively small number of raters and their varying professional expertise in the study, the assessment outcomes were shown with a satisfactory level of reliability and validity. Meanwhile, the study also expressed the necessity of providing relevant guidance to raters, if not explicit training, to enable more valid evaluations. The intuitive rationale and simple procedure of CJ reduce the need for time-consuming rater training (S. R. Bartholomew et al., 2018; Pollitt, 2012a; Whitehouse, 2013), but information about targeted writing skills and common pitfalls like the halo effect can help raters avoid being affected by construct-irrelevant factors and judgment biases.

To the author’s knowledge, this is one of the first studies that systematically investigated what text feature(s) raters considered to deliver a comparative decision in the context of L2 writing assessment with CJ. It contributes to the existing literature by providing empirical evidence for the less examined construct validity of CJ in the context of L2 writing, and adding a novel perspective to the discussion of rater variation in the rater-mediated assessment literature. With the growing interest of CJ in the educational assessment filed, more elaborated investigations of the quality and effectiveness of CJ are needed to foster confidence in its use.

Footnotes

Appendix

Appendix 2 ORCID iD

Qian Wu

Ethical Considerations

The study was approved by the Research Ethics Committee of Shaoxing University (No.2023WGY06) on February 27, 2023. The participants provided their written informed consent to participate in this study.

Author Contributions

Qian Wu contributed to the Conceptualization, Methodology, Formal analysis, Investigation, Drafting and Revision, and Funding acquisition of the study. The author thanks Qi Luo for assisting in collecting, scanning, and uploading the text materials.

Funding

The author disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the university research grant (NO. 13011002002/114) and the project grant (No. 2023SK008) from Shaoxing University.

Declaration of Conflicting Interests

The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data Availability Statement

The datasets analyzed during the current study are available in the Open Science Framework repository,

References

Baniya

Mentzer

Bartholomew

S. R.

Chesley

Moon

Sherman

(2022). Using adaptive comparative judgment in writing assessment: An investigation of reliability among interdisciplinary evaluators. Journal of Technology Studies, 45(1), 24–35. https://doi.org/10.21061/jots.v45i1.a.3

Bartholomew

S. R.

Yoshikawa-Ruesch

(2018). A systematic review of research around adaptive comparative judgement (ACJ) in K-16 education. CTETE - Research Monograph Series, 1(1), 6–28. https://doi.org/10.21061/ctete-rms.v1.c.1

Bartholomew

S. R.

Nadelson

L. S.

Goodridge

W. H.

Reeve

E. M.

(2018). Adaptive comparative judgment as a tool for assessing open-ended design problems and model eliciting activities. Educational Assessment, 23(2), 85–101. https://doi.org/10.1080/10627197.2018.1444986

Bloxham

(2009). Marking and moderation in the UK: False assumptions and wasted resources. Assessment & Evaluation in Higher Education, 34(2), 209–220. https://doi.org/10.1080/02602930801955978

Bouwer

Lesterhuis

Bonne

De Maeyer

(2018). Applying criteria to examples or learning by comparison: Effects on students’ evaluative judgment and performance in writing. Frontiers in Education, 3, 86. https://doi.org/10.3389/feduc.2018.00086

Bradley

R. A.

Terry

M. E.

(1952). Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3–4), 324–345. https://doi.org/10.1093/biomet/39.3-4.324

Bramley

(2015). Investigating the reliability of adaptive comparative judgment. Cambridge Assessment.

Bramley

Vitello

(2019). The effect of adaptivity on the reliability coefficient in adaptive comparative judgement. Assessment in Education Principles Policy and Practice, 26(1), 43–58. https://doi.org/10.1080/0969594X.2017.1418734

Brennan

R. L.

(2001). Generalizability theory. Springer New York.

10.

Brooks

McCluskey

Turley

King

(2006). Using thematic analysis in psychology. Qualitative Research in Psychology, 3(2), 202–101. https://doi.org/10.1191/1478088706qp063oa

11.

Chambers

Cunningham

(2022). Exploring the validity of comparative judgement: Do judges attend to construct-irrelevant features? Frontiers in Education, 7, 802392. https://doi.org/10.3389/feduc.2022.802392

12.

Charney

(1984). The validity of using holistic scoring to evaluate writing: A critical overview. Research in the Teaching of English, 18(1), 65–81.

13.

Crisp

(2013). Criteria, comparison and past experiences: How do teachers make judgements when marking coursework? Assessment in Education: Principles. Policy & Practice, 20(1), 127–144. https://doi.org/10.1080/0969594X.2012.741059

14.

Cumming

(1990). Expertise in evaluating second language compositions. Language Testing, 7(1), 31–51. https://doi.org/10.1177/026553229000700104

15.

Cumming

Kantor

Powers

D. E.

(2001). Scoring TOEFL essays and TOEFL 2000 prototype writing tasks: An investigation into raters’ decision making and development of a preliminary analytic frame-work (No. RM-01-04). Educational Testing Service.

16.

Cumming

Kantor

Powers

D. E.

(2002). Decision making while rating ESL/EFL writing tasks: A descriptive framework. Modern Language Journal, 86(1), 67–96. https://doi.org/10.1111/1540-4781.00137

17.

Davis

(2016). The influence of training and experience on rater performance in scoring spoken language. Language Testing, 33(1), 117–135. https://doi.org/10.1177/0265532215582282

18.

DeCarlo

L. T.

Kim

Johnson

M. S.

(2011). A hierarchical rater model for constructed responses, with a signal detection rater model: Hierarchical signal detection rater model. Journal of Educational Measurement, 48(3), 333–356. https://doi.org/10.1111/j.1745-3984.2011.00143.x

19.

Eckes

(2008). Rater types in writing performance assessments: A classification approach to rater variability. Language Testing, 25(2), 155–185. https://doi.org/10.1177/0265532207086780

20.

Eckes

(2012). Operational rater types in writing assessment: Linking rater cognition to rater behavior. Language Assessment Quarterly, 9(3), 270–292. https://doi.org/10.1080/15434303.2011.649381

21.

Elliott

(2017). What does a good one look like? Marking A-Level English scripts in relation to others. English in Education, 51(1), 58–75. https://doi.org/10.1111/eie.12133

22.

Engelhard

(1994). Examining rater errors in the assessment of written composition with a many-faceted Rasch model. Journal of Educational Measurement, 31(2), 93–112. https://doi.org/10.1111/j.1745-3984.1994.tb00436.x

23.

Hartell

Buckley

(2021). Comparative judgment: An overview. In Marcus-Quinn

Hourigan

(Eds.), Handbook for Online Learning Contexts: Digital, mobile and Open (pp. 289–307). Springer International Publishing.

24.

Humphry

Heldsinger

(2019). Raters’ perceptions of assessment criteria relevance. Assessing Writing, 41, 1–13. https://doi.org/10.1016/j.asw.2019.04.002

25.

Humphry

Heldsinger

(2020). A two-stage method for obtaining reliable teacher assessments of writing. Frontiers in Education, 5, 6. https://doi.org/10.3389/feduc.2020.00006

26.

Huot

(1990). Reliability, validity, and holistic scoring: What we know and what we need to know. College Composition and Communication, 41(2), 201. https://doi.org/10.2307/358160

27.

Jones

Alcock

(2014). Peer assessment without assessment criteria. Studies in Higher Education, 39(10), 1774–1787. https://doi.org/10.1080/03075079.2013.821974

28.

Jones

Swan

Pollitt

(2015). Assessing mathematical problem solving using comparative judgement. International Journal of Science and Mathematics Education, 13(1), 151–177. https://doi.org/10.1007/s10763-013-9497-6

29.

Kelly

K. T.

Richardson

Isaacs

(2022). Critiquing the rationales for using comparative judgement: A call for clarity. Assessment in Education: Principles, Policy & Practice, 29(6), 674–688. https://doi.org/10.1080/0969594X.2022.2147901

30.

Kimbell

(2022). Examining the reliability of adaptive comparative judgement (ACJ) as an assessment tool in educational settings. International Journal of Technology and Design Education, 32(3), 1515–1529. https://doi.org/10.1007/s10798-021-09654-w

31.

Knoch

(2011). Investigating the effectiveness of individualized feedback to rating behavior—A longitudinal study. Language Testing, 28(2), 179–200. https://doi.org/10.1177/0265532210384252

32.

Laming

(2004). Human judgment: The eye of the beholder. Thomson Learning.

33.

Lamprianou

(2018). Investigation of rater effects using social network analysis and exponential random graph models. Educational and Psychological Measurement, 78(3), 430–459. https://doi.org/10.1177/0013164416689696

34.

Lesterhuis

Daal

T. V.

Van Gasse

Coertjens

Donche

(2018). When teachers compare argumentative texts: Decisions informed by multiple complex aspects of text quality. L1 Educational Studies in Language and Literature, 18, 1–22. https://doi.org/10.17239/L1ESLL-2018.18.01.02

35.

Linacre

J. M.

(Ed.). (1994). Many-facet Rasch measurement (2nd ed.). MESA Press.

36.

Luce

R. D.

(1959). Individual choice behaviours: A theoretical analysis. John Wiley.

37.

Lumley

(2002). Assessment criteria in a large-scale writing test: What do they really mean to the raters? Language Testing, 19(3), 246–276. https://doi.org/10.1191/0265532202lt230oa

38.

Lumley

McNamara

T. F.

(1995). Rater characteristics and rater bias: Implications for training. Language Testing, 12(1), 54–71. https://doi.org/10.1177/026553229501200104

39.

Marshall

Shaw

Hunter

Jones

(2020). Assessment by comparative judgement: An application to secondary statistics and English in New Zealand. New Zealand Journal of Educational Studies, 55(1), 49–71. https://doi.org/10.1007/s40841-020-00163-3

40.

Messick

(1995). Validity of psychological assessment: Validation of inferences from persons’ responses and performances as scientific inquiry into score meaning. American Psychologist, 50(9), 741–749. https://doi.org/10.1037/0003-066X.50.9.741

41.

Myford

C. M.

Wolfe

E. W.

(2003). Detecting and measuring rater effects using many-facet Rasch measurement: Part I. Journal of Applied Measurement, 4(4), 189–422.

42.

Myford

C. M.

Wolfe

E. W.

(2004). Detecting and measuring rater effects using many-facet Rasch measurement: Part II. Journal of Applied Measurement, 5(2), 189–227.

43.

NACFLT. (2000). Syllabus for University English Language Teaching. Shanghai Foreign Language Education Press.

44.

NVivo qualitative data analysis (Version 14). (2023). [Computer software]. Lumivero. https://lumivero.com/products/nvivo/

45.

Palermo

(2022). Rater characteristics, response content, and scoring contexts: Decomposing the determinates of scoring accuracy. Frontiers in Psychology, 13, 937097. https://doi.org/10.3389/fpsyg.2022.937097

46.

Palermo

Bunch

M. B.

Ridge

(2019). Scoring Stability in a Large-Scale assessment program: A Longitudinal Analysis of leniency/Severity Effects. Journal of Educational Measurement, 56(3), 626–652. https://doi.org/10.1111/jedm.12228

47.

Paquot

Rubin

Vandeweerd

(2022). Crowdsourced adaptive comparative judgment: A community-based solution for proficiency rating. Language Learning, 72(3), 853–885. https://doi.org/10.1111/lang.12498

48.

Patz

R. J.

Junker

B. W.

Johnson

M. S.

Mariano

L. T.

(2002). The hierarchical rater model for rated test items and its application to large-scale educational assessment data. Journal of Educational and Behavioral Statistics, 27(4), 341–384. https://doi.org/10.3102/10769986027004341

49.

Pollitt

(2004). Let’s stop marking exams [Paper presentation]. International Association for Educational Assessment Conference, Philadelphia, US. https://www.cambridgeassessment.org.uk/images/109719-let-s-stop-marking-exams.pdf

50.

Pollitt

(2012a). Comparative judgement for assessment. International Journal of Technology and Design Education, 22(2), 157–170. https://doi.org/10.1007/s10798-011-9189-x

51.

Pollitt

(2012b). The method of adaptive comparative judgement. Assessment in Education Principles Policy and Practice, 19(3), 281–300. https://doi.org/10.1080/0969594X.2012.665354

52.

Rasch

(1960). Probabilistic model for some intelligence and achievement tests. Danish Institute for Educational Research.

53.

R Core Team. (2023). R: A language and environment for statistical computing [Computer software]. R Foundation for Statistical Computing. https://www.R-project.org

54.

Rhead

Black

Pinot de Moira

(2016). Marking consistency metrics. Ofqual. https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/681625/Marking_consistency_metrics_-_November_2016.pdf

55.

Sims

M. E.

Cox

T. L.

Eckstein

G. T.

Hartshorn

K. J.

Wilcox

M. P.

Hart

J. M.

(2020). Rubric rating with MFRM versus randomly distributed comparative judgment: A comparison of two approaches to second-language writing assessment. Educational Measurement Issues and Practice, 39(4), 30–40. https://doi.org/10.1111/emip.12329

56.

Steedle

J. T.

Ferrara

(2016). Evaluating comparative judgment as an approach to essay scoring. Applied Measurement in Education, 29(3), 211–223. https://doi.org/10.1080/08957347.2016.1171769

57.

Strimel

G. J.

Bartholomew

S. R.

Purzer

Zhang

Ruesch

E. Y.

(2021). Informing engineering design through adaptive comparative judgment. European Journal of Engineering Education, 46(2), 227–246. https://doi.org/10.1080/03043797.2020.1718614

58.

Thorndike

E. L.

(1920). A constant error in psychological ratings. Journal of Applied Psychology, 4(1), 25–29. https://doi.org/10.1037/h0071663

59.

Thurstone

L. L.

(1927). A law of comparative judgment. Psychological Review, 34(4), 273–286. https://doi.org/10.1037/h0070288

60.

Thwaites

Kollias

Paquot

(2024). Is CJ a valid, reliable form of L2 writing assessment when texts are long, homogeneous in proficiency, and feature heterogeneous prompts? Assessing Writing, 60, 100843. https://doi.org/10.1016/j.asw.2024.100843

61.

Thwaites

Vandeweerd

Paquot

(2024). Crowdsourced comparative judgement for evaluating learner texts: How reliable are judges recruited from an online crowdsourcing platform? Applied Linguistics, amae048. https://doi.org/10.1093/applin/amae048

62.

Turner

Firth

(2012). Bradley-terry models in R: the BradleyTerry2 package. Journal of Statistical Software, 48(9), 1–21. https://doi.org/10.18637/jss.v048.i09

63.

van Daal

Lesterhuis

Coertjens

Donche

De Maeyer

. (2019). Validity of comparative judgement to assess academic writing: Examining implications of its holistic character and building on a shared consensus. Assessment in Education Principles Policy and Practice, 26(1), 59–74. https://doi.org/10.1080/0969594X.2016.1253542

64.

Vaughan

(1991). Holistic assessment: What goes on in the rater’s mind. In Hamp-Lyons

(Ed.), Assessing second language writing in academic contexts (pp. 111–125). Ablex.

65.

Verhavert

Bouwer

Donche

De Maeyer

(2019). A meta-analysis on the reliability of comparative judgement. Assessment in Education Principles Policy and Practice, 26(5), 541–562. https://doi.org/10.1080/0969594X.2019.1602027

66.

Verhavert

De Maeyer

Donche

Coertjens

(2018). Scale separation reliability: What does it mean in the context of comparative judgment? Applied psychological measurement, 42(6), 428–445. https://doi.org/10.1177/0146621617748321

67.

Walland

(2022). Judges’ views on pairwise comparative judgement and rank ordering as alternatives to analytical essay marking. Research Matters: A Cambridge University Press & Assessment Publication, 33, 48–67.

68.

Wang

Engelhard

Raczynski

Song

Wolfe

E. W.

(2017). Evaluating rater accuracy and perception for integrated writing assessments using a mixed-methods approach. Assessing Writing, 33, 36–47. https://doi.org/10.1016/j.asw.2017.03.003

69.

Weigle

S. C.

(1994). Effects of training on raters of ESL compositions. Language Testing, 11(2), 197–223. https://doi.org/10.1177/026553229401100206

70.

Weigle

S. C.

(1998). Using FACETS to model rater training effects. Language Testing, 15(2), 263–287. https://doi.org/10.1177/026553229801500205

71.

Weigle

S. C.

(2002). Assessing writing. Cambridge University Press.

72.

Whitehouse

(2013). Testing the validity of judgements about geography essays using the Adaptive Comparative Judgement method. AQA Centre for Education Research and Policy.

73.

Whitehouse

Pollitt

(2012). Using adaptive comparative judgement to obtain a highly reliable rank order in summative assessment. AQA Centre for Education Research and Policy.

74.

Wilson

(2006). Rethinking rubrics in writing assessment. Heinemann.

75.

Wolfe

E. W.

(1997). The relationship between essay reading style and scoring proficiency in a psychometric scoring system. Assessing Writing, 4(1), 83–106. https://doi.org/10.1016/S1075-2935(97)80006-2

76.

Wolfe

E. W.

(2004). Identifying rater effects using latent trait models. Psychology Science, 46(1), 35–51.

77.

Wright

B. D.

Masters

G. N.

(1982). Rating scale analysis. Mesa Press.

78.

Yan

Chuang

P.-L.

(2022). How do raters learn to rate? Many-facet Rasch modeling of rater performance over the course of a rater certification program. Language Testing, 40, 153–026553222210749. https://doi.org/10.1177/02655322221074913

79.

Zhang

(2016). Same text different processing? Exploring how raters’ cognitive and meta-cognitive strategies influence rating accuracy in essay scoring. Assessing Writing, 27, 37–53. https://doi.org/10.1016/j.asw.2015.11.001

Comparative Judgment: Building a Shared Consensus Over Rater Variation in Assessing Second Language Writing Performance

Abstract

Keywords

Introduction

Approaches to Rater Variation in Language Writing Assessment

Comparative Judgment: A Relative Judgment Approach

The Rationales for Using Comparative Judgment

Comparative Judgment in L2 Writing Assessment

The Present Study

Method

Text Materials

Raters

Data Collection and Procedure

Data Analysis

Results

RQ1a: Rank Order and Reliability

RQ1b: Rater and Text Misfit

RQ2: Text Features and Judgment Making

RQ3: Inconsistency of Misfitting Rater and Texts

Discussion

How Reliable Are the CJ Results and to What Extent Were Raters (Texts) Consistent with the Shared Consensus?

How Did the Raters Arrive at (In)consistent Judgments When Comparing L2 Writing?

Where Does the Misfit Lie?

Implications

Limitations

Conclusion

Footnotes

Appendix

Appendix 2

ORCID iD

Ethical Considerations

Author Contributions

Funding

Declaration of Conflicting Interests

Data Availability Statement

References