Sage Journals: Discover world-class research

Abstract

Conventional cross-country scoring reliability in international large-scale assessments often depends on double scoring, which typically involves relatively small samples of multilingual responses. To extend the reach of reliability estimation, this study introduces the Linguistic-integrated Reliability Audit (LiRA), a novel method that measures scoring reliability using an entire dataset in a large-scale, multilingual context. LiRA automatically generates a second score for each response by analyzing its semantic alignment within a neighborhood of similar responses, then applies a weighted majority voting to determine a consensus score. Results demonstrate that LiRA provides a more comprehensive and systematic estimation of scoring reliability at the item, country, and language levels, while preserving the fundamental concepts of traditional reliability.

Keywords

scoring reliability cross-country scoring consistency semantic similarity weighted majority voting ILSAs

Introduction

Constructed-response (CR) items are considered more aligned with capturing students’ comprehension, reasoning, and higher-order skills than selected response formats such as multiple-choice (MC) items. Consequently, a substantial portion of the large-scale educational assessments, such as the Progress in International Reading Literacy Study (PIRLS), the Programme for International Student Assessment (PISA), the Trends in International Mathematics and Science Study (TIMSS), and the National Assessment of Educational Progress (NAEP), include CR items. Scoring responses to CR items, whether by human scorers or automated systems, can be resource-intensive and vulnerable to inconsistency, particularly when interpretation is required. Scoring reliability, which refers to the consistency and accuracy of scoring student responses, is a fundamental aspect of obtaining high-quality data. Ensuring that the different scorers assign the same score to the same response is essential to maintaining the validity of the assessment results (Shin et al., 2019).

Concerns about the reliability of human scoring in educational contexts have a long history. Early work demonstrated inconsistencies in teachers’ essay scoring, highlighting the subjective nature of grading in the absence of standardized criteria (Starch & Elliott, 1912, 1913a, 1913b). Subsequent studies confirmed that extraneous factors such as grammar, handwriting, or even the perceived neatness of an essay can significantly influence scorers’ evaluations (Huck & Bounds, 1972; Scannell & Marshall, 1966). These findings motivated the development of scoring rubrics and structured scoring frameworks to impose objectivity and alignment across scorers.

Empirical research has demonstrated that structured scoring protocols, especially those incorporating task-specific rubrics, example scripts, and scorer calibration exercises, substantially improve inter-rater agreement (Cooper et al., 1977; Jonsson & Svingby, 2007). However, residual variability in scoring remains a persistent concern. Braun’s (1988) calibration studies revealed that by employing partially balanced incomplete block (PBIB) designs, the experiment isolates systematic sources of variation, such as reader severity and day-to-day drift, without requiring redundant readings of the same essays. The study argued that even trained raters exhibit systematic differences in scoring severity and that statistical adjustments to account for such differences can enhance reliability more cost-effectively than double-scoring procedures.

Building on these insights, Stemler (2004) proposed a comprehensive framework for inter-rater reliability, distinguishing among three key perspectives: consensus estimates, defined as exact agreement between raters; consistency estimates, assessed through correlational alignment; and measurement estimates, which encompass more complex modeling approaches such as generalizability theory (Cronbach, 1946; Shavelson & Webb, 1981). Each perspective offers distinct methodological tools, underscoring that scoring reliability is not a singular construct but rather a multidimensional concept. Furthermore, no single safeguard, such as double scoring, is sufficient to fully control for rater effects. Instead, robust quality assurance requires a multistep approach that incorporates scorer selection, training, calibration exercises, ongoing statistical monitoring, and procedural controls (Graham et al., 2012).

These challenges of maintaining scoring consistency, ensuring fairness across diverse student populations, and managing the resource demands of human scoring are especially pronounced in international large-scale assessments (ILSAs). In multilingual assessment contexts, human raters must be trained across a variety of languages and cultural frameworks to ensure that scoring is valid and comparable across countries and cycles (Mazzeo & von Davier, 2008, 2014). Traditionally, ILSAs have addressed these challenges by implementing rigorous scoring procedures: centralized rubric (scoring guide) development, international and national scoring training, within-country and cross-country scoring calibration, and ongoing quality monitoring using double scoring and inter-rater reliability statistics. These protocols are resource-intensive, but they have proven effective in producing high levels of scoring reliability in operational practice.

A prime example of this approach is found in PIRLS, where CR items comprise approximately half of the total score points. Given the consistency of human judgment in scoring these items, PIRLS has developed a comprehensive, multilayered scoring and quality assurance system. This includes coordinated international and national scorer training, the use of software-supported scoring platforms, and systematic reliability studies conducted at national, cross-national, and across cycles, all aimed at ensuring consistent application of scoring criteria across countries and over time.

Scoring of CR items in PIRLS is guided by detailed scoring guides developed by the TIMSS & PIRLS International Study Center. These guides provide precise criteria and annotated examples for each score category, allowing scorers to apply the guides consistently across diverse response types. Since the transition to the computer-based assessment in PIRLS 2021, all scoring activities have been conducted online through IEA’s CodingExpert Software, which integrates reliability tracking and scoring standardization functions (Johansone, 2024).

The scoring process begins with international training sessions where National Research Coordinators or their appointed representatives are trained directly by the International Study Center. During these sessions, participants practice scoring standardized responses, engage in group discussions to resolve discrepancies, and align their interpretations with the scoring guide. These trained coordinators then return to their countries and lead national-level training using both international materials and additional examples derived from their own country’s student responses. This layered approach is necessary as scoring within countries is carried out in the different languages of the test used in each participating country. However, this approach assumes that the international scoring training conducted in English can be turned into localized training in dozens of other languages without loss of fidelity.

PIRLS evaluates scoring reliability through three complementary approaches: within-country, trend, and cross-country scoring reliability (CCSR). Within-country reliability is assessed by having each participating country independently double-score a random sample of approximately 200 responses per CR item. The agreement between the two scores serves as a measure of scorer consistency. Trend reliability examines scoring consistency over time by having scorers in the current cycle re-score responses from the previous administration. CCSR, while more limited in scope, assesses scoring consistency across countries by selecting a scorer in each of the participating countries who is proficient in English to independently score a common set of 200 responses drawn from English-speaking countries. Unlike within-country and trend reliability procedures that are integrated in the operational scoring workflow, CCSR studies are conducted as a separate exercise following the completion of main scoring activities (Johansone, 2024).

While CCSR provides an important index of international scoring consistency, its validity is limited due to the need to use English-only responses and is constrained by several structural and logistical challenges. Most notably, language restrictions limit the selection of responses to English, excluding a large portion of the PIRLS population and leading to an incomplete representation of global scoring comparability. Moreover, scorers from participating countries must be proficient in English, which further restricts the scorer selection process and may result in a less representative pool of raters. Sample constraints also limit the generalizability of findings, as CCSR is based on a fixed subset of responses (usually 200 per item) for a limited number of items, which may not capture the full diversity of item types, response patterns, or student populations. In addition, the approach lacks scalability, as double scoring and international coordination are resource-intensive and become increasingly burdensome as digital platforms and participating countries expand. Finally, the method constitutes a static evaluation, offering only a snapshot of agreement without the capacity for continuous, response-level diagnostics or adaptive quality control. These limitations underscore the need to rethink the scoring reliability framework that preserves the foundational role of human judgment while extending the scope, granularity, and inclusivity of reliability estimation.

Recent studies further highlight both the strengths and limitations of current practices. For instance, Shin et al. (2019) reported near-perfect correlations between human and automated scores in the PISA 2015 science assessment, suggesting that well-designed scoring guides and monitoring protocols can effectively control for rater variance in many settings. However, the same study uncovered residual severity biases in a subset of low-performing countries, reinforcing the importance of empirical, data-driven mechanisms for ongoing quality checks. Similarly, Ercikan and Lyons-Thomas (2013) argue that scoring comparability in multilingual assessments cannot rely on training and scoring guides alone; it must be supported by empirical validation across linguistic and cultural contexts. These findings collectively emphasize that while existing scoring procedures have demonstrated strong reliability, they remain logistically challenging and labor-intensive, financially demanding, and inherently limited by their reliance on subsample-based double scoring.

Despite being considered the gold standard for estimating inter-rater reliability, double scoring introduces practical trade-offs. Scoring each response twice by independent human raters, especially in large-scale, multilingual assessments, requires significant investment in human labor, training, and monitoring infrastructure. Moreover, since double scoring is typically limited to a small number of responses per item, it raises important concerns about the representativeness of reliability estimates, particularly for rare, ambiguous, or culturally nuanced responses. These persistent constraints motivate the need for more scalable, inclusive, and adaptive frameworks for scoring reliability. To address these limitations, we propose a novel semantic similarity-based approach that maintains the foundational role of human scoring but redefines how scoring reliability is assessed. Rather than relying exclusively on duplicated scores, our approach leverages the semantic relationships between responses to infer the expected consistency in human scoring.

Although this study focuses on evaluating the reliability of human scoring, the proposed semantic similarity framework is equally applicable to assessing the internal consistency of automated scoring systems. In such a context, one could evaluate whether an artificial intelligence (AI) model assigns the same score to semantically similar responses; this would be an indicator of model stability and fairness. This adaptation would allow for checking the coherence of machine-scoring outputs without requiring ground truth labels for every response. While we do not pursue this direction in this paper, the underlying architecture is designed to support such extensions as well.

Background

Double Scoring for Scoring Reliability

Double scoring by independent raters is a foundational practice in educational measurement to reduce rater bias and improve score consistency (Williamson et al., 2012). This method creates a system of checks and validation for reliable scoring. Yet, implementing double scoring is resource-intensive, demanding significant investments in time, effort, and costs for recruiting, training, and continuously monitoring multiple raters (Gwet, 2014; Wiggins, 1990). These logistical hurdles are amplified in ILSAs, where large response volumes, coupled with diverse languages and cultural contexts, increase operational complexity. Despite these considerable investments, prior studies indicate that double scoring often yields only modest reliability gains, and rater effects and inconsistencies can persist (Ofqual, 2014; Song & Lee, 2022).

To address the resource intensity of full double scoring, partial double scoring has emerged as an alternative, where only a random subset of responses is evaluated by a second rater (Miao & Cao, 2019). While this reduces costs and time, it may allow scoring errors or inconsistencies to remain undetected in the responses that are not double-scored. This is especially problematic in high-stakes assessments, where accurate student classification is critical (Finkelman et al., 2009). A more refined strategy, targeted double scoring (TDS), attempts to maximize scoring reliability by focusing on those second evaluations on responses near critical score boundaries, such as pass/fail cut scores (Finkelman et al., 2009; Sinharay et al., 2023). However, the effectiveness of TDS hinges on the accurate identification of these critical score boundaries; misidentification can reduce its benefits. Furthermore, recent research suggests that TDS might not consistently outperform simpler partial double scoring in terms of overall psychometric improvements (Xu & Wind, 2025). Given these well-documented limitations of double scoring, our novel approach focuses on estimating the reliability of initial human scoring, rather than depending on a second, additional human score.

Semantic Text Similarity

Semantic text similarity (STS) is a crucial natural language processing (NLP) technique that measures the degree of shared meaning between two texts. Driven by AI advancements, STS has become vital for numerous tasks, such as text summarization, classification, question-answering, machine translation, and information retrieval (Chandrasekaran & Mago, 2021; Han et al., 2021; Harispe et al., 2022). In education, STS is applied to analyze the classroom discourse, assess adherence to educational intervention protocols, and evaluate alignment between different content standards (Anglin et al., 2021; Boyle & Crossley, 2024; Butterfuss & Doran, 2025; Khan et al., 2021). Recently, Ayaan and Ng (2025) showed that STS improves the accuracy of automated scoring for open-ended responses by aligning student responses and teacher-defined reference responses.

Early methods for STS primarily relied on surface-level lexical overlaps, such as Bag of Words (BoW) and Term Frequency-Inverse Document Frequency (TF-IDF). These techniques convert text into numeric vectors to calculate similarity (Singh et al., 2024). BoW simply counts word frequency, while TF-IDF assigns weights to words based on their importance within a document relative to their frequency across a larger collection of texts. However, a major limitation of these methods is their inability to capture nuanced meaning, as they ignore context and relationships between words (Chandrasekaran & Mago, 2021; Dai et al., 2024). For example, sentences like “She bought a new car” and “She purchased a new automobile” would receive low similarity scores despite their semantic equivalence, because they use different words.

Recent advances in STS have overcome these limitations by leveraging transformer-based models, like Bidirectional Encoder Representations from Transformers (BERT), which generate contextualized embeddings for more accurate similarity assessments. Sentence-BERT (SBERT) is fine-tuned to produce meaningful sentence embeddings, accelerating STS and clustering tasks without sacrificing accuracy (Prashanth, 2025; Reimers & Gurevych, 2019). MPNet further enhances embedding quality by mapping text to a 768-dimensional vector space; models like all-mpnet-base-v2 show strong performance on various semantic tasks (Galli et al., 2024; Sonavane et al., 2024; Song et al., 2020). Among these cutting-edge models, MiniLM (e.g., all-MiniLM-L6-v2) stands out as a compact solution for processing large volumes of text (Wang et al., 2020). Sajja et al. (2025) highlight its effectiveness in educational applications due to its competitive performance, lightweight design, and low-latency inference. Given this optimal balance of performance and efficiency, we adopted all-MiniLM-L6-v2 for our STS analysis.

Voting Strategies

Ensemble learning is a powerful machine learning technique that combines the predictions from multiple models or classifiers to boost overall performance and reduce errors inherent in single-model predictions (Dong et al., 2020; Rokaya & Alsufiani, 2024). This approach leverages the collective strengths of different models, with research showing that ensemble methods outperform individual models in terms of accuracy and robustness (Rojarath et al., 2016). Ensemble learning is typically categorized as either hard voting or soft voting (Kumari et al., 2021).

Hard voting, or majority voting, is a straightforward ensemble technique where each model casts a vote for a specific label. The label with the most votes becomes the final prediction (Breiman, 1996; Freund & Schapire, 1997; Qamar et al., 2016; Zhu, 2015). While transparent and simple to implement, hard voting has limitations. It assumes all models are equally reliable, which can allow less accurate models to disproportionately skew the results. There is also a risk that a simple majority of less accurate models could, by chance, override predictions from better-performing models (Shahzad & Lavesson, 2013). In education, hard voting mirrors traditional adjudication processes, where a third rater is brought in to resolve discrepancies among human and/or machine scores (McCaffrey et al., 2022; Williamson et al., 2012). However, if these raters possess varying levels of expertise, treating their scores equally can compromise the reliability of the final score, echoing the limitation seen in hard voting.

Soft voting, in contrast, aggregates predicted probabilities from multiple models. It then selects the label with the highest average probability as the final prediction (Delgado, 2022; Mohammed & Kora, 2023). Weighted soft voting, also referred to as weighted majority voting, advances this approach by assigning weights to each model’s contribution according to its accuracy or confidence level. The final label is determined by the highest sum of these weighted probabilities (Sherazi et al., 2021). This approach is often preferred because it prioritizes stronger, more reliable models, typically resulting in improved predictive performance (Li & Luo, 2020). While our approach shares structural similarities with ensemble voting, particularly weighted soft voting, it does not involve multiple models.

In an education context, weighted soft voting has practical applications. Rupp et al. (2019) illustrated its utility in the Graduate Record Examinations (GRE) analytical writing section, where a final score is a weighted average that weighs human scores when differences between human and machine scores are minimal. Sofjan et al. (2023) also identified a strong correlation between weighted peer scores and instructor evaluations, suggesting the potential of weighted peer ratings to enhance reliability in collaborative settings. Furthermore, Malik and Jothimani (2024) proposed the innovative Feature X model, which leverages a confidence-weighted fusion voting classifier (CWFVC) employing weighted soft voting to improve student performance prediction. This study utilized weighted soft voting, termed “weighted majority voting,” to determine final scores for student responses based on their semantic neighbors.

Method

Dataset

PIRLS assesses fourth-grade reading comprehension, tracking reading achievement across more than 50 countries for over two decades. Its framework centers on two main purposes of reading: for literary experience and to acquire and use information. In reading for literary experience, students engage with narrative fiction to explore events, feelings, and language. Reading to acquire and use information focuses on students’ experiences with informational texts, including scientific, historical, or social materials in authentic contexts. The assessment aligns each item with one of four cognitive processes: focus on and retrieve, make straightforward inferences, interpret and integrate, and evaluate and critique (von Davier & Kennedy, 2024). In 2021, PIRLS introduced a digital format (digitalPIRLS), enhancing student engagement through more interactive assessments (Mullis & Martin, 2019). This study analyzed nine CR items from the digitalPIRLS 2021, using multilingual responses from all 27 participating countries. Details of the CR items analyzed are provided in Table 1.

Table 1.

PIRLS 2021 CR Items Used in This Study.

Item	Max points	Passage	Purpose	Process	N
1	1	a	Literary	Focus on & Retrieve	15,535
2	1	b	Literary	Straightforward Inferences	12,470
3	1	b	Literary	Interpret & Integrate	12,314
4	1	c	Informational	Evaluate & Critique	13,647
5	2	d	Informational	Focus on & Retrieve	14,466
6	2	a	Literary	Focus on & Retrieve	14,875
7	2	e	Informational	Straightforward Inferences	14,832
8	2	f	Literary	Interpret & Integrate	12,290
9	2	a	Literary	Straightforward Inferences	14,151

Multilingual Response Translation

We employed a standardized scoring prompt template with GPT-4.1 to translate non-English responses into English and to correct spelling and grammatical errors in English responses. As detailed in Table 2, the prompt template comprised four essential components (Jung et al., 2025). This Zero-Shot Chain-of-Thought (Zero-Shot CoT) approach is generalizable (Kojima et al., 2022), ensuring flexible application across various assessment items for context-appropriate translations. In addition, when GPT-4.1 translated a term in some instances but left it untranslated in others, we replaced those untranslated instances with the translated version.

Table 2.

Scoring Prompt Template for PIRLS.

Component	Content
Task instruction	Comprehensive instructions on translation, scoring, validation, and output generation
Reading passage	A written text, spanning 500–900 words, representing the stimulus
Question	A question, typically consisting of one to two sentences
Scoring guide	Guidelines for scoring an item, including descriptions and examples

Response Flagging

Following translation, responses underwent a two-stage quality control process to identify and flag untranslated and filter out semantically meaningless responses.

Untranslated Response Flagging

Untranslated responses were identified and flagged “missing” based on two criteria, applied only to responses longer than eight characters: (a) the response was explicitly marked as “untranslatable” by GPT-4.1 during translation, or (b) the responses contained less than 75% English vocabulary. To quantify the percentage of English vocabulary, linguistic preprocessing (e.g., lower-casing, tokenization, and lemmatization) was performed using spaCy’s en_core_web_lg model in Python (3.13.4). To prevent proper nouns (e.g., “America”) from being misclassified as non-English, we used named entity recognition (NER) to ensure they were counted as valid English vocabulary.

Meaningless Response Flagging

After excluding missing responses, we flagged “meaningless” entries. These were defined as responses that were exceptionally brief and/or semantically incoherent relative to the rest of the dataset. Although these responses were automatically assigned a score of 0, they were retained in the dataset for subsequent analysis. To identify them, a composite meaningfulness score (M_i) was calculated for each response i. A response was flagged as “meaningless” if its M_i fell below a threshold of 0.30 and its Top3-CoSim (the average cosine similarity of the top three most similar responses) was less than 0.70. Cosine similarity (CoSim) measures the similarity between two vectors, ranging from 0 to 1, where a higher value indicates greater similarity (Han et al., 2012). The number of similar responses (k), a key hyperparameter, was determined as 3 following a grid search.

The composite score (M_i) combines two key metrics: median-scaled translation length (L_i) and coherence score (C_i). For median-scaled length, we scaled the character length of each translated response (l_i) using a min-median method to minimize the effect of extremely long outliers:

L_{i} = \frac{l_{i} - \min (l)}{median (l) - \min (l)}

(19)

The coherence score (C_i) was determined by averaging the semantic similarity between a given response and all other responses in the dataset, providing a global measure of how closely a response aligns with the overall response set. The average was used to assess semantic relatedness to the other responses, revealing that a very low coherence score indicates that a response is not only dissimilar but also out of context with the rest of the dataset:

C_{i} = \frac{1}{N - 1} \sum_{i \neq j}^{N} s i m (E_{i}, E_{j})

where N represents the total number of valid responses, and sim(E_i, E_j) indicates the CoSim between the embeddings of response i and j. We generated these response embeddings using the sentence transformer model (all-MiniLM-L6-v2) in Python.

Finally, M_i was calculated as a weighted sum of these two metrics, prioritizing semantic content by weighing coherence more heavily:

M_{i} = 0.80 \cdot C_{i} + 0.20 \cdot L_{i}

Linguistic-Integrated Reliability Audit (LiRA)

We implemented LiRA to ensure consistent scoring for semantically similar responses (neighbors). Our framework estimates the reliability of the initial human score by evaluating its semantic alignment with a neighborhood of similar, human-scored responses. For each response, the three most semantically similar neighbors are retrieved. A weighted majority vote, an ensemble learning technique, is then applied to their score to assess the agreement level with the original score. This method retains the conceptual core of traditional reliability assessment: determining the degree to which a score reflects a consensus judgment. However, it shifts from scorer redundancy to semantic redundancy, leveraging similarity in meaning rather than duplication in labor. It also enables reliability to be assessed at scale, for every response, rather than for a small double-scored sample.

Similarity-Based Weighted Majority Voting

First, we generated response embeddings using the all-MiniLM-L6-v2 model in Python and calculated CoSim for all embedding pairs. For each response i, we identified the three nearest neighbors based on the highest CoSim values. We then determined the majority score s*∈{0, 1, 2} using a weighted summation, where s was the score that maximized the sum of CoSim from response i to its top three neighbors sharing the same score:

s^{*} = {a r g m a x}_{s \in {0, 1, 2}} \sum_{j \in S_{i s}} s i m (E_{i}, E_{j})

where S_is is the set of the top three neighbors to response i that were assigned a human score of s. We assigned the majority score s* only if its proportion of the total weighted score exceeded a threshold of 0.60. Otherwise, the response was flagged as “inconsistent,” indicating that human scores among similar responses were too varied to assign a single majority score. This condition is defined as:

\frac{\sum_{j \in S_{i s *}} s i m (E_{i}, E_{j})}{\sum_{j \in S_{i}} s i m (E_{i}, E_{j})} > 0.60

The two key hyperparameters, the number of nearest neighbors (k) and the weight threshold (WT), were determined through a systematic grid search. This explored 28 unique combinations for k∈ {1, 2, 3, 4, 5, 10, 15} and WT∈ {0.60, 0.65, 0.70, 0.75} (Jung et al., 2025).

LiRA Application to a Mock Item

To illustrate our LiRA approach, we consider a Turkish response from a mock item (see Appendix A): “Gölde her gün yüzmeye gitti. Suyun içinde oynamak çok eğlenceliydi,” which translates to “He went swimming in the lake daily. Playing in the water was very enjoyable.” The top three neighbors for this translated response are presented in Table 3.

Table 3.

Cosine Similarity for Mock Item Responses.

#	Similar translated response	Human score	Cosine similarity
1	He went swimming in the lake every day. He had lots of fun in the water.	1	0.96
2	He was swimming in the lake every day. Playing in the water was his favorite.	1	0.95
3	Each day, he swam in the lake. His brother taught him how to swim, and it was fun.	2	0.94

We then calculated weighted scores for each score category (see Table 4). In this example, two similar responses (Responses 1 and 2 in Table 3) received a human score of 1, and one (Response 3 in Table 3) received a human score of 2. This resulted in weighted scores (sum of CoSim per score category) of 0.96 + 0.95 = 1.91 for human score 1 and 0.94 for human score 2. Since the weighted score proportion for human score 1 was 0.67, which exceeded our threshold of 0.60, we assigned a majority score of 1 to this response.

Table 4.

Weighted Score Analysis by Human Score Category.

Human score	Number of neighbors	Weighted score	Weighted score proportion
0	0	0.00	0.00
1	2	1.91	1.91/2.85 = 0.67
2	1	0.94	0.94/2.85 = 0.33
Total	3	2.85	1.00

In addition, Table 5 details the flags used in LiRA, including their descriptions and illustrative examples. These response-level diagnostics are highly beneficial for human experts investigating inconsistency sources with concrete responses.

Table 5.

Flag Description and Mock Examples.

Flag	Description	Examples
Missing	Responses are primarily non-English (less than 75% English words) or contain an explicit “untranslatable” marker in their translation.	– untranslatable text – fun – Ffqfhjox heepyy swim
Meaningless	Responses automatically assigned a majority score of 0 due to being extremely short, semantically irrelevant, or significant outliers (M < 0.30 and Top3-CoSim <0.70). If a human rater assigned a score of 1 or 2 to a “meaningless” response, it necessitates human review to understand the discrepancy.	– ??!?asge@ – 2003 – The sky turned golden – About swimming – Water outside
Inconsistent	Responses for which LiRA could not assign a single majority score. This occurs when human scores for semantically similar responses diverge significantly, precluding a clear consensus (weighted score proportion <0.60). Such responses indicate high variability in human judgment, or the observed disagreement may reflect weaker semantic relatedness.	– It was a lake. – His brother taught. Lake. – His brother loved swimming. Go to the lake often.

Evaluation Metrics

We evaluated our LiRA approach across three aspects: (a) comparability with traditional CCSR values, (b) distribution of majority scores and flags, and (c) Top3-CoSim statistics.

We first assessed LiRA’s comparability with traditional CCSR by calculating weighted exact agreement (EA) between initial human scores and LiRA’s majority scores. Weighted EA assigns greater importance to matches (where human and majority scores align) with higher CoSim, calculated as the ratio of summed average CoSim for matching responses to all valid responses. This analysis focused on valid responses only, excluding those flagged as “missing.” We computed weighted EA both across countries and by country to identify scoring patterns deviating from overall trends.

Next, we examined the distribution of LiRA’s majority scores (0, 1, and 2) and proportions of “missing” and “inconsistent” flagged responses. We also calculated the percentage of responses flagged as “meaningless” within the human score 0 category to evaluate our meaningless detection algorithm’s accuracy.

Finally, we examined CoSim statistics to assess how effectively our STS measurement captured similar neighbors. Our key metric was Top3-CoSim (the average cosine similarity of the top three most similar responses) for each response. To summarize these values, we considered their central tendency (mean and median Top3-CoSim across responses), their between-response variability (the overall spread of Top3-CoSim across responses), and within-response consistency (the average dispersion of CoSim within each response, showing how tightly its nearest neighbors cluster).

Results

Comparability With Traditional CCSR

Table 6 illustrates the robust performance of LiRA and its close alignment with traditional CCSR metrics. For the five items where CCSR values were available (Items 1, 3, 4, 6, and 9), LiRA’s weighted EA closely matched the corresponding CCSR values. This alignment is particularly noteworthy given that LiRA utilized the entire dataset (typically n = 12,000–15,000 per item) for more comprehensive estimates. In contrast, CCSR relied on a much smaller sample size (n = 200 per item). In addition, for Item 2, LiRA even achieved approximately 5% higher agreement with initial human scoring than CCSR. This suggests that LiRA produced second scores more consistent with initial human scoring. Crucially, both methods identified a significant scoring issue with Item 9, which exhibited remarkably low reliability, with 76.8% for CCSR and 75.6% for weighted EA. LiRA further demonstrated its scalability by providing high-weighted EAs for items where CCSR values were unavailable (Items 5, 7, and 8), ranging from 86.2% to 95.8%.

Table 6.

Comparison of Traditional CCSR and LiRA Weighted Exact Agreement (%).

Item	Max points	Traditional CCSR	LiRA weighted EA
1	1	99.5	98.7
2	1	85.0	90.2
3	1	92.9	89.5
4	1	83.9	83.2
5	2	N/A	95.8
6	2	86.8	88.2
7	2	N/A	90.9
8	2	N/A	86.2
9	2	76.8	75.6
Average		87.5	88.7

Note. CCSR = cross-country scoring reliability; EA = exact agreement.

Table 7 presents the weighted EA achieved by LiRA across 27 countries for all nine CR items. This country-level breakdown enables a more detailed assessment of scoring reliability at both the item and country levels. The median weighted EA exceeded 85% for most items, indicating high scoring reliability across countries. However, notable variations in reliability emerged across both items and countries. Item 4, which requires the highest-order cognitive processes (i.e., evaluate and critique), exhibited notable cross-country variability in weighted EA, ranging from 76.1% to 89.4%. Even more pronounced challenges were observed with Item 9, consistent with its problematic CCSR of 76.8% and average weighted EA of 75.6%. Item 9 showed the highest cross-country variability (64.6%–85.3%), with more than half of the countries (14 out of 27) reporting weighted EAs below 75%. These results indicate persistent difficulties in achieving adequate scoring reliability for this specific item. Despite these item-specific disparities, a subset of countries (e.g., Countries V and Z) consistently demonstrated high reliability across most items.

Table 7.

Weighted Exact Agreement (%) by Country.

Country	Item 1	Item 2	Item 3	Item 4	Item 5	Item 6	Item 7	Item 8	Item 9
A	99.32	93.40	91.98	84.98	96.02	89.20	93.55	84.83	76.55
B	99.07	91.77	93.98	87.05	95.34	91.67	87.26	86.72	78.24
C	99.50	88.85	90.44	80.75	93.88	82.41	90.23	85.76	74.30
D	95.65	87.88	88.43	85.94	96.92	91.79	91.71	90.94	80.17
E	99.26	85.91	91.62	88.78	94.41	86.97	91.58	86.98	77.97
F	92.20	88.00	90.84	81.01	95.78	90.65	87.26	89.39	74.62
G	96.22	90.49	89.90	82.74	94.89	81.03	85.01	84.45	74.31
H	99.21	85.75	93.36	82.11	97.67	92.09	87.27	87.19	72.63
I	99.26	84.54	84.95	84.01	97.64	92.31	90.05	87.47	72.85
J	99.08	89.25	90.11	84.55	93.21	90.69	90.38	83.62	74.81
K	99.28	86.56	90.26	85.36	96.79	86.01	89.24	89.22	74.94
L	98.74	86.21	84.60	79.71	95.72	82.87	86.73	81.89	71.33
M	99.22	81.87	84.34	78.42	95.84	88.22	93.46	83.35	69.57
N	97.61	85.76	75.42	76.05	90.91	81.52	88.00	78.91	64.60
O	99.34	94.79	88.27	82.61	96.87	92.80	90.14	86.14	76.65
P	99.31	92.16	92.52	76.26	94.77	87.62	92.58	88.76	78.54
Q	99.15	91.06	75.21	84.03	95.83	90.84	90.58	88.80	78.32
R	99.84	90.05	91.68	86.69	95.61	85.53	92.40	81.82	64.82
S	100.00	90.31	88.05	83.30	98.90	91.24	93.73	90.15	74.62
T	99.44	92.05	87.25	80.16	97.48	90.20	92.05	88.71	76.56
U	95.89	91.39	91.69	85.71	95.67	87.75	88.45	76.19	71.22
V	99.71	92.43	92.91	86.79	98.07	93.97	96.91	89.02	81.73
W	99.27	N/A	N/A	86.06	95.93	92.36	91.56	88.00	78.48
X	98.65	90.71	87.22	77.53	96.71	86.75	88.00	87.36	77.07
Y	98.74	92.96	90.75	83.77	97.19	86.90	92.34	87.39	81.56
Z	99.62	97.84	96.23	89.35	97.24	90.92	94.30	87.68	85.33
AA	99.60	88.29	87.60	80.20	95.14	84.37	91.97	86.28	71.72
Average	98.72	90.15	89.49	83.18	95.84	88.22	90.91	86.20	75.60

Distribution of Majority Scores and Flags

Table 8 details the distribution of majority scores (0, 1, and 2) and response flags (“inconsistent” and “missing”). LiRA successfully assigned the majority of scores to over 98% of all responses across items. Both “inconsistent” (0.01%–2.28%) and “missing” (0.53%–1.38%) flags were rare. Also, Items 4 and 9, which had lower CCSR and weighted EA, showed the highest flag rates: 1.38% “missing” for Item 4 and 2.28% “inconsistent” for Item 9. These elevated flag rates align with their challenging scoring nature. Overall, the results show that LiRA effectively scored most responses, identifying potential response-level issues for expert review.

Table 8.

Distribution of Majority Scores and Response Flags (%).

Item	Majority score				Missing	Total
Item	0	1	2	Inconsistent	Missing	Total
1	11.47	87.89	N/A	0.02	0.62	100
2	50.14	49.33	N/A	0.01	0.53	100
3	44.37	55.04	N/A	0.01	0.58	100
4	60.94	37.64	N/A	0.04	1.38	100
5	37.43	9.48	52.10	0.26	0.73	100
6	29.17	22.41	46.90	0.63	0.89	100
7	15.71	20.59	62.24	0.73	0.73	100
8	35.32	28.81	34.02	1.05	0.80	100
9	34.92	46.16	15.53	2.28	1.11	100

Moreover, the meaningless detection algorithm was found to be effective (see Figure 1), as nearly all flagged responses received a human score of 0. This indicates its strong capability to detect extremely brief, irrelevant, or off-topic responses. Interestingly, 93 meaningless responses for Item 4 (0.68% of 13,647 total) received a human score of 1; this finding is further examined in the Discussion section. Item 9 was the only item that had one meaningless response with a human score of 2.

Figure 1.

Human Score Distribution of Meaningless Responses.

Figure 2 illustrates the high accuracy of our missing response detection algorithm. The majority of responses flagged as “missing” corresponded to a human score of 0, validating the algorithm’s performance. We observed only a minimal deviation in Item 1; 17 responses flagged as missing (0.11% of 15,535) received a human score of 1, indicating a slight divergence between algorithmic detection and human assessment. This discrepancy is addressed in the Discussion section.

Figure 2.

Human Score Distribution of Untranslated (Missing) Responses.

Semantic Text Similarity Statistics

Table 9 presents STS statistics for the Top3-CoSim per response. LiRA consistently identified coherent neighbors across all responses, exhibiting minimal semantic variability (see Appendix B for plots): mean and median Top3-CoSim values were high (typically above 0.90), with very low standard deviations (SDs). The low average SD of the top three CoSim values within responses further demonstrates that these three neighbors for each response tend to form tight semantic clusters. Item 4, however, displayed relatively lower mean (0.8407) and median (0.8630) CoSim values, coupled with the highest SD of 0.1226. This pattern reflects greater semantic diversity among Item 4 responses, consistent with its recognized scoring complexities in other metrics.

Table 9.

Cosine Similarity Statistics.

Item	Mean Top3-CoSim	Median Top3-CoSim	SD of Top3-CoSim (between-response variability)	Avg SD of top three CoSim (within-response variability)
1	0.9827	1.0000	0.0781	0.0026
2	0.9216	0.9591	0.1071	0.0099
3	0.8909	0.9207	0.1171	0.0133
4	0.8407	0.8630	0.1226	0.0178
5	0.9223	0.9609	0.0962	0.0091
6	0.9337	0.9770	0.0975	0.0071
7	0.9337	0.9646	0.0827	0.0070
8	0.8953	0.9124	0.0837	0.0110
9	0.8956	0.9149	0.0923	0.0118

Discussion

Interpretation of LiRA’s Performance

This study demonstrates that LiRA provides an effective and innovative solution for measuring scoring reliability in large-scale, multilingual contexts. Unlike conventional CCSR, which depends on double-scoring small and unrepresentative samples, LiRA uses only initial human scores and applies similarity-based weighted majority voting. This approach addresses long-standing methodological concerns with double scoring reported in previous studies (e.g., Song & Lee, 2022; Wiggins, 1990) and marks a significant advance in reliability assessment. Moreover, by embedding reliability estimation within a semantic retrieval framework, LiRA supports additional use cases such as uncertainty flagging and anomaly detection, and allows for human-in-the-loop reevaluation.

The innovation of LiRA lies in its ability to automatically generate consistent “second scores” while preserving the fundamental concept of traditional reliability. It identifies semantic neighborhoods among responses from initial human scoring and subsequently applies weighted majority voting. This mechanism evaluates similarity on a continuous scale (CoSim, ranging from 0 to 1), assigning greater weight to highly similar neighbors. Consequently, LiRA can ascertain the most suitable score by considering semantic differences with enhanced granularity, a feature critical for scoring written responses. Moreover, LiRA draws on the entire dataset of initial scores for reliability estimation, providing a comprehensive analysis that overcomes the logistic limitations of subset-based CCSR approaches. This holistic approach not only yields reliability estimates comparable to, or even surpassing, established CCSR metrics but also enables more thorough and systematic evaluations of human scoring practices across all CR items and countries.

Despite LiRA’s impressive performance, the results also revealed item-specific challenges. For example, Item 4 proved particularly difficult to score, as evidenced by its lower CCSR (83.9%) and average weighted EA (83.2%). This difficulty appears closely related to its design as an “evaluate and critique” item. According to Mullis and Martin (2019), such items require students to make justified judgments about text content, structure, and language. Prior studies (Cook & Myers, 2004; Güler, 2014; Polat, 2020) have found that reliably assessing higher cognitive skills like evaluative reasoning or critical thinking is inherently challenging. This difficulty partly arises from the complex semantic judgments human raters engage in, which can lead to bias and variability in interpreting and scoring diverse possible responses.

Specifically, Item 4’s scoring guide accepted two distinct key ideas in correct answers and allowed a wide range of expressions for these concepts. Scoring was further complicated by subtle nuances that distinguished correct from seemingly similar, but incorrect, responses. For example, phrases like “they have to hunt alone,”“be self-reliant and strong,” and “need to be more independent and find food” were awarded credit. In contrast, similar responses such as “they need to live their own life” or “have to manage alone” did not earn credit, due to nuanced distinctions. This highly nuanced scoring, along with syntactic and lexical diversity among correct responses, contributed to notable scoring inconsistencies across countries (weighted EA ranging from 76.1% to 89.4%) and resulted in the lowest mean Top3-CoSim (0.8407). Item 4’s scoring challenges were also reflected in its flagging rates: 93 meaningless responses received a human score of 1. Review of these cases showed they were often borderline responses subject to rater interpretation, or correct responses expressed with different wording from the remaining dataset. This suggests that the meaningless detection threshold warrants further examination and potential adjustment, possibly lowering it below 0.30. Future work should involve a grid search incorporating more “evaluate and critique” type items to enhance generalizability to such item categories, and consider different thresholds based on item cognitive processes. Furthermore, while the key hyperparameters (k and WT) were optimized using the large-scale PILRS dataset, fully deploying LiRA requires a grid search to validate their transferability to diverse data contexts, such as smaller or domain-specific datasets.

Item 9 emerged as the most challenging to score, as indicated by its low CCSR (76.8%) and the lowest average weighted EA of 75.6%. This item required students to infer two characteristics from an informational text, with the scoring guide permitting eight possible answers. This broad scope rendered Item 9 highly open-ended, as each characteristic could be expressed in various forms, from brief phrases to detailed inferences described in long sentences. High open-endedness is known to adversely affect scoring reliability (McCaffrey et al., 2022; Wolfe et al., 2016; Zhai et al., 2021), a pattern reflected in Item 9’s substantial cross-country variability (weighted EA ranging from 64.6% to 85.3%) and the highest inconsistency flag rate (2.28%). In addition, research has shown that greater variation in text features such as length, lexical diversity, and syntactic variety further increases scoring variability due to their impact on rater judgments (Palermo, 2022; Wolfe et al., 2016). The cognitive demand of Item 9, which required “straightforward inferences,” also heightened these challenges. Inference-based items that allow a wide range of valid concepts elevate cognitive demands on human raters, introducing greater interpretive variability and reducing scoring reliability (Leacock et al., 2013).

Interestingly, Item 1 showed that 17 responses (0.11% of 15,535 total) flagged as “missing” received a human score of 1, suggesting the presence of scorable content. These flagged responses contained significant misspellings or various spelling errors, often from less-resourced languages. Consistent with prior research (Robinson et al., 2023; Yan et al., 2024), translation and spelling issues from large language models (LLMs) were commonly observed in medium- to low-resource languages. These findings indicate the need for human review of flagged responses with human scores of 1 or 2. However, given their extreme rarity (0.11% of the dataset), this is expected to impose minimal operational burden on human-in-the-loop processes. Encouragingly, recent studies indicate that LLMs and transformer-based approaches can significantly improve translation and spelling correction in less-resourced languages, potentially reducing such errors in future LLM applications (Luitel et al., 2024; Turhan, 2025; Zhong et al., 2024).

Limitations and Future Research

This study has several limitations. First, all findings are based on translated responses that were generated and validated using GPT-4.1. Although LiRA’s flagging algorithms for “meaningless” and “missing” responses were generally effective, certain anomalies were observed, particularly with less-resourced languages with a variety of spelling errors. While such cases were infrequent, they underscore LiRA’s reliance on the translation quality of GPT models. Future work could explore an ensemble learning approach that utilizes multiple state-of-the-art LLMs (e.g., GPT 5 series, Claude 4.5, or Gemini models) to select the most suitable translation, particularly for less-resourced languages. In addition, we can consider the use of modern multilingual encoders, including gemini-embedding-001, Qwen3-Embedding-8B, or Language-agnostic BERT (LaBSE) (Boizard et al., 2025; Feng et al., 2020; Nielsen et al., 2024), which can directly encode responses in multiple languages, potentially eliminating the need for English translation.

Second, while this study provided a qualitative overview of item-specific challenges, future research should adopt more systematic approaches to investigating these issues. For example, automated clustering of semantically similar responses could uncover patterns within or across countries/languages for specific items. Recent studies have shown that LLMs can facilitate high-quality, interpretable clusters while reducing computational complexity and manual effort (Huang & He, 2024; Miller & Alexander, 2025). Identifying such response patterns could help human content experts better understand sources of scoring deviations and address scoring inconsistencies more effectively.

Finally, this study computed STS using CoSim, a standard and widely adopted approach. For future work, we suggest exploring modern deep learning based methods, such as BERTScore (Zhang et al., 2019), SimCSE (Gao et al., 2021), or Word and sentence Structure Mover’s Distance (WSMD) (Yamagiwa et al., 2022). These advanced techniques move beyond simple vector angles by considering contextual embeddings and token-level interactions, enabling the capture of deeper semantic nuances. Integrating one of these methods could thus enhance LiRA’s performance in evaluating semantic alignment with neighbors.

Conclusion

This study highlights LiRA as a robust, scalable, and resource-efficient solution for measuring scoring reliability in ILSAs like PIRLS. Employing similarity-based weighted majority voting, LiRA addresses traditional CCSR’s practical and methodological shortcomings, all while maintaining the essential role of initial human scoring. With its integrated flagging algorithms, LiRA provides detailed diagnostics at the response, item, and country/language levels. These features position LiRA as a valuable tool for reliability assessment in ILSAs, delivering granular insights into national scoring processes and aiding data quality and reporting enhancement.

Footnotes

Appendix

Table A1.

Mock Item and Original Scoring Guide.

Question

When Jake was growing up, why was it easy for him to learn how to swim?
Give two reasons.

2 - Complete Comprehension

The response includes both reasons why it was easy for Jake to learn how to swim when he was growing up:
1. He had a lake close to his house.
Examples:
• He swam in the lake every day.
• The lake was perfect for swimming.
2. His older brother taught him how to swim.
Examples:
• His brother swam well enough to teach.
• His brother taught him swimming.

1 - Partial Comprehension

The response includes one of the above reasons why it was easy for Jake to learn how to swim.

0 - No Comprehension

The response does not include either of the above. Response may be vague, unrelated to the text, or repeat information in the question.
Examples:
• Swimming is fun.
• Jake liked water.

Authors’ Note

This research adheres to the General Data Protection Regulation (GDPR) and uses data from the Progress in International Reading Literacy Study (PIRLS). The original data were collected under the governance of the International Association for the Evaluation of Educational Achievement (IEA) and the TIMSS & PIRLS International Study Center. Parental or guardian permission was obtained for student participation, and strict procedures for confidentiality and data protection were implemented. The resulting datasets were anonymized and made accessible exclusively for research purposes.

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

ORCID iDs

Ji Yoon Jung

Matthias von Davier

References

Anglin

K. L.

Wong

V. C.

Boguslav

(2021). A natural language processing approach to measuring treatment adherence and consistency using semantic similarity. AERA Open, 7, Article 1028615. https://doi.org/10.1177/23328584211028615

Ayaan

K. W.

(2025). Automated grading using natural language processing and semantic analysis. Methodsx, 14, Article 103395. https://doi.org/10.1016/j.mex.2025.103395

Boizard

Gisserot-Boukhlef

Alves

D. M.

Martins

Hammal

Corro

Colombo

(2025). EuroBERT: Scaling multilingual encoders for European languages arXiv preprint arXiv: 2503.05500. https://doi.org/10.48550/arXiv.2503.05500

Boyle

Crossley

(2024). Semantic similarity of teacher and student discourse linked to quality ratings from classroom observations. In Proceedings of the 17th international conference on educational data mining (pp. 797–801). https://educationaldatamining.org/edm2024/proceedings/2024.EDM-posters.89/2024.EDM-posters.89.pdf

Braun

H. I.

(1988). Understanding scoring reliability: Experiments in calibrating essay readers. Journal of Educational Statistics, 13(1), 1–18. https://doi.org/10.3102/10769986013001001

Breiman

(1996). Bagging predictors. Machine Learning, 24, 123–140.

Butterfuss

Doran

(2025). An application of text embeddings to support alignment of educational content standards. Educational Measurement: Issues and Practice, 44(1), 73–83. https://doi.org/10.1111/emip.12641

Chandrasekaran

Mago

(2021). Evolution of semantic similarity: A survey. ACM Computing Surveys, 54(2), 1–37. https://doi.org/10.1145/3440755

Cook

A. E.

Myers

J. L.

(2004). Processing discourse roles in scripted narratives: The influences of context and world knowledge. Journal of Memory and Language, 50(3), 268–288. https://doi.org/10.1016/j.jml.2003.11.003

10.

Cooper

J. E.

Copeland

J. R. M.

Brown

G. W.

Harris

Gourlay

A. J.

(1977). Further studies on interviewer training and inter-rater reliability of the Present State Examination (PSE). Psychological Medicine, 7(3), 517–523. https://doi.org/10.1017/S0033291700004499

11.

Cronbach

L. J.

(1946). Response sets and test validity. Educational and Psychological Measurement, 6(4), 475–494.

12.

Dai

Luo

Zhao

Hong

Zhu

Liu

(2024). AI-based NLP section discusses the application and effect of bag-of-words models and TF-IDF in NLP tasks. Journal of Artificial Intelligence General Science (JAIGS), 5(1), 13–21.

13.

Delgado

(2022). A semi-hard voting combiner scheme to ensemble multi-class probabilistic classifiers. Applied Intelligence, 52(4), 3653–3677.

14.

Dong

Cao

Shi

(2020). A survey on ensemble learning. Frontiers of Computer Science, 14, 241–258.

15.

Ercikan

Lyons-Thomas

(2013). Adapting tests for use in other languages and cultures. In Geisinger

K. F.

Bracken

B. A.

Carlson

J. F.

Hansen

J.-I. C.

Kuncel

N. R.

Reise

S. P.

Rodriguez

M. C.

(Eds.), APA handbook of testing and assessment in psychology: Testing and assessment in school psychology and education (Vol. 3, pp. 545–569). American Psychological Association. https://doi.org/10.1037/14049-026

16.

Feng

Yang

Cer

Arivazhagan

Wang

(2020). Language-agnostic BERT sentence embedding (arXiv: 2007.01852200701852). https://arxiv.org/abs/2007.01852

17.

Finkelman

Darby

Nering

(2009). A two-stage scoring method to enhance accuracy of performance level classification. Educational and Psychological Measurement, 69(1), 5–17. https://doi.org/10.1177/0013164408322025

18.

Freund

Schapire

R. E.

(1997). A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1), 119–139.

19.

Galli

Donos

Calciolari

(2024). Performance of 4 pre-trained sentence transformer models in the semantic query of a systematic review dataset on peri-implantitis. Information, 15(2), Article 68. https://doi.org/10.3390/info15020068

20.

Gao

Yao

Chen

(2021). Simcse: Simple contrastive learning of sentence embeddings (Arxiv: 2104.08821). https://doi.org/10.48550/arXiv.2104.08821

21.

Graham

McKeown

Kiuhara

Harris

K. R.

(2012). A meta-analysis of writing instruction for students in the elementary grades. Journal of Educational Psychology, 104(4), Article 879. https://psycnet.apa.org/doi/10.1037/a0029939

22.

Güler

(2014). Analysis of open-ended statistics questions with many facet Rasch model. Eurasian Journal of Educational Research, 55, 73–90.

23.

Gwet

K. L.

(2014). Handbook of inter-rater reliability: The definitive guide to measuring the extent of agreement among raters. Advanced Analytics.

24.

Han

Kamber

Pei

(2012). Data mining: Concepts and techniques. Morgan Kaufmann.

25.

Han

Zhang

Yuan

Jiang

Yun

Gao

(2021). A survey on the techniques, applications, and performance of short text semantic similarity. Concurrency and Computation: Practice and Experience, 33(5), Article e5971. https://doi.org/10.1002/cpe.5971

26.

Harispe

Ranwez

Montmain

(2022). Semantic similarity from natural language and ontology analysis. Springer.

27.

Huang

(2024). Text clustering as classification with llms. https://arxiv.org/abs/2410.00927

28.

Huck

S. W.

Bounds

W. G.

(1972). Essay grades: An interaction between graders’ handwriting clarity and the neatness of examination papers. American Educational Research Journal, 9(2), 279–283. https://doi.org/10.3102/00028312009002279

29.

Johansone

(2024). TIMSS survey operations procedures. In von Davier

Fishbein

Kennedy

(Eds.), TIMSS 2023 technical report (methods and procedures) (pp. 4.1–4.19). TIMSS & PIRLS International Study Center, Boston College. https://doi.org/10.6017/lse.tpisc.timss.rs2399

30.

Jonsson

Svingby

(2007). The use of scoring rubrics: Reliability, validity and educational consequences. Educational Research Review, 2(2), 130–144. https://doi.org/10.1016/j.edurev.2007.05.002

31.

Jung

J. Y.

Bezirhan

von Davier

(2025, October). Input optimization for automated scoring in reading assessment. In Proceedings of the artificial intelligence in measurement and education conference (AIME-con): Full papers (pp. 1–8). https://aclanthology.org/2025.aimecon-main.1/

32.

Khan

Rosaler

Hamer

Almeida

(2021). Catalog: An educational content tagging system. In Educational data mining. https://educationaldatamining.org/EDM2021/virtual/static/pdf/EDM21_paper_274.pdf

33.

Kojima

S. S.

Reid

Matsuo

Iwasawa

(2022). Large language models are zero-shot reasoners. Advances in Neural Information Processing Systems, 35, 22199–22213.

34.

Kumari

Kumar

Mittal

(2021). An ensemble approach for classification and prediction of diabetes mellitus using soft voting classifier. International Journal of Cognitive Computing in Engineering, 2, 40–46. https://doi.org/10.1016/j.ijcce.2021.01.001

35.

Leacock

Messineo

Zhang

(2013, April 28). Issues in prompt selection for automated scoring of short answer questions. Paper presented at the Annual Conference of the National Council on Measurement in Education, San Francisco, CA, United States.

36.

Luo

(2020). Performance-weighted-voting model: An ensemble machine learning method for cancer type classification using whole-exome sequencing mutation. Quantitative Biology, 8(4), 347–358. https://doi.org/10.1007/s40484-020-0226-1

37.

Luitel

Bekoju

Sah

A. K.

Shakya

(2024, April). Contextual spelling correction with language model for low-resource setting. In Proceedings of the 2024 international conference on inventive computation technologies (ICICT) (pp. 582–589). Institute of Electrical and Electronics Engineers.

38.

Malik

Jothimani

(2024). Enhancing student success prediction with featurex: A fusion voting classifier algorithm with hybrid feature selection. Education and Information Technologies, 29(7), 8741–8791. https://doi.org/10.1007/s10639-023-12139-z

39.

Mazzeo

von Davier

(2008). Review of the Programme for International Student Assessment (PISA) test design: Recommendations for fostering stability in assessment results. Education Working Papers EDU/PISA/GB, 28, 23–24.

40.

Mazzeo

von Davier

(2014). Linking scales in international large-scale assessments. In Rutkowski

von Davier

Rutkowski

(Eds.), Handbook of international large-scale assessment: Background, technical issues, and methods of data analysis (pp. 229–258). Routledge.

41.

McCaffrey

D. F.

Casabianca

J. M.

Ricker-Pedley

K. L.

Lawless

R. R.

Wendler

(2022). Best practices for constructed-response scoring. ETS Research Report Series, 2022(1), 1–58. https://doi.org/10.1002/ets2.12358

42.

Miao

Cao

(2019). Development and evaluation of a partial double scoring procedure for preservice teacher portfolio assessment. AERA Online Paper Repository. https://eric.ed.gov/?id=ED613856

43.

Miller

J. K.

Alexander

T. J.

(2025). Human-interpretable clustering of short text using large language models. Royal Society Open Science, 12(1), Article 241692. https://doi.org/10.1098/rsos.241692

44.

Mohammed

Kora

(2023). A comprehensive review on ensemble deep learning: Opportunities and challenges. Journal of King Saud University-Computer and Information Sciences, 35(2), 757–774. https://doi.org/10.1016/j.jksuci.2023.01.014

45.

Mullis

I. V. S.

Martin

M. O.

(Eds.) (2019). PIRLS 2021 assessment frameworks. TIMSS & PIRLS International Study Center, Boston College. https://timssandpirls.bc.edu/pirls2021/frameworks/

46.

Nielsen

D. S.

Enevoldsen

Schneider-Kamp

(2024). Encoder vs decoder: Comparative analysis of encoder and decoder language models on multilingual NLU tasks. https://arxiv.org/abs/2406.13469

47.

Ofqual. (2014). Review of double marking research. https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/605661/2014-02-14-review-of-double-marking-research.pdf

48.

Palermo

(2022). Rater characteristics, response content, and scoring contexts: Decomposing the determinates of scoring accuracy. Frontiers in Psychology, 13, Article 937097. https://doi.org/10.3389/fpsyg.2022.937097

49.

Polat

(2020). Analysis of multiple-choice versus open-ended questions in language tests according to different cognitive domain levels. Novitas-Royal (Research on Youth and Language), 14(2), 76–96.

50.

Prashanth

(2025). Semantic similarity estimation for domain specific data using BERT and other techniques. https://arxiv.org/abs/2506.18602

51.

Qamar

Niza

Bashir

Khan

F. H.

(2016). A majority vote based classifier ensemble for web service classification. Business & Information Systems Engineering, 58(4), 249–259. https://doi.org/10.1007/s12599-015-0407-z

52.

Reimers

Gurevych

(2019). Sentence-BERT: Sentence embeddings using Siamese BERT-networks. https://arxiv.org/abs/1908.10084

53.

Robinson

N. R.

Ogayo

Mortensen

D. R.

Neubig

(2023). ChatGPT MT: Competitive for high-(but not low-) resource languages. https://arxiv.org/abs/2309.07423

54.

Rojarath

Songpan

Pong-inwong

(2016, August). Improved ensemble learning for classification techniques based on majority voting. In Proceedings of the 2016 7th IEEE international conference on software engineering and service science (ICSESS) (pp. 107–110). Institute of Electrical and Electronics Engineers.

55.

Rokaya

M. B.

Alsufiani

K. D.

(2024). Ensemble learning based on relative accuracy approach and diversity teams. Bulletin of Electrical Engineering and Informatics, 13(3), 1897–1912. https://doi.org/10.11591/eei.v13i3.6003

56.

Rupp

A. A.

Casabianca

J. M.

Krüger

Keller

Köller

(2019). Automated essay scoring at scale: A case study in Switzerland and Germany. ETS Research Report Series, 2019(1), 1–23. https://doi.org/10.1002/ets2.12249

57.

Sajja

Sermet

Demir

(2025). An open-source dual-loss embedding model for semantic retrieval in higher education. https://www.arxiv.org/abs/2505.04916

58.

Scannell

D. P.

Marshall

J. C.

(1966). The effect of selected composition errors on grades assigned to essay examinations. American Educational Research Journal, 3(2), 125–130. https://doi.org/10.3102/00028312003002125

59.

Shahzad

R. K.

Lavesson

(2013). Comparative analysis of voting schemes for ensemble-based malware detection. Journal of Wireless Mobile Networks, Ubiquitous Computing, and Dependable Applications, 4(1), 98–117.

60.

Shavelson

R. J.

Webb

N. M.

(1981). Generalizability theory: 1973–1980. British Journal of Mathematical and Statistical Psychology, 34(2), 133–166. https://doi.org/10.1111/j.2044-8317.1981.tb00625.x

61.

Sherazi

S. W. A.

Bae

J. W.

Lee

J. Y.

(2021). A soft voting ensemble classifier for early prediction and diagnosis of occurrences of major adverse cardiovascular events for STEMI and NSTEMI during 2-year follow-up in patients with acute coronary syndrome. PLOS ONE, 16(6), Article e0249338.

62.

Shin

H. J.

von Davier

Yamamoto

(2019). Investigating rater effects in international large-scale assessments. In Veldkamp

Sluijter

(Eds.), Theoretical and practical advances in computer-based educational measurement (pp. 249–268). Springer.

63.

Singh

Ahmed

Akhtar

M. M.

(2024). Semantic textual similarity: How far we are and where we need to go? (SSRN 5177247). https://doi.org/10.2139/ssrn.5177247

64.

Sinharay

Johnson

M. S.

Wang

Miao

(2023). Targeted double scoring of performance tasks using a decision-theoretic approach. Applied Psychological Measurement, 47(2), 155–163. https://doi.org/10.1177/01466216221129271

65.

Sofjan

A. K.

Nguyen

K. A.

Surati

Marette

(2023). Assessment of the validity of peer scores and peer feedback in an online peer assessment platform (Kritik). Currents in Pharmacy Teaching and Learning, 15(4), 400–407. https://doi.org/10.1016/j.cptl.2023.04.004

66.

Sonavane

Endait

Sinare

Rohera

Naik

Kadam

(2024, June). CAILMD-23 at SemEval-2024 task 1: Multilingual evaluation of semantic textual relatedness. In Proceedings of the 18th international workshop on semantic evaluation SemEval-2024 (SemEval-2024) (pp. 980–985). https://aclanthology.org/2024.semeval-1.142/

67.

Song

Tan

Qin

Liu

T. Y.

(2020). Mpnet: Masked and permuted pre-training for language understanding. Advances in Neural Information Processing Systems, 33, 16857–16867.

68.

Song

Y. A.

Lee

W. C.

(2022). Effects of using double ratings as item scores on IRT proficiency estimation. Applied Measurement in Education, 35(2), 95–115. https://doi.org/10.1080/08957347.2022.2067543

69.

Starch

Elliott

E. C.

(1912). Reliability of the grading of high-school work in English. The School Review, 20(7), 442–457. https://doi.org/10.1086/435971

70.

Starch

Elliott

E. C.

(1913a). Reliability of grading work in mathematics. The School Review, 21(4), 254–259.

71.

Starch

Elliott

E. C.

(1913b). Reliability of grading work in history. The School Review, 21(10), 676–681.

72.

Stemler

S. E.

(2004). A comparison of consensus, consistency, and measurement approaches to estimating interrater reliability. Practical Assessment, Research, and Evaluation, 9(1). https://doi.org/10.7275/96jp-xz07

73.

Turhan

C. G.

(2025). Leveraging large language models for spelling correction in Turkish. PeerJ Computer Science, 11, Article e2889. https://doi.org/10.7717/peerj-cs.2889

74.

von Davier

Kennedy

, Editors (2024). PIRLS 2026 Assessment Frameworks. Boston College, TIMSS & PIRLS International Study Center. https://doi.org/10.6017/lse.tpisc.tr2103.kb4199

75.

Wang

Wei

Dong

Bao

Yang

Zhou

(2020). Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. Advances in Neural Information Processing Systems, 33, 5776–5788.

76.

Wiggins

(1990). The case for authentic assessment. Practical Assessment, Research, and Evaluation, 2(1), Article 2. https://doi.org/10.7275/ffb1-mm19

77.

Williamson

D. M.

Breyer

F. J.

(2012). A framework for evaluation and use of automated scoring. Educational Measurement: Issues and Practice, 31(1), 2–13. https://doi.org/10.1111/j.1745-3992.2011.00223.x

78.

Wolfe

E. W.

Song

Jiao

(2016). Features of difficult-to-score essays. Assessing Writing, 27, 1–10. https://doi.org/10.1016/j.asw.2015.06.002

79.

Wind

S. A.

(2025). Examining the psychometric impact of targeted and random double-scoring in mixed-format assessments. Educational Measurement: Issues and Practice, 44(1), 18–30. https://doi.org/10.1111/emip.12636

80.

Yamagiwa

Yokoi

Shimodaira

(2022). Improving word mover’s distance by leveraging self-attention matrix (arXiv: 2211.06229). https://doi.org/10.48550/arXiv.2211.06229

81.

Yan

Chen

Zhu

Zhang

(2024). GPT-4 vs. human translators: A comprehensive evaluation of translation quality across languages, domains, and expertise levels. https://arxiv.org/abs/2407.03658

82.

Zhai

Haudek

K. C.

Wilson

Stuhlsatz

(2021). A framework of construct-irrelevant variance for contextualized constructed response assessment. Frontiers in Education, 6, Article 751283.

83.

Zhang

Kishore

Weinberger

K. Q.

Artzi

(2019). BERTscore: Evaluating text generation with BERT (Arxiv:1904.09675). https://doi.org/10.48550/arXiv.1904.09675

84.

Zhong

Yang

Liu

Zhang

Liu

Sun

Liu

(2024). Opportunities and challenges of large language models for low-resource languages in humanities research. https://arxiv.org/abs/2412.04497

85.

Zhu

(2015). Use of majority votes in statistical learning. Wiley Interdisciplinary Reviews: Computational Statistics, 7(6), 357–371. https://doi.org/10.1002/wics.1362

Reconceptualizing Scoring Reliability Through Linguistic Similarity

Abstract

Keywords

Introduction

Background

Double Scoring for Scoring Reliability

Semantic Text Similarity

Voting Strategies

Method

Dataset

Multilingual Response Translation

Response Flagging

Untranslated Response Flagging

Meaningless Response Flagging

Linguistic-Integrated Reliability Audit (LiRA)

Similarity-Based Weighted Majority Voting

LiRA Application to a Mock Item

Evaluation Metrics

Results

Comparability With Traditional CCSR

Distribution of Majority Scores and Flags

Semantic Text Similarity Statistics

Discussion

Interpretation of LiRA’s Performance

Limitations and Future Research

Conclusion

Footnotes

Appendix

Authors’ Note

Declaration of Conflicting Interests

Funding

ORCID iDs

References