Sage Journals: Discover world-class research

Abstract

Assessing translation and interpreting (T&I) is essential in tertiary-level T&I education, professional certification, and foreign language testing. Recently, researchers have explored automating T&I assessment, with large language models (LLMs) emerging as a promising agent for automatic scoring. This study presents one of the first large-scale empirical investigations into the scoring reliability, severity, and validity of GPT-4o and DeepSeek-R1 in English–Chinese consecutive and simultaneous interpreting assessment. Using more than 500 pre-scored samples from the Interpreting Quality Evaluation Corpus (IQEC), the study configured eight e-raters per LLM, systematically varying three scoring parameters: reference availability (zero vs. four references), scoring granularity (segment vs. document-level scoring), and model randomness (temperature 0 vs. 1). A combination of correlation, linear mixed model, and Rasch analyses revealed that: (a) both LLMs demonstrated higher reliability than human raters; (b) DeepSeek-R1 applied significantly harsher scoring patterns than GPT-4o; (c) both LLMs achieved moderately strong correlations with human raters, with overall Spearman’s correlation coefficients ranging from .586 to .700; (d) GPT-4o exhibited higher scoring accuracy than DeepSeek-R1; and (e) LLM-based e-raters’ performance varied significantly across different scoring conditions. These results have important theoretical and practical implications, providing insights into optimizing LLM-based automatic scoring for interpreting and broader language assessment contexts.

Keywords

Automatic assessment interpreting assessment large language models spoken-language interpreting zero-shot evaluator

Get full access to this article

View all access options for this article.

References

Arora

Sayeed

A. I.

Licorish

Wang

Treude

(2024). Optimizing large language model hyperparameters for code generation. arXiv. https://doi.org/10.48550/arXiv.2408.10577.

Barr

D. J.

Levy

Scheepers

Tily

H. J.

(2013). Random effects structure for confirmatory hypothesis testing: Keep it maximal. Journal of Memory and Language, 68(3), 255–278. https://doi.org/10.1016/j.jml.2012.11.001

Bernstein

Cohen

Murveit

Rtischev

Weintraub

(1990). Automatic evaluation and training in English pronunciation. In Proceedings of the ICSLP-90: 1990 International Conference on Spoken Language Processing (pp. 1185–1188). Acoustical Society of Japan. https://www.isca-archive.org/icslp_1990/bernstein90_icslp.pdf

Cook

(2010). Translation in language teaching: An argument for reassessment. Oxford University Press.

De Jong

N. H

. (2018). Fluency in second language testing: Insights from different disciplines. Language Assessment Quarterly, 15(3), 237–254. https://doi.org/10.1080/15434303.2018.1477780

Engelhard

(2013). Invariant measurement: Using Rasch models in the social, behavioural and health sciences. Routledge.

Floros

Tsagari

Phlōros

(Eds.). (2013). Translation in language teaching and assessment. Cambridge Scholars Publishing.

Ghosh

Kumar

Seth

Evuru

C. K. R.

Tyagi

Sakshi

Nieto

Duraiswami

Manocha

(2024). GAMA: A large audio-language model with advanced audio understanding and complex reasoning abilities. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 6288–6313). Association for Computational Linguistics. https://aclanthology.org/2024.emnlp-main.361/

Gieshoff

A. C.

Albl-Mikasa

(2024). Interpreting accuracy revisited: A refined approach to interpreting performance analysis. Perspectives, 32(2), 210–228. https://doi.org/10.1080/0907676X.2022.2088296

10.

Gile

(2001). Consecutive vs. simultaneous: Which is more accurate? Interpretation Studies, 1, 8–20. https://jaits.jpn.org/home/kaishi2001/pdf/03-danielgilefinal.pdf

11.

Han

(2015). Building the validity foundation for interpreter certification performance testing [Doctoral dissertation, Macquarie University]. http://hdl.handle.net/1959.14/1068520

12.

Han

(2022). Interpreting testing and assessment: A state-of-the-art review. Language Testing, 39(1), 30–55. https://doi.org/10.1177/02655322211036100

13.

Han

Chen

S.-R.

Feng

(2025). Modeling rater cognition in translation assessment: An exploratory investigation based on think-aloud, eye-tracking, and interview data. Target. Advance online publication. https://doi.org/10.1075/target.23040.han

14.

Han

Deng

(2023). Effects of language background and directionality on raters’ assessments of spoken-language interpreting. Revista Española de Lingüística Aplicada, 36(2), 556–584. https://doi.org/10.1075/resla.21009.han

15.

Han

Jiang

M.-T.

Chen

Q.-L.

(2025a). Rubricizing the assessment practice: A systematic review and meta-analysis of rubrics in rater-mediated assessment of language interpreting. Language Testing. https://doi.org/10.1177/02655322251391233

16.

Han

X.-L.

(2023). Can automated machine translation evaluation metrics be used to assess students’ interpretation in the language learning classroom? Computer Assisted Language Learning, 36(5–6), 1064–1087. https://doi.org/10.1080/09588221.2021.1968915

17.

Han

X.-L.

(2025). Beyond BLEU: Repurposing neural-based metrics to assess interlingual interpreting in tertiary-level language learning settings. Research Methods in Applied Linguistics, 4(1), Article 100184. https://doi.org/10.1016/j.rmal.2025.10018

18.

Han

X.-L.

Chen

S.-R.

(2025b). Modeling rater judgments of interpreting quality: Ordinal logistic regression using neural-based evaluation metrics, acoustic fluency measures, and computational linguistic indices. Research Methods in Applied Linguistics, 4(1), Article 100194. https://doi.org/10.1016/j.rmal.2025.100194

19.

Han

X.-L.

Fan

(2025c). Taming generative AI for interpreter education: Using large language models in classroom-based assessment of English-Chinese consecutive interpreting. Interpreter and Translator Trainer. Advance online publication. https://doi.org/10.1080/1750399X.2025.2533606

20.

Han

Xiao

(2021). Assessing the fidelity of consecutive interpreting: The effects of using source versus target text as the reference material. Interpreting, 23(2), 245–268. https://doi.org/10.1075/intp.00058.han

21.

Han

Xiao

X.-Y.

(2022). A comparative judgment approach to assessing Chinese Sign Language interpreting. Language Testing, 39(2), 289–312. https://doi.org/10.1177/02655322211038977

22.

Han

Yang

L.-Y.

(2023). Relating utterance fluency to perceived fluency of interpreting: A partial replication and a mini meta-analysis. Translation and Interpreting Studies, 18(3), 421–447. https://doi.org/10.1075/tis.20091.han

23.

Han

Zheng

B.-H.

Xie

M.-Q.

Chen

S.-R.

(2024). Raters’ scoring process in assessment of interpreting: An empirical study based on eye tracking and retrospective verbalization. Interpreter and Translator Trainer, 18(3), 400–422. https://doi.org/10.1080/1750399X.2024.2326400

24.

Han

(2022). Comparing product quality between translation and paraphrasing: Using NLP-assisted evaluation frameworks. Frontiers in Psychology, 13, 1048132. https://doi.org/10.3389/fpsyg.2022.1048132

25.

Hlavac

(2010). A cross-national overview of translator and interpreter certification procedures. Translation & Interpreting, 5(1), 32–65. https://www.trans-int.org/index.php/transint/article/view/184/108

26.

Jia

Aryadoust

(2024). The utility of generative artificial intelligence in rating interpreters’ accuracy: A case study of ChatGPT-4. In Chapelle

C. A.

Beckett

G. H.

Ranalli

(Eds.), Exploring artificial intelligence in applied linguistics (pp. 59–72). Iowa State University Digital Press. https://www.iastatedigitalpress.com/plugins/books/154/chapter/1229

27.

Jiang

Z.-K.

Zhang

Z.-Y.

(2025). From black box to transparency: Enhancing automated interpreting assessment with explainable AI in college classrooms. Research Methods in Applied Linguistics, 4(3), 100237. https://doi.org/10.1016/j.rmal.2025.100237

28.

Kobayashi

Mita

Komachi

(2024). Large language models are state-of-the-art evaluator for grammatical error correction. In Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (pp. 68–77). Association for Computational Linguistics. https://aclanthology.org/2024.bea-1.6/

29.

Lee

S.-B.

(2015). Developing an analytic scale for assessing undergraduate students’ consecutive interpreting performances. Interpreting, 7(2), 226–254. https://doi.org/10.1075/intp.17.2.04lee

30.

Lee

S.-B.

(2019). Holistic assessment of consecutive interpretation: How interpreter trainers rate student performances. Interpreting, 21(2), 245–269. https://doi.org/10.1075/intp.00029.lee

31.

Fan

Zhang

Wang

(2025). Temperature matters: Evaluating the stability and variability of large language model outputs across tasks arXiv. https://arxiv.org/abs/2506.07295.

32.

Lin

(2024). Evaluating LLMs’ grammatical error correction performance in learner Chinese. PLOS ONE, 19(10), Article e0312881. https://doi.org/10.1371/journal.pone.0312881

33.

Lin

Liang

J.-Y.

(2021). Differentiating interpreting types: Connecting complex networks to cognitive complexity. Frontier in Psychology, 12, 590399. https://doi.org/10.3389/fpsyg.2021.590399

34.

Linacre

J. M.

(2017). FACETS: Computer program for many faceted Rasch measurement (Version 3.80.0). Mesa Press.

35.

X.-L.

Han

(2025). ChatGPT as an evaluator in summative assessment of spoken-language interpreting: A psychometric analysis of ChatGPT versus human raters [Manuscript submitted for publication].

36.

Luo

Liu

Jiang

Fan

Kuang

Gao

Yin

Zheng

(2025). Evaluating the performance of GPT-3.5, GPT-4, and GPT-4o in the Chinese National Medical Licensing Examination. Scientific Reports, 15, 14119. https://doi.org/10.1038/s41598-025-98949-2

37.

Mizumoto

Eguchi

(2023). Exploring the potential of using an AI language model for automated essay scoring. Research Methods in Applied Linguistics, 2(2), 100050. https://doi.org/10.1016/j.rmal.2023.100050

38.

OpenAI. (2024). Hello GPT-4o. https://openai.com/index/hello-gpt-4o/?utm_source=chatgpt.com

39.

Pack

Barrett

Escalante

(2024). Large language models and automated essay scoring of English language learner writing: Insights into validity and reliability. Computers and Education: Artificial Intelligence, 6, Article 100234. https://doi.org/10.1016/j.caeai.2024.100234

40.

Page

E. B.

(1966). The imminence of grading essays by computer. Phi Delta Kappan, 47(5), 238–243.

41.

Patel

Chen

Zhang

(2024). Robustness of temperature settings in clinical NLP: A study of large language models on medical reasoning tasks. arXiv. https://arxiv.org/abs/2405.12345.

42.

Renze

(2024). The effect of sampling temperature on problem solving in large language models. In Findings of the Association for Computational Linguistics: Empirical Methods in Natural Language Processing (EMNLP) 2024 (pp. 7346–7356). Association for Computational Linguistics. https://doi.org/10.18653/v1/2024.findings-emnlp.650

43.

Sahoo

Singh

A. K.

Saha

Jain

Mondal

Chadha

(2024). A systematic survey of prompt engineering in large language models: Techniques and applications. arXiv. https://doi.org/10.48550/arXiv.2402.07927.

44.

Setton

Dawrant

(2016). Conference interpreting: A trainer’s guide. John Benjamins. https://doi.org/10.1075/btl.121

45.

Setton

Motta

(2007). Syntacrobatics: Quality and reformulation in simultaneous-with-text. Interpreting, 9(2), 199–230. https://doi.org/10.1075/intp.9.2.04set

46.

Shafiei

(2024). A proposed analytic rubric for consecutive interpreting assessment: Implications for similar contexts. Language Testing in Asia, 14, Article 13. https://doi.org/10.1186/s40468-024-00278-0

47.

Shahriar

Lund

B. D.

Mannuru

N. R.

Arshad

M. A.

Hayawi

Bevara

R. V. K.

Mannuru

Batool

(2024). Putting GPT-4o to the sword: A comprehensive evaluation of language, vision, speech, and multimodal proficiency. Applied Sciences, 14, 7782. https://doi.org/10.3390/app14177782

48.

Sickinger

Brunfaut

Pill

(2025). Comparative Judgement for evaluating young learners’ EFL writing performances: Reliability and teacher perceptions of holistic and dimension-based judgements. Language Testing, 42(2), 137–166. https://doi.org/10.1177/02655322241288847

49.

Tiselius

(2009). Revisiting Carroll’s scales. In Angelelli

Jacobson

H. E.

(Eds.), Testing and assessment in translation and interpreting studies: A call for dialogue between research and practice (pp. 95–121). John Benjamins. https://doi.org/10.1075/ata.xiv.07tis.

50.

Tsagari

van Deemter

(Eds.). (2013). Assessment issues in language translation and interpreting. Peter Lang.

51.

Ünlü

(2023). InterpreTutor: Using large language models for interpreter assessment. Proceedings of the International Conference on Human-Informed Translation and Interpreting Technolog (HiT-IT), 2023, 78–96. https://acl-bg.org/proceedings/2023/HiT-IT%202023/pdf/2023.hitit2023-1.7.pdf

52.

Wang

Fantinuoli

(2024). Exploring the correlation between human and machine evaluation of simultaneous speech translation. In Scarton

Prescott

Bayliss

Oakley

Wright

Wrigley

Song

Gow-Smith

Bawden

Sánchez-Cartagena

V. M.

Cadwell

Lapshinova-Koltunski

Cabarrão

Chatzitheodorou

Nurminen

Kanojia

Moniz

(Eds.), Proceedings of the 25th Annual Conference of the European Association for Machine Translation (Vol. 1, pp. 327–336). https://aclanthology.org/2024.eamt-1.28.pdf

53.

Wang

(2024). Identifying fluency parameters for a machine-learning-based automated interpreting assessment system. Perspectives, 32(2), 278–294. https://doi.org/10.1080/0907676X.2022.2133618

54.

Wang

Yuan

(2023). Machine-learning based automatic assessment of communication in interpreting. Frontiers in Communication, 8, 1047753. https://doi.org/10.3389/fcomm.2023.1047753

55.

Wein

S. I. T.

Cherry

Juraska

Padfield

Macherey

(2024). Barriers to effective evaluation of simultaneous interpretation. In Findings of the Association for Computational Linguistics: European Chapter of the Association for Computational Linguistics (EACL) 2024 (pp. 209–219). Association for Computational Linguistics. https://aclanthology.org/2024.findings-eacl.15/

56.

Williamson

D. M.

Breyer

F. J.

(2012). A framework for evaluation and use of automated scoring. Educational Measurement: Issues and Practice, 31(1), 2–13. https://doi.org/10.1111/j.1745-3992.2011.00223.x

57.

Yeh

S.-P.

Liu

(2006). A more objective approach to interpretation evaluation: Exploring the use of scoring rubrics. Compilation and Translation Review, 34(4), 57–78.

58.

Zechner

Evanini

(2019). Automated speaking assessment: Using language technologies to score spontaneous speech. Routledge. https://doi.org/10.4324/9781315165103

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.71 MB

Large language models as zero-shot evaluators of English–Chinese interpreting: A comparison of GPT-4o and DeepSeek-R1

Abstract

Keywords

Get full access to this article

References

Supplementary Material