Sage Journals: Discover world-class research

Abstract

Assessment rubrics have increasingly been developed and deployed to evaluate the quality of language interpreting, yet understanding of rubric-based interpreting assessment remains limited. This systematic review aims to: (a) catalog rubrics, (b) examine rubric design features, (c) understand rubric use, and (d) evaluate rubric utility. A rigorous review process, involving database searching, citation tracking, and targeted review of core literature, identified 80 unique rubrics comprising a total of 265 (sub-)scales. A comprehensive analysis revealed that: (a) among 11 potential sources informing rubric development, test-external sources (e.g., literature review) were primarily used, whereas test-internal sources (e.g., performance samples) were much less consulted; (b) assessments primarily used analytic and task-type rubrics with an average of five performance levels and four scoring criteria. Rubric descriptors generally incorporated observable indicators of interpreting quality, employing both descriptive and evaluative rubric language; (c) rubrics were used by three main types of raters—interpreting practitioners, trainers, and students—to assess primarily spoken-language interpreting; and (d) rubric-based scores demonstrated moderate-to-high reliability and validity overall, though meta-analysis identified three significant moderators, including correlation coefficient type, assessment criterion, and rubric length. These findings are expected to provide guidance for assessment practice and research in interpreting.

Keywords

language interpreting meta-analysis rater-mediated assessment rubric systematic review

Get full access to this article

View all access options for this article.

References

Baker

B. A.

Turner

C. E.

(2022). Rating scales and rubrics in language assessment. In Chapelle

C. A.

(Ed.), The encyclopedia of applied linguistics (pp. 987–994). John Wiley & Sons. https://doi.org/10.1002/9781405198431.wbeal1045.pub2

Bargainnier

(2003). Fundamentals of rubrics. Pacific Crest, 5, 1–4.

Borenstein

Hedges

L. V.

Higgins

J. P. T.

Rothstein

H. R.

(2021). Introduction to meta-analysis (2nd ed.). John Wiley & Sons.

Brookhart

S. M.

(2013). How to create and use rubrics for formative assessment and grading. ASCD publications.

Brookhart

S. M.

(2018). Appropriate criteria: Key to effective rubrics. Frontiers in Education, 3, Article 22. https://doi.org/10.3389/feduc.2018.00022

Brookhart

S. M.

Chen

(2015). The quality and effectiveness of descriptive rubrics. Educational Review, 67(3), 343–368. https://doi.org/10.1080/00131911.2014.929565

Chen

S.-J.

Kruger

J.-L.

(2024). A computer-assisted consecutive interpreting workflow: Training and evaluation. Interpreter and Translator Trainer, 18(3), 380–399. https://doi.org/10.1080/1750399X.2024.2373553

Davis

(2016). The influence of training and experience on rater performance in scoring spoken language. Language Testing, 33(1), 117–135. https://doi.org/10.1177/0265532215582282

Dawrant

Han

(2021). Testing for professional qualification in conference interpreting. In Albl-Mikasa

Tiselius

(Eds.), The Routledge handbook of conference interpreting (pp. 258–274). Routledge.

10.

Dawson

(2017). Assessment rubrics: Towards clearer and more replicable design, research and practice. Assessment & Evaluation in Higher Education, 42(3), 347–360. https://doi.org/10.1080/02602938.2015.1111294

11.

Fluckiger

(2010). Single point rubric: A tool for responsible student self-assessment. The Delta Kappa Gamma Bulletin, 76(4), 18–25. https://digitalcommons.unomaha.edu/tedfacpub/5/

12.

Fulcher

Davidson

Kemp

(2011). Effective rating scale development for speaking tests: Performance decision trees. Language Testing, 28(1), 5–29. https://doi.org/10.1177/0265532209359514

13.

Giambruno

(2014). Assessing legal interpreter quality through testing and certification: The QUALITAS project. Publications Universidad de Alicante.

14.

Grainger

Weir

(Eds.) (2020). Facilitating student learning and engagement in higher education through assessment rubrics. Cambridge Scholars Publishing.

15.

Hale

Garcia

Hlavac

Kim

Lai

Turner

Slatyer

(2012). Development of a conceptual overview for a new model for NAATI standards, testing and assessment. https://www.naati.com.au/wp-content/uploads/2023/08/Improvements-to-NAATI-Testing-Report_2012.pdf

16.

Hamp-Lyons

(1995). Rating nonnative writing: The trouble with holistic scoring. TESOL Quarterly, 29(4), 759–762. https://doi.org/10.2307/3588173

17.

Han

(2015). Investigating rater severity/leniency in interpreter performance testing: A multifaceted Rasch measurement approach. Interpreting, 17(2), 255–283. https://doi.org/10.1075/intp.17.2.05han

18.

Han

(2022). Interpreting testing and assessment: A state-of-the-art review. Language Testing, 39(1), 30–55. https://doi.org/10.1177/02655322211036100

19.

Han

(2018). Using rating scales to assess interpretation: Practices, problems and prospects. Interpreting, 20(1), 59–95. https://doi.org/10.1075/intp.00003.han

20.

Han

(2019). A generalizability theory study of optimal measurement design for a summative assessment of English/Chinese consecutive interpreting. Language Testing, 36(3), 419–438. https://doi.org/10.1177/0265532218809396

21.

Han

Deng

(2023). Effects of language background and directionality on raters’ assessments of spoken-language interpreting. Revista Española de Lingüística Aplicada, 36(2), 556–584. https://doi.org/10.1075/resla.21009.han

22.

Han

Jiang

M.-T.

Chen

Q.-L.

(2025, September 30). A systematic review and meta-analysis of rubrics in rater-mediated assessment of language interpreting. Open Science Framework. https://osf.io/n5vfw/

23.

Han

Slatyer

(2016). Test validation in interpreter certification performance testing: An argument-based approach. Interpreting, 18(2), 231–258. https://doi.org/10.1075/intp.18.2.04han

24.

Han

Xiao

(2021). Assessing the fidelity of consecutive interpretation: The effects of using source versus target text as the reference material. Interpreting, 23(2), 245–468. https://doi.org/10.1075/intp.00058.han

25.

Han

Zheng

B.-H.

Xie

M.-Q.

Chen

S.-R.

(2024). Raters’ scoring process in assessment of interpreting: An empirical study based on eye tracking and retrospective verbalization. The Interpreter and Translator Trainer, 18(3), 400–422. https://doi.org/10.1080/1750399X.2024.2326400

26.

Humphry

S. M.

Heldsinger

S. A.

(2014). Common structural design features of rubrics may represent a threat to validity. Educational Researcher, 43(5), 253–263. https://doi.org/10.3102/0013189X14542154

27.

Isaacs

Suvorov

Yan

Styrina

(in press). Research syntheses in the international journal, ‘Language Testing’: Publishing trends and a drive to improve standards in research dissemination. In Chong

S. W.

(Ed.), Research synthesis methodologies in applied linguistics. John Benjamins.

28.

Isaacs

Thomson

R. I.

(2013). Rater experience, rating scale length, and judgments of L2 pronunciation: Revisiting research conventions. Language Assessment Quarterly, 10(2), 135–159. https://doi.org/10.1080/15434303.2013.769545

29.

Isaacs

Yan

(2025). Language Testing behind the scenes: New capabilities, groundbreaking developments, and disruptive forces. Language Testing, 42(1), 3–10. https://doi.org/10.1177/02655322241288490

30.

Jiang

Z.-K.

Zhang

Z.-Y.

(2025). From black box to transparency: Enhancing automated interpreting assessment with explainable AI in college classrooms. Research Methods in Applied Linguistics, 4(3), 100237. https://doi.org/10.1016/j.rmal.2025.100237

31.

Jones

Allen

Dunn

Brooker

(2017). Demystifying the rubric: A five-step pedagogy to improve student understanding and utilisation of marking criteria. Higher Education Research & Development, 36(1), 129–142. https://doi.org/10.1080/07294360.2016.1177000

32.

Jonsson

Panadero

(2017). The use and design of rubrics to support assessment for learning. In Carless

Bridges

S. M.

Chan

C. K. Y.

Glofcheski

(Eds.), Scaling up assessment for learning in higher education (pp. 99–112). Springer. https://doi.org/10.1007/978-981-10-3045-1_7

33.

Jonsson

Svingby

(2007). The use of scoring rubrics: Reliability, validity and educational consequences. Educational Research Review, 2(2), 130–144. https://doi.org/10.1016/j.edurev.2007.05.002

34.

Knoch

Deygers

Khamboonruang

(2021). Revisiting rating scale development for rater-mediated language performance assessments: Modelling construct and contextual choices made by scale developers. Language Testing, 38(4), 602–626. https://doi.org/10.1177/0265532221994052

35.

Lee

(2008). Rating scales for interpreting performance assessment. The Interpreter and Translator Trainer, 2(2), 165–184. https://doi.org/10.1080/1750399X.2008.10798772

36.

X.-D.

(2018). Self-assessment as “assessment as learning” in translator and interpreter education: Validity and washback. The Interpreter and Translator Trainer, 12(1), 48–67. https://doi.org/10.1080/1750399X.2017.1418581

37.

Liao

X.-N.

Jia

(2025). Effects of raters’ professional backgrounds on assessing interpreting quality: An exploratory mixed-methods investigation into rater behavior. System, 133, 103772. https://doi.org/10.1016/j.system.2025.103772

38.

Liu

M.-H.

(2013). Design and analysis of Taiwan’s Interpretation Certification Examination. In Tsagari

van Deemter

(Eds.), Assessment issues in language translation and interpreting (pp. 163–178). Peter Lang.

39.

Mokkink

L. B.

Prinsen

C. A. C.

Bouter

L. M.

de Vet

H. C. W.

Terwee

C. B.

(2016). The COnsensus-based standards for the selection of health Measurement INstruments (COSMIN) and how to select an outcome measurement instrument. Brazilian Journal of Physical Therapy, 20(2), 105–113. http://dx.doi.org/10.1590/bjpt-rbf.2014.0143

40.

Munday

(2008). Introducing translation studies (2nd ed.). Routledge. https://doi.org/10.4324/9780203869734

41.

Myford

C. M.

Wolfe

E. W.

(2004). Detecting and measuring rater effects using many-facet Rasch measurement: Part I. Journal of Applied Measurement, 5(2), 189–227.

42.

National Accreditation Authority for Translators and Interpreters. (2024). Certified interpreter: Assessment rubrics. https://www.naati.com.au/wp-content/uploads/2023/07/Certified-Interpreter-Assessment-Rubrics.pdf

43.

Page

M. J.

McKenzie

J. E.

Bossuyt

P. M.

Boutron

Hoffmann

T. C.

Mulrow

C. D.

Shamseer

Tetzlaff

J. M.

Akl

E. A.

Brennan

S. E.

Chou

Glanville

Grimshaw

J. M.

Hróbjartsson

Lalu

M. M.

Loder

E. W.

Mayo-Wilson

McDonald

. . .Moher

(2021). The PRISMA 2020 statement: An updated guideline for reporting systematic reviews. BMJ, 372, Article n71. https://doi.org/10.1136/bmj.n71

44.

Panadero

Jonsson

(2013). The use of scoring rubrics for formative assessment purposes revisited: A review. Educational Research Review, 9, 129–144. https://doi.org/10.1016/j.edurev.2013.01.002

45.

Panadero

Jonsson

(2020). A critical review of the arguments against the use of rubrics. Educational Research Review, 30, 100329. https://doi.org/10.1016/j.edurev.2020.100329

46.

Penny

Johnson

R. L.

Gordon

(2000). Using rating augmentation to expand the scale of an analytic rubric. The Journal of Experimental Education, 68(3), 269–287.

47.

Plonsky

Derrick

D. J.

(2016). A meta-analysis of reliability coefficients in second language research. The Modern Language Journal, 100(2), 538–553. https://doi.org/10.1111/modl.12335

48.

Popham

W. J.

(1997). What’s wrong-and what’s right-with rubrics. Educational Leadership, 55, 72–75.

49.

Sadler

D. R.

(1987). Specifying and promulgating achievement standards. Oxford Review of Education, 13(2), 191–209. https://doi.org/10.1080/0305498870130207

50.

Setton

Dawrant

(2016). Conference interpreting: A trainer’s guide. John Benjamins. https://doi.org/10.1075/btl.121

51.

Stanley

(2021). Using rubrics for performance-based assessment: A practical guide to evaluating student work. Routledge. https://doi.org/10.4324/9781003239390

52.

Stemler

S. E.

Tsai

(2008). Best practices in interrater reliability: Three common approaches. In Osborne

(Ed.), Best practices in quantitative methods (pp. 29–49). Sage. https://doi.org/10.4135/9781412995627.d5

53.

Tan

K. H.

(2020). Assessment rubrics decoded: An educator’s guide. Routledge. https://doi.org/10.4324/9780429022081

54.

Tiselius

(2009). Revisiting Carroll’s scales. In Angelelli

Jacobson

H. E.

(Eds.), Testing and assessment in translation and interpreting studies: A call for dialogue between research and practice (pp. 95–121). John Benjamins. https://doi.org/10.1075/ata.xiv.07tis

55.

Wang

B.-H.

(2011). 口译能力的评估模式及测试设计再探—以全国英语口译大赛为例 [Exploration of the assessment model and test design of interpreting competence]. Foreign Language World, 142, 66–71.

56.

Wang

Napier

Goswell

Carmichael

(2015). The design and application of rubrics to assess signed language interpreting performance. The Interpreter and Translator Trainer, 9(1), 83–103. https://doi.org/10.1080/1750399X.2015.1009261

57.

Wind

A. S.

(2023). A sequential approach to detecting differential rater functioning in sparse rater-mediated assessment networks. Language Testing, 40(2), 209–226. https://doi.org/10.1177/02655322221092388

58.

Yeh

S.-P.

Liu

M.-H.

(2006). 口譯評分客觀化初探: 採用量表的可能性 [A more objective approach to interpretation evaluation: Exploring the use of scoring rubrics]. Journal of the National Institute for Compilation and Translation Review, 34(4), 57–78.

Rubricizing the assessment practice: A systematic review and meta-analysis of rubrics in rater-mediated assessment of language interpreting

Abstract

Keywords

Get full access to this article

References