Sage Journals: Discover world-class research

Abstract

This longitudinal study (2002–2014) investigates the stability of rating characteristics of a large group of raters over time in the context of the writing paper of a national high-stakes examination. The study uses one measure of rater severity and two measures of rater consistency. The results suggest that the rating characteristics of individual raters are not stable. Thus, predictions from one administration to the next are difficult, although not impossible. In fact, as the membership of the group of raters changes from year to year, past data on rating characteristics become less useful. When the membership of the group of raters is retained, the community of raters develops more stable characteristics. However, “cultural shocks” (low retention of raters and large numbers of newcomers) destabilize the rating characteristics of the community and predictions become more difficult. We propose practical measures to increase the stability of rating across time and offer methodological suggestions for more efficient rater effect-related research designs and analyses.

Keywords

Communities of practice consistency rater experience raters severity stability of rating

Get full access to this article

View all access options for this article.

References

Al-Maamari

(2016). Community of assessment practice or interests: The case of EAP writing assessment. Indonesian Journal of Applied Linguistics, 5(2), 272–281. https://doi.org/10.17509/ijal.v5i2.1351

Baird

J. A.

Greatorex

Bell

J. F.

(2004). What makes marking reliable? Experiments with UK examinations. Assessment in Education: Principles, Policy & Practice, 11(3), 331–348. https://doi.org/10.1080/0969594042000304627

Barkaoui

(2010a). Explaining ESL essay holistic scores: A multilevel modelling approach. Language Testing, 27(4), 515–535. https://doi.org/10.1177/0265532210368717

Barkaoui

(2010b). Variability in ESL Essay rating processes: The role of the rating scale and rater experience. Language Assessment Quarterly, 7, 54–74. https://doi.org/10.1080/15434300903464418

Barrett

(2001). The impact of training on rater variability. International Education Journal, 2(1), 49–58. https://webapps.flinders.edu.au/education/iej/articles/v2n1/barrett/barrett.pdf

Bonk

W. J.

Ockey

G. J.

(2003). A many-facet Rasch analysis of the second language group oral discussion task. Language Testing, 20(1), 89–110. https://doi.org/10.1191/0265532203lt245oa

Congdon

P. J.

McQueen

(2000). The stability of rater severity in large-scale assessment programs. Journal of Educational Measurement, 37, 163–178. https://doi.org/10.1111/j.1745-3984.2000.tb01081.x

Davis

(2016). The influence of training and experience on rater performance in scoring spoken language. Language Testing, 33(1), 117–135. https://doi.org/10.1177/0265532215582282

Deygers

Van Gorp

(2015). Determining the scoring validity of a co-constructed CEFR-based rating scale. Language Testing, 32(4), 521–541. https://doi.org/10.1177/0265532215575626

10.

Douglas

S. R.

(2015). The relationship between lexical frequency profiling measures and rater judgements of spoken and written general English language proficiency on the CELPIP-General Test. TESL Canada Journal, 32(9), 43–64. https://doi.org/10.18806/tesl.v32i0.1217

11.

Elder

Barkhuizen

Knoch

von Randow

(2007). Evaluating rater responses to an online training program for L2 writing assessment. Language Testing, 24(1), 37–64. https://doi.org/10.1177/0265532207071511

12.

Fitzpatrick

A. R.

Ercikan

Yen

W. M.

Ferrara

(1998). The consistency between raters scoring in different test years. Applied Measurement in Education, 11(2), 195–208. https://doi.org/10.1207/s15324818ame1102_5

13.

Gianinazzi

M. E.

Rueegg

C. S.

Zimmerman

Kuehni

C. E.

Michel

, & the Swiss Paediatric Oncology Group (SPOG). (2015). Intra-rater and inter-rater reliability of a medical record abstraction study on transition of care after childhood cancer. PLOS ONE, 10(5), 124–290. https://doi.org/10.1371/journal.pone.0124290

14.

Haj-Ali

Feil

(2006). Rater reliability: Short- and long-term effects of calibration training. Journal of Dentistry Education, 70(4), 428–433. https://www.ncbi.nlm.nih.gov/pubmed/16595535

15.

Han

(2015). Investigating rater severity/leniency in interpreter performance testing: A multifaceted Rasch measurement approach. Interpreting, 17(2), 255–283. https://doi.org/10.1075/intp.17.2.05han

16.

Harik

Clauser

B. E.

Grabovsky

Nungester

R.J.

Swanson

Nandakumar

(2009). An examination of rater drift within a generalizability theory framework. Journal of Educational Measurement, 46(1), 43–58. https://doi.org/10.1111/j.1745-3984.2009.01068.x

17.

Herbert

I. P.

Joyce

Hassall

(2014). Assessment in higher education: The potential for a community of practice to improve inter-marker reliability. Accounting Education, 23(6), 542–561. https://doi.org/10.1080/09639284.2014.974195

18.

Holzknecht

Huhta

Lamprianou

(2018). Comparing the outcomes of two different approaches to CEFR-based rating of students’ writing performances across two European countries. Assessing Writing, 37, 57–67. https://doi.org/10.1016/j.asw.2018.03.009

19.

Hoskens

Wilson

(2001). Real-time feedback on rater drift in constructed-response items: An example from the Golden State Examination. Journal of Educational Measurement, 38(2), 121–145.

20.

Huhta

Alanen

Tarnanen

Martin

Hirvela

(2014). Assessing learners’ writing skills in a SLA study: Validating the rating process across tasks, scales and languages. Language Testing, 31(3), 307–328. https://doi.org/10.1177/0265532214526176

21.

Isbell

D.R.

(2017). Assessing C2 writing ability on the Certificate of English Language Proficiency: Rater and examinee age effects. Assessing Writing, 34, 37–49. https://doi.org/10.1016/j.asw.2017.08.004

22.

Kane

(2013). Validating the interpretations and uses of test scores. Journal of Educational Measurement, 50(1), 1–73. https://doi.org/10.1111/jedm.12000

23.

Knoch

(2011). Investigating the effectiveness of individualized feedback to rating behaviour a longitudinal study. Language Testing, 28(2), 179–200. https://doi.org/10.1177/0265532210384252

24.

Knoch

Chapelle

C. A.

(2018). Validation of rating processes within an argument-based framework. Language Testing, 35(4), 477–499. https://doi.org/10.1177/0265532217710049

25.

Kondo-Brown

(2002). A FACETS analysis of rater bias in measuring Japanese second language writing performance. Language Testing, 19(1), 3–31. https://doi.org/10.1191/0265532202lt218oa

26.

Lamprianou

(2006). The stability of marker characteristics across tests of the same subject and across subjects. Journal of Applied Measurement, 7(2), 192–200.

27.

Lamprianou

(2018). Investigation of rater effects using social network analysis and exponential random graph models. Educational and Psychological Measurement, 78(3), 430–459. https://doi.org/10.1177/0013164416689696

28.

Lamprianou

(2020). Applying the Rasch model in social sciences using R. Routledge.

29.

Lamprianou

Boyle

(2004). Accuracy of measurement in the context of mathematics National Curriculum tests in England for ethnic minority pupils and pupils who speak English as an additional language. Journal of Educational Measurement, 41(3), 239–260. https://doi.org/10.1111/j.1745-3984.2004.tb01164.x

30.

Leckie

Baird

J.-A.

(2011). Rater effects on essay scoring: A multilevel analysis of severity drift, central tendency, and rater experience. Journal of Educational Measurement, 48(4), 399–418. https://doi.org/10.1111/j.1745-3984.2011.00152.x

31.

Lim

G. S.

(2009). Prompt and rater effects in second language writing performance assessment [Unpublished doctoral dissertation]. University of Michigan, USA.

32.

Lim

G. S.

(2011). The development and maintenance of rating quality in performance writing assessment: A longitudinal study of new and experienced raters. Language Testing, 28(4), 543–560. https://doi.org/10.1177/0265532211406422

33.

Linacre

J. M.

(1989). Many-facet Rasch measurement. MESA Press.

34.

Linacre

J. M.

(1994). Many-facet Rasch measurement. MESA Press.

35.

Linacre

J. M.

(2005). A user’s guide to FACETS: Rasch-model computer programs [Software manual]. Winsteps.com.

36.

Lumley

McNamara

T. F.

(1995). Rater characteristics and rater bias: Implications for training. Language Testing, 12(1), 54–71. https://doi.org/10.1177/026553229501200104

37.

MATSEC Support Unit. (2018). Paper setting: Procedures and good practices. Università ta’ Malta.

38.

Myford

Wolfe

E. W.

(2003). Detecting and measuring rater effects using many-facet Rasch measurement: Part I. Journal of Applied Measurement, 4(4), 386–422.

39.

Myford

C. M.

Wolfe

E. W.

(2009). Monitoring rater performance over time: A framework for detecting differential accuracy and differential scale category use. Journal of Educational Measurement, 46(4), 371–389. https://doi.org/10.1111/j.1745-3984.2009.00088.x

40.

NSW Government. (2019). Home | NSW Education Standards. Retrieved January 10, 2019, from https://www.educationstandards.nsw.edu.au/wps/portal/nesa/home

41.

Ockey

G. J.

(2009). The effects of group members’ personalities on a test taker’s L2 group oral discussion test scores. Language Testing, 26(2), 161–186. https://doi.org/10.1177/0265532208101005

42.

Shay

(2005). The assessment of complex tasks: A double reading. Studies in Higher Education, 30(6), 663–679. https://doi.org/10.1080/03075070500339988

43.

Slomp

East

(Eds.). (2019). Framing the future of writing assessment [Special issue]. Assessing Writing, 42.

44.

Smith

R. M.

(1991). The distributional properties of Rasch item fit statistics. Educational and Psychological Measurement, 51(3), 541–565. https://doi.org/10.1177/0013164491513003

45.

Wang

Engelhard

(2019). Exploring the impersonal judgments and personal preferences of raters in rater-mediated assessments with unfolding models. Educational and Psychological Measurement, 79(4), 773–795. https://doi.org/10.1177/0013164419827345

46.

Weigle

S. C.

(1998). Using FACETS to model rater training effects. Language Testing, 15(2), 263–287. https://doi.org/10.1177/026553229801500205

47.

Wenger

McDermott

Snyder

W. M.

(2002). Cultivating communities of practice. Harvard Business School Press.

48.

Willey

Gardner

(2011). Building a community of practice to improve inter marker standardisation and consistency. In Bernardino

Quadrado

J. C.

(Eds.), Proceedings of the SEFI 2011, 27–30 September 2011 (pp. 666–671). Lisbon, Portugal.

49.

Wolfe

E. W.

(2009). Item and rater analysis of constructed response items via the multi-faceted Rasch model. Journal of Applied Measurement, 10(3), 335–347.

50.

Adams

R. J.

(2013). Properties of Rasch residual fit statistics. Journal of Applied Measurement, 14(4), 339–355.

51.

Zhang

Chen

Albert

P. S.

(2012). Estimating diagnostic accuracy of raters without an old standard by exploiting a group of experts. Biometrics, 68(4), 1294–1302. https://doi.org/10.1111/j.1541-0420.2012.01789.x

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.21 MB

The longitudinal stability of rating characteristics in an EFL examination: Methodological and substantive considerations

Abstract

Keywords

Get full access to this article

References

Supplementary Material