Sage Journals: Discover world-class research

Abstract

Two factors were investigated that are thought to contribute to consistency in rater scoring judgments: rater training and experience in scoring. Also considered were the relative effects of scoring rubrics and exemplars on rater performance. Experienced teachers of English (N = 20) scored recorded responses from the TOEFL iBT speaking test prior to training and in three sessions following training (100 responses for each session). Scores were analyzed using multifaceted Rasch measurement and traditional measures of rater reliability and agreement, and the frequency with which exemplar responses were viewed was measured. Prior to training, rater severity and internal consistency were already of a standard typical for operational language performance tests, but training resulted in increased inter-rater correlation and agreement as well as improved agreement with established reference scores. Additional experience gained after training appeared to have little further effect on raters’ scoring consistency, although the level of agreement with reference scores continued to increase. The most accurate raters generally reviewed exemplar responses more often and took longer to make scoring decisions compared to the least accurate raters. These results raise questions regarding the relative contribution of scoring aids such as exemplars and scoring rubrics to desirable scoring patterns.

Keywords

Inter-rater reliability rater experience rater expertise rater training scoring aids speaking assessment

Get full access to this article

View all access options for this article.

References

Advanced Analytics. (2011). AgreeStat (Version 2011.1) [Computer software]. Gaithersburg, MD: Advanced Analytics.

Bond

T. G.

Fox

C. M

. (2007). Applying the Rasch model: Fundamental measurement in the human sciences (2nd ed.). Mahwah, NJ: Lawrence Erlbaum.

Bonk

W. J.

Ockey

G. J

. (2003). A many-facet Rasch analysis of the second language group oral discussion task. Language Testing, 20, 89–110.

Cumming

. (1990). Expertise in evaluating second language compositions. Language Testing, 7, 31–51.

Cumming

Kantor

Powers

D. E

. (2002). Decision making while rating ESL/EFL writing tasks: A descriptive framework. Modern Language Journal, 86, 67–96.

Delaruelle

. (1997). Text type and rater decision-making in the writing module. In Brindley

Wigglesworth

(Eds.), Access: Issues in language test design and delivery (pp. 215–242). Sydney: National Centre for English Language Teaching and Research, Macquarie University.

Eckes

. (2011). Introduction to many-facet Rasch measurement: Analyzing and evaluating rater-mediated assessments. Frankfurt: Peter Lang.

Educational Testing Service. (2004). iBT/Next Generation TOEFL iBT Test: Independent Speaking Rubrics (Scoring Standards). Retrieved from www.ets.org/Media/Tests/TOEFL/pdf/Speaking_Rubrics.pdf

Educational Testing Service. (2008). TOEFL iBT public use dataset [Data files, score prompts, scoring rubrics]. Princeton, NJ: Educational Testing Service.

10.

Elder

Barkhuizen

Knoch

von Randow

. (2007). Evaluating rater responses to an online training program for L2 writing assessment. Language Testing, 24, 37–64.

11.

Fahim

Bijani

. (2011). The effects of rater training on raters’ severity and bias in second language writing assessment. Iranian Journal of Language Testing, 1, 1–16.

12.

Field

. (2005). Discovering statistics using SPSS (2nd ed.). London: SAGE Publications.

13.

Fleiss

J. L.

Levin

Paik

M. C

. (2003). Statistical methods for rates and proportions (3rd ed.). Hoboken, NJ: John Wiley & Sons.

14.

Fulcher

. (2003). Testing second language speaking. Harlow, UK: Longman.

15.

Furneaux

Rignall

. (2007). The effect of standardization-training on rater judgements for the IELTS writing module. In Taylor

Falvey

(Eds.), IELTS Collected Papers: Research in speaking and writing assessment (pp. 422–445). Cambridge: Cambridge University Press.

16.

Hatch

Lazaraton

. (1991). The research manual: Design and statistics for applied linguistics. Boston, MA: Heinle & Heinle.

17.

Kim

H. J

. (2011). Investigating raters’ development of rating ability on a second language speaking assessment. Doctoral dissertation. Retrieved from ProQuest Dissertations and Theses database. (UMI No. 3448033)

18.

Knoch

. (2011). Investigating the effectiveness of individualized feedback to rating behavior – a longitudinal study. Language Testing, 28, 179–200.

19.

Knoch

Read

von Randow

. (2007). Re-training raters online: How does it compare with face-to-face training? Assessing Writing, 12, 26–43.

20.

Laming

. (1997). The measurement of sensation. Oxford: Oxford University Press.

21.

Laming

. (2003). Marking university examinations: Some lessons from psychophysics. Psychology Learning and Teaching, 3, 89–96.

22.

Landis

J. R.

Koch

G. G

. (1977). The measurement of observer agreement for categorical data. Biometrics, 33, 159–174.

23.

Lim

G. S

. (2011). The development and maintenance of rating quality in performance writing assessment: A longitudinal study of new and experienced raters. Language Testing, 28, 543–560.

24.

Linacre

J. M

. (2007a). FACETS Rasch measurement computer program [Computer software]. Chicago, IL: Author.

25.

Linacre

J. M

. (2007b). A user’s guide to FACETS. Chicago, IL: Author.

26.

Lumley

. (2005). Assessing second language writing: The rater’s perspective. Frankfurt: Peter Lang.

27.

Lumley

McNamara

T. F

. (1995). Rater characteristics and rater bias: implications for training. Language Testing, 12, 54–71.

28.

McIntyre

P. N

. (1993). The importance and effectiveness of moderation training on the reliability of teacher assessments of ESL writing samples (Unpublished master’s thesis). University of Melbourne, Melbourne, Australia.

29.

McNamara

. (1996). Measuring second language performance. London: Longman.

30.

Meiron

B. E

. (1998). Rating oral proficiency tests: A triangulated study of rater thought processes (Unpublished master’s thesis). California State University Los Angeles, Los Angeles, CA.

31.

Myford

C. M.

Marr

D. B.

Linacre

J. M

. (1996). Reader calibration and its potential role in equating for the Test of Written English. (TOEFL Research Report No. 52, RR-95-40). Princeton, NJ: Educational Testing Service.

32.

Norris

J. M.

Brown

J. D.

Hudson

Yoshioka

. (1998). Designing second language performance assessments. Honolulu, HI: University of Hawai’i, National Foreign Language Resource Center.

33.

Orr

. (2002). The FCE Speaking test: Using rater reports to help interpret test scores. System, 30, 143–154.

34.

O’Sullivan

Rignall

. (2007). Assessing the value of bias analysis feedback to raters for the IELTS writing module. In Taylor

Falvey

(Eds.), IELTS Collected Papers: Research in speaking and writing assessment (pp. 446–478). Cambridge: Cambridge University Press.

35.

Papajohn

. (2002). Concept mapping for rater training. TESOL Quarterly, 36, 219–233.

36.

Shaw

. (2002). The effect of training and standardization on rater judgement and inter-rater reliability. Research Notes, 9, 13–17. Retrieved from www.cambridgeesol.org/rs_notes/rs_nts8.pdf

37.

Shohamy

Gordon

C. M.

Kraemer

. (1992). The effect of raters’ background and training on the reliability of direct writing tests. Modern Language Journal, 76, 27–33.

38.

Stewart

Chater

Brown

G. D. A

. (2006). Decision by sampling. Cognitive Psychology, 53, 1–26.

39.

Weigle

S. C

. (1994). Effects of training on raters of ESL compositions. Language Testing, 11, 197–223.

40.

Weigle

S. C

. (1998). Using FACETS to model rater training effects. Language Testing, 15, 263–287.

41.

Wigglesworth

. (1993). Exploring bias analysis as a tool for improving rater consistency in assessing oral interaction. Language Testing, 10, 305–319.

42.

Mollaun

. (2009). How do raters from India perform in scoring the TOEFL iBT™ speaking section and what kind of training helps? (TOEFL iBT Research Report No. 11, RR-09-31). Princeton, NJ: Educational Testing Service.

The influence of training and experience on rater performance in scoring spoken language

Abstract

Keywords

Get full access to this article

References