Abstract
Two factors were investigated that are thought to contribute to consistency in rater scoring judgments: rater training and experience in scoring. Also considered were the relative effects of scoring rubrics and exemplars on rater performance. Experienced teachers of English (N = 20) scored recorded responses from the TOEFL iBT speaking test prior to training and in three sessions following training (100 responses for each session). Scores were analyzed using multifaceted Rasch measurement and traditional measures of rater reliability and agreement, and the frequency with which exemplar responses were viewed was measured. Prior to training, rater severity and internal consistency were already of a standard typical for operational language performance tests, but training resulted in increased inter-rater correlation and agreement as well as improved agreement with established reference scores. Additional experience gained after training appeared to have little further effect on raters’ scoring consistency, although the level of agreement with reference scores continued to increase. The most accurate raters generally reviewed exemplar responses more often and took longer to make scoring decisions compared to the least accurate raters. These results raise questions regarding the relative contribution of scoring aids such as exemplars and scoring rubrics to desirable scoring patterns.
Keywords
Get full access to this article
View all access options for this article.
