The influence of rater language background on writing performance assessment

Abstract

Language performance assessments typically require human raters, introducing possible error. In international examinations of English proficiency, rater language background is an especially salient factor that needs to be considered. The existence of rater language background-related bias in writing performance assessment is the object of this study. Data for this study are ratings assigned by Michigan English Language Assessment Battery (MELAB) raters to compositions written by examinees of various language backgrounds. While most of the raters are native speakers of English, four have first languages other than English: two Spanish, one Korean, and one bilingual speaker of Filipino and Chinese (Amoy). Examinees were divided into 21 language groups. The IRT application FACETS was used to estimate and control for rater severity when calculating the amount of bias reflected by each rater’s set of ratings for each language/language group. Results show that the magnitude of bias terms for all raters for all language groups was minimal, thus having little effect on examinee scores, and that there is no pattern of language-related bias in the ratings.

Keywords

MELAB multi-faceted Rasch analysis rater background rater bias second language writing assessment

Get full access to this article

View all access options for this article.

References

Barnwell, D. ( 1989). ‘Naïve’ native speakers and judgments of oral proficiency in Spanish. Language Testing, 6, 152—163.

Birdsong, D. (Ed.) (1999). Second language acquisition and the critical period hypothesis. Mahwah, NJ: Lawrence Erlbaum.

Brown, A. ( 1995). The effect of rater variables in the development of an occupation-specific language performance test. Language Testing , 12, 1—15.

Chalhoub-Deville, M. ( 2003). Fundamentals of ESL admissions tests: MELAB, IELTS, and TOEFL. In D. Douglas (Ed.), English language testing in US colleges and universities. 2nd ed. (pp. 11—35). Washington, DC: NAFSA.

Congdon, P.J. & McQueen, J. ( 2000). The stability of rater severity in large-scale assessment programs. Journal of Educational Measurement, 37, 163—178.

Connor-Linton, J. ( 1995). Looking behind the curtain: What do L2 comp osition ratings really mean? TESOL Quarterly, 29, 762—765.

Cumming, A. , Kantor, R. , & Powers, D. ( 2001). Scoring TOEFL essays and TOEFL 2000 prototype writing tasks: An investigation into raters’ decision making and development of a preliminary analytic framework. TOEFL Monograph Series, MS-22. Princeton, NJ: Educational Testing Service.

Davies, A. ( 2003). The native speaker: Myth and reality. Clevedon, UK: Multilingual Matters.

Dunbar, S.B. , Koretz, D.M. , & Hoover, H.D. ( 1991). Quality control in the development and use of performance assessments. Applied Measurement in Education, 4, 289—303.

10.

Elder, C. & Davies, A. ( 1998). Performance on ESL examinations: Is there a language distance effect? Language and Education, 12, 1—17.

11.

Englehard, G. ( 1994). Examining rater errors in the assessment of written compositions with a many-faceted Rasch model. Journal of Educational Measurement , 31, 93—112.

12.

English Language Institute, University of Michigan ( 2005). Michigan English language assessment battery: Technical manual 2003. Ann Arbor, MI: English Language Institute, University of Michigan.

13.

Fayer, J.M. & Krasinski, E. ( 1987). Native and nonnative judgements of intelligibility and irritation. Language Learning, 37, 313—326.

14.

Hamp-Lyons, L. ( 1989). Raters respond to rhetoric in writing. In H. W. Dechert & M. Raupach (Eds.), Interlingual processes (pp. 229—244). Tubingen: Gunter Narr.

15.

Hamp-Lyons, L. & Davies, A. ( 2008). The Englishes of English tests: Bias revisited. World Englishes, 27, 26—39.

16.

Hill, K. ( 1996). Who should be the judge? The use of non-native speakers as raters on a test of English as an international language. Melbourne Papers in Language Testing, 5, 29—50.

17.

Hyltenstam, K. & Abrahamsson, N. ( 2003). Maturational constraints in SLA. In C. J. Doughty & M. H. Long (Eds.), The handbook of second language acquisition (pp. 539—588). Oxford: Blackwell .

18.

Johnson, J.S. ( 2004). 2003 MELAB data analysis: Descriptive statistics and reliability estimates. Ann Arbor, MI: English Language Institute, University of Michigan.

19.

Johnson, J.S. ( 2005). MELAB 2004: Descriptive statistics and reliability estimates . Ann Arbor, MI: English Language Institute, University of Michigan.

20.

Johnson, J.S. ( 2006). MELAB 2005: Descriptive statistics and reliability estimates . Ann Arbor, MI: English Language Institute, University of Michigan.

21.

Johnson, J.S. ( 2007). MELAB 2006: Descriptive statistics and reliability estimates . Ann Arbor, MI: English Language Institute, University of Michigan.

22.

Kachru, B. B. (Ed.) (1992). The other tongue: English across cultures. 2nd ed. Urbana, IL: University of Illinois Press .

23.

Kobayashi, H. & Rinnert, C. ( 1996). Factors affecting composition evaluation in an EFL context: Cultural rhetorical pattern and readers’ background. Language Learning, 46, 397—437.

24.

Linacre, J.M. ( 2002). What do infit, outfit, mean-square and standardized mean? Rasch Measurement Transactions, 16, 878.

25.

Linacre, J.M. ( 2006). Facets Rasch measurement computer program. Chicago: Winsteps.com.

26.

Lumley, T. ( 2006). Assessing second language writing: The rater’s perspective. Frankfurt am Main: Peter Lang.

27.

Lumley, T. & McNamara, T.F. ( 1995). Rater characteristics and rater bias: Implications for training. Language Testing, 12, 54—71.

28.

McMillan, J.H. & Schumacher, S. ( 2001). Research in education: A conceptual introduction. 5th ed. New York: Longman .

29.

McNamara, T. ( 1996). Measuring second language performance. London: Longman.

30.

Purpura, J.E. ( 2005). Michigan English language assessment battery (MELAB) . In S. Stoynoff & C. A. Chapelle (Eds.), ESOL tests and testing (pp. 87—91). Alexandria, VA: TESOL.

31.

Reed, D.J. & Cohen, A.D. ( 2001). Revisiting raters and ratings in oral language assessment . In C. Elder , A. Brown , E. Grove , K. Hill , N. Iwashita , T. Lumley , T. McNamara , & K. O’Loughlin , (Eds.), Experimenting with uncertainty: Language testing essays in honor of Alan Davies (pp. 82—96). Cambridge : Cambridge University Press.

32.

Shi, L. ( 2001). Native- and nonnative-speaking EFL teachers’ evaluation of Chinese students’ English writing. Language Testing , 18, 303—325.

33.

Shohamy, E. , Gordon, C.M. , & Kraemer, R. ( 1992). The effect of raters’ background and training on the reliability of direct writing tests. The Modern Language Journal, 76, 27—33.

34.

Vaughan, C. ( 1991). Holistic assessment: What goes on in the rater’s mind? In L. Hamp-Lyons (Ed.), Assessing second language writing in a cademic contexts (pp. 111—125). Norwood, NJ: Ablex.

35.

Weigle, S.C. ( 1998). Using FACETS to model rater training effects. Language Testing, 15, 263—287.

36.

Weigle, S.C. ( 2000). Test review: The Michigan English language assessment battery (MELAB). Language Testing, 17, 449—455.