Sage Journals: Discover world-class research

Abstract

A short training program for evaluating responses to an essay writing task consisted of scoring 20 training essays with immediate feedback about the correct score. The same scoring session also served as a certification test for trainees. Participants with little or no previous rating experience completed this session and 14 trainees who passed an accuracy threshold proceeded to score other essays. Performance of the newly-trained raters was compared to that of 16 expert raters with extensive experience in scoring responses to the writing task. Results showed that the scores from the newly-trained group of raters exhibited similar measurement properties (mean and variability of scores, reliability and various validity coefficients, and underlying factor structure) to those from the experienced group of raters. Implications for the place of initial training and screening of raters on rater performance are discussed.

Keywords

Rater training reliability validity writing assessment

Get full access to this article

View all access options for this article.

References

Attali

Lewis

Steier

(2013). Scoring with the computer: Alternative procedures for improving the reliability of holistic essay scoring. Language Testing, 30(1), 125–141.

Bachman

L. F.

Lynch

B. K.

Mason

(1995). Investigating variability in tasks and rater judgments in a performance test of foreign language speaking. Language Testing, 12, 238–257.

Baldwin

Fowles

Livingston

(2005). Guidelines for constructed-response and other performance assessments. Princeton, NJ: Educational Testing Service.

Barkaoui

(2010). Variability in ESL essay rating processes: The role of the rating scale and rater experience. Language Assessment Quarterly, 7, 54–74.

Barrett

(2001). The impact of training on rater variability. International Education Journal, 2, 49–58.

Breland

H. M.

Bridgeman

Fowles

M. E.

(1999). Writing assessment in admission to higher education: Review and framework (College Board Report No. 99–3). New York: College Entrance Examination Board.

Breland

H. M.

Camp

Jones

R. J.

Morris

M. M.

Rock

D. A.

(1987). Assessing writing skill. (Research Monograph No. 11). New York: College Entrance Examination Board.

Buhrmester

Kwang

Gosling

S. D.

(2011). Amazon’s Mechanical Turk a new source of inexpensive, yet high-quality, data? Perspectives on Psychological Science, 6, 3–5.

Carlton

S. T.

Diederich

P. B.

French

J. W.

(1961). Factors in judgments of writing ability. Princeton, NJ: Educational Testing Service.

10.

Cumming

(1990). Expertise in evaluating second language compositions. Language Testing, 7, 31–51.

11.

Eckes

(2008). Rater types in writing performance assessments: A classification approach to rater variability. Language Testing, 25, 155–185.

12.

Elder

Knoch

Barkhuizen

von Randow

(2005). Individual feedback to enhance rater training: Does it work? Language Assessment Quarterly, 2, 175–196.

13.

Engelhard

Jr (1994). Examining rater errors in the assessment of written composition with a many-faceted Rasch model. Journal of Educational Measurement, 31, 93–112.

14.

Engelhard

Jr. Myford

C. M.

(2003). Monitoring faculty consultant performance in the Advanced Placement English Literature and Composition Program with a many-faceted Rasch model (College Board Research Report No. 2003–1). New York: College Entrance Examination Board.

15.

Glover

J. A.

(1989). The “testing” phenomenon: Not gone but nearly forgotten. Journal of Educational Psychology, 81(3), 392.

16.

Hoyle

R. H.

Panter

A. T.

(1995). Writing about structural equation models. In Hoyle

R. H.

(Ed.), Structural equation modeling: Concepts, issues, and applications (pp. 158–176). Thousand Oaks, CA: SAGE Publications.

17.

Karpicke

J. D.

Blunt

J. R.

(2011). Retrieval practice produces more learning than elaborative studying with concept mapping. Science, 331(6018), 772–775.

18.

Knoch

(2011). Investigating the effectiveness of individualized feedback to rating behavior – a longitudinal study. Language Testing, 28(2), 179–200.

19.

Lim

G. S.

(2011). The development and maintenance of rating quality in performance writing assessment: A longitudinal study of new and experienced raters. Language Testing, 28, 543–560.

20.

Lumley

McNamara

T. F.

(1995). Rater characteristics and rater bias: Implications for training. Language Testing, 12, 54–71.

21.

Lunt

Morton

Wigglesworth

(1994). Rater behaviour in performance testing: Evaluating the effect of bias feedback. Paper presented at the 19th annual congress of the Applied Linguistics Association of Australia: University of Melbourne, July.

22.

McClellan

C. A.

(2010). Constructed-response scoring: Doing it right (ETS R&D Connections 13). Princeton, NJ: Educational Testing Service.

23.

Moss

P. A.

(1994). Can there be validity without reliability? Educational Researcher, 23, 5–12.

24.

O’Sullivan

Rignall

(2007). Assessing the value of bias analysis feedback to raters for the IELTS writing module. In Taylor

Falvey

(Eds.), IELTS Collected Papers (pp. 446–476). Cambridge: Cambridge University Press.

25.

Powers

(2005). “Wordiness”: A selective review of its influence, and suggestions for investigating its relevance in tests requiring extended written responses (ETS RM-04–08). Princeton, NJ: Educational Testing Service.

26.

Roediger

H. L.

Karpicke

J. D.

(2006). Test-enhanced learning taking memory tests improves long-term retention. Psychological Science, 17(3), 249–255.

27.

Rosseel

(2012). lavaan: An R package for structural equation modeling. Journal of Statistical Software, 48(2), 1–36.

28.

Schoonen

Vergeer

Eiting

(1997). The assessment of writing ability: Expert readers versus lay readers. Language Testing, 14, 157–184.

29.

Shohamy

Gordon

C. M.

Kraemer

(1992). The effect of raters’ background and training on the reliability of direct writing tests. The Modern Language Journal, 76, 27–33.

30.

Weigle

S. C.

(1994). Effects of training on raters of ESL compositions. Language Testing, 11, 197–223.

31.

Weigle

S. C.

(1998). Using FACETS to model rater training effects. Language Testing, 15, 263–287.

32.

Weigle

S. C.

(1999). Investigating rater/prompt interactions in writing assessment: Quantitative and qualitative approaches. Assessing Writing, 6, 145–178.

33.

Wolfe

E. W.

Matthews

Vickers

(2010). The effectiveness and efficiency of distributed online, regional online, and regional face-to-face training for writing assessment raters. The Journal of Technology, Learning and Assessment, 10(1).

A comparison of newly-trained and experienced raters on a standardized writing assessment

Abstract

Keywords

Get full access to this article

References