Scoring with the computer: Alternative procedures for improving the reliability of holistic essay scoring

Abstract

Automated essay scoring can produce reliable scores that are highly correlated with human scores, but is limited in its evaluation of content and other higher-order aspects of writing. The increased use of automated essay scoring in high-stakes testing underscores the need for human scoring that is focused on higher-order aspects of writing. This study experimentally evaluated several alternative procedures for eliciting distinct human scores and improving their reliability. Essays written in response to the argument and issue tasks of the Analytical Writing measure of the GRE General Test were scored by experienced raters under different conditions. Criteria for evaluation included inter-rater agreement, agreement with machine scores, and cross-task reliability. First, the use of a modified scoring rubric that focused on higher-order writing skills increased the reliability for one type of task but decreased it for another. Second, scoring in batches of similar length essays did not have any effect on scores. Third, scoring with available automated essay scores increased reliability of human scores, but also increased their similarity with automated scores. Finally, the use of a more refined 18-point scoring scale significantly increased reliability.

Keywords

Automated scoring essay writing assessment reliability

Get full access to this article

View all access options for this article.

References

Alderson

J. C.

Clapham

Wall

(1995). Language test construction and evaluation. Cambridge: Cambridge University Press.

Attali

(2007). Construct validity of e-rater in scoring TOEFL essays (ETS RR-07-21). Princeton, NJ: Educational Testing Service.

Attali

(2011a). A differential word use measure for content analysis in automated essay scoring (ETS RR-11-36). Princeton, NJ: Educational Testing Service.

Attali

(2011b). Sequential effects in essay ratings. Educational and Psychological Measurement, 71, 68–79.

Attali

(2012). Validity and reliability of automated essay scoring. In Shermis

M.D.

Burstein

J.C.

(Eds.), Handbook on automated essay evaluation: Current applications and new directions. New York, NY: Routledge.

Attali

Burstein

(2006). Automated essay scoring with e-rater® V.2. Journal of Technology, Learning, and Assessment, 4(3). Available from http://ejournals.bc.edu/ojs/index.php/jtla/

Bachman

L. F.

Lynch

B. K.

Mason

(1995). Investigating variability in tasks and rater judgments in a performance test of foreign language speaking. Language Testing, 12, 238–257.

Baldwin

Fowles

Livingston

(2005). Guidelines for constructed-response and other performance assessments. Princeton, NJ: Educational Testing Service.

Barrett

(2001). The impact of training on rater variability. International Education Journal, 2, 49–58.

10.

Breland

H. M.

Bridgeman

Fowles

M. E.

(1999). Writing assessment in admission to higher education: Review and framework (College Board Report No. 99-3). New York: College Entrance Examination Board.

11.

Breland

H. M.

Camp

Jones

R. J.

Morris

M. M.

Rock

D. A.

(1987). Assessing writing skill. (Research Monograph No. 11). New York: College Entrance Examination Board.

12.

Burstein

J. C.

Kukich

Wolff

Chodorow

(1998). Computer analysis of essays. Paper presented at the annual meeting of the National Council of Measurement in Education, San Diego, CA.

13.

Daly

J. A.

Dickson-Markman

(1982). Contrast effects in evaluating essays. Journal of Educational Measurement, 19, 309–316.

14.

Eckes

(2008). Rater types in writing performance assessments: A classification approach to rater variability. Language Testing, 25, 155–185.

15.

Elder

Knoch

Barkhuizen

von Randow

(2005). Individual feedback to enhance rater training: Does it work? Language Assessment Quarterly, 2, 175–196.

16.

Elliot

S. M.

(2001). IntelliMetric: From here to validity. Paper presented at the annual meeting of the American Educational Research Association, Seattle, WA.

17.

Engelhard

Jr. (1994). Examining rater errors in the assessment of written composition with a many-faceted Rasch model. Journal of Educational Measurement, 31, 93–112.

18.

Engelhard

Jr. Myford

C. M.

(2003). Monitoring faculty consultant performance in the Advanced Placement English Literature and Composition Program with a many-faceted Rasch model (College Board Research Report No. 2003–1). New York: College Entrance Examination Board.

19.

GRE (2011). Scoring guide for the issue task. Retrieved from www.ets.org/gre/revised_general/prepare/analytical_writing/issue/scoring_guide

20.

Hales

L. W.

Tokar

(1975). The effect of the quality of preceding responses on the grades assigned to subsequent responses to an essay question. Journal of Educational Measurement, 12, 115–117.

21.

Huot

(1990). The literature of direct writing assessment: Major concerns and prevailing trends. Review of Educational Research, 60, 237–263.

22.

Landauer

T. K.

Foltz

P. W.

Laham

(1998). Introduction to latent semantic analysis. Discourse Processes, 25, 259–284.

23.

Landauer

T. K.

Laham

Foltz

P. W.

(2003). Automated scoring and annotation of essays with the Intelligent Essay Assessor. In Shermis

M. D.

Burstein

(Eds.), Automated essay scoring: A cross-disciplinary perspective (pp. 87–112). Mahwah, NJ: Lawrence Erlbaum.

24.

Lee

Gentile

Kantor

(2008). Analytic scoring of TOEFL® CBT essays: Scores from humans and e-rater (ETS RR-08-01). Princeton, NJ: Educational Testing Service.

25.

Lumley

McNamara

T. F.

(1995). Rater characteristics and rater bias: Implications for training. Language Testing, 12, 54–71.

26.

Moss

P. A.

(1994). Can there be validity without reliability? Educational Researcher, 23, 5–12.

27.

Mussweiler

(2003). Comparison processes in social judgment: Mechanisms and consequences. Psychological Review, 110, 472–489.

28.

Pelham

B. W.

Wachsmuth

J. O.

(1995). The waxing and waning of the social self: Assimilation and contrast in social comparison. Journal of Personality and Social Psychology, 69, 825–838.

29.

Powers

(2005). ‘Wordiness’: A selective review of its influence, and suggestions for investigating its relevance in tests requiring extended written responses (ETS RM-04-08). Princeton, NJ: Educational Testing Service.

30.

Sherif

Taub

Hovland

C. I.

(1958). Assimilation and contrast effects of anchoring stimuli on judgments. Journal of Experimental Psychology, 55, 150–155.

31.

Spear

(1997). The influence of contrast effects upon teachers’ marks. Educational Research, 39, 229–233.

32.

Weigle

S. C.

(1998). Using FACETS to model rater training effects. Language Testing, 15, 263–287.

33.

Weigle

S. C.

(1999). Investigating rater/prompt interactions in writing assessment: Quantitative and qualitative approaches. Assessing Writing, 6, 145–178.

34.

Weigle

S. C.

(2002). Assessing writing. New York: Cambridge University Press.