Sage Journals: Discover world-class research

Abstract

Automated, or computer-based, scoring represents one promising possibility for improving the cost effectiveness (and other features) of complex performance assessments (such as direct tests of writing skill) that require examinees to construct responses rather than select them from a set of multiple choices. Indeed, significant advances have been made in applying natural language processing techniques to the automatic scoring of essays. Thus far, most of the validation of automated scoring has focused appropriately (but too narrowly, we contend) on the correspondence between computer-generated scores and those assigned by human readers. Far less effort has been devoted to assessing the relation of automated scores to independent indicators of examinees' writing skills. This study examined the relationship of scores from a graduate level writing assessment to several independent, non-test indicators of examinees' writing skills—both for automated scores and for scores assigned by trained human readers. The extent to which automated and human scores exhibited similar relations with the non-test indicators was taken as evidence of the degree to which the two methods of scoring reflect similar aspects of writing proficiency. Analyses revealed significant, but modest, correlations between the non-test indicators and each of the two methods of scoring. These relations were somewhat weaker for automated scores than for scores awarded by human readers. Overall, however, the results provide some evidence of the validity of one specific procedure for automated scoring.

Get full access to this article

View all access options for this article.

References

Baird

L. L.

(1976). Using self-reports to predict student performance. New York: College Entrance Examination Board.

Bennett

R. E.

, & Bejar

I. I.

(1998). Validity and automated scoring: It's not only the scoring. Educational Measurement: Issues and Practice, 17, 9–17.

Bennett

R. E.

Steffen

Singley

M. K.

Morley

, & Jacquemin

(1997). Evaluating an automatically scorable, open-ended response type for measuring mathematical reasoning in computerized-adaptive tests. Journal of Educational Measurement, 34, 162–176.

Bennett

R. E.

, & Ward

W. C.

, (Eds.) (1993). Construction versus choice in cognitive assessment: Issues in constructed response, performance testing, and portfolio assessment. Hillsdale, NJ: Lawrence Erlbaum Associates, Publishers.

Blanche

, & Merino

(1989). Self-assessment of foreign language skills: Implications for teachers and researchers. Language Learning, 39, 313–340.

Braun

H. I.

Bennett

R. E.

Frye

, & Soloway

(1990). Scoring constructed responses using expert systems. Journal of Educational Measurement, 27, 93–108.

Burstein

Braden-Harder

Chodorow

Hua

Kaplan

Kukich

Nolan

Rock

, & Wolff

(1998). Computer analysis of essay content for automated score prediction: A Prototype automated scoring system for GMAT analytical writing assessment essays. (ETS Research Report RR-98-15). Princeton, NJ: Educational Testing Service.

Burstein

, & Chodorow

(1999, June). Automated essay scoring for normative English speakers. Paper presented at the Joint Symposium of the Association of Computational Linguistics and the International Association of Language Learning Technologies, Workshop on Computer-Mediated Language Assessment and Evaluation of Natural Language Processing, College Park, Maryland.

Clauser

B. E.

Margolis

M. J.

Clyman

S. G.

, & Ross

L. P.

(1997). Development of automated scoring algorithms for complex performance assessments: A comparison of two approaches. Journal of Educational Measurement, 34, 141–161.

10.

Cronbach

L. J.

(1990). Essentials of psychological testing (5th ed.). New York: HarperCollins Publishers.

11.

Daigon

(1966). Computer grading of English composition. English Journal, 55, 46–52.

12.

DeLoughry

T. J.

(1995, October 20). Duke professor pushes concept of grading essays by computer. Chronicle of Higher Education.

13.

Freeberg

N. E.

(1988). Analysis of the revised student descriptive questionnaire: Phase l (College Board Report No. 88–5; ETS Research Report No. 88–11). New York: College Entrance Examination Board.

14.

Freeberg

N. E.

Rock

D. A.

, & Pollack

(1989). Analysis of the revised student descriptive questionnaire: Phase II (College Board Report No. 89–8; ETS Research Report No. 89–49). New York: College Entrance Examination Board.

15.

Hambleton

R. K.

, & Rogers

H. J.

(1991). Advances in criterion-referenced measurement. In Hambleton

R. K.

& Zaal

J. N.

(Eds.), Advances in educational and psychological testing: Theory and applications (pp. 3–34). Boston, MA: Kluwer Academic Publishers.

16.

Kaplan

R. M.

Wolff

Burstein

J. C.

Rock

, & Kaplan

(1998). Scoring essays automatically using surface features (GRE Report 94–21). Princeton, NJ: Educational Testing Service.

17.

Labi

(1999, February 27). When computers do the grading. Time Magazine.

18.

Laing

Sawyer

, & Noble

(1987). Accuracy of self-reported activities and accomplishments of college-bound students (ACT Research Report Series No. 87–6). Iowa City, IA: The American College Testing Program.

19.

Landauer

T. K.

Foltz

P. W.

, & Laham

(1998). An introduction to latent semantic analysis. Discourse Processes, 25, 259–284.

20.

Landauer

T. K.

Laham

Rehder

, & Schreiner

M. E.

(1997). How well can passage meaning be derived without word order? A comparison of latent semantic analysis and humans. Paper presented at the 19th annual conference of the Cognitive Science Society, Palo Alto, CA.

21.

Mabe

P. A.

III , & West

S. G.

(1982). Validity of self-evaluation of ability: A review and meta-analysis. Journal of Applied Psychology, 67, 280–296.

22.

Meng

Rosenthal

, & Rubin

D. B.

(1992). Comparing correlated correlation coefficients. Psychological Bulletin, 111, 172–175.

23.

Messick

(1989). Validity. In Linn

R. L.

(Ed.), Educational measurement (3rd ed., pp. 13–103). New York: Macmillan.

24.

Messick

(1995). Validity of psychological assessment: Validation of inferences from persons' responses and performances as scientific inquiry into score meaning. American Psychologist, 50, 741–749.

25.

Mitchell

J. S.

(1998, May 27). Commentary: SATs don't get you in. Education Week on the Web. (http://www.edweek.org/ew/current/3mitch.h17).

26.

Murray

(1998, August). The latest techno tool: Essay-grading computers. APA Monitor, p. 43.

27.

Page

E. B.

(1966). The imminence of grading essays by computer. Phi Delta Kappan, 48, 238–243.

28.

Page

E. B.

(1968). Analyzing student essays by computer. International Review of Education, 14, 210–225.

29.

Page

E. B.

, & Petersen

N. S.

(1995). The computer moves into essay grading: Updating the ancient test. Phi Delta Kappan, 76, 561–565.

30.

Petersen

N. S.

(1997, March). Automated scoring of writing essays: Can such scores be valid? Paper presented at the annual meeting of the National Council on Education, Chicago, Ill.

31.

Powers

D. E.

Fowles

M. E.

, & Welsh

C. K.

(1999). Further validation of a writing assessment for graduate admissions. (GRE Board Research Report No. 96–13R and ETS Research Report 99–18). Princeton, NJ: Educational Testing Service.

32.

Sawyer

Laing

, & Houston

(1988). Accuracy of self-reported high school courses and grades of college-bound students (ACT Research Report Series No. 88–1). Iowa City, IA: The American College Testing Program.

33.

Schaeffer

G. A.

Fowles

M. A.

, & Briel

J. B.

(2001). Assessment of the psychometric characteristics of the GRE writing test (GRE Board Research Report 96–11). Princeton, NJ: Educational Testing Service.

34.

Schwartz

A. E.

(1998, April 26)…. Graded by machine. Washington Post, p. C07.

35.

Scott

(1999, January 31). Looking for the tidy mind, alas. The New York Times.

36.

Upshur

(1975). Objective evaluation of oral proficiency in the ESOL classroom. In Palmer

& Spolsky

(Eds.), Papers on language testing 1967–1974. Washington, D.C.: TESOL.

37.

Williamson

D. M.

Bejar

I. I.

, & Hone

A. S.

(1999). ‘Mental model’ comparison of automated and human scoring. Journal of Educational Measurement, 36, 158–184.

Comparing the Validity of Automated and Human Scoring of Essays

Abstract

Get full access to this article

References