Validation of automated scores of TOEFL iBT tasks against non-test indicators of writing ability

Abstract

Automated scoring has the potential to dramatically reduce the time and costs associated with the assessment of complex skills such as writing, but its use must be validated against a variety of criteria for it to be accepted by test users and stakeholders. This study approaches validity by comparing human and automated scores on responses to TOEFL^® iBT Independent writing tasks with several non-test indicators of writing ability: student self-assessment, instructor assessment, and independent ratings of non-test writing samples. Automated scores were produced using e-rater ^®, developed by Educational Testing Service (ETS). Correlations between both human and e-rater scores and non-test indicators were moderate but consistent, providing criterion-related validity evidence for the use of e-rater along with human scores. The implications of the findings for the validity of automated scores are discussed.

Keywords

automated scoring essay examination second language testing test validity writing assessment

Get full access to this article

View all access options for this article.

References

Allwright J. , Banerjee J. ( 1997). Investigating the accuracy of admissions criteria: A case study in a British university. Lancaster, UK: Centre for Research in Language Education, Lancaster University .

Anson C. ( 2006). Can’t touch this: Reflections on the servitude of computers as readers. In PF Ericsson , R Haswell (eds), Machine scoring of human essays (pp. 38-56). Logan, UT: Utah State University Press.

Attali Y. ( 2007). Construct validity of e-rater in scoring TOEFL essays (ETS Research Report No. RR-07-21). Princeton, NJ: ETS.

Attali Y. ( 2008). Construct validity of e-rater in scoring TOEFL essays. ETS Research Report No. RR-07-01. Educational Testing Service .

Attali Y. , J. Burstein ( 2006). Automated essay scoring with e-rater v. 2. Journal of Technology, Learning, and Assessment, 4(3): 1-30.

Ben-Simon A. , Bennett RE ( 2007). Toward a more substantially meaningful automated essay scoring. Journal of Technology, Learning and Assessment. 6(1): 4-47.

Bennett RE , Bejar II ( 1998). Validity and automated scoring: It’s not only the scoring. Educational Measurement: Issues and Practice, 9-17.

Breland HM , Bridgeman B. , and Fowles M. ( 1999). Writing assessment in admission to higher education: Review and framework. College Board Report. New York, College Entrance Examination Board.

Burstein J. ( 2002). The e-rater scoring engine: Automated essay scoring with natural language processing. In M. Shermis , J. Burstein (eds) Automated essay scoring: A cross-disciplinary perspective (pp. 113-121). Mahwah, NJ, Lawrence Erlbaum.

10.

Burstein J. , Chodorow M. ( 1999). Automated essay scoring for nonnative English speakers . In Proceedings of the ACL99 Workshop on Computer-Mediated Language Assessment and Evaluation of Natural Language Processing. College Park, MD. Available online: http://ftp.ets.org/pub/res/erater_acl99rev.pdf

11.

Burstein J. , Chodorow M. , and Leacock C. ( 2003). Criterion: Online essay evaluation: An application for automated evaluation of student essays. In Proceedings of the Fifteenth Annual Conference on Innovative Applications of Artificial Intelligence, Acapulco, Mexico.

12.

Burstein J. , Chodorow M. , and Leacock C. ( 2004). Automated essay evaluation: The Criterion Online Writing Service. AI Magazine, 25(3): 27-36.

13.

Cheville J. ( 2004). Automated scoring technologies and the rising influence of error. English Journal, 93(4), 47-52.

14.

Chodorow M. , Gamon M. , and Tetreault J. ( 2010). The utility of grammatical error detection systems for English language learners: Feedback and assessment. Language Testing, 27(3): 419-436.

15.

Cohen J. , Cohen P. ( 1983) Applied multiple regression/correlation analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ: Lawrence Erlbaum.

16.

Condon W. ( 2006).Why less is not more: What we lose by letting a computer score writing sample. In PF Ericsson , R Haswell (eds), Machine scoring of human essays (pp. 211-220). Logan, UT: Utah State University Press.

17.

Cumming A. ( 1989). Writing expertise and second language proficiency. Language Learning, 39(1): 81-141.

18.

Eckes T. ( 2008). Rater types in writing performance assessments: A classification approach to rater variability. Language Testing, 25, 155-185.

19.

Elliot S. ( 2003). IntelliMetric: From here to validity. In M Shermis , J Burstein (eds) Automated essay scoring: A cross-disciplinary perspective (pp. 71-86). Mahwah, NJ: Lawrence Erlbaum.

20.

Enright M. , Quinlan T. ( 2010). Complementing human judgment of essays written by English language learners with e-rater® scoring. Language Testing , 27(3): 317-334.

21.

Herrington A. , Moran C. ( 2001). What happens when machines read our students’ writing ? College English, 63(4), 480-499.

22.

Kuncel NR , Hezlett SA , and Ones DS ( 2001). A comprehensive meta-analysis of the predictive validity of the Graduate Record Examinations®: Implications for graduate student selection and performance. Psychological Bulletin, 127(1), 162-181.

23.

Landauer TK , Laham D. , and Foltz PW ( 2003). Automated scoring and annotation of essays with the Intelligent Essay Assessor. In M. Shermis and J. Burstein (eds) Automated essay scoring: A cross-disciplinary perspective (pp. 87-112). Mahwah, NJ: Lawrence Erlbaum.

24.

Lee Y. -W (2006). Variability and validity of automated essay scores for TOEFL iBT: Generic, hybrid, and prompt-specific models. Unpublished manuscript . Princeton, NJ: ETS.

25.

Lee Y-W. , Gentile C. , and Kantor R. ( 2008). Analytic scoring of TOEFL® CBT essays: Scores from humans and e-rater®. Educational Testing Service: TOEFL Research Report No. RR-81

26.

Lumley T. ( 2002). Assessment criteria in a large-scale writing test: What do they really mean to the raters? Language Testing, 19: 246-276.

27.

Page EB ( 1966). The imminence of grading essays by computer. Phi Delta Kappan, 48(1): 238-243.

28.

Page EB ( 2003). Project Essay Grade: PEG. In M Shermis , J Burstein (eds), Automated essay scoring: A cross-disciplinary perspective (pp. 43-54). Mahwah, NJ: Lawrence Erlbaum.

29.

Powers DE , Burstein JC , Chodorow M. , Fowles ME , and Kukich K. ( 2000). Comparing the validity of automated and human essay scores. GRE Board Research Report No. 98-08aR. Princeton, NJ: Educational Testing Service. Available online: http://ftp.ets.org/pub/gre/gre_98-08ar.pdf

30.

Sasaki M. , Hirose K. ( 1996). Explanatory variables for EFL students’ expository writing. Language Learning, 46: 137-174.

31.

Simner L. ( 1999). Postscript to the Canadian Psychological Association’s Position Statement on the TOEFL. Retrieved 10/28/05: http://www.cpa.ca/documents/TOEFL.html

32.

Weigle S. ( 2002). Assessing writing. Cambridge: Cambridge University Press.

33.

Weigle S. (forthcoming). Validity of scores of TOEFL iBT essays against non-test indicators of writing ability. Princeton, NJ: Educational Testing Service.

34.

Yang Y. , Buckendahl CW , Jusziewicz PJ , and Bhola DS ( 2002). A review of strategies for validating computer-automated scoring. Applied Measurement in Education, 15: 391-412.