Automated scoring of junior and senior high essays using Coh-Metrix features: Implications for large-scale language testing

Abstract

An automated essay scoring (AES) program is a software system that uses techniques from corpus and computational linguistics and machine learning to grade essays. In this study, we aimed to describe and evaluate particular language features of Coh-Metrix for a novel AES program that would score junior and senior high school students’ essays from their large-scale assessments. Specifically, we studied nine categories of Coh-Metrix features for developing prompt-specific AES scoring models for our sample. We developed the models by capitalizing on the nine features’ informativeness as a function of dimensionality reduction. We used a three-staged scoring framework. The machine scores were validated against a “gold standard” of ratings, that is, those assigned by two human raters. The nine language features reliably captured the construct of the students’ writing quality. We performed a secondary analysis to see how the scoring models performed in relation to other, already established AES systems, and there was no systematic pattern of scoring discrepancy. However, for essays with widely divergent human ratings, the scoring models were disadvantaged owing to the inherent unreliability of the human scores.

Keywords

Automated scoring Coh-Metrix language features large-scale assessment writing assessment

Get full access to this article

View all access options for this article.

References

Allen

L. K.

Likens

A. D.

McNamara

D. S.

(2018). A multi-dimensional analysis of writing flexibility in an automated writing evaluation system. In A. Pardo, K. Bartimote-Aufflick, & G. Lynch (Eds.), Proceedings of the 8th International Conference on Learning Analytics and Knowledge (pp. 380–388). Association for Computing Machinery. https://doi.org/10.1145/3170358.3170404

Arlot

Celisse

(2010). A survey of cross-validation procedures for model selection. Statistics Surveys, 4, 40–79. https://doi.org/10.1214/09-SS054

Attali

(2013). Validity and reliability of automated essay scoring. In M. D. Shermis & J. C. Burstein (Eds.), Handbook of automated essay evaluation: Current application and new directions (pp. 181–198). Psychology Press. https://doi.org/10.4324/9780203122761.ch11

Bennett

R. E.

Zhang

(2016). Validity and automated scoring. In F. Drasgow (Ed.), Technology and testing: Improving educational and psychological measurement (pp. 142–173). Routledge.

Berry

Browne

(2005). Understanding search engines: Mathematical modeling and text retrieval (2nd ed.). SIAM, Society for Industrial and Applied Mathematics. https://doi.org/10.1137/1.9780898718164

Bogaards

(2001). Lexical units and the learning of foreign language vocabulary. Studies in Second Language Acquisition, 23(3), 321–343. https://doi.org/10.1017/S0272263101003011

Breiman

(2017). Classification and regression trees. Routledge. https://doi.org/10.1201/9781315139470

Bunch

M. B.

Vaughn

Miel

(2016). Automated scoring in assessment systems. In Y. Rosen, S. Ferrara, & M. Mosharraf. (Eds.), Handbook of research on technology tools for real-world skill development (pp. 611–626). IGI Global. http://doi.org/10.4018/978-1-4666-9441-5.ch023

Chen

Fife

J. H.

Bejar

I. I.

Rupp

A. A.

(2016). Building e‐rater® scoring models using machine learning methods. (ETS Research Report RR-16-04). Educational Testing Service. http://doi.org/10.1002/ets2.12094

10.

Cohen

Levi

Ben-Simon

(2018). Validating human and automated scoring of essays against “True” scores. Applied Measurement in Education, 31(3), 241–250. http://doi.org/10.1080/08957347.2018.1464450

11.

Crossley

S. A.

McNamara

D. S.

(2011). Understanding expert ratings of essay quality: Coh-Metrix analyses of first and second language writing. International Journal of Continuing Engineering Education and Life-Long Learning, 21(2/3), 170–191. http://doi.org/10.1504/IJCEELL.2011.040197

12.

Crossley

S. A.

Roscoe

McNamara

D. S.

(2011). Predicting human scores of essay quality using computational indices of linguistic and textual features. In H. C. Lane, K. Yacef, J. Mostow, & P. Pavlik (Eds.), International conference on artificial intelligence in education (pp. 438–440). Springer. https://doi.org/10.1007/978-3-642-21869-9_62

13.

Cushing

S. T.

(2017). Corpus linguistics in language testing research. Language Testing, 34(4), 441–449. https://doi.org/10.1177/0265532217713044

14.

Deane

(2013). On the relation between automated essay scoring and modern views of the writing construct. Assessing Writing, 18(1), 7–24. https://doi.org/10.1016/j.asw.2012.10.002

15.

Dóczi

Kormos

(2016). Longitudinal developments in vocabulary knowledge and lexical organization. Oxford: Oxford University Press. https://doi.org/10.1093/acprof:oso/9780190210274.001.0001

16.

Fleiss

J. L.

Cohen

(1973). The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability. Educational and Psychological Measurement, 33(3), 613–619. https://doi.org/10.1177/001316447303300309

17.

Foltz

P. W.

(2016). Advances in automated scoring of writing for performance assessment. In Y. Rosen, S. Ferrara, & M. Mosharraf (Eds.), Handbook of research on technology tools for real-world skill development (pp. 658–677). IGI Global. https://doi.org/10.4018/978-1-4666-9441-5.ch025

18.

Graesser

A. C.

McNamara

D. S.

Kulikowich

J. M.

(2011). Coh-Metrix: Providing multilevel analyses of text characteristics. Educational Researcher, 40(5), 223–234. https://doi.org/10.3102/0013189X11413260

19.

Guo

Crossley

S. A.

McNamara

D. S.

(2013). Predicting human judgments of essay quality in both integrated and independent second language writing samples: A comparison study. Assessing Writing, 18(3), 218–238. https://doi.org/10.1016/j.asw.2013.05.002

20.

Jung

(2018). Multiple predicting K-fold cross-validation for model selection. Journal of Nonparametric Statistics, 30(1), 197–215. https://doi.org/10.1080/10485252.2017.1404598

21.

Kaplan

(2010). The Oxford handbook of applied linguistics. Oxford University Press. https://doi.org/10.1093/oxfordhb/9780195384253.001.0001

22.

Keith

T. Z.

(2003). Validity of automated essay scoring systems. In M. D. Shermis & J. C. Burstein (Eds.), Automated essay scoring: A cross-disciplinary perspective (pp. 147–168). Lawrence Erlbaum.

23.

Kostrzewa

Brzeski

(2019). The data dimensionality reduction and features weighting in the classification process using forest optimization algorithm. In N. T. Nguyen, F. L. Gaol, T. P. Hong, & B. Trawinski. (Eds.), Asian conference on intelligent information and database systems (pp. 97–108). Springer. https://doi.org/10.1007/978-3-030-14132-5_8

24.

Kyle

Crossley

S. A.

McNamara

D. S.

(2016). Construct validity in TOEFL iBT speaking tasks: Insights from natural language processing. Language Testing, 33(3), 319–340. https://doi.org/10.1177/0265532215587391

25.

Latifi

(2016). Development and validation of an Automated Essay Scoring framework by integrating deep features of English language [Unpublished doctoral dissertation]. University of Alberta). Education & Research Archive (ERA). https://doi.org/10.7939/R37S7J134

26.

Gobert

Dickler

Morad

(2018). Students’ academic language use when constructing scientific explanations in an intelligent tutoring system. In C. P. Rose, R. Martınez-Maldonado, H. U. Hoppe, R. Luckin, M. Mavrikis, K. Porayska-Pomsta, B. McLaren, & B. du-Boulay. (Eds.), Artificial intelligence in education: Lecture notes in computer science, vol. 10947 (pp. 267–281). Springer. https://doi.org/10.1007/978-3-319-93843-1_20

27.

Liu

(2017). Automatic essay scoring based on Coh-Metrix feature selection for Chinese English learners. In T. Huang, R Lau, Y Huan, M Spaniol, & C Yuen. (Eds.), International symposium on emerging technologies for education (pp. 382–393). Springer. https://doi.org/10.1007/978-3-319-52836-6_40

28.

Lomax

Vadera

(2013). A survey of cost-sensitive decision tree induction algorithms. ACM Computing Surveys, 45(2), 16. https://doi.org/10.1145/2431211.2431215

29.

(2017). Automated measurement of syntactic complexity in corpus-based L2 writing research and implications for writing assessment. Language Testing, 34(4), 493–511. https://doi.org/10.1177/0265532217710675

30.

MacArthur

C. A.

Jennings

Philippakos

Z. A.

(2019). Which linguistic features predict quality of argumentative writing for college basic writers, and how do those features change with instruction? Reading and Writing, 32(6), 1553–1574. https://doi.org/10.1007/s11145-018-9853-6

31.

McCarthy

P. M.

Jarvis

(2010). MTLD, vocd-D, and HD-D: A validation study of sophisticated approaches to lexical diversity assessment. Behavior Research Methods, 42(2), 381–392. https://doi.org/10.3758/BRM.42.2.381

32.

McNamara

D. S.

Crossley

S. A.

McCarthy

P. M.

(2010). Linguistic features of writing quality. Written Communication, 27(1), 57–86. https://doi.org/10.1177/0741088309351547

33.

McNamara

D. S.

Crossley

S. A.

Roscoe

R. D.

Allen

L. K.

Dai

(2015). A hierarchical classification approach to automated essay scoring. Assessing Writing, 23, 35–59. https://doi.org/10.1016/j.asw.2014.09.002

34.

McNamara

D. S.

Graesser

A. C.

McCarthy

P. M.

Cai

(2014). Automated evaluation of text and discourse with Coh-Metrix. Cambridge University Press. https://doi.org/10.1017/CBO9780511894664

35.

McNamara

D. S.

Louwerse

M. M.

McCarthy

P. M.

Graesser

A. C.

(2010). Coh-Metrix: Capturing linguistic features of cohesion. Discourse Processes, 47(4), 292–330. https://doi.org/10.1080/01638530902959943

36.

O’Leary

Scully

Karakolidis

Pitsia

(2018). The state‐of‐the‐art in digital technology‐based assessment. European Journal of Education, 53(2), 160–175. https://doi.org/10.1111/ejed.12271

37.

Page

E. B.

(2003). Project Essay Grade: PEG. In M.D. Shermis & J.C. Burstein. (Eds.), Automated essay scoring: A cross-disciplinary perspective (pp. 43–54). Lawrence Erlbaum.

38.

Pavlenko

(2003). Feature informativeness in high-dimensional discriminant analysis. Communications in Statistics – Theory and Methods, 2(2), 459–474. https://doi.org/10.1081/STA-120018195

39.

Perelman

(2014). When “state of the art” is counting words. Assessing Writing, 21, 104–111. http://.doi.org/10.1016/j.asw.2014.05.001 1075-2935

40.

Powers

Escoffery

Duchnowski

(2015). Validating automated essay scoring: A (modest) refinement of the ‘gold standard’. Applied Measurement in Education, 28(2), 130–142. https://doi.org/10.1081/STA-120018195

41.

Riazi

A. M.

(2016). Comparing writing performance in TOEFL-iBT and academic assignments: An exploration of textual features. Assessing Writing, 28, 15–27. https://doi.org/10.1016/j.asw.2016.02.001

42.

Roscoe, R. D., Allen, L. K., & McNamara, D. S. (2018). Contrasting Writing Practice Formats in a Writing Strategy Tutoring System. Journal of Educational Computing Research, 57(3), 723–754. https://doi.org/10.1177/0735633118763429

43.

Shermis

M. D.

(2014a). State-of-the-art automated essay scoring: Competition, results, and future directions from a United States demonstration. Assessing Writing, 20, 53–76. https://doi.org/10.1016/j.asw.2013.04.001

44.

Shermis

M. D.

(2014b). The challenges of emulating human behavior in writing assessment. Assessing Writing, 22, 91–99. http://doi.org/10.1016/j.asw.2014.07.002

45.

Shermis

M. D.

Morgan

(2016). Using prizes to facilitate change in educational assessment. In F. Drasgow (Ed.), Technology and testing: Improving educational and psychological measurement (pp. 323–338). Routledge.

46.

Williamson

D. M.

Breyer

F. J.

(2012). A framework for evaluation and use of automated scoring. Educational Measurement: Issues and Practice, 31(1), 2–13. http://doi.org/10.1111/j.1745-3992.2011.00223.x

47.

Wilson

Andrada

G. N.

(2016). Using automated feedback to improve writing quality: Opportunities and challenges. In Y. Rosen, S. Ferrara, & M. Mosharraf. (Eds.), Handbook of research on technology tools for real-world skill development (pp. 678–703). IGI Global. http://doi.org/10.4018/978-1-4666-9441-5.ch026

48.

Witten

Frank

Hall

(2011). Data mining: Practical machine learning tools and techniques (3rd ed.). Morgan Kaufmann. https://doi.org/10.1145/2020976.2021004

49.

Witten

Frank

Hall

Pal

(2017). Data mining: Practical machine learning tools and techniques (4th ed.). Morgan Kaufmann.

50.

Kumar

Quinlan

J. R.

Ghosh

Yang

Motoda

, … & Zhou

Z. H.

(2008). Top 10 algorithms in data mining. Knowledge and Information Systems, 14(1), 1–37. http://doi.org/10.1007/s10115-007-0114-2

51.

Yamamoto

Umemura

Kawano

(2017). Automated essay scoring system based on rubric. In J. Kacprzyk. (Ed.), International conference on applied computing and information technology (pp. 177–190). Springer. http://doi.org/10.1007/978-3-319-64051-8_11

52.

Zedelius

C. M.

Mills

Schooler

J. W.

(2019). Beyond subjective judgments: Predicting evaluations of creative writing from computational linguistic features. Behavior Research Methods, 51(2), 879–894. http://doi.org/10.3758/s13428-018-1137-1

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

1.67 MB