Comparison of traditional machine learning and neural network approaches for automated scoring of second language English essays

Abstract

An increasing number of language testing companies are developing and deploying deep learning-based automated essay scoring systems (AES) to replace traditional approaches that rely on handcrafted feature extraction. However, there is hesitation to accept neural network approaches to automated essay scoring because the features are automatically extracted and are viewed as less transparent and score interpretation is opaque. In order to compare the two approaches systematically, this paper investigated the performance of five approaches to automated essay scoring using traditional machine learning models and neural network models (i.e., deep learning). The models were developed to assign scores to responses in the TOEFL11 learner corpus. Since the dataset and metrics were held static, the results are dependent on model selection, training, and hyperparameter adjustment to find the best fit for each model. Results indicate the performance of the models was similar in accuracy but differed in precision and agreement as measured with the quadratic weighted kappa metric. Performance with traditional models can increase as specific features are added that align with the scoring criteria. The findings are relevant for the discussion about transparency in artificial intelligence (AI) scoring models.

Keywords

Artificial intelligence automated essay scoring BERT deep learning machine learning

Get full access to this article

View all access options for this article.

References

Alikaniotis

Yannakoudakis

Rei

(2016). Automatic text scoring using neural networks. ArXiv, abs/1606.04289. https://doi.org/10.48550/arXiv.1606.04289

Angelov

P. P.

Soares

E. A.

Jiang

Arnold

N. I.

Atkinson

P. M.

(2021). Explainable artificial intelligence: An analytical review. WIREs Data Mining and Knowledge Discovery, 11(5), Article e1424. https://doi.org/10.1002/widm.1424

Attali

(2013). Validity and reliability of automated essay scoring. In Handbook of automated essay evaluation (pp. 181–198). Routledge.

Attali

Burstein

(2006). Automated essay scoring with E-rater® v.2. The Journal of Technology, Learning and Assessment, 4(3). https://ejournals.bc.edu/index.php/jtla/article/view/1650

Blanchard

Tetreault

Higgins

Cahill

Chodorow

(2013). TOEFL11: A corpus of non-native English. ETS Research Report Series, 2013(2), i–15. https://doi.org/10.1002/j.2333-8504.2013.tb02331.x

Block

H.-D.

(1962). The perceptron: A model for brain functioning. Reviews of Modern Physics, 34(1), 123. https://doi.org/10.7551/mitpress/4943.003.0013

Burstein

Kukich

Wolff

Chodorow

Braden-Harder

Harris

M. D.

(1998, August). Automated scoring using a hybrid feature identification technique. 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, Volume 1, Montreal, Quebec, Canada.

Chapelle

C. A.

Voss

(2016). Twenty years of technology and language assessment in language learning & technology. Language Learning & Technology, 20(2), 116–128.

Chen

Fife

J. H.

Bejar

I. I.

Rupp

A. A.

(2016). Building E-rater® scoring models using machine learning methods. ETS Research Report Series, 2016(1), 1–12. https://doi.org/10.1002/ets2.12094

10.

Chodorow

Burstein

(2004). Beyond essay length: Evaluating E-rater®’s performance on TOEFL® essays. ETS Research Report Series, 2004, i–38. https://doi.org/10.1002/j.2333-8504.2004.tb01931.x

11.

Coniam

(2009). Experimenting with a computer essay-scoring program based on ESL student writing scripts. ReCALL, 21(2), 259–279. https://doi.org/10.1017/S0958344009000147

12.

Deane

(2013). On the relation between automated essay scoring and modern views of the writing construct. Assessing Writing, 18(1), 7–24. https://doi.org/10.1016/j.asw.2012.10.002

13.

Devlin

Chang

M.-W.

Lee

Toutanova

(2019). BERT: Pre-training of deep bidirectional transformers for language understanding. Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, Minnesota. https://doi.org/10.18653/v1/N19-1423

14.

Dong

Zhang

(2016, November). Automatic features for essay scoring: An empirical study. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing Austin, Texas, USA. https://doi.org/10.18653/v1/D16-1115

15.

Dong

Zhang

Yang

(2017, August). Attention-based recurrent convolutional neural network for automatic essay scoring. Proceedings of the St21 Conference on Computational Natural Language Learning (CoNLL 2017). Vancouver, Canada.

16.

Dumais

(1994). Latent semantic indexing (LSI) and TREC-2. Nist Special Publication Sp, 105–105.

17.

Fadziso

(2020). Overcoming the vanishing gradient problem during learning recurrent neural nets (RNN). Asian Journal of Applied Science and Engineering, 9, 207–218. https://doi.org/10.18034/ajase.v9i1.41

18.

Foltz

P. W.

Streeter

L. A.

Lochbaum

K. E.

Landauer

T. K.

(2013). Implementation and applications of the intelligent essay assessor. In Shermis

M. D.

Burstein

(Eds.), Handbook of automated essay evaluation (pp. 68–88). Routledge.

19.

Hochreiter

Schmidhuber

(1997). Long short-term memory. Neural Computation, 9(8), 1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735

20.

Hussein

M. A.

Hassan

Nassef

(2019). Automated language essay scoring systems: A literature review. PeerJ Computer Science, 5, Article e208. https://doi.org/10.7717/peerj-cs.208

21.

Kaggle. (2012). The Hewlett foundation: Automated essay scoring. Automated student assessment prize (ASAP). https://www.kaggle.com/competitions/asap-aes/overview

22.

Kaggle. (2022). Feedback prize: English language learning. https://www.kaggle.com/competitions/feedback-prize-english-language-learning/data

23.

Khalifa

Ffrench

(2009). Aligning Cambridge ESOL examinations to the CEFR: Issues & practice. Cambridge ESOL Research Notes, 37, 10–14.

24.

Kim

(2014). Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882. https://doi.org/10.48550/arXiv.1408.5882

25.

Kumar

Boulanger

(2020). Explainable automated essay scoring: Deep learning really has pedagogical value [Original Research]. Frontiers in Education, 5, Article 572367. https://doi.org/10.3389/feduc.2020.572367

26.

Landauer

T. K.

Foltz

P. W.

Laham

(1998). An introduction to latent semantic analysis. Discourse Processes, 25(2–3), 259–284. https://doi.org/10.1080/01638539809545028

27.

Landauer

T. K.

Laham

Folz

P. W.

(2003). Automated scoring and annotation of essays with the intelligent essay assessor. In Shermis

M. D.

Burstein

(Eds.), Automated essay scoring: A cross-disciplinary perspective (pp. 87–112). Routledge.

28.

Latifi

Gierl

(2021). Automated scoring of junior and senior high essays using Coh-Metrix features: Implications for large-scale language testing. Language Testing, 38(1), 62–85. https://doi.org/10.1177/0265532220929918

29.

LeCun

Bengio

Hinton

(2015). Deep learning. Nature, 521(7553), 436–444. https://doi.org/10.1038/nature14539

30.

Yang

Geng

Lin

(2022). Enhanced hybrid neural network for automated essay scoring. Expert Systems, 39(10), Article e13068. https://doi.org/10.1111/exsy.13068

31.

Mayfield

Black

A. W.

(2020, July). Should you fine-tune Bert for automated essay scoring? In Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications Seattle, WA, USA.

32.

McNamara

D. S.

Crossley

S. A.

Roscoe

R. D.

Allen

L. K.

Dai

(2015). A hierarchical classification approach to automated essay scoring. Assessing Writing, 23, 35–59. https://doi.org/10.1016/j.asw.2014.09.002

33.

Mikolov

Karafiát

Burget

Černocký

J. H.

Khudanpur

(2010). Recurrent neural network based language model. Interspeech, 2010.

34.

Nadeem

Nguyen

Liu

Ostendorf

(2019, August). Automated essay scoring with discourse-aware neural models. In Proceedings of the fourteenth workshop on innovative use of NLP for building educational applications, Florence, Italy.

35.

Nguyen

H. V.

Litman

D. J.

(2018). Argument mining for improving the automated scoring of persuasive essays. AAAI Conference on Artificial Intelligence. https://doi.org/10.1609/aaai.v32i1.12046

36.

Page

E. B.

(1966). The imminence of grading essays by computers. Phi Delta Kappan, 47, 238–243. https://www.jstor.org/stable/20371545

37.

Page

E. B.

(2003). Project essay grade: PEG. In Shermis

M. D.

Burstein

(Eds.), Automated essay scoring: A cross-disciplinary perspective (pp. 43–54). Lawrence Erlbaum.

38.

Papageorgiou

(2024). Can language test providers do more to support open science? A response to Winke. Language Testing, 41(4), 860–864. https://doi.org/10.1177/02655322241232361

39.

Rennie

J. D. M.

Shih

Teevan

Karger

D. R.

(2003). Tackling the poor assumptions of naive bayes text classifiers. Proceedings of the Twentieth International Conference on International Conference on Machine Learning, Washington, DC, USA. https://dl.acm.org/doi/10.5555/3041838.3041916

40.

Rudner

L. M.

Liang

(2002). Automated essay scoring using Bayes’ theorem. The Journal of Technology, Learning and Assessment, 1(2). https://ejournals.bc.edu/index.php/jtla/article/view/1668

41.

Taghipour

H. T.

(2016). A neural approach to automated essay scoring. In A. F. C. Linguistics, Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, Texas, November 1-5, 2016. https://doi.org/10.18653/v1/D16-1193

42.

Ursenbach

O’Connell

M. E.

Neiser

Tierney

M. C.

Morgan

Kosteniuk

Spiteri

R. J.

(2019). Scoring algorithms for a computer-based cognitive screening tool: An illustrative example of overfitting machine learning approaches and the impact on estimates of classification accuracy. Psychological Assessment, 31(11), 1377. https://doi.org/10.1037/pas0000764

43.

Uto

Xie

Ueno

(2020). Neural automated essay scoring incorporating handcrafted features. In Proceedings of the 28th International Conference on Computational Linguistics, Barcelona, Spain, pp. 6077–6088. https://aclanthology.org/2020.coling-main.535.pdf

44.

Vajjala

(2017). 2018/03/01). Automated assessment of non-native learner essays: Investigating the role of linguistic features. International Journal of Artificial Intelligence in Education, 28(1), 79–105. https://doi.org/10.1007/s40593-017-0142-3

45.

Vaswani

Shazeer

N. M.

Parmar

Uszkoreit

Jones

Gomez

A. N.

Kaiser

Polosukhin

(2017). Attention is all you need. 31st International Conference on Neural Information Processing Systems. https://doi.org/10.48550/arXiv.1706.03762

46.

Voss

(2021). The role of technology in learning-oriented assessment. In Gebril

(Ed.), Learning-oriented language assessment: Putting theory into practice (pp. 207–224). Routledge.

47.

Williams

R. J.

Zipser

(1989). A learning algorithm for continually running fully recurrent neural networks. Neural Computation, 1, 270–280. https://doi.org/10.1162/neco.1989.1.2.270

48.

Williamson

D. M.

Breyer

F. J.

(2012). A framework for evaluation and use of automated scoring. Educational Measurement: Issues and Practice, 31(1), 2–13. https://doi.org/10.1111/j.1745-3992.2011.00223.x

49.

Winke

(2024). Sharing, collaborating, and building trust: How open science advances language testing. Language Testing, 41(4), 845–859. https://doi.org/10.1177/02655322231211159

50.

(2010). Automated scoring and feedback systems: Where are we and where are we heading? Language Testing, 27, 291–300. https://doi.org/10.1177/0265532210364643

51.

(2016). Bayesian Naïve Bayes classifiers to text classification. Journal of Information Science, 44(1), 48–59. https://doi.org/10.1177/0165551516677946

52.

Yang

Cao

Wen

(2020). Enhancing automated essay scoring performance via fine-tuning pre-trained language models with combination of regression and ranking. In. Findings of the Association for Computational Linguistics: EMNLP 2020 (pp. 1560–1569). https://aclanthology.org/2020.findings-emnlp.141.pdf

53.

Yannakoudakis

Briscoe

Medlock

(2011). A new dataset and method for automatically grading ESOL texts. Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. Portland, OR. https://aclanthology.org/P11-1019/