Abstract
An increasing number of language testing companies are developing and deploying deep learning-based automated essay scoring systems (AES) to replace traditional approaches that rely on handcrafted feature extraction. However, there is hesitation to accept neural network approaches to automated essay scoring because the features are automatically extracted and are viewed as less transparent and score interpretation is opaque. In order to compare the two approaches systematically, this paper investigated the performance of five approaches to automated essay scoring using traditional machine learning models and neural network models (i.e., deep learning). The models were developed to assign scores to responses in the TOEFL11 learner corpus. Since the dataset and metrics were held static, the results are dependent on model selection, training, and hyperparameter adjustment to find the best fit for each model. Results indicate the performance of the models was similar in accuracy but differed in precision and agreement as measured with the quadratic weighted kappa metric. Performance with traditional models can increase as specific features are added that align with the scoring criteria. The findings are relevant for the discussion about transparency in artificial intelligence (AI) scoring models.
Get full access to this article
View all access options for this article.
