Abstract
This three-level meta-analysis investigated the human–machine correlation of automated speech evaluation (ASE) systems. Sixty-seven studies representing 392 effect sizes were included. The results indicated a positive overall correlation (r = .654, p < .001) between machine and human scoring of speech. Pooled effect sizes across speaking constructs showed the highest correlation for delivery (r = .784), followed by overall speaking proficiency (r = .686), fluency (r = .618), pronunciation (r = .606), content (r = .574), and grammar and vocabulary (r = .499). A clear upward trend was observed in ASE studies, from the traditional machine learning stage (r = .597), to the deep learning application stage (r = .641), and further to the transformer-driven stage (r = .680). Moderator analysis revealed significant moderating effects on the overall human–machine correlation from 10 variables: publication year, publication type, unit of sample, age group, level of task constraints, rater expertise, inter-rater reliability, system developer, feature engineering, and algorithm type. However, no significant moderating effects were observed for level of task integration, scoring method, system architecture, and automated speech recognition (ASR) accuracy. Key considerations for future ASE development are proposed, offering insights for educators and policymakers in integrating ASE into education.
Get full access to this article
View all access options for this article.
