Sage Journals: Discover world-class research

Abstract

This three-level meta-analysis investigated the human–machine correlation of automated speech evaluation (ASE) systems. Sixty-seven studies representing 392 effect sizes were included. The results indicated a positive overall correlation (r = .654, p < .001) between machine and human scoring of speech. Pooled effect sizes across speaking constructs showed the highest correlation for delivery (r = .784), followed by overall speaking proficiency (r = .686), fluency (r = .618), pronunciation (r = .606), content (r = .574), and grammar and vocabulary (r = .499). A clear upward trend was observed in ASE studies, from the traditional machine learning stage (r = .597), to the deep learning application stage (r = .641), and further to the transformer-driven stage (r = .680). Moderator analysis revealed significant moderating effects on the overall human–machine correlation from 10 variables: publication year, publication type, unit of sample, age group, level of task constraints, rater expertise, inter-rater reliability, system developer, feature engineering, and algorithm type. However, no significant moderating effects were observed for level of task integration, scoring method, system architecture, and automated speech recognition (ASR) accuracy. Key considerations for future ASE development are proposed, offering insights for educators and policymakers in integrating ASE into education.

Keywords

Automated speech scoring effect size human scoring human–machine correlation meta-analysis

Get full access to this article

View all access options for this article.

References

Assink

Wibbelink

C. J. M.

(2016). Fitting three-level meta-analytic models in R: A step-by-step tutorial. The Quantitative Methods for Psychology, 12(3), 154–174. https://doi.org/10.20982/tqmp.12.3.p154

Bannò

Qian

Knill

K. M.

Gales

M. J. F.

(2024). Towards end-to-end spoken grammatical error correction. In 2024 Institute of Electrical and Electronics Engineers (IEEE) International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 10791–10795). IEEE. https://doi.org/10.1109/ICASSP48485.2024.10446782

Bannò

Matassoni

(2024). Back to grammar: Using grammatical error correction to automatically assess L2 speaking proficiency. Speech Communication, 157, 103025. https://doi.org/10.1016/j.specom.2023.103025

Bhat

Xue

Yoon

S.-Y.

(2014). Shallow analysis based assessment of syntactic complexity for automated speech scoring. In Toutanova

(Eds.), Proceedings of the 52nd annual meeting of the Association for Computational Linguistics (Long Papers) (Vol. 1, pp. 1305–1315). Association for Computational Linguistics. https://doi.org/10.3115/v1/p14-1123

Bhat

Yoon

S.-Y.

(2015). Automatic assessment of syntactic complexity for spontaneous speech scoring. Speech Communication, 67, 42–57. https://doi.org/10.1016/j.specom.2014.09.005

Bijani

(2018). The investigation of rater expertise in oral language proficiency assessment: A multifaceted Rasch analysis. Journal of Language Horizons, 2(2), 103–124. https://doi.org/10.22051/lghor.2019.26072.1123

Borenstein

Hedges

L. V.

Higgins

J. P. T.

Rothstein

H. R.

(2009). Introduction to meta-analysis. John Wiley & Sons.

Bridgeman

Powers

Stone

Mollaun

(2012). TOEFL iBT speaking test scores as indicators of oral communicative language proficiency. Language Testing, 29(1), 91–108. https://doi.org/10.1177/0265532211411078

Brown

(2012). Interlocutor and rater training. In Fulcher

Davidson

(Eds.), The Routledge handbook of language testing (pp. 413–425). Routledge.

10.

Brownlee

(2023, October 11). A tour of machine learning algorithms. Machine Learning Mastery. https://machinelearningmastery.com/a-tour-of-machine-learning-algorithms/

11.

Castelvecchi

(2016). Can we open the black box of AI? Nature, 538(7623), 20–23. https://doi.org/10.1038/538020a

12.

Chen

Tao

Ghaffarzadegan

Qian

(2018). End-to-end neural network based automated speech scoring. In 2018 Institute of Electrical and Electronics Engineers (IEEE) International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6234–6238). IEEE. https://doi.org/10.1109/ICASSP.2018.8462562

13.

Chen

Zechner

(2011). Applying rhythm features to automatically assess non-native speech. In 12th annual conference of the International Speech Communication Association (Interspeech 2011) (pp. 1861–1864). ISCA. https://doi.org/10.21437/Interspeech.2011-506

14.

Chen

Zechner

Yoon

Evanini

Wang

Loukina

Tao

Davis

Lee

C. M.

Mundkowsky

Leong

C. W.

Gyawali

(2018). Automated scoring of nonnative speech using the SpeechRaterSMv. 5.0 engine. ETS Research Report Series, 2018. https://doi.org/10.1002/ets2.12198

15.

Cheung

M. W.-L.

(2014). Modeling dependent effect sizes with three-level meta-analyses: A structural equation modeling approach. Psychological Methods, 19(2), 211–229. https://doi.org/10.1037/a0032968

16.

Crossley

McNamara

(2013). Applications of text analysis tools for spoken response grading. Language Learning & Technology, 17(2), 171–192. https://doi.org/10125/44329

17.

Cucchiarini

Strik

Boves

(2000). Different aspects of expert pronunciation quality ratings and their relation to scores produced by speech recognition algorithms. Speech Communication, 30(2–3), 109–119. https://doi.org/10.1016/s0167-6393(99)00040-0

18.

Davis

Papageorgiou

(2021). Complementary strengths? Evaluation of a hybrid human-machine scoring approach for a test of oral academic English. Assessment in Education: Principles, Policy & Practice, 28(4), 437–455. https://doi.org/10.1080/0969594x.2021.1979466

19.

Duval

Tweedie

(2000a). A nonparametric “trim and fill” method of accounting for publication bias in meta-analysis. Journal of the American Statistical Association, 95(449), 89–98. https://doi.org/10.1080/01621459.2000.10473905

20.

Duval

Tweedie

(2000b). Trim and fill: A simple funnel-plot-based method of testing and adjusting for publication bias in meta-analysis. Biometrics, 56(2), 455–463. https://doi.org/10.1111/j.0006-341x.2000.00455.x

21.

Egger

Smith

G. D.

Schneider

Minder

(1997). Bias in meta-analysis detected by a simple, graphical test. British Medical Journal, 315(7109), 629–634. https://doi.org/10.1136/bmj.315.7109.629

22.

Engelhard

(1994). Examining rater errors in the assessment of written composition with a many-faceted Rasch model. Journal of Educational Measurement, 31(2), 93–112. https://doi.org/10.1111/j.1745-3984.1994.tb00436.x

23.

Engelhard

(2002). Monitoring raters in performance assessments. In Tindal

Haladyna

T. M.

(Eds.), Large-scale assessment programs for all students (pp. 224–249). Routledge.

24.

Evanini

Singh

Loukina

Wang

Lee

C. M.

(2015). Content-based automated assessment of non-native spoken language proficiency in a simulated conversation. Proceedings of the Neural Information Processing Systems (NIPS) workshop on machine learning for spoken language understanding and interaction. http://media.wix.com/ugd/b6d786_4d20a844a31f4654bd543a299ec6fb7d.pdf

25.

Franco

Neumeyer

Digalakis

Ronen

(2000). Combination of machine scores for automatic grading of pronunciation quality. Speech Communication, 30(2–3), 121–130. https://doi.org/10.1016/s0167-6393(99)00045-x

26.

Chiba

Nose

Ito

(2020). Automatic assessment of English proficiency for Japanese learners without reference sentences based on deep neural network acoustic models. Speech Communication, 116, 86–97. https://doi.org/10.1016/j.specom.2019.12.002

27.

Galaczi

E. D.

(2010). Face-to-face and computer-based assessment of speaking: Challenges and opportunities. In Araújo

(Ed.), Computer-based assessment of foreign language speaking skills: CBA 2010 (pp. 29–51). Publications Office of the European Union.

28.

Gao

Chen

Liu

(2024). Transferable adversarial attacks against ASR. IEEE Signal Processing Letters, 31, 2200–2204. https://doi.org/10.1109/lsp.2024.3443711

29.

Gupta

Unnam

Yadav

Aggarwal

(2024). Towards building a language-independent speech scoring assessment. Proceedings of the Association for the Advancement of Artificial Intelligence (AAAI) Conference on Artificial Intelligence, 38(21), 23200–23206. https://doi.org/10.1609/aaai.v38i21.30366

30.

Handley

Z. L.

Wang

(2023). What do the measures of utterance fluency employed in automatic speech evaluation (ASE) tell us about oral proficiency? Language Assessment Quarterly, 21(1), 3–32. https://doi.org/10.1080/15434303.2023.2283839

31.

Hannah

Kim

Jang

E. E.

(2022). Investigating the effects of task type and linguistic background on accuracy in automated speech recognition systems: Implications for use in language assessment of young learners. Language Assessment Quarterly, 19(3), 289–313. https://doi.org/10.1080/15434303.2022.2038172

32.

Harding

(2025). Utopian and dystopian visions: Steering a course for the responsible use of artificial intelligence (AI) in language testing and assessment. Language Testing, 42(4), 561-575. https://doi.org/10.1177/02655322251350717

33.

Harrer

Cuijpers

Furukawa

T. A.

Ebert

D. D.

(2021). Doing meta-analysis with R: A hands-on guide. https://bookdown.org/MathiasHarrer/Doing_Meta_Analysis_in_R/

34.

Higgins

Zechner

Williamson

(2011). A three-stage approach to the automated scoring of spontaneous spoken responses. Computer Speech & Language, 25(2), 282–306. https://doi.org/10.1016/j.csl.2010.06.001

35.

Hinton

Deng

Dahl

Mohamed

Jaitly

Senior

Vanhoucke

Nguyen

Sainath

Kingsbury

(2012). Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. Institute of Electrical and Electronics Engineers (IEEE) Signal Processing Magazine, 29(6), 82–97. https://doi.org/10.1109/msp.2012.2205597

36.

Hox

Moerbeek

Van de Schoot

(2017). Multilevel analysis: Techniques and applications (3rd ed.). Routledge.

37.

Hussein

M. A.

Hassan

Nassef

(2019). Automated language essay scoring systems: A literature review. PeerJ Computer Science, 5, Article e208. https://doi.org/10.7717/peerj-cs.208

38.

International Test Commission and Association of Test Publishers. (2022). Guidelines for technology-based assessment. Association of Test Publishers.

39.

Iwashita

Brown

McNamara

O’Hagan

(2008). Assessed levels of second language speaking proficiency: How distinct? Applied Linguistics, 29(1), 24–49. https://doi.org/10.1093/applin/amm017

40.

Kang

Johnson

(2018). The roles of suprasegmental features in predicting English oral proficiency with an automated system. Language Assessment Quarterly, 15(2), 150–168. https://doi.org/10.1080/15434303.2018.1451531

41.

Karatay

(2025). Exploring the potential of conversational ai for assessing second language oral proficiency. TESOL Quarterly. Advance online publication. https://doi.org/10.1002/tesq.70003

42.

Khabbazbashi

Galaczi

E. D.

(2020). A comparison of holistic, analytic, and part marking models in speaking assessment. Language Testing, 37(3), 333–360. https://doi.org/10.1177/0265532219898635

43.

Khabbazbashi

Galaczi

E. D.

(2021). Opening the black box: Exploring automated speaking evaluation. In Lanteigne

Coombe

Brown

J. D.

(Eds.), Challenges in language testing around the world (pp. 333–343). Springer. https://doi.org/10.1007/978-981-33-4232-3_25

44.

Lipsey

M. W.

Wilson

D. B.

(2001). Practical meta-analysis. SAGE.

45.

(2012). The relationship of lexical richness to the quality of ESL learners’ oral narratives. The Modern Language Journal, 96(2), 190–208. https://doi.org/10.1111/j.1540-4781.2011.01232_1.x

46.

Matsuura

Suzuki

Saeki

Ogawa

Matsuyama

(2022). Refinement of utterance fluency feature extraction and automated scoring of L2 oral fluency with dialogic features. In 2022 Asia-Pacific Signal and Information Processing Association annual summit and conference (pp. 1312–1320). IEEE. https://doi.org/10.23919/apsipaasc55919.2022.9980148

47.

Metallinou

Cheng

(2014). Using deep neural networks to improve proficiency assessment for children English language learners. In Li

Meng

H. M.

Chng

Xie

(Eds.), 15th annual conference of the International Speech Communication Association (Interspeech 2014) (pp. 1468–1472). ISCA. https://doi.org/10.21437/interspeech.2014-358

48.

Mokkink

L. B.

de Vet

H. C. W.

Prinsen

C. A. C.

Patrick

D. L.

Alonso

Bouter

L. M.

Terwee

C. B.

(2018). COSMIN risk of bias checklist for systematic reviews of patient-reported outcome measures. Quality of Life Research, 27(5), 1171–1179. https://doi.org/10.1007/s11136-017-1765-4

49.

Myers

Sirois

M. J.

(2014). Spearman correlation coefficients, differences between. In Balakrishnan

Colton

Everitt

Piegorsch

Ruggeri

Teugels

J. L.

(Eds.), Statistics reference online. John Wiley & Sons. https://doi.org/10.1002/9781118445112.stat02802

50.

Ockey

G. J.

Chukharev-Hudilainen

(2021). Human versus computer partner in the paired oral discussion test. Applied Linguistics, 42(5), 924–944. https://doi.org/10.1093/applin/amaa067

51.

Page

M. J.

McKenzie

J. E.

Bossuyt

P. M.

Boutron

Hoffmann

T. C.

Mulrow

C. D.

Shamseer

Tetzlaff

J. M.

Akl

E. A.

Brennan

S. E.

(2021). The PRISMA 2020 statement: An updated guideline for reporting systematic reviews. British Medical Journal, 372, n71. https://doi.org/10.1136/bmj.n71

52.

Park

Choi

(2023). Addressing cold start problem for end-to-end automatic speech scoring. In 24th annual conference of the International Speech Communication Association (Interspeech 2023) (pp. 994–998). ISCA. http://10.0.83.189/Interspeech.2023-533

53.

Peters

J. L.

Sutton

A. J.

Jones

D. R.

Abrams

K. R.

Rushton

(2008). Contour-enhanced meta-analysis funnel plots help distinguish publication bias from other causes of asymmetry. Journal of Clinical Epidemiology, 61(10), 991–996. https://doi.org/10.1016/j.jclinepi.2007.11.010

54.

Plonsky

(2024). Study quality as an intellectual and ethical imperative: A proposed framework. Annual Review of Applied Linguistics, 44, 4–18. https://doi.org/10.1017/s0267190524000059

55.

Plonsky

Derrick

D. J.

(2016). A meta-analysis of reliability coefficients in second language research. The Modern Language Journal, 100(2), 538–553. https://doi.org/10.1111/modl.12335

56.

Qian

Lange

Evanini

Pugh

Ubale

Mulholland

Wang

(2019). Neural approaches to automated speech scoring of monologue and dialogue responses. In 2019 Institute of Electrical and Electronics Engineers (IEEE) International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 8112–8116). IEEE. https://doi.org/10.1109/ICASSP.2019.8683717

57.

R Core Team. (2022). R: A language and environment for statistical computing (Version 4.2.0) [Computer software]. R Foundation for Statistical Computing. https://www.R-project.org/

58.

Rotou

Rupp

A. A.

(2020). Evaluations of automated scoring systems in practice. ETS Research Report Series, 20(1), 201–218. https://doi.org/10.1002/ets2.12293

59.

Saito

Macmillan

Kachlicka

Kunihara

Minematsu

(2023). Automated assessment of second language comprehensibility: Review, training, validation, and generalization studies. Studies in Second Language Acquisition, 45(1), 234–263. https://doi.org/10.1017/S0272263122000080

60.

Skehan

(2003). Task-based instruction. Language Teaching, 36(1), 1–14. https://doi.org/10.1017/s026144480200188x

61.

Tabachnick

B. G.

Fidell

L. S.

(2013). Using multivariate statistics (6th ed.). Pearson Education.

62.

Van den Noortgate

López-López

J. A.

Marín-Martínez

Sánchez-Meca

. (2013). Three-level meta-analysis of dependent effect sizes. Behavior Research Methods, 45(2), 576–594. https://doi.org/10.3758/s13428-012-0261-6

63.

Van den Noortgate

López-López

J. A.

Marín-Martínez

Sánchez-Meca

. (2015). Meta-analysis of multiple outcomes: A multilevel approach. Behavior Research Methods, 47(4), 1274–1294. https://doi.org/10.3758/s13428-014-0527-2

64.

Van der Walt

de Wet

Niesler

. (2008). Oral proficiency assessment: The use of automatic speech recognition systems. Southern African Linguistics and Applied Language Studies, 26(1), 135–146. https://doi.org/10.2989/salals.2008.26.1.11.426

65.

Vaswani

Shazeer

N. M.

Parmar

Uszkoreit

Jones

Gomez

A. N.

Kaiser

Polosukhin

(2017). Attention is all you need. In Proceedings of the 31st international conference on Neural Information Processing Systems NIPS (pp. 6000–6010). Curran Associates. https://api.semanticscholar.org/CorpusID:13756489

66.

Viechtbauer

(2010). Conducting meta-analyses in R with the metafor package. Journal of Statistical Software, 36(3), 1–48. https://doi.org/10.18637/jss.v036.i03

67.

Wang

Evanini

Qian

Mulholland

(2021). Automated scoring of spontaneous speech from young learners of English using transformers. In 2021 Institute of Electrical and Electronics Engineers (IEEE) Spoken Language Technology (SLT) workshop (pp. 705–712). IEEE. https://doi.org/10.1109/SLT48900.2021.9383553

68.

Wang

Gales

M. J. F.

Knill

K. M.

Kyriakopoulos

Malinin

van Dalen

R. C.

Rashid

(2018). Towards automatic assessment of spontaneous spoken English. Speech Communication, 104, 47–56. https://doi.org/10.1016/j.specom.2018.09.002

69.

Wang

Min

(2025, November 7). The human-machine correlation in automated speech evaluation: A three-level meta-analysis. Retrieved from osf.io/ts3mb

70.

Wang

Zechner

Sun

(2018). Monitoring the performance of human and automated scores for spoken responses. Language Testing, 35(1), 101–120. https://doi.org/10.1177/0265532216679451

71.

White

H. D.

(2009). Scientific communication and literature retrieval. In Cooper

Hedges

L. V.

Valentine

J. C.

(Eds.), The handbook of research synthesis and meta-analysis (pp. 51–71). SAGE.

72.

Xiao

Sun

Zhang

(2022). A survey of human-in-the-loop for machine learning. Future Generation Computer Systems, 135, 364–381.

73.

(2010). Automated scoring and feedback systems: Where are we and where are we heading? Language Testing, 27(3), 291–300. https://doi.org/10.1177/0265532210364643

74.

(2023). Advancing language assessment with AI and ML- leaning into AI is inevitable, but can theory keep up? Language Assessment Quarterly, 20(4–5), 357–376. https://doi.org/10.1080/15434303.2023.2291488

75.

Higgins

Zechner

Williamson

(2012). A comparison of two scoring methods for an automated speech scoring system. Language Testing, 29(3), 371–394. https://doi.org/10.1177/0265532211425673

76.

Xie

Evanini

Zechner

(2012). Exploring content features for automated speech scoring. In Fosler-Lussier

Riloff

Bangalore

(Eds.), Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 103-111)

77.

Jones

Laxton

Galaczi

(2021). Assessing L2 English speaking using automated scoring technology: Examining automarker reliability. Assessment in Education: Principles, Policy & Practice, 28(4), 411–436. https://doi.org/10.1080/0969594x.2021.1979467

78.

Knill

(in press). Computer scoring of spoken responses. In Chapelle

C. A.

(Ed.), Encyclopedia of Applied Linguistics 2nd Edition. Wiley.

79.

Schmidt

Galaczi

Somers

(2024). Automarking in language assessment: Key considerations for best practice. Cambridge University Press & Assessment. https://doi.org/10.17863/CAM.117098

80.

Yoon

S.-Y.

Bhat

(2012). Assessment of ESL learners’ syntactic competence based on similarity measures. In Tsujii

Henderson

Paşca

(Eds.), Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (pp. 600–608). Association for Computational Linguistics. https://aclanthology.org/D12-1055/

81.

Yoon

S.-Y.

Loukina

Lee

C. M.

Mulholland

Wang

Choi

(2018). Word-embedding based content features for automated oral proficiency scoring. In Anke

L. E.

Gromann

Declerck

(Eds.), Proceedings of the third workshop on semantic deep learning (pp. 12–22). Association for Computational Linguistics. https://aclanthology.org/W18-4002/

82.

Yoon

S.-Y.

Zechner

(2017). Combining human and automated scores for the improved assessment of non-native speech. Speech Communication, 93, 43–52. https://doi.org/10.1016/j.specom.2017.08.001

83.

Deng

(2015). Automated speech recognition: A deep learning approach. Springer.

84.

Zechner

(2019). Summary and outlook on automated speech scoring. In Zechner

Evanini

(Eds.), Automated speaking assessment: Using language technologies to score spontaneous speech (pp. 192–204). Routledge.

85.

Zechner

Higgins

Williamson

D. M.

(2009). Automatic scoring of non-native spontaneous speech in tests of spoken English. Speech Communication, 51(10), 883–895. https://doi.org/10.1016/j.specom.2009.04.009

86.

Zhang

(2013). Contrasting automated and human scoring of essays. R & D Connections, 21(2). http://www.ets.org/Media/Research/pdf/RD_Connections_21.pdf

The human–machine correlation in automated speech evaluation: A three-level meta-analysis

Abstract

Keywords

Get full access to this article

References