Monitoring the performance of human and automated scores for spoken responses

Abstract

As automated scoring systems for spoken responses are increasingly used in language assessments, testing organizations need to analyze their performance, as compared to human raters, across several dimensions, for example, on individual items or based on subgroups of test takers. In addition, there is a need in testing organizations to establish rigorous procedures for monitoring the performance of both human and automated scoring processes during operational administrations. This paper provides an overview of the automated speech scoring system SpeechRater^SM and how to use charts and evaluation statistics to monitor and evaluate automated scores and human rater scores of spoken constructed responses.

Keywords

Automated speech scoring language assessment score monitoring Shewhart control chart human and machine scoring reliability

Get full access to this article

View all access options for this article.

References

Attali

Burstein

(2006). Automated essay scoring with e-rater ^® v.2. Journal of Technology, Learning, and Assessment, 4(3).

Attali

(2007). Construct validity of e-rater^® in scoring L essays (Research Report No. RR−07–21). Princeton, NJ: Educational Testing Service.

Attali

Bridgeman

Trapani

(2010). Performance of a generic approach in automated scoring. Journal of Technology, Learning, and Assessment, 10(3). Retrieved from www.jtla.org

Bachman

L. F.

(1990). Fundamental considerations in language testing. Oxford: Oxford University Press.

Bachman

L. F.

Palmer

(1996). Language testing in practice: Designing and developing useful language tests. Oxford: Oxford University Press.

Bejar

I. I.

(2011). A validity-based approach to quality control and assurance of automated scoring. Assessment in Education: Principles, Policy & Practice, 18(3), 319–341.

Bennett

R. E.

Bejar

I. I.

(1998). Validity and automated scoring: It’s not only the scoring. Educational Measurement: Issues and Practice, 17(4), 9–17.

Bernstein

Cohen

Murveit

Rtischev

Weintraub

(1990). Automatic evaluation and training in English pronunciation. Proceedings of the ICSLP-90: 1990 International Conference on Spoken Language Processing (pp. 1185–1188). Kobe, Japan.

Bernstein

DeJong

Pisoni

Townshend

(2000). Two experiments in automatic scoring of spoken language proficiency. Proceedings of InSTILL2000. Dundee, UK.

10.

Bridgeman

(2013). Human ratings and automated essay evaluation. In Shermis

M. D.

Burstein

(Eds.) Handbook of automated essay evaluation: Current applications and new directions (pp. 221–232). New York: Routledge.

11.

Burstein

Chodorow

(1999). Automated essay scoring for nonnative English speakers. In Broman Olsen

(Ed.), Computer mediated language assessment and evaluation in natural language processing (pp. 68–75). Morristown, NJ: Association for Computational Linguistics.

12.

Chen

Zechner

(2011). Computing and evaluating syntactic complexity features for automated scoring of spontaneous non–native speech. Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics and the Human Language Technologies Conference (ACL–HLT–2011), Portland, OR, June.

13.

Cheng

Chen

Metallinou

(2015). Deep neural network acoustic models for spoken assessment applications. Speech Communication, 73, 14–27.

14.

Chodorow

Burstein

(2004). Beyond essay length: Evaluating e-rater^® ’s performance on essays (Research Report No. RR−04−04). Princeton, NJ: Educational Testing Service.

15.

Cucchiarini

Strik

Boves

(1997a). Automatic evaluation of Dutch pronunciation by using speech recognition technology. Paper presented at the meeting of IEEE Automatic Speech Recognition and Understanding Workshop, Santa Barbara, CA.

16.

Cucchiarini

Strik

Boves

(1997b). Using speech recognition technology to assess foreign speakers’ pronunciation of Dutch. Paper presented at the Third International Symposium on the Acquisition of Second Language Speech: NEW SOUNDS 97, Klagenfurt, Austria.

17.

Cucchiarini

Strik

Boves

(2000a). Different aspects of expert pronunciation: Quality ratings and their relation to scores produced by speech recognition algorithms. Speech Communication, 30(2–3), 109–119.

18.

Cucchiarini

Strik

Boves

(2000b). Quantitative assessment of second language learners’ fluency by means of automatic speech recognition technology. Journal of the Acoustical Society of America, 107, 989–999.

19.

Cucchiarini

Strik

Boves

(2002). Quantitative assessment of second language learners’ fluency: Comparisons between read and spontaneous speech. Journal of the Acoustical Society of America, 111(6), 2862–2873.

20.

Engelhard

Jr. (1994). Examining rater errors in the assessment of written composition with a many-faceted Rasch model. Journal of Educational Measurement, 31, 93–112.

21.

Engelhard

Jr. (2002). Monitoring raters in performance assessments. Large–scale assessment programs for all examinees: Validity, technical adequacy, and implementation (pp. 261–287). Mahwah, NJ: Lawrence Erlbaum.

22.

Franco

Abrash

Precoda

Bratt

Rao

Butzberger

(2000a). The SRI EduSpeak system: recognition and pronunciation scoring for language learning. Proceedings of Intelligent Speech Technology in Language Learning, InSTiLL-2000. Dundee, UK.

23.

Franco

Bratt

Rossier

Gadde

V. R.

Shriberg

Abrash

Precoda

(2010). EduSpeak®: A speech recognition and pronunciation scoring toolkit for computer-aided language learning applications. Language Testing, 27(3), 401–418.

24.

Franco

Neumeyer

Digalakis

Ronen

(2000b). Combination of machine scores for automatic grading of pronunciation quality. Speech Communication, 30, 121–130.

25.

Gao

(2009). Detect cheating using statistical control methods for computer based CLEP examinations with item exposure risks. Unpublished manuscript.

26.

Lee

Y.-H.

von Davier

A. A.

(2013). Monitoring scale scores over time via quality control charts, model-based approaches, and time series techniques. Psychometrika, 78(3), 557–575.

27.

Luecht

R. M.

(2010). Some small sample statistical quality control procedures for constructed response scoring in Language Testing. Paper presented at the National Council on Measurement in Education, Denver, CO.

28.

Meijer

R. R.

(2002). Outlier detection in high-stakes certification testing. Journal of Educational Measurement, 39, 219–233.

29.

Myford

C. M.

Wolfe

E. W.

(2009). Monitoring rater performance over time: A framework for detecting differential accuracy and differential scale category use. Journal of Educational Measurement, 46, 371–389.

30.

Omar

M. H.

(2010). Statistical process control charts for measuring and monitoring temporal consistency of ratings. Journal of Educational Measurement, 47(1), 18–35.

31.

Ramineni

Trapani

Williamson

D. M.

Davey

Bridgeman

(2012). Evaluation of e-rater^® for the GRE issue and argument prompts (Research Report No. RR–12−06). Princeton, NJ: Educational Testing Service.

32.

Vani

(1995). McGraw-Hill’s certified quality engineer examination guide. NewYork: McGraw-Hill.

33.

Veerkamp

Glas

C. A. W.

(2000). Detection of known items in adaptive testing with a statistical quality control method. Journal of Behavioural and Educational Statistics, 25, 373–389.

34.

Wang

von Davier

A. A.

(2010). Proposed procedures to monitor the performance of the human- & electronic ratings for all programs. Unpublished manuscript.

35.

Wang

Yao

(2013). Investigation of the effects of scoring designs and rater severity on students’ ability estimation using different rater models (Research Report. No. RR–13–23). Princeton, NJ: Educational Testing Service.

36.

Wang

von Davier

A. A.

(2014). Monitoring of scoring using the e-rater automated scoring system and human raters on a writing test (Research Report No. RR–14−04). Princeton, NJ: Educational Testing Service.

37.

Way

W. D.

Vickers

Nichols

(2008). Effects of different training and scoring approaches on human constructed response scoring. Paper presented at the annual meeting of the National Council on Measurement in Education, New York.

38.

Williamson

D. M.

Breyer

F. J.

(2012). A framework for evaluation and use of automated scoring. Educational Measurement: Issues and Practice, 31(1), 2–13.

39.

Xie

Evanini

Zechner

(2012). Exploring content features for automated speech scoring. Proceedings of NAACL–HLT 2012. Montreal, Canada.

40.

Yoon

S.-Y.

Bhat

(2012). Assessment of ESL learners’ syntactic competence based on similarity measures. Proceedings of EMNLP–CoNLL 2012. Jeju, Korea.

41.

Yoon

S.-Y.

Bhat

Zechner

(2012). Vocabulary profile as a measure of vocabulary sophistication. Proceedings of the 7th Workshop on Innovative Use of NLP for Building Educational Applications, NAACL–HLT 2012. Montreal, Canada.

42.

Zechner

Higgins

Williamson

D. M.

(2009). Automatic scoring of non-native spontaneous speech in tests of spoken English. Speech Communication, 51(10), October.

43.

Zhang

(2013, March). Contrasting automated and human scoring of essays. (R & D Connections, No. 21). Princeton, NJ: Educational Testing Service.