Sage Journals: Discover world-class research

Abstract

This study explores the integration of generative artificial intelligence (GenAI) with human experts to improve the quality of distractors in multiple-choice questions (MCQs) for second language (L2) listening tests. A psychometric analysis of responses from 2267 EFL Chinese undergraduates, using the two-parameter logistic nested logit model (2PLNLM), identified problematic items and distractors. Guided by established distractor design principles, GenAI was applied iteratively to refine these distractors, and GenAI was iteratively used to revise these distractors, with human experts providing ongoing feedback throughout the process. The revised versions were then evaluated by expert judgment and NLP-based cosine similarity analysis. The results indicate that GenAI effectively enhanced distractor quality by maintaining content and structural alignment and ensuring semantic independence. However, it struggled to fully capture listening miscomprehension patterns and contextualized language use. These preliminary findings suggest that GenAI revisions, guided by principle-based prompts and supervised by humans, tend to effectively improve the quality of distractors. This study offers practical insights into the potential and limitations of GenAI in improving L2 listening tests.

Keywords

Distractor items generative AI L2 listening tests multiple choice questions natural language processing prompt design psychometric analysis

Get full access to this article

View all access options for this article.

References

Aryadoust

Foo

L. Y.

(2022). What can gaze behaviors, neuroimaging data, and test scores tell us about test method effects and cognitive load in listening assessments? Language Testing, 39(1), 56–89. https://doi.org/10.1177/02655322211026876

Aryadoust

Zakaria

Jia

(2024). Investigating the affordances of OpenAI’s large language model in developing listening assessments. Computers and Education: Artificial Intelligence, 6, 100204. https://doi.org/10.1016/j.compedu.2024.100204

Attali

Fraenkel

(2000). The point-biserial as a discrimination index for distractors in multiple-choice items: Deficiencies in usage and an alternative. Journal of Educational Measurement, 37(1), 77–86. https://doi.org/10.1111/j.1745-3984.2000.tb01077.x

Bachman

L. F.

Palmer

A. S.

(1996). Language testing in practice: Designing and developing useful language tests. Oxford University Press.

Bock

R. D.

(1972). Estimating item parameters and latent ability when responses are scored in two or more nominal categories. Psychometrika, 37(1), 29–51. https://doi.org/10.1007/BF02291411

Brindley

Slatyer

(2002). Exploring task difficulty in ESL listening assessment. Language Testing, 19(4), 369–394. https://doi.org/10.1191/0265532202lt236oa

Chang

H. H.

Ying

Z. L.

(1996). A global information approach to computerized adaptive testing. Applied Psychological Measurement, 20(3), 213–229. https://doi.org/10.1177/014662169602000303

Chen

(2025). Research on automated item generation based on large language models: A case study of English reading comprehension tests [In Chinese]. Foreign Language Education, 46(2), 40–47. https://link.oversea.cnki.net/doi/10.16362/j.cnki.cn61-1023/h.2025.02.002

Chun

J. Y.

Barley

(2024). A Comparative analysis of multiple-choice questions: ChatGPT-generated items vs. human-developed items. In Chapelle

C. A.

Beckett

G. H.

Ranalli

(Eds.), Exploring artificial intelligence in applied linguistics (pp. 118–136). Iowa State University Digital Press. https://doi.org/10.31274/isudp.2024.154.08

10.

de la Torre

(2009). DINA model and parameter estimation: A didactic. Journal of Educational and Behavioral Statistics, 34(1), 115–130. https://doi.org/10.3102/1076998607309474

11.

Devlin

Chang

Lee

Toutanova

(2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Burstein

Doran

Solorio

(Eds.), Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: Human language technologies (pp. 4171–4186). Association for Computational Linguistics.

12.

Geng

Meng

(2024). The Construction and validation of English listening multiple choice question’s added distractors in dynamic assessment [In Chinese]. Foreign Language Education, 45(3), 59–65. https://link.oversea.cnki.net/doi/10.16362/j.cnki.cn61-1023/h.2024.03.008

13.

Gierl

M. J.

Bulut

Guo

Zhang

(2017). Developing, analyzing, and using distractors for multiple-choice tests in education: A comprehensive review. Review of Educational Research, 87(6), 1082–1116. https://doi.org/10.3102/0034654317723991

14.

Haladyna

T. M.

Downing

S. M.

(1993). How many options is enough for a multiple-choice item? Educational and Psychological Measurement, 53(4), 999–1010. https://doi.org/10.1177/0013164493053004013

15.

Haladyna

T. M.

Rodriguez

M. C.

(2013). Developing and validating test items. Routledge.

16.

Harding

(2012). Accent, listening assessment and the potential for a shared-L1 advantage: A DIF perspective. Language Testing, 29(2), 163–180. https://doi.org/10.1177/0265532211421161

17.

Xiong

Min

(2022). Diagnosing listening and reading skills in the Chinese EFL context: Performance stability and variability across modalities and performance levels. System, 106, 102787. https://doi.org/10.1016/j.system.2022.102787

18.

Holzknecht

McCray

Eberharter

Kremmel

Zehentner

Spiby

Dunlea

(2020). The effect of response order on candidate viewing behaviour and item difficulty in a multiple-choice listening test. Language Testing, 38(1), 41–61. https://doi.org/10.1177/0265532220917316

19.

Jacoby

Hoyer

W. D.

(1989). The comprehension/miscomprehension of print communication: Selected findings. Journal of Consumer Research, 15(4), 434–443. https://doi.org/10.1086/209183

20.

Jones

(2020). Designing multiple-choice test items. In Winke

Brunfaut

(Eds.), The Routledge handbook of second language acquisition and language testing (pp. 90–101). Routledge.

21.

Knoch

(2009). Diagnostic writing assessment: The development and validation of a rating scale. Peter Lang.

22.

Lai

Gierl

M. J.

Touchie

Pugh

Boulais

De Champlain

(2016). Using automatic item generation to improve the quality of MCQ distractors. Teaching and Learning in Medicine, 28(3), 166–173. https://doi.org/10.1080/10401334.2016.1146608

23.

Lin

Chen

(2024). Investigating the capability of ChatGPT for generating multiple-choice reading comprehension items. System, 123, 103344. https://doi.org/10.1016/j.system.2024.103344

24.

Manning

C. D.

Raghavan

Shutze

(2008). Introduction to information retrieval. Cambridge University Press.

25.

Mitkov

Varga

Rello

(2009). Semantic similarity of distractors in multiple-choice tests: Extrinsic evaluation. In Mitkov

Varga

Rello

(Eds.), Proceedings of the ACL-IJCNLP (Association for Computational Linguistics-International Joint Conference on Natural Language Processing) 2009 workshop on geometrical models of natural language semantics (GEMS 2009) (pp. 49–56). Association for Computational Linguistics.

26.

Moonshot AI. (2024). Kimi: Your AI assistant [Large language model]. https://Kimi.moonshot.cn/

27.

Moreno

Martínez

R. J.

Muñiz

(2015). Guidelines based on validity criteria for the development of multiple choice items. Psicothema, 27(4), 388–394. https://doi.org/10.7334/psicothema2015.110

28.

K.-M

. (2024). A comparative study of AI-human-made and human-made test forms for a university TESOL theory course. Language Testing in Asia, 14(1), Article 19. https://doi.org/10.1186/s40468-024-00291-3

29.

Rodriguez

M. C.

Kettler

R. J.

Elliott

S. N.

(2014). Distractor functioning in modified items for test accessibility. SAGE Open, 4(4), 2158244014553586. https://doi.org/10.1177/2158244014553586

30.

Sayin

Gierl

(2024). Using OpenAI GPT to generate reading comprehension items. Educational Measurement: Issues and Practice, 43(1), 5–18. https://doi.org/10.1111/emip.12590

31.

Shin

Lee

J. H.

(2023). Can ChatGPT make reading comprehension testing items on par with human experts? Language Learning & Technology, 27(3), 27–40. https://hdl.handle.net/10125/73530

32.

Shin

Guo

Gierl

M. J.

(2019). Multiple-choice item distractor development using topic modeling approaches. Frontiers in Psychology, 10, Article 825. https://doi.org/10.3389/fpsyg.2019.00825

33.

Steele

K. M.

Munger

M. E.

Peters

K. M.

Shuman

B. R.

Schwartz

M. H.

(2019). Repeatability of electromyography recordings and muscle synergies during gait among children with cerebral palsy. Gait & Posture, 67, 290–295. https://doi.org/10.1016/j.gaitpost.2018.10.009

34.

Suh

Bolt

D. M.

(2010). Nested logit models for multiple-choice item response data. Psychometrika, 75(3), 454–473. https://doi.org/10.1007/s11336-010-9163-7

35.

Vandergrift

Goh

C. M.

(2012). Teaching and learning second language listening: Metacognition in action. Routledge.

36.

Van Dis

E. A.

Bollen

Zuidema

Van Rooij

Bockting

C. L

. (2023). ChatGPT: Five priorities for research. Nature, 614(7947), 224–226. https://doi.org/10.1038/d41586-023-00289-8

37.

Wagner

(2021). Assessing listening. In Harding

Fulcher

(Eds.), The Routledge handbook of language testing (pp. 223–235). Routledge.

38.

Wen

Q. F.

Liang

M. C.

(2024). Human-computer interactive negotiation ability: ChatGPT and foreign language education [In Chinese]. Foreign Language Teaching and Research, 56(2), 286–296. https://link.oversea.cnki.net/doi/10.19923/j.cnki.fltr.2024.02.006

39.

X. L.

(2022). A study on the relationship between distractor types and distractor selection patterns in an EFL listening comprehension test [Unpublished thesis]. Guangdong University of Foreign Studies.

40.

(2023). Advancing language assessment with AI and ML-leaning into AI is inevitable, but can theory keep up? Language Assessment Quarterly, 20(4–5), 357–376. https://doi.org/10.1080/15434303.2023.2288268

41.

Zhang

VanLehn

(2019). Evaluation of auto-generated distractors in multiple choice questions from a semantic network. Interactive Learning Environments, 29(6), 1019–1036. https://doi.org/10.1080/10494820.2019.1619586

42.

Kyllonen

P. C.

(2020). The nominal response model is useful for scoring multiple-choice situational judgment tests. Organizational Research Methods, 23(2), 342–366. https://doi.org/10.1177/1094428119885173

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.12 MB

0.22 MB

Optimizing distractor quality in a locally developed second language listening test: Integrating generative AI and psychometric methods

Abstract

Keywords

Get full access to this article

References

Supplementary Material