Sage Journals: Discover world-class research

Abstract

ChatGPT has shown considerable potential for Automated Item Generation, but the quality of ChatGPT-generated items in language assessment remains insufficiently substantiated. This research recruited 121 participants to systematically compare the psychometric properties of the test items and the linguistic features of the reading passages in ChatGPT-generated and official CET-4 reading comprehension materials, using Item Response Theory and Coh-Metrix. Key findings are as follows: (1) generated items fell short in higher-order reading skills; (2) the generated items were less difficult than official ones, showing weaker discrimination and providing measurement information mainly for lower-performing students; (3) only 22.9% distractors functioned effectively, indicating insufficient distractor performance; and (4) ChatGPT-generated passages were characterized by irregular lexical distribution, higher lexical complexity, weaker cohesion but simpler sentences than CET-4 passages. Although ChatGPT-generated passages were less readable than CET-4 passages, the corresponding items were easier and showed lower discrimination. This discrepancy can be attributed to inadequate distractor functioning that facilitates option elimination without complete passage comprehension, as well as to the underrepresentation of higher-order reading skills. The findings corroborate the conclusion that ChatGPT may function effectively as a supplementary tool in low-stakes assessment; however, substantial refinements in item quality are imperative before its application in high-stakes testing.

Keywords

Automatic Item Generation Item Response Theory reading comprehension Multiple-Choice Questions Coh-Metrix

Get full access to this article

View all access options for this article.

References

Abba

K. A.

Joshi

R. M.

X. R.

(2019). Analyzing writing performance of L1, L2, and Generation 1.5 community college students through Coh-Metrix. Written Language & Literacy, 22(1), 67–94. https://doi.org/10.1075/wll.00020.abb

Ackerman

T. A.

(1992). A didactic explanation of item bias, item impact, and item validity from a multidimensional perspective. Journal of Educational Measurement, 29(1), 67–91. https://doi.org/10.1111/j.1745-3984.1992.tb00368

Alderson

J. C.

(2000). Assessing reading. Cambridge University Press.

Allaithy

Zaki

(2025). Evaluation of AI-generated reading comprehension materials for Arabic language teaching. Computer Assisted Language Learning, 1–33. https://doi.org/10.1080/09588221.2025.2474037

Anderson

L. W.

Krathwohl

D. E.

(2001). A taxonomy for learning, teaching and assessing: A revision of bloom’s taxonomy of educational objectives. Longman.

Ardeshirifar

(2025). Comparing hand-crafted and deep learning approaches for detecting AI-generated text: Performance, generalization, and linguistic insights. AI and Ethics, 5(4), 4197–4209. https://doi.org/10.1007/s43681-025-00699-4

Aryadoust

(2020). A review of comprehension subskills: A scientometrics perspective. System, 88, 102180. https://doi.org/10.1016/j.system.2019.102180

Bachman

Palmer

(2010). Language assessment in practice: Developing language assessments and justifying their use in the real world. Oxford University Press.

Baker

F. B.

Kim

S. H.

(2004). Item response theory: Parameter estimation techniques. CRC Press.

10.

Basaraba

Yovanoff

Alonzo

Tindal

(2012). Examining the structure of reading comprehension: Do literal, inferential, and evaluative comprehension truly exist? Reading and Writing, 26(3), 349–379. https://doi.org/10.1007/s11145-012-9372-9

11.

Benesch

Prior

M. T.

(2023). Rescuing “emotion labor” from (and for) language teacher emotion research. System, 113, Article 102995. https://doi.org/10.1016/j.system.2023.102995

12.

Bhandari

Liu

Kwak

Pardos

Z. A.

(2024). Evaluating the psychometric properties of ChatGPT-generated questions. Computers and Education: Artificial Intelligence, 7, 100284. https://doi.org/10.1016/j.caeai.2024.100284

13.

Bhandari

Liu

Pardos

Z. A.

(2023). Evaluating ChatGPT-generated textbook questions using IRT. In Proceedings of the generative AI for education workshop (GAIED) at the thirty-seventh conference on neural information processing systems (NeurIPS). https://gaied.org/neurips2023/files/44/44_paper.pdf

14.

Bloom

Englehart

Furst

Hill

Krathwohl

(1956). Taxonomy of educational objectives, handbook I: Cognitive domain. Green: Longmans.

15.

Brown

Mann

Ryder

Subbiah

Kaplan

J. D.

Dhariwal

Amodei

(2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, 33, 1877–1901.

16.

Chakrabarty

Laban

C.-S.

(2025). Can AI writing be salvaged? Mitigating Idiosyncrasies and improving Human-AI Alignment in the Writing Process through edits. (No. arXiv:2409.14509). arXiv. https://doi.org/10.48550/arXiv.2409.14509

17.

Chinkina

Ruiz

Meurers

(2019). Crowdsourcing evaluation of the quality of automatically generated questions for supporting computer-assisted language teaching. ReCALL, 32(2), 145–161. https://doi.org/10.1017/s0958344019000193

18.

Chon

Y. V.

Shin

(2020). Direct writing, translated writing, and machine-translated writing: A text level analysis with coh-metrix. English Teaching, 75(1), 25–48. https://doi.org/10.15858/engtea.75.1.202003.25

19.

Chun

J. Y.

Barley

(2024). A comparative analysis of multiple-choice questions: ChatGPT-generated items vs. human-developed items. In Chapelle

C. A.

Beckett

G. H.

Ranalli

(Eds.), Exploring artificial intelligence in applied linguistics (pp. 118–136). Iowa State University Digital Press. https://doi.org/10.31274/isudp.2024.154.08

20.

Council of Europe . (2001). Common European framework of reference for languages. In Learning, teaching, assessment. Cambridge University Press.

21.

Crossley

S. A.

Greenfield

McNamara

D. S.

(2008). Assessing text readability using cognitively based indices. TESOL Quarterly, 42(3), 475–493. https://doi.org/10.1002/j.1545-7249.2008.tb00142.x

22.

Crossley

S. A.

McNamara

D. S.

(2011). Understanding expert ratings of essay quality: Coh-metrix analyses of first and second language writing. International Journal of Continuing Engineering Education and Life Long Learning, 21(2-3), 170–191. https://doi.org/10.1504/IJCEELL.2011.040197

23.

Davier

A. A.

(Ed.), (2011). Statistical models for test equating, scaling, and linking. Springer Science+ Business Media, LLC.

24.

DeMars

(2002). Incomplete data and item parameter estimates under JMLE and MML estimation. Applied Measurement in Education, 15(1), 15–31. https://doi.org/10.1207/s15324818ame1501_02

25.

Drasgow

Lissak

R. I.

(1983). Modified parallel analysis: A procedure for examining the latent dimensionality of dichotomously scored item responses. Journal of Applied Psychology, 68(3), 363–373. https://doi.org/10.1037/0021-9010.68.3.363

26.

Embretson

S. E.

Reise

S. P.

(2000). Item response theory for psychologists. Psychology Press.

27.

Freedle

Kostin

(1999). Does the text matter in a multiple-choice test of comprehension? The case for the construct validity of TOEFL’s minitalks. Language Testing, 16(1), 2–32. https://doi.org/10.1177/026553229901600102

28.

Gierl

M. J.

Bulut

Guo

Zhang

(2017). Developing, analyzing, and using distractors for multiple-choice tests in education: A comprehensive review. Review of Educational Research, 87(6), 1082–1116. https://doi.org/10.3102/0034654317726529

29.

Gierl

M. J.

Lai

(2013). Instructional topics in educational measurement (ITEMS) module: Using automated processes to generate test items. Educational Measurement: Issues and Practice, 32(3), 36–50. https://doi.org/10.1111/emip.12018

30.

Grabe

(2009). Reading in a second language: Moving from theory to practice: Cambridge University Press.

31.

Haladyna

T. M.

Downing

S. M.

(1989). A taxonomy of multiple-choice item-writing rules. Applied Measurement in Education, 2(1), 37–50. https://doi.org/10.1207/s15324818ame0201_3

32.

Herbold

Hautli-Janisz

Heuer

Kikteva

Trautsch

(2023). A large-scale comparison of human-written versus ChatGPT-generated essays. Scientific Reports, 13(1), 18617. https://doi.org/10.1038/s41598-023-45644-9

33.

Hughes

(2003). Testing for language teachers. Cambridge University Press.

34.

Isley

Gilbert

Kassos

Kocher

Nie

Brunskill

Goel

(2025). Assessing the quality of AI-Generated exams: A large-scale field study. arXiv preprint arXiv:2508.08314. https://doi.org/10.48550/arXiv.2508.08314

35.

Kassner

Schütze

(2020). Negated and misprimed probes for pretrained language models: Birds can talk, but cannot fly. In Proceedings of the 58th annual meeting of the association for computational linguistics (pp. 7811–7818). https://doi.org/10.18653/v1/2020.acl-main.698

36.

Kim

J. R.

Yang

(2012). An analysis of the continuity of elementary and middle school English textbooks using Coh-Metrix. English Teaching, 67(2), 319–341. https://doi.org/10.15858/engtea.67.2.201207.319

37.

Kubinger

K. D.

(2003). On artificial results due to using factor analysis for dichotomous variables. Psychology Science, 45(1), 106–110.

38.

Lane

Raymond

M. R.

Haladyna

T. M.

Downing

S. M.

(2015). Test development process. In handbook of test development (pp. 3–18). Routledge.

39.

Lee

Jung

Jeon

Sohn

Hwang

Moon

Kim

(2024). Few-shot is enough: Exploring ChatGPT prompt engineering method for automatic question generation in English education. Education and Information Technologies, 29(9), 11483–11515. https://doi.org/10.1007/s10639-023-12249-8

40.

Lew

(2023). ChatGPT as a COBUILD lexicographer. Humanities and Social Sciences Communications, 10(1), 704. https://doi.org/10.1057/s41599-023-02119-6

41.

Lin

Chen

(2024). Investigating the capability of ChatGPT for generating multiple-choice reading comprehension items. System, 123, 103344. https://doi.org/10.1016/j.system.2024.103344

42.

Lord

F. M.

(1980). Applications of item response theory to practical testing problems. Routledge.

43.

Lotto

Sheppard

S. C.

Anschuetz

Stricker

Molinari

Huwendiek

Anschuetz

(2024). ChatGPT generated otorhinolaryngology multiple‐choice questions: Quality, psychometric properties, and suitability for assessments. OTO Open, 8(3), e70018. https://doi.org/10.1002/oto2.70018

44.

Ludewig

Schwerter

McElvany

(2023). The features of plausible but incorrect options: Distractor plausibility in synonym-based vocabulary tests. Journal of Psychoeducational Assessment, 41(7), 711–731. https://doi.org/10.1177/07342829231167892

45.

Malec

(2024). Investigating the quality of AI-Generated distractors for a multiple-choice vocabulary test. CSEDU, 1(1), 836–843.

46.

Mari

Wilson

Maul

(2023). Measurement across the sciences: Developing a shared concept system for measurement (p. 307). Springer Nature. https://library.oapen.org/handle/20.500.12657/61879

47.

Masters

G. N.

(1988). Item discrimination: When more is worse. Journal of Educational Measurement, 25(1), 15–29. https://doi.org/10.1111/j.1745-3984.1988.tb00288.x

48.

McNamara

D. S.

Graesser

A. C.

McCarthy

P. M.

Cai

(2014). Automated evaluation of text and discourse with Coh-Metrix. Cambridge University Press.

49.

Mendoza

K. K. R.

Zúñiga

L. H. P.

(2025). Rasch-based comparison of items created with and without generative AI. JOTSE, 15(2), 479–494.

50.

Mislevy

R. J.

Steinberg

L. S.

Almond

R. G.

(2003). On the structure of educational assessments. CSE Technical Report.

51.

Mizumoto

Yasuda

Tamura

(2024). Identifying ChatGPT-generated texts in EFL students’ writing: Through comparative analysis of linguistic fingerprints. Applied Corpus Linguistics, 4(3), 100106. https://doi.org/10.1016/j.acorp.2024.100106

52.

Moore

Stanley

(2013). Critical thinking and formative assessments: Increasing the rigor in your classroom. Routledge.

53.

National College English Testing Committee . (2016). 全国大学英语四、六级考试大纲 [Test specification for College English Test band 4 and band 6]. Shanghai Jiao Tong University Press.

54.

OpenAI . (2023). GPT-4 technical report. OpenAI. https://openai.com/research/gpt-4

55.

Ouyang

Liang

(2021). Coh-Metrix model-based automatic assessment of interpreting quality. In Testing and assessment of interpreting: Recent developments in China (pp. 179–200). Springer Singapore. https://doi.org/10.1007/978-981-15-8554-8_9

56.

Ouyang

Jiang

Almeida

Wainwright

Mishkin

Zhang

Agarwal

Slama

Ray

Schulman

Hilton

Kelton

Miller

Simens

Askell

Welinder

Christiano

Leike

Lowe

(2022). Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35, 27730–27744.

57.

Özer

N. E.

Balcı

Bölükbaşı

İlhan

Güneri

(2025). Examining the role of artificial intelligence in assessment: A comparative study of ChatGPT and educator‐generated multiple‐choice questions in a dental exam. European Journal of Dental Education. https://doi.org/10.1111/eje.70034

58.

Reviriego

Conde

Merino-Gómez

Martínez

Hernández

J. A.

(2024). Playing with words: Comparing the vocabulary and lexical diversity of ChatGPT and humans. Machine Learning with Applications, 18, 100602. https://doi.org/10.1016/j.mlwa.2024.100602

59.

Ripoll Y Schmitz

L.M.

Sonnleitner

(2025). Evaluating AI-generated vs. human-written reading comprehension passages: An expert SWOT analysis and comparative study for an educational large-scale assessment. Large-Scale Assessments in Education, 13(1), 20. https://doi.org/10.1186/s40536-025-00255-w

60.

Ryu

Jeon

(2020). An analysis of text difficulty across grades in Korean middle school English textbooks using Coh-Metrix. Journal of Asia TEFL, 17(3), 921–936. https://doi.org/10.18823/asiatefl.2020.17.3.11.921

61.

Shaib

Elazar

J. J.

Wallace

B. C.

(2024). Detection and measurement of syntactic templates in generated text. (No. arXiv:2407.00211). arXiv. https://doi.org/10.48550/arXiv.2407.00211

62.

Shen

Kane-Cabello

Candelaria

P. Y.

Stratford

Clemens

N. H.

(2025). Can artificial intelligence tools generate text that is useful for reading practice? Learning Disabilities Research & Practice, 40(4), 191–204. https://doi.org/10.1177/09388982251352564

63.

Shin

Lee

J. H.

(2023). Can ChatGPT make reading comprehension testing items on par with human experts. Language Learning & Technology, 27(3), 27–40. https://doi.org/10.64152/10125/73530. https://hdl.handle.net/10125/73530

64.

Sihite

M. R.

Meisuri

Sibarani

(2023). Examining the validity and reliability of chatgpt 3.5-generated reading comprehension questions for academic texts. Randwick International of Education and Linguistics Science Journal, 4(4), 937–944. https://doi.org/10.47175/rielsj.v4i4.835

65.

Tarrant

Ware

Mohammed

A. M.

(2009). An assessment of functioning and non-functioning distractors in multiple-choice questions: A descriptive analysis. BMC Medical Education, 9(1), 40. https://doi.org/10.1186/1472-6920-9-40

66.

Torres Irribarra

Freund

(2014). Wright map: IRT item-person map with ConQuest integration. Available at: https://github.com/david-ti/wrightmap

67.

Vaswani

Shazeer

Parmar

Uszkoreit

Jones

Gomez

A. N.

Polosukhin

(2017). Attention is all you need. Advances in Neural Information Processing Systems, 30. https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.

68.

Warm

T. A.

(1989). Weighted likelihood estimation of ability in item response theory. Psychometrika, 54(3), 427–450. https://doi.org/10.1007/BF02294627

69.

Wen

Chu

S. K. W.

(2025). Using generative AI for reading question creation based on PIRLS 2011 framework. Cogent Education, 12(1), 2458653. https://doi.org/10.1080/2331186X.2025.2458653

70.

Wilson

(2023). Constructing measures: An item response modeling approach. Routledge. https://doi.org/10.4324/9781003286929

71.

Wingersky

M. S.

Lord

F. M.

(1984). An investigation of methods for reducing sampling error in certain IRT procedures. Applied Psychological Measurement, 8(3), 347–364. https://doi.org/10.1177/014662168400800312

72.

Wright

B. D.

Stone

M. H.

(1979). Best test design. MESA Press.

73.

Young

Courtney

Kah

Wilkerson

Chen

Y. H.

(2025). Content and item response theory analysis of ChatGPT-4-generated multiple-choice items. Teaching of Psychology, 52(3), 305–313. https://doi.org/10.1177/00986283241311220

74.

Zhang

Crosthwaite

(2025). More human than human? Differences in lexis and collocation within academic essays produced by ChatGPT-3.5 and human L2 writers. International Review of Applied Linguistics in Language Teaching. https://doi.org/10.1515/iral-2024-0196

75.

Zhang

Erlam

de Magalhães

M. B.

(2025). Exploring the dual impact of AI in post-entry language assessment: Potentials and pitfalls. Annual Review of Applied Linguistics, 45, 274–293. https://doi.org/10.1017/S0267190525000030

76.

Zheng

Cheng

(2008). Test review: College English test (CET) in China. Language Testing, 25(3), 408–417. https://doi.org/10.1177/0265532208092433

77.

Zhou

Cao

Zhou

Zhang

(2023). Chinese intermediate English learners outdid ChatGPT in deep cohesion: Evidence from English narrative writing. System, 118, 103141. https://doi.org/10.1016/j.system.2023.103141

78.

Ziegler

Hagemann

(2015). Testing the unidimensionality of items. European Journal of Psychological Assessment, 31(4), 231–237. https://doi.org/10.1027/1015-5759/a000309

Psychometrics and Linguistics in ChatGPT-Generated Reading Tests Compared with CET-4

Abstract

Keywords

Get full access to this article

References