Content and Item Response Theory Analysis of ChatGPT-4-Generated Multiple-Choice Items

Abstract

Background

Multiple-choice item (MCI) assessments are burdensome for instructors to develop. Artificial intelligence (AI, e.g., ChatGPT) can streamline the process without sacrificing quality. The quality of AI-generated MCIs and human experts is comparable. However, whether the quality of AI-generated MCIs is equally good across various domain- and task-specific prompts remains to be determined. Therefore, we ask whether AI can generate high-quality MCIs to assess learning outcomes from a psychology textbook chapter reading.

Objective

In an exploratory study, we enlist Item Response Theory analysis and expert reviewers to assess MCIs generated by ChatGPT-4 from a psychology textbook chapter.

Method

We submitted a prompt and textbook chapter to ChatGPT-4 requesting 20 MCIs. One hundred ninety undergraduate participants read the chapter before responding to the MCIs. Expert reviewers assessed the MCIs for learning outcome alignment and quality.

Results

ChatGPT-4-generated MCIs were low in difficulty and high in discrimination. Expert reviewers found that nearly all items were logically sound, aligned with learning objectives, and met prevailing standards of MCI quality.

Conclusion

When carefully prompted, ChatGPT-4 can rapidly generate high-quality MCIs to test comprehension of a psychology textbook chapter. However, due to the uniformly low difficulty of the items, we recommend enlisting ChatGPT-4 to write MCIs for formative, but not summative, assessments.

Keywords

Item response theory automatic question generation ChatGPT multiple-choice assessment

Get full access to this article

View all access options for this article.

References

Attali

Bar-Hillel

(2003). Guess where: The position of correct answers in multiple-choice test items as a psychometric variable. Journal of Educational Measurement, 40(2), 109–128. https://doi.org/10.1111/j.1745-3984.2003.tb01099.x

Bhandari

Liu

Pardos

Z. A.

(2023). Evaluating ChatGPT-generated textbook questions using IRT [Paper presentation]. Generative AI for Education Workshop (GAIED) at the Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS), New Orleans, LA, December 10–16, 2023 (pp. 1–9). https://gaied.org/neurips2023/files/44/44_paper.pdf

Brannan

Mohr

C. D.

(2024). Love, friendship, and social support. In Biswas-Diener

Diener

(Eds.), Noba textbook series: Psychology. DEF Publishers. http://noba.to/s54tmp7k

Celik

Dindar

Muukkonen

Järvelä

(2022). The promises and challenges of artificial intelligence for teachers: A systematic review of research. TechTrends, 66(4), 616–630. https://doi.org/10.1007/s11528-022-00715-y

Chen

W.-H.

Thissen

(1997). Local dependence indexes for item pairs using item response theory. Journal of Educational and Behavioral Statistics, 22(3), 265–289. https://doi.org/10.3102/10769986022003265

Cheung

B. H. H.

Lau

G. K. K.

Wong

G. T. C.

Lee

E. Y. P.

Kulkarni

Seow

C. S.

Wong

M. T. H.

(2023). ChatGPT versus human in generating medical graduate exam multiple-choice questions: A multinational prospective study (Hong Kong SAR, Singapore, Ireland, and the United Kingdom). PLoS One, 18(8), Article e0290691. https://doi.org/10.1371/journal.pone.0290691

Downing

S. M.

(2006a). Selected-response item formats in test development. In Downing

S. M.

Haladyna

T. M.

(Eds.), Handbook of test development (pp. 287–301). Taylor & Francis. https://doi.org/10.4324/9780203874776.ch12

Downing

S. M.

(2006b). Twelve steps for effective test development. In Downing

S. M.

Haladyna

T. M.

(Eds.), Handbook of test development (pp. 3–25). Taylor & Francis.

Ebel

R. L.

Frisbie

D. A.

(1972). Evaluating tests and item characteristics. In Ebel

R. L.

Frisbie

D. A.

(Eds.), Essentials of educational measurement (pp. 220–231). Prentice Hall.

10.

Embretson

S. E.

Reise

S. P.

(2013). Item response theory for psychologists. Psychology Press. https://doi.org/10.4324/9781410605269

11.

Haladyna

T. M.

Downing

S. M.

Rodriguez

M. C.

(2002). A review of multiple-choice item-writing guidelines for classroom assessment. Applied Measurement in Education, 15(3), 309–333. https://doi.org/10.1207/S15324818AME1503_5

12.

Hattle

(1985). Methodology review: Assessing unidimensionality of test and items. Applied Psychological Measurement, 9(2), 139–164. https://doi.org/10.1177/014662168500900204 .

13.

Hsu

Y. F.

Tsai

W. Y.

(2007). EZLID: SAS macro for exact lower confidence intervals for indirect effects [Computer software]. https://www.lexjansen.com/nesug/nesug07/sa/sa10.pdf

14.

Kibble

J. D.

(2017). Best practices in summative assessment. Advances in Physiology Education, 41(1), 110–119. https://doi.org/10.1152/advan.00116.2016

15.

Kıyak

Y. S.

(2023). A ChatGPT prompt for writing case-based multiple-choice questions. Revista Española de Educación Médica, 4(3), 98–103. https://doi.org/10.6018/edumed.587451

16.

Krienert

J. L.

Walsh

J. A.

Cannon

K. D.

(2022). Changes in the tradecraft of cheating: Technological advances in academic dishonesty. College Teaching, 70(3), 309–318. https://doi.org/10.1080/87567555.2021.1940813

17.

Kurdi

Leo

Parsia

Sattler

Al-Emari

(2020). A systematic review of automatic question generation for educational purposes. International Journal of Artificial Intelligence in Education, 30(1), 121–204. https://doi.org/10.1007/s40593-019-00186-y

18.

Lord

F. M.

(1968). An analysis of the verbal scholastic aptitude test using Birnbaum’s three-parameter logistic model. Educational and Psychological Measurement, 28(4), 989–1020. https://doi.org/10.1177/001316446802800401

19.

Nwadinigwe

P. I.

Naibi

(2013). The number of options in a multiple-choice test item and the psychometric characteristics. Journal of Education and Practice, 4(27), 189–196.

20.

Open AI. (2022, November 30). Introducing ChatGPT. OpenAI Blog. https://openai.com/blog/chatgpt

21.

Open AI. (2024, June 6). Models. OpenAI Platform. https://platform.openai.com/docs/models

22.

Parmenter

D. A.

(2009). Essay versus multiple-choice: Student preferences and the underlying rationale with implications for test construction. Academy of Educational Leadership Journal, 13(2), 57–71. https://api.semanticscholar.org/CorpusID:142548007

23.

Przymuszała

Piotrowska

Lipski

Marciniak

Cerbin-Koczorowska

(2020). Guidelines on writing multiple choice questions: A well-received and effective faculty development intervention. SAGE Open, 10(3), 1–12. https://doi.org/10.1177/2158244020947432https://doi.org/10.1177/2158244020947432

24.

Reckase

M. D.

(1979). Unifactor latent trait models applied to multifactor tests: Results and implications. Journal of Educational Statistics, 4(3), 207–230. https://doi.org/10.3102/10769986004003207

25.

Sahin

Anil

(2017). The effects of test length and sample size on item parameters in item response theory. Educational Sciences: Theory & Practice, 17(1), 321–335. https://doi.org/10.12738/estp.2017.1.0270

26.

SAS Institute Inc. (2024). SAS/STAT (Version 15.3) [Computer software]. https://www.sas.com

27.

Sireci

S. G.

Wiley

Keller

L. A.

(1998, October 18–20). An empirical evaluation of selected multiple-choice item writing guidelines [Paper presentation]. Annual Meeting of the Northeastern Educational Research Association, Ellenville, NY. https://eric.ed.gov/?id=ED428122

28.

Whisenhunt

B. L.

Cathey

C. L.

Hudson

D. L.

Needy

L. M.

(2022). Maximizing learning while minimizing cheating: New evidence and advice for online multiple-choice exams. Scholarship of Teaching and Learning in Psychology, 8(2), 140–153. https://doi.org/10.1037/stl0000242

29.

Kauer

Tupy

(2016). Multiple-choice questions: Tips for optimizing assessment in-seat and online. Scholarship of Teaching and Learning in Psychology, 2(2), 147–158. https://doi.org/10.1037/stl0000062

30.

Yen

W. M.

(1993). Scaling performance assessments: Strategies for managing local item dependence. Journal of Educational Measurement, 30(3), 187–213. https://doi.org/10.1111/j.1745-3984.1993.tb00423.x

31.

Yi-Hsin

(2015). IA_CTT: A SAS Macro for conducting item analysis based on classical test theory. Southeast SAS Users Group 2015 Proceedings. SAS Institute. https://www.lexjansen.com/sesug/2015/184_Final_PDF.pdf

32.

Young

Courtney

Kah

Wilkerson

Yi-Hsin

(2024). Supplementary materials for Content and item response theory analysis of ChatGPT-4-generated multiple-choice items [Supplemental Materials]. Open Science Framework. https://doi.org/10.17605/OSF.IO/ZQ4EG