Sage Journals: Discover world-class research

Abstract

Traditional item development methods have constrained the advancement of computerized adaptive testing, hindering the achievement of fully intelligent assessments. With the progress of natural language processing technologies, automatic item generation (AIG) based on large language models (LLMs) offers a promising solution to this challenge. This study employed three LLMs to generate Simplified Chinese Big Five personality items and evaluated the effectiveness of the resulting adaptive item bank through two rounds of empirical testing. The goal was to leverage emerging technologies to address one of the key bottlenecks in CAT development and to promote the realization of fully automated, intelligent assessment workflows. Findings indicate that LLM-based AIG can produce high-quality Big Five CAT item banks cost-effectively and efficiently. Moreover, this approach demonstrates robust performance across different LLMs, highlighting its cross-model stability and practical potential.

Keywords

automatic item generation large language models Computerized Adaptive Testing Big Five Personality Item Response Theory natural language processing

Get full access to this article

View all access options for this article.

References

Akaike

(1974). Stochastic theory of minimal realization. IEEE Transactions on Automatic Control, 19(6), 667–674. https://doi.org/10.1109/tac.1974.1100707

Baker

F. B.

Kim

S. H.

(2017). The basics of Item Response Theory using R. Springer.

Baldonado

A. A.

Svetina

Gorin

(2015). Using necessary information to identify item dependence in passage-based reading comprehension tests. Applied Measurement in Education, 28(3), 202–218. https://doi.org/10.1080/08957347.2015.1042154

Bock

R. D.

Mislevy

R. J.

(1982). Adaptive EAP estimation of ability in a microcomputer environment. Applied Psychological Measurement, 6(4), 431–444. https://doi.org/10.1177/014662168200600405

Bommasani

Hudson

D. A.

Adeli

Altman

Arora

von Arx

. . .Liang

(2021). On the opportunities and risks of foundation models. arXiv. https://doi.org/10.48550/arxiv.2108.07258

Brown

Mann

Ryder

Subbiah

Kaplan

J. D.

Dhariwal

. . .Amodei

(2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, 33, 1877–1901.

Chalmers

R. P.

(2012). mirt: A multidimensional Item Response Theory package for the R environment. Journal of Statistical Software, 48(6), 1–29. https://doi.org/10.18637/jss.v048.i06

Chen

W.-H.

Thissen

(1997). Local dependence indices for item pairs using Item Response Theory. Journal of Educational and Behavioral Statistics, 22(3), 265–289. https://doi.org/10.3102/10769986022003265

Choi

S. W.

Gibbons

L. E.

Crane

P. K.

(2011). Lordif: An R package for detecting differential item functioning using iterative hybrid ordinal logistic regression/Item Response Theory and Monte Carlo simulations. Journal of Statistical Software, 39(8), 1–30. https://doi.org/10.18637/jss.v039.i08

10.

Cohen

(1988). Statistical power analysis for the behavioral sciences (2nd ed.). Lawrence Erlbaum Associates.

11.

Dahlgren Lindström

Methnani

Krause

Ericson

de Rituerto de Troya

Í. M.

Coelho Mollo

Dobbe

. (2025). Helpful, harmless, honest? Sociotechnical limits of AI alignment and safety through Reinforcement Learning from Human Feedback. Ethics and Information Technology, 27(2), 28. https://doi.org/10.1007/s10676-025-09837-2

12.

Demir

French

B. F.

(2023). Applicability and efficiency of a computerized adaptive test for the Washington assessment of the risks and needs of students. Assessment, 30(1), 238–247. https://doi.org/10.1177/10731911211047892

13.

Deng

Y. P.

Dai

H. Q.

Luo

Z. S.

(2014). Application of Computerized Adaptive Testing to trait anxiety scale. Psychological Exploration, 34(3), 272–275. (In Chinese)

14.

Edwards

M. C.

Houts

C. R.

Cai

(2018). A diagnostic procedure to detect departures from local independence in Item Response Theory models. Psychological Methods, 23(1), 138–149. https://doi.org/10.1037/met0000121

15.

Flens

Smits

Terwee

C. B.

Dekker

Huijbrechts

Spinhoven

de Beurs

(2019). Development of a computerized adaptive test for anxiety based on the Dutch-Flemish version of the PROMIS item bank. Assessment, 26(7), 1362–1374. https://doi.org/10.1177/1073191117746742

16.

Fliege

Becker

Walter

O. B.

Bjorner

J. B.

Klapp

B. F.

Rose

(2005). Development of a computer-adaptive test for depression (D-CAT). Quality of Life Research, 14, 2277–2291. https://doi.org/10.1007/s11136-005-6651-9

17.

Gierl

M. J.

Haladyna

T. M.

(Eds.). (2013). Automatic item generation: Theory and practice. Routledge.

18.

Gierl

M. J.

Lai

(2015). Using automated processes to generate test items and their associated solutions and rationales to support formative feedback. Interaction Design and Architecture(s), 25(1), 9–20. https://doi.org/10.55612/s-5002-025-001

19.

Gierl

M. J.

Lai

(2018). Using automatic item generation to create solutions and rationales for computerized formative testing. Applied Psychological Measurement, 42(1), 42–57. https://doi.org/10.1177/0146621617726788

20.

Gosling

S. D.

Rentfrow

P. J.

Swann

W. B.

(2003). A very brief measure of the Big-Five personality domains. Journal of Research in Personality, 37(6), 504–528. https://doi.org/10.1016/S0092-6566(03)00046-1

21.

Götz

F. M.

Maertens

Loomba

van der Linden

(2024). Let the algorithm speak: How to use neural networks for automatic item generation in psychological scale development. Psychological Methods, 29(3), 494–518. https://doi.org/10.1037/met0000540

22.

Hambleton

R. K.

Swaminathan

Rogers

H. J.

(1991). Fundamentals of Item Response Theory. Sage.

23.

Hernandez

Nie

(2023). The AI-IP: Minimizing the guesswork of personality scale item development through artificial intelligence. Personnel Psychology, 76(4), 1011–1035. https://doi.org/10.1111/peps.12543

24.

Hommel

B. E.

Wollang

F. J. M.

Kotova

Zacher

Schmukle

S. C.

(2022). Transformer-based deep neural language modeling for construct-specific automatic item generation. Psychometrika, 87(2), 749–772. https://doi.org/10.1007/s11336-021-09823-9

25.

Jaech

Kalai

Lerer

Richardson

El-Kishky

Low

. . .Li

Z. H.

(2024). OpenAI o1 system card. arXiv. https://doi.org/10.48550/arXiv.2412.16720

26.

John

O. P.

Robins

R. W.

(Eds.). (2022). Handbook of personality: Theory and research (4th ed.). The Guilford Press.

27.

Kieftenbeld

Natesan

(2012). Recovery of graded response model parameters: A comparison of marginal maximum likelihood and Markov chain Monte Carlo estimation. Applied Psychological Measurement, 36(5), 399–419. https://doi.org/10.1177/0146621612446170

28.

Kurdi

Leo

Parsia

Sattler

Al-Emari

(2020). A systematic review of automatic question generation for educational purposes. International Journal of Artificial Intelligence in Education, 30, 121–204. https://doi.org/10.1007/s40593-019-00186-y

29.

Lane

Raymond

M. R.

Haladyna

T. M.

(Eds.). (2015). Handbook of test development (2nd ed.). Routledge.

30.

Lee

Fyffe

Son

Jia

Yao

(2023). A paradigm shift from “human writing” to “machine generation” in personality test development: An application of state-of-the-art natural language processing. Journal of Business and Psychology, 38(1), 163–190. https://doi.org/10.1007/s10869-022-09864-6

31.

Lei

Wei

Liu

(2024). AlphaReadabilityChinese: A tool for the measurement of readability in Chinese texts and its applications. Foreign Languages and Their Teaching, 46(1), 83–93. (In Chinese)

32.

Liu

Yuan

Jiang

Hayashi

Neubig

(2023). Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9), 1–35. https://doi.org/10.1145/3560815

33.

Liu

Zhou

Chao

Liu

(2022). Development of a computerized adaptive test for problematic mobile phone use. Frontiers in Psychology, 13, Article 892387. https://doi.org/10.3389/fpsyg.2022.892387

34.

Magis

Barrada

J. R.

(2017). Computerized Adaptive Testing with R: Recent updates of the package catR. Journal of Statistical Software, 76(1), 1–19. https://doi.org/10.18637/jss.v076.c01

35.

Makransky

Mortensen

E. L.

Glas

C. A.

(2013). Improving personality facet scores with multidimensional computer adaptive testing: An illustration with the NEO PI-R. Assessment, 20(1), 3–13. https://doi.org/10.1177/1073191112437756

36.

May

Littlewood

Bishop

(2006). Reliability of procedures used in the physical examination of non-specific low back pain: A systematic review. Australian Journal of Physiotherapy, 52(2), 91–102. https://doi.org/10.1016/s0004-9514(06)70044-7

37.

McNeish

(2024). Practical implications of sum scores being psychometrics’ greatest accomplishment. Psychometrika, 89(4), 1148–1169. https://doi.org/10.1007/s11336-024-09988-z

38.

Mitchell

Krakauer

D. C.

(2023). The debate over understanding in AI’s large language models. Proceedings of the National Academy of Sciences, 120(13), e2215907120. https://doi.org/10.1073/pnas.2215907120

39.

Muraki

(1992). A generalized partial credit model: Application of an EM algorithm. Applied Psychological Measurement, 16(2), 159–176. https://doi.org/10.1177/014662169201600206

40.

Naveed

Khan

A. U.

Qiu

Saqib

Anwar

Usman

. . .Mian

(2024). A comprehensive overview of large language models. arXiv. https://doi.org/10.48550/arXiv.2307.06435

41.

Nieto

M. D.

Abad

F. J.

Hernández-Camacho

Garrido

L. E.

Barrada

J. R.

Aguado

Olea

(2017). Calibrating a new item pool to adaptively assess the Big Five. Psicothema, 29(3), 390–395. https://doi.org/10.7334/psicothema2016.391

42.

Nieto

M. D.

Abad

F. J.

Olea

(2018). Assessing the Big Five with bifactor Computerized Adaptive Testing. Psychological Assessment, 30(12), 1678. https://doi.org/10.1037/pas0000631

43.

Oeljeklaus

Höft

Danner

(2025). Comparing psychometric properties of expert-developed and AI-generated personality scales: A proof-of-concept study. Psychological Test Adaptation and Development, 6(1), 29–43. https://doi.org/10.1027/2698-1866/a000095

44.

OpenAI. (2024, September 12). O-1: Optimization for language models with continuous integration. https://openai.com/index/learning-to-reason-with-llms/

45.

OpenAI. (2025). Reasoning best practices. https://platform.openai.com/docs/guides/reasoning-best-practices#reasoning-models-vs-gpt-models

46.

Owan

V. J.

Abang

K. B.

Idika

D. O.

Etta

E. O.

Bassey

B. A.

(2023). Exploring the potential of artificial intelligence tools in educational measurement and assessment. Eurasia Journal of Mathematics, Science and Technology Education, 19(8), em2307. https://doi.org/10.29333/ejmste/13428

47.

Polit

D. F.

Beck

C. T.

Owen

S. V.

(2007). Is the CVI an acceptable indicator of content validity? Appraisal and recommendations. Research in Nursing & Health, 30(4), 459–467. https://doi.org/10.1002/nur.20199

48.

Raiaan

M. A. K.

Mukta

M. S. H.

Fatema

Fahad

N. M.

Sakib

Mim

M. M. J.

. . .Azam

(2024). A review on large language models: Architectures, applications, taxonomies, open issues and challenges. IEEE Access, 12, 26839–26874.

49.

Reckase

M. D.

(1979). Unifactor latent trait models applied to multifactor tests: Results and implications. Journal of Educational Statistics, 4(3), 207–230. https://doi.org/10.3102/10769986004003207

50.

Reeve

B. B.

Hays

R. D.

Bjorner

J. B.

Cook

K. F.

Crane

P. K.

Teresi

J. A.

, . . . PROMIS Cooperative Group (2007). Psychometric evaluation and calibration of health-related quality of life item banks: Plans for the Patient-Reported Outcomes Measurement Information System (PROMIS). Medical Care, 45(5), S22–S31. https://doi.org/10.1097/01.mlr.0000250483.85507.04

51.

Rice

Pêgo

J. M.

Collares

C. F.

Kisielewska

Gale

(2022). The development and implementation of a computer adaptive progress test across European countries. Computers and Education: Artificial Intelligence, 3, 100083. https://doi.org/10.1016/j.caeai.2022.100083

52.

Russell-Lasalandra

L. L.

Christensen

A. P.

Golino

(2024, September). Generative psychometrics via AI-GENIE: Automatic item generation and validation via network-integrated evaluation. https://doi.org/10.31234/osf.io/fgbj4

53.

Samejima

(1969). Estimation of latent ability using a response pattern of graded scores. Psychometrika Monograph Supplement, 34(4, Pt. 2), 100.

54.

Schreiber

J. B.

(2021). Issues and recommendations for exploratory factor analysis and principal component analysis. Research in Social and Administrative Pharmacy, 17(5), 1004–1011. https://doi.org/10.1016/j.sapharm.2020.07.027

55.

Schwarz

(1978). Estimating the dimension of a model. The Annals of Statistics, 6(2), 461–464.

56.

Sijtsma

Ellis

J. L.

Borsboom

(2024). Rejoinder to McNeish and Mislevy: What does psychological measurement require? Psychometrika, 89(4), 1175–1185. https://doi.org/10.1007/s11336-024-10004-7

57.

Swiecki

Khosravi

Chen

Martinez-Maldonado

Lodge

J. M.

Milligan

. . .Gašević

(2022). Assessment in the age of artificial intelligence. Computers and Education: Artificial Intelligence, 3, 100075. https://doi.org/10.1016/j.caeai.2022.100075

58.

Vaswani

Shazeer

Parmar

Uszkoreit

Jones

Gomez

A. N.

Polosukhin

(2017). Attention is all you need. In Guyon

Luxburg

U. V.

Bengio

Wallach

Fergus

Vishwanathan

Garnett

(Eds.), Advances in neural information processing systems (Vol. 30, pp. 5998–6008). Curran Associates.

59.

Wang

Zhang

Zhan

Shen

. . .Hovy

(2025, March 4). Unlocking the mysteries of OpenAI o1: A survey of the reasoning abilities of large language models. https://github.com/ShuheSH/A-Survey-of-the-Reasoning-Abilities-of-LLMs

60.

Jin

Huang

Zhou

Zhang

(2020). Development of Computerized Adaptive Testing for emotion regulation. Frontiers in Psychology, 11, Article 561358. https://doi.org/10.3389/fpsyg.2020.561358

61.

Yao

Liu

Chen

Fang

Hou

. . .Chua

T. S.

(2025). Are reasoning models more prone to hallucination? arXiv. https://doi.org/10.48550/arXiv.2505.23646

62.

Yenduri

Ramalingam

Selvi

G. C.

Supriya

Srivastava

Maddikunta

P. K. R.

. . .Gadekallu

T. R.

(2024). GPT (Generative Pre-Trained Transformer)—A comprehensive review on enabling technologies, potential applications, emerging challenges, and future directions. IEEE Access, 12, 54608–54649.

63.

Zhang

Y. M.

Luo

Yin

. . .John

O. P.

(2022). The big five inventory–2 in China: A comprehensive psychometric evaluation in four diverse samples. Assessment, 29(6), 1262–1284. https://doi.org/10.1177/10731911211008245

Development of a Computerized Adaptive Item Bank for the Big Five Personality Based on Large Language Models

Abstract

Keywords

Get full access to this article

References