Sage Journals: Discover world-class research

Abstract

In this article, we explore the application of Large Language Models (LLMs) in assessing the content validity of psychometric instruments, focusing on the Big Five Questionnaire (BFQ) and Big Five Inventory (BFI). Content validity, a cornerstone of test construction, ensures that psychological measures adequately cover their intended constructs. Using both human expert evaluations and advanced LLMs, we compared the accuracy of semantic item-construct alignment. Graduate psychology students employed the Content Validity Ratio to rate test items, forming the human baseline. In parallel, state-of-the-art LLMs, including multilingual and fine-tuned models, analyzed item embeddings to predict construct mappings. The results reveal distinct strengths and limitations of human and AI approaches. Human validators excelled in aligning the behaviorally rich BFQ items, while LLMs performed better with the linguistically concise BFI items. Training strategies significantly influenced LLM performance, with models tailored for lexical relationships outperforming general-purpose LLMs. Here we highlight the complementary potential of hybrid validation systems that integrate human expertise and AI precision. The findings underscore the transformative role of LLMs in psychological assessment, paving the way for scalable, objective, and robust test development methodologies.

Keywords

content validity Big Five Questionnaire Big Five Inventory embeddings natural language processing large language models

Get full access to this article

View all access options for this article.

References

Abdurahman

Zou

Ungar

Bhatia

(2024). A deep learning approach to personality assessment: Generalizing across items and expanding the reach of survey-based research. Journal of Personality and Social Psychology, 126(2), 312.

Akhtar

M. S.

Chauhan

D. S.

Ghosal

Poria

Ekbal

Bhattacharyya

(2019). Multi-task learning for multi-modal emotion recognition and sentiment analysis. arXiv preprint arXiv:1905.05812.

Bainbridge

T. F.

Ludeke

S. G.

Smillie

L. D.

(2022). Evaluating the Big Five as an organizing framework for commonly used psychological trait scales. Journal of Personality and Social Psychology, 122(4), 749.

Bengio

Ducharme

Vincent

Janvin

(2003). A neural probabilistic language model. Journal of Machine Learning Research, 3, 1137–1155.

Brown

Mann

Ryder

Subbiah

Kaplan

J. D.

Dhariwal

Neelakantan

Shyam

Sastry

Askell

Agarwal

Herbert-Voss

Krueger

Henighan

Child

Ramesh

Ziegler

Winter

… Amodei

(2020). Language models are few-shot learners. Advances in neural information processing systems, 33, 1877–1901.

Caprara

G. V.

Barbaranelli

Borgogni

Perugini

(1993). The “Big Five Questionnaire”: A new questionnaire to assess the five factor model. Personality and individual Differences, 15(3), 281–288.

Conneau

Kiela

Schwenk

Barrault

Bordes

(2017). Supervised learning of universal sentence representations from natural language inference data. arXiv preprint arXiv:1705.02364.

Devlin

Chang

M. W.

Lee

Toutanova

(2019, June). Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers) (pp. 4171–4186).

Downing

S. M.

Haladyna

T. M.

(Eds.). (2004). Handbook of test development. Lawrence Erlbaum Associates.

10.

El-Den

Schneider

Mirzaei

Carter

(2020). How to measure a latent construct: Psychometric principles for the development and validation of measurement instruments. International Journal of Pharmacy Practice, 28(4), 326–336.

11.

Guenole

D’Urso

E. D.

Samo

Sun

(2024). Pseudo factor analysis of language embedding similarity matrices: New ways to model latent constructs. https://osf.io/preprints/psyarxiv/vf3se_v1

12.

Haynes

S. N.

Richard

D. C. S.

Kubany

E. S.

(1995). Content validity in psychological assessment: A functional approach to concepts and methods. Psychological Assessment, 7(3), 238–247.

13.

Hitsuwari

Okano

Nomura

(2024). Predicting attitudes toward ambiguity using natural language processing on free descriptions for open-ended question measurements. Scientific Reports, 14(1), 8276.

14.

Hommel

B. E.

Arslan

R. C.

(2024). Language models accurately infer correlations between psychological items and scales from text alone. PsyArXiv. [Preprint]. https://doi.org/10.31234/osf.io/kjuce.

15.

Hussain

Binz

Mata

Wulff

D. U.

(2024). A tutorial on open-source large language models for behavioral science. Behavior Research Methods, 56(8), 8214–8237.

16.

John

O. P.

Donahue

E. M.

Kentle

R. L.

(1991). Big Five Inventory. Journal of Personality and Social Psychology.

17.

Kjell

O. N.

Sikström

Kjell

Schwartz

H. A.

(2022). Natural language analyzed with AI-based transformers predict traditional subjective well-being measures approaching the theoretical upper limits in accuracy. Scientific reports, 12(1), 3918.

18.

Lawshe

C. H.

(1975). A quantitative approach to content validity. Personnel Psychology, 28(4), 563–575.

19.

Shi

Liu

Yang

Payani

Liu

(2024). Quantifying multilingual performance of large language models across languages. arXiv preprint arXiv:2404.11553.

20.

Liu

Ott

Goyal

Joshi

Chen

Levy

Lewis

Stoyanov

(2019). RoBERTa: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692.

21.

Lynn

(1986). Determination and quantification of content validity. Nursing Research, 35(6), 382–385.

22.

Milano

Luongo

Ponticorvo

Marocco

(2025). Semantic analysis of test items through large language model embeddings predicts a-priori factorial structure of personality tests. Current Research in Behavioral Sciences, 8, 100168.

23.

Mikolov

Sutskever

Chen

Corrado

G. S.

Dean

(2013). Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems, 26.

24.

Nilsson

A. H.

Schwartz

H. A.

Rosenthal

R. N.

McKay

J. R.

Cho

Y. M.

Mahwish

Ganesan

A. V.

Ungar

(2024). Language-based EMA assessments help understand problematic alcohol consumption. Plos one, 19(3), e0298300.

25.

Radford

Narasimhan

Salimans

Sutskever

(2018). Improving language understanding by generative pre-training. OpenAI blog, 1(7), 8.

26.

Radford

Child

Luan

Amodei

Sutskever

(2019). Language models are unsupervised multitask learners. OpenAI blog, 1(8), 9.

27.

Reimers

Gurevych

(2020). Making monolingual sentence embeddings multilingual using knowledge distillation. arXiv preprint arXiv:2004.09813.

28.

Song

Tan

Qin

Liu

T. Y.

(2020). MPNet: Masked and permuted pre-training for language understanding. Advances in Neural Information Processing Systems, 33, 16857–16867.

29.

Spoto

Nucci

Prunetti

Vicovaro

(2023). Improving content validity evaluation of assessment instruments through formal content validity analysis. Psychological methods. 30(2), 203–222.

30.

Vaswani

Shazeer

Parmar

Uszkoreit

Jones

Gomez

A. N.

Kaiser

Polosukhin

(2017). Attention is all you need. Advances in neural information processing systems, 30.

31.

Dredze

(2020). Are all languages created equal in multilingual BERT? arXiv preprint arXiv:2005.09093.

32.

Wulff

D. U.

Mata

(2025). Semantic embeddings reveal and address taxonomic incommensurability in psychological measurement. Nature Human Behaviour. 9(5), 944–954.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.15 MB

Human Expertise and Large Language Model Embeddings in the Content Validity Assessment of Personality Tests

Abstract

Keywords

Get full access to this article

References

Supplementary Material