Sage Journals: Discover world-class research

Abstract

We propose, illustrate, and evaluate the use of artificial intelligence (AI) to advance rigorous hypothesis-driven scale validation. Using a qualitative approach, we found that AI provided useful suggestions for measures to be used as criteria in scale validation research. Using data and expert predictions previously used to validate nine scales/subscales, we evaluated AI’s ability to produce precise, psychologically reasonable validity hypotheses. ChatGPT and Gemini produced hypotheses with “inter-trial consistency” similar to experts’ “inter-rater consistency,” and their hypotheses agreed strongly with experts’ hypotheses. Importantly, their hypothesized validity correlations were roughly as accurate (in terms of corresponding with actual validity correlations) as were experts’ hypotheses. Replicating across nine scales/subscales, results are encouraging regarding the use of AI to facilitate a precise hypothesis-driven approach to convergent and discriminant validity in a way that saves time with little-to-no cost in psychological or psychometric quality.

Keywords

validity psychometrics artificial intelligence scale development measurement assessment

Get full access to this article

View all access options for this article.

References

Abdurahman

Zou

Ungar

Bhatia

(2024). A deep learning approach to personality assessment: Generalizing across items and expanding the reach of survey-based research. Journal of Personality and Social Psychology, 126(2), 312–331. https://doi.org/10.1037/pspp0000480

AERA APA NCME . (2014). The standards for educational and psychological testing. https://www.testingstandards.net/open-access-files.html

Aquino

Reed

II . (2002). The self-importance of moral identity. Journal of Personality and Social Psychology, 83(6), 1423–1440. https://doi.org/10.1037/0022-3514.83.6.1423

Banker

Chatterjee

Mishra

(2024). Machine-assisted social psychology hypothesis generation. The American Psychologist, 79(6), 789–797. https://doi.org/10.1037/amp0001222

Borsboom

Mellenbergh

G. J.

van Heerden

(2004). The concept of validity. Psychological Review, 111(4), 1061–1071. https://doi.org/10.1037/0033-295X.111.4.1061

Canty

Ripley

B. D.

(2024). boot: Bootstrap R (S-Plus) functions. R Package Version 1.3-31. https://doi.org/10.32614/CRAN.package.boot

Costello

T. H.

Bowes

S. M.

Stevens

S. T.

Waldman

I. D.

Tasimi

Lilienfeld

S. O.

(2022). Clarifying the structure and nature of left-wing authoritarianism. Journal of Personality and Social Psychology, 122(1), 135–170. https://doi.org/10.1037/pspp0000341

Dumas

Greiff

Wetzel

(2025). Ten guidelines for scoring psychological assessments using artificial intelligence. European Journal of Psychological Assessment, 41(3), 169–173. https://doi.org/10.1027/1015-5759/a000904

Fan

Sun

Liu

Zhao

Zhang

Chen

Glorioso

Hack

(2023). How well can an AI chatbot infer personality? Examining psychometric properties of machine-inferred personality scores. Journal of Applied Psychology, 108(8), 1277–1299. https://doi.org/10.1037/apl0001082

10.

Furr

R. M.

Heuckeroth

S. A.

(2019). The “Quantifying Construct Validity” procedure: Its role, value, interpretations, and computation. Assessment, 26, 555–566. https://doi.org/10.1177/1073191118820638

11.

Furr

R. M.

Prentice

Hawkins Parham

Jayawickreme

(2022). Development and validation of the Moral Character Questionnaire. Journal of Research in Personality, 98, 104228. https://doi.org/10.1016/j.jrp.2022.104228

12.

Furr

R. M.

Waugh

C. E.

Good

R. N.

Miller

C. B.

Cole

Porth

(In preparation). Development and initial psychometric evaluation of the Patient Reaction Scale and the Patience Regulation Scale: Theoretically grounded and complementary scales with alternate forms.

13.

Götz

F. M.

Maertens

Loomba

van der Linden

(2024). Let the algorithm speak: How to use neural networks for automatic item generation in psychological scale development. Psychological Methods, 29(3), 494–518. https://doi.org/10.1037/met0000540

14.

Haines

Kvam

P. D.

Irving

Smith

C. T.

Beauchaine

T. P.

Pitt

M. A.

Ahn

W.-Y.

Turner

B. M.

(2025). A tutorial on using generative models to advance psychological science: Lessons from the reliability paradox. Psychological Methods. Advance online publication. https://dx.doi.org/10.1037/met0000674

15.

Hernandez

Nie

(2023). The AI-IP: Minimizing the guesswork of personality scale item development through artificial intelligence. Personnel Psychology, 76(4), 1011–1035. https://doi.org/10.1111/peps.12543

16.

Kjell

O. N. E.

Kjell

Schwartz

H. A.

(2024). Beyond rating scales: With targeted evaluation, large language models are poised for psychological assessment. Psychiatry Research, 333, 115667. https://doi.org/10.1016/j.psychres.2023.115667

17.

Krumm

Thiel

A. M.

Reznik

Freudenstein

J.-P.

Schäpers

Mussel

(2024). Creating a psychological test in a few seconds: Can ChatGPT develop a psychometrically sound situational judgment test? European Journal of Psychological Assessment. Advance online publication. https://dx.doi.org/10.1027/1015-5759/a000878

18.

Lee

Fyffe

Son

Jia

Yao

(2023). A paradigm shift from “human writing” to “machine generation” in personality test development: An application of state-of-the-art natural language processing. Journal of Business Psychology, 38, 163–190. https://doi.org/10.1007/s10869-022-09864-6

19.

Markey

P. M.

Campbell

Goldman

(2025). A framework for the initial phases of personality test development using large language models and artificial personas. Journal of Research in Personality, 118, 104647. https://doi.org/10.1016/j.jrp.2025.104647

20.

Miller

J. D.

Lynam

D. R.

Campbell

W. K.

(2016). Measures of narcissism and their relations to DSM-5 pathological traits: A critical reappraisal. Assessment, 23, 3–9. https://doi.org/10.1177/1073191114522909

21.

Moultrie

J. K.

Engel

R. R.

(2017). Empirical correlates for the Minnesota Multiphasic Personality Inventory-2-Restructured Form in a German inpatient sample. Psychological Assessment, 29, 1273–1289. https://psycnet.apa.org/doi/10.1037/pas0000415

22.

Pavlick

(2023). Symbols and grounding in large language models. Philosophical Transactions. Series A, Mathematical, Physical, and Engineering Sciences, 381(2251), 20220041. https://doi.org/10.1098/rsta.2022.0041

23.

Porcerelli

J. H.

Cogan

Melchior

K. A.

Jasinski

M. J.

Richardson

Fowler

Morris

Murdoch

(2016). Convergent validity of the early memory index in two primary care samples. Journal of Personality Assessment, 98, 289–297. https://psycnet.apa.org/doi/10.1080/00223891.2015.1107573

24.

Poythress

N. G.

Lilienfeld

S. O.

Skeem

J. L.

Douglas

K. S.

Edens

J. F.

Epstein

Patrick

C. J.

(2010). Using the PCL-R to help estimate the validity of two self-report measures of psychopathy with offenders. Assessment, 17, 206–219. https://doi.org/10.1177/1073191109351715

25.

Reynolds

C. J.

Jayawickreme

Wheat

Stokes

Santos

Fleeson

Furr

R. M.

(2025). Honesty as truthfulness: New trait and state measures of truthfulness to advance research and theorizing on when and why people are honest. European Journal of Personality. Advance online publication. https://doi.org/10.1177/08902070251338199

26.

Singh

(2024). AI-based personality prediction for human well-being from text data: A systematic review. Multimedia Tools and Applications, 83, 46325–46368. https://doi.org/10.1007/s11042-023-17282-w

27.

Suzuki

Griffin

S. A.

Samuel

D. B.

(2017). Capturing the DSM-5 alternative personality disorder model traits five-factor model’s nomological net. Journal Personality, 85, 220–231. https://doi.org/10.1111/jopy.12235

28.

Tong

Mao

Huang

Zhao

Peng

(2024). Automating psychological hypothesis generation with AI: When large language models meet causal graph. Humanities and Social Science Communications, 11, 896. https://doi.org/10.1057/s41599-024-03407-5

29.

Wehner

Maaß

Leckelt

Back

M. D.

Ziegler

(2021). Validation of the Short Dark Triad in a German sample: Structure, nomological network, and an ultrashort version. European Journal of Psychological Assessment, 37(5), 397–408. https://doi.org/10.1027/1015-5759/a000617

30.

Westen

Rosenthal

(2003). Quantifying construct validity: Two simple measures. Journal of Personality and Social Psychology, 84, 608–618. https://psycnet.apa.org/doi/10.1037/0022-3514.84.3.608

31.

Wetzel

Killisch

(2025). Testing convergent and discriminant validity using a priori defined hypotheses [Editorial]. European Journal of Psychological Assessment, 41(4), 253–256. https://doi.org/10.1027/1015-5759/a000918

32.

Peng

Nastase

S. A.

Chodorow

(2023). Does conceptual representation require embodiment? Insights from large language models (No. arXiv:2305.19103). arXiv. https://doi.org/10.48550/arXiv.2305.19103

Using Generative Artificial Intelligence to Advance Hypothesis-Driven Scale Validation: Identifying Criterion Measures and Generating Precise a Priori Hypotheses

Abstract

Keywords

Get full access to this article

References