Integrating Ensemble Clustering and Text Embeddings for Estimating the Factor Loadings of Self-Report Scales

Abstract

Advances in large language models can provide opportunities to evaluate the characteristics of scales prior to data collection. In this study, we explore if item text can be used to predict a scale’s psychometric properties. Specifically, we examine if clustering consensus (i.e., the frequency by which items are grouped with other items from the same underlying factor across multiple clustering algorithms), and a cosine similarity metric (i.e., the semantic similarity of items to other items from the same factor), can be used to predict exploratory factor analysis (EFA) factor loadings. Across six scales with varying sample sizes, number of factors/items, we found that both the cosine similarity and ensemble clustering consensus methods predicted factor loading values. While the methods share some conceptual and empirical overlap, and results vary by scale, the ensemble clustering approach explains incremental variance above and beyond cosine similarity in predicting factor loadings. Using both methods in conjunction can be a useful way to identify problematic items prior to data collection and help researchers develop more optimal scales from the onset, thereby potentially saving time, resources, and increasing the likelihood of developing sound measures.

Keywords

text embeddings ensemble cluster analysis factor loadings psychometrics large language models

Get full access to this article

View all access options for this article.

References

Anvari

Alsalti

Oehler

L. A.

Marion

Hussey

Elson

Arslan

R. C.

(2025). A fragmented field: Construct and measure proliferation in psychology. Advances in Methods and Practices in Psychological Science, 8(3), 25152459251360642.

Ashton

M. C.

Lee

Goldberg

L. R.

(2007). The IPIP–HEXACO scales: An alternative, public-domain measure of the personality constructs in the HEXACO model. Personality and Individual Differences, 42, 1515–1526. https://doi.org/10.1016/j.paid.2006.10.027

Chiu

D. S.

Talhouk

(2018). DiceR: An R package for class discovery using an ensemble driven approach. BMC Bioinformatics, 19, 1–4.

Church

A. T.

Burke

P. J.

(1994). Exploratory and confirmatory tests of the Big Five and Tellegen’s three- and four-dimensional models. Journal of Personality and Social Psychology, 66(1), 93–114. https://doi.org/10.1037/0022-3514.66.1.93

DeVellis

R. F.

Thorpe

C. T.

(2021). Scale development: Theory and applications (5th ed.). Sage.

Devlin

Chang

M.-W.

Lee

Toutanova

(2019). BERT: Pre-training of deep bidirectional transformers for language understanding. Proceedings of NAACL-HLT, 2019, 4171–4186. https://arxiv.org/abs/1810.04805

Eberhardt

S. T.

Vehlen

Schaffrath

Schwartz

Baur

Schiller

Hallmen

André

Lutz

(2025). Development and validation of large language model rating scales for automatically transcribed psychological therapy sessions. Scientific Reports, 15, 29541. https://doi.org/10.1038/s41598-025-14923-y

Feraco

Toffalini

(2025). SEMbeddings: How to evaluate model misfit before data collection using large-language models. Frontiers in Psychology, 15, Article 1433339.

Flake

J. K.

Pek

Hehman

(2017). Construct validation in social and personality research: Current practice and recommendations. Social Psychological and Personality Science, 8(4), 370–378. https://doi.org/10.1177/1948550617693063

10.

Gao

C. X.

Wang

Zhu

Ziou

Teo

S. M.

Smith

C. L.

Chiu

Talhouk

Cotton

S. M.

Dwyer

(2024). Ensemble clustering: A practical tutorial. https://osf.io/preprints/psyarxiv/fq6e9_v1

11.

Ghosh

Acharya

(2011). Cluster ensembles. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 1(4), 305–315.

12.

Goldberg

L. R.

(1999). A broad-bandwidth, public domain, personality inventory measuring the lower-level facets of several five-factor models. Personality Psychology in Europe, 7(1), 7–28.

13.

Goldberg

L. R.

Johnson

J. A.

Eber

H. W.

Hogan

Ashton

M. C.

Cloninger

C. R.

Gough

H. G.

(2006). The international personality item pool and the future of public-domain personality measures. Journal of Research in Personality, 40(1), 84–96. https://doi.org/10.1016/j.jrp.2005.08.007

14.

Guenole

D’Urso

E. D.

Samo

Sun

(2024). Pseudo factor analysis of language embedding similarity matrices: New ways to model latent constructs. https://osf.io/preprints/psyarxiv/vf3se_v2

15.

Harber

J. G.

Day

E. A.

(2023, August 16). ASVAB item development process [Presentation]. Defense Advisory Committee on Military Personnel Testing. https://dacmpt.com/wp-content/uploads/2023/07/04-Item-Development-Process-Presentation-to-the-DACMPT-8-16-23-Read-Only.pdf

16.

Hennig

(2015). What are the true clusters? Pattern Recognition Letters, 64, 53–62.

17.

Hernandez

Nie

(2023). The AI-IP: Minimizing the guesswork of personality scale item development through artificial intelligence. Personnel Psychology, 76(4), 1011–1035.

18.

Hommel

B. E.

Arslan

R. C.

(2024). Language models accurately infer correlations between psychological items and scales from text alone. PsyArXiv. https://doi.org/10.31234/osf.io/kjuce

19.

James

Witten

Hastie

Tibshirani

(2013). An introduction to statistical learning. Springer.

20.

Johnson

J. A.

(2014). Measuring thirty facets of the Five Factor Model with a 120-item public domain inventory: Development of the IPIP-NEO-120. Journal of Research in Personality, 51, 78–89.

21.

Jones

D. N.

Paulhus

D. L.

(2014). Introducing the short dark triad (SD3) a brief measure of dark personality traits. Assessment, 21(1), 28–41.

22.

Kilmen

Bulut

(2025). Shortening psychological scales: Semantic similarity matters. Educational and Psychological Measurement, 85, 910–934.

23.

Lambert

L. S.

Newman

D. A.

(2023). Construct development and validation in three practical steps: Recommendations for reviewers, editors, and authors. Organizational Research Methods, 26(4), 779–809. https://doi.org/10.1177/10944281221115374

24.

Liao

H.-Y.

Armstrong

P. I.

Rounds

(2008). Development and initial validation of public domain Basic Interest Markers. Journal of Vocational Behavior, 73(1), 159–183. https://doi.org/10.1016/j.jvb.2007.12.002

25.

Maeda

(2025). Field-testing multiple-choice questions with AI examinees: English grammar items. Educational and Psychological Measurement, 85(2), 221–244. https://doi.org/10.1177/00131644241281053

26.

Marsh

H. W.

Lüdtke

Muthén

Asparouhov

Morin

A. J. S.

Trautwein

Nagengast

(2010). A new look at the Big Five factor structure through exploratory structural equation modeling. Psychological Assessment, 22(3), 471–491. https://doi.org/10.1037/a0019227

27.

Marsh

H. W.

Morin

A. J.

Parker

P. D.

Kaur

(2014). Exploratory structural equation modeling: An integration of the best features of EFA and CFA. Annual Review of Clinical Psychology, 10(1), 85–110. https://doi.org/10.1146/annurev-clinpsy-032813-153700

28.

McElroy

Wood

Bond

Mulvenna

Shevlin

Ploubidis

G. B.

Hoffmann

M. S.

Moltrecht

(2024). Using natural language processing to facilitate the harmonisation of mental health questionnaires: A validation study using real-world data. BMC Psychiatry, 24, 530. https://doi.org/10.1186/s12888-024-05954-2

29.

Mehta

Bawa

Singh

(2021). WEClustering: Word embeddings based text clustering technique for large datasets. Complex & Intelligent Systems, 7(6), 3211–3224.

30.

Milano

Luongo

Ponticorvo

Marocco

(2025). Semantic analysis of test items through Large Language Model embeddings predicts a-priori factorial structure of personality tests. Current Research in Behavioral Sciences, 8, 100168.

31.

Milano

Ponticorvo

Marocco

(2025). Human expertise and large language model embeddings in the content validity assessment of personality tests. Educational and Psychological Measurement, 86(1), 30–53.

32.

Monti

Tamayo

Mesirov

Golub

(2003). Consensus clustering: A resampling- based method for class discovery and visualization of gene expression microarray data. Machine Learning, 52, 91–118.

33.

Petukhova

Matos-Carvalho

J. P.

Fachada

(2025). Text clustering with large language model embeddings. International Journal of Cognitive Computing in Engineering, 6, 100–108.

34.

Python Software Foundation. (2025). Python language reference. https://www.python.org/

35.

R Core Team. (2024). R: A language and environment for statistical computing. R Foundation for Statistical Computing. https://www.R-project.org/

36.

Reeder

M. C.

(2023, August 16). ASVAB item development process: Item analysis [Presentation]. Defense Advisory Committee on Military Personnel Testing. https://dacmpt.com/wp-content/uploads/2023/07/05-Item_Analysis_Presentation-to-the-DACMPT-8-16-23-002-Read-Only.pdf

37.

Reimers

Gurevych

(2019). Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv:1908.10084.

38.

Revelle

(2024). psych: Procedures for psychological, psychometric, and personality research. Northwestern University. https://CRAN.R-project.org/package=psych

39.

Soto

C. J.

John

O. P.

(2017). Short and extra-short forms of the Big Five Inventory–2 (BFI-2-S and BFI-2-XS): Reliable and efficient assessment of the Big Five domains and facets. Journal of Research in Personality, 68, 69–81. https://doi.org/10.1016/j.jrp.2017.02.004

40.

Stanton

Ramnarine-Rieks

Sang

(2024). Evaluating item content and scale characteristics using a pretrained neural network model. Survey Research Methods, 18(2), 153–165.

41.

Tay

Liao

H.-Y.

Zhang

Rounds

(2019). Toward a dimensional model of vocational interests. Journal of Applied Psychology, 104(5), 690–714. https://doi.org/10.1037/apl0000373

42.

U.S. Department of Defense. (2024). ASVAB fact sheet: Understanding the armed services vocational aptitude battery. https://www.officialasvab.com/wp-content/uploads/2024/02/ASVAB-Fact_Sheet.pdf

43.

U.S. Department of Defense. (2025). Official ASVAB website. https://www.officialasvab.com/

44.

U.S. Office of Personnel Management. (2023). Federal Employee Viewpoint Survey results: Technical report. https://www.opm.gov/fevs/reports/governmentwide-reports/governmentwide-reports/governmentwide-management-report/2023/2023-governmentwide-management-report.pdf

45.

Wendler

Bridgeman

(Eds.). (2014). The research foundation for the GRE® revised general test: A compendium of studies. Educational Testing Service.

46.

Wulff

D. U.

Mata

(2025). Semantic embeddings reveal and address taxonomic incommensurability in psychological measurement. Nature Human Behaviour, 9(5), 944–954.