Abstract
Many behavioral scientists do not agree on core constructs and how they should be measured. Different literatures measure related constructs, but the connections are not always obvious to readers and meta-analysts. Many measures in behavioral science are based on agreement with survey items. Because these items are sentences, computerized language models can make connections between disparate measures and constructs and help researchers regain an overview over the rapidly growing, fragmented literature. Our fine-tuned language model, the SurveyBot3000, accurately predicts the correlations between survey items, the reliability of aggregated measurement scales, and intercorrelations between scales from item positions in semantic vector space. We measured the model’s performance as the convergence between its synthetic model estimates and empirical coefficients observed in human data. In our pilot study, the out-of-sample accuracy was .71 for item correlations, .89 for reliabilities, and .89 for scale correlations. In our preregistered validation study using novel items, the out-of-sample accuracy was slightly reduced to .59 for item correlations, .84 for reliabilities, and .84 for scale correlations. The synthetic item correlations showed an average prediction error of .17, and there were larger errors for middling correlations. Predictions exhibited generalizability beyond the training data and across various domains, with some variability in accuracy. Our work shows language models can reliably predict psychometric relationships between survey items, enabling researchers to evaluate new measures against existing scales, reduce redundancy in measurement, and work toward a more unified behavioral-science taxonomy.
Keywords
Behavioral science struggles to be cumulative in part because scientists in many fields fail to agree on core constructs (Bainbridge et al., 2022; Sharp et al., 2023). The literature silos, which consequently develop, can appear unconnected but pursue the same phenomena under different labels (see e.g., “grit” and “conscientiousness”; Credé et al., 2017).
One reason why connections are lacking is the asymmetry inherent in measure and construct validation: Adding novel constructs to the pile is easier than sorting through it. Investigators can easily invent a new ad hoc measure and benefit reputationally if a new construct becomes associated with their name (Elson et al., 2023; Flake & Fried, 2020). By contrast, finding out whether a purported new construct or measure is redundant with the thousands of existing ones is cumbersome and can cause conflict with other researchers (Bainbridge et al., 2022; Elson et al., 2023). The same holds for replicating construct-validation studies and reporting evidence of overfitting or other problems (Hussey et al., 2024; Kopalle & Lehmann, 1997).
Untangling the “nomological net”—a term coined by Cronbach and Meehl (1955) to describe the relationships between measures and constructs—has become increasingly difficult given the growing number of published measures (Anvari, Alsalti, Oehler, Hussey, et al., 2025; Elson et al., 2023). Conventional construct-validation methods, although effective in mapping these relationships, do not scale to, for instance, the thousands of measures that might be related to neuroticism. To tackle this problem, Condon and Revelle (2015; see also Condon, 2017; Condon et al., 2017) championed the Synthetic Aperture Personality Assessment, in which survey participants respond to a small random selection of a large set of items from the personality literature. Over time, as the sample size grows, this procedure allows estimating pairwise correlations between all items. Although the approach is efficient, each new item requires thousands of participants to answer the survey before it can be correlated with all existing items. Hence, the approach cannot be used to quickly evaluate new proposed scales. What is missing is an efficient way to prioritize, prune the growth in constructs and measures, and sort through the disorganized pile of existing measures.
Natural language processing could provide this efficiency. In the social and behavioral sciences, subjective self-reports are one of the predominant forms of measurement. The textual nature of survey items lends itself to natural language processing. Recently, transformer models have become the state of the art in language models (Vaswani et al., 2017), displaying proficiency in understanding and generating text. They have dramatically reduced the costs of many tasks and chores, notably in programming and generating images from verbal prompts. Although capabilities for natural language generation are currently more visible in the public eye through the use of chat-like interfaces, they are backed by capabilities in natural-language understanding (e.g., classifying or extracting features from text).
On a technical level, this understanding is implemented by the so-called encoder block, which processes input text and encodes it as a high-dimensional numeric vector. The vector representation of a word such as “party” in the resulting semantic vector space is context-dependent. The same word will yield a different vector representation if it occurs in the statement “I am the life of the party” compared with “I always vote for the same party.” The encoder block’s ability to contextualize words is crucial for recognizing the nuances of language. At heart, the efficiency of the transformer model can largely be attributed to its self-attention mechanism (Vaswani et al., 2017). As the name suggests, it is loosely analogous to the executive function in human cognition. Instead of “memorizing” an entire corpus of text, word by word, the attention mechanism weights the relevance of words in a context window for a given target word.
Transformer models excel in transfer learning, that is, they adapt to new tasks easily (Tunstall et al., 2022). Following the pretraining stage, which establishes a base level of linguistic expertise, the models can undergo domain adaptation, which involves training the model on a corpus of text specifically curated for the task at hand. In a process called “fine-tuning,” the model then learns to carry out a specific task (e.g., text classification). Fine-tuning often involves slight architectural adjustments to the model’s output layer, although the term is used somewhat inconsistently in the literature to describe various adaptation approaches. Essentially, the model builds on the fundamental knowledge acquired during pretraining to adapt to specialized tasks, even with limited training data. High-quality annotated training data are key for the domain adaptation that turns generalists into specialists.
Using linguistic information to scaffold scientific models has a long history in personality psychology, in which the lexical hypothesis states that more important personality characteristics are more likely to be encoded as words. To find important personality dimensions, researchers had human subjects rate themselves on prominent adjectives, or “items”; identified systematic correlations between items; and applied factor-analytic techniques to reduce the number of dimensions. The most popular organizing framework, the Big Five, was distilled from personality-descriptive items in this manner (Digman, 1990).
Pretransformer-era attempts to use semantic features of items to predict associations between measurement scales using latent semantic analysis have demonstrated moderate utility (Arnulf et al., 2014; Larsen & Bong, 2016; Rosenbusch et al., 2020). As the ability to capture meaning in computerized language models has grown, researchers have sought to directly quantify relationships between adjectives from textual data (Cutler & Condon, 2023), assign items to constructs (Fyffe et al., 2024; Guenole et al., 2024), directly predict item responses (Abdurahman et al., 2024; Argyle et al., 2023), and quantify answers to open-ended questions (Kjell et al., 2019, 2024).
Wulff and Mata (2023) used large language models (LLMs) to map survey items to vector space and predict empirical item correlations. They tested various transformer models for their ability to predict properties of psychological inventories. They observed a correlation of r = .22 between the semantic similarities of items as judged by OpenAI’s ada-002 model (Greene et al., 2022) and the item correlations estimated in empirical data and found that accuracy improved when aggregating vectors to the scale level. Their work shows LLMs can approximately infer item correlations and outperform latent semantic analysis. However, their approach relied on pretrained models that were not adapted to the domain of survey items and do not appreciate that empirical item correlations are often negative because of negation. This approach cannot be expected to unlock the latent ability of the models but, rather, to give a lower bound of their usefulness. At the same time, pretrained models can overfit to their training data. Because OpenAI’s LLMs obtain knowledge from scraping large quantities of internet text, they presumably have seen items from existing measures co-occur in online studies and public-item repositories (for details on training data leakage, see Supplementary Note 11 in the Supplemental Material available online). The results for survey items that inadvertently were part of the training data can lead to more optimistic results than could be expected for novel items.
We have adapted a sentence-transformer model to the domain of survey-response patterns and trained our model, the SurveyBot3000, to place items in vector space. The distances between item pairs in vector space produce what we call “synthetic” item correlations, scale correlations, and reliabilities. These synthetic estimates can potentially help to cheaply evaluate measures and constructs. We have validated that the SurveyBot3000 can approximately infer empirical item correlations beyond its training data by preregistering the model’s synthetic estimates before collecting empirical data. Based on our pilot study, we predicted that the model will exhibit substantial accuracy in inferring empirical item correlations (r = .71, 95% confidence interval [CI] = [.70, .72]) and even higher accuracy in inferring latent correlations between scales (r = .89, 95% CI = [.88, .90]) and inferring reliability coefficients (r = .89, 95% CI = [.84, .94]). We detail our predictions in Table 1.
Design Table
Note: We determined the planned precision to detect any deterioration in performance greater than .01 for item-pair correlations. Because increasing the number of scales is costlier than increasing the number of items, the sensitivity for the reliability coefficients is a compromise with feasibility. LLM = large language model.
Our model can be put to work in multiple areas. Synthetic correlations will always require careful follow-up with empirical data, but they can be used to search and prioritize. Authors can use our model as a semantic search engine to find existing constructs and measures and avoid reinventions. Synthetic correlations could be used as inputs for more realistic a priori power analyses. Scientific reviewers can use it to flag optimistic reliability coefficients and unstable factor structures, especially when researchers have not validated an ad hoc measure out of sample yet. Generally, discrepancies between reported estimates and LLM-based synthetic estimates can motivate greater attention to replication and construct validation. Finally, metascientists and measurement researchers can use the model to start sorting through the pile of tens of thousands existing constructs and measures (Anvari, Alsalti, Oehler, Hussey, et al., 2025; Elson et al., 2023).
As a showcase, we have made the model available as an app on Hugging Face. Researchers can enter item texts, and the app will generate synthetic item correlations, scale correlations, and reliability coefficients. The app contains a prominent cautionary note to discourage researchers from taking the synthetic estimates at face value before further validation has occurred.
Method
Materials, data, and code for the present study are available on OSF: https://osf.io/z47qs/. Data preprocessing, model training, and statistical analyses were conducted using Python (Version 3.10.12; Van Rossum & Drake, 2009) and R (Version 4.3.1; R Core Team, 2023), with an Nvidia GeForce RTX 2070 Super GPU, and using the CUDA 11.7.1 toolkit (NVIDIA et al., 2022).
Ethics information
The planned research complies with the ethics guidelines by the German Society for Psychology (Berufsverband Deutscher Psychologinnen und Psychologen, 2022). Data used in model training were collected by third parties, as shown in the online supplemental section (https://osf.io/z47qs/). Participants in the validation study were recruited from the crowdsourcing platform Prolific and compensated at a median wage of $12 per hour. Informed consent was obtained from all human respondents. Ethics approval for the validation study was granted from the Institutional Review Board at Leipzig University. All necessary support is in place for the proposed research.
Pretrained language model
Our preliminary work has focused on improving the predictions of item correlations with sentence-transformer models using high-quality training corpora for domain adaptation. We modified an LLM to generate synthetic item correlations by fine-tuning a pretrained sentence-transformer model (Reimers & Gurevych, 2019). Unlike conventional transformer models used in natural-language-understanding tasks that produce vector representations of individual tokens (i.e., basic linguistic units, e.g., words or syllables), sentence transformers produce vector representations for longer sequences of text (e.g., sentences).
Sentence transformers—specifically the bi-encoder architecture used throughout this research—work by using two parallel LLMs that process text inputs independently but share the same structure and parameters. The central idea behind these models is to capture the semantic essence of a sentence. One method to accomplish this is by pooling (e.g., averaging) the contextualized token vectors for each of the two models and then combining them. The underlying neural network then learns these combined representations by predicting sentence similarities, for instance, using natural-language-inference data. In natural-language inference, a given text (i.e., the premise) is evaluated based on its relation to another text (i.e., the hypothesis), classified as either contradicting, entailing, or being neutral to it. The network’s output layer consists of three neurons, each representing one of these classes. The model’s learning effectiveness is assessed using cross-entropy loss; improvements in sentence-vector representation are achieved through backpropagation. For further details on bi-encoders, see Reimers and Gurevych (2019) and Schroff et al. (2015). For accessible in-depth introductions to transformer models and deep neural networks, see Hussain et al. (2023) and Hommel et al. (2022).
We chose the all-mpnet-base-v2 model (referred hereafter to as the “SBERT model” for further fine-tuning from the Hugging Face Model Hub (n.d.), based on its commendable performance across 14 benchmark data sets (Pretrained Models—Sentence-Transformers Documentation, n.d.). This pretrained model is a sentence-transformer adaptation of the mpnet-base model (Song et al., 2020), initially trained on 160 gigabytes of English-language text, including Wikipedia, BooksCorpus, OpenWebText, CC-News, and Stories. The SBERT model places sentences in a 768-dimensional semantic vector space. Distances in this Euclidean space can be computed using, for instance, cosine similarity. In our case, we hypothesized that the cosine similarity between the vector representations of any two survey items (e.g., personality statements) should correspond to the correlation coefficients obtained from survey data.
Domain adaptation and fine-tuning
We fine-tuned the pretrained model in two steps. In the first step, we trained the model to distinguish between semantically opposing concepts. In the second step, we trained the model to predict pairwise item correlations using survey data. Figure 1 depicts the multistep training procedure.

Multistep training procedure for the SurveyBot3000, which produces synthetic estimates of interitem correlations. (a) Pretraining base model (SBERT). (b) Fine-tuning SurveyBot3000. (c) Validation. SBERT = all-mpnet-base-v2 model.
Step 1: polarity calibration
Although cosine similarity spans from −1 to 1, negative coefficients are rarely produced when comparing vector representations of sentences (cf. the croissant shape of the top left plot in Fig. 2). This limitation primarily arises because the high-dimensional vector representation of sentences encodes a range of abstract linguistic features, many of which tend to be positively correlated across text sequences. This poses a challenge in accurately predicting correlations for items of opposing scale polarities, such as those on the introversion-extraversion continuum. To illustrate, when assessing cosine similarity between items from the pretrained model, the item “I am the life of the party” produces comparable coefficients with “I make friends easily” (Θ = .32) and “I keep in the background” (Θ = .35). This occurs even though the last item represents the polar opposite of the first item.

Scatter plots of the synthetic and empirical estimates, pilot study (Stage 1). We show N = 87,153 item-pair correlations, N = 307 scale reliabilities, and N = 6,245 scale-pair correlations for the (top) pretrained SBERT model and the (bottom) fine-tuned SurveyBot3000 model. The yellow line and shaded yellow region show the prediction and the 95% prediction interval for the latent outcome according to a Bayesian multimembership regression model that allowed for heteroskedasticity and sampling error. Because the empirical estimates are estimated with sampling error, which the model adjusts for, fewer than 95% of dots are in the shaded prediction interval. Brown dots in the middle column show randomly combined scales, which we used to increase variance in the criterion. For reliabilities, 18 randomly combined scales with negative synthetic alphas according to the pretrained model are not shown for ease of presentation. SBERT = all-mpnet-base-v2 model.
We fine-tuned the pretrained model with the goal of maximizing the cosine distance between vector representations of opposing concepts. We achieved this by augmenting the Stanford Natural Language Inference (SNLI) corpus (SNLI Version 1.0; see also Supplementary Note 3 in the Supplemental Material; Williams et al., 2018) for our purposes. SNLI consists of around 570,000 sentence pairs, each labeled for textual entailment as either “contradiction,” “neutral,” or “entailment.” We relabeled each sentence pair by additionally assigning a magnitude to the semantic relationship. We let the pretrained SBERT model generate the cosine similarity of the sentence pair (e.g., “the moon is shining” and “it is a sunny day”; Θ = .46) but assigned a negative direction if the pair was labeled as “contradictory” (e.g., Θ = −.46). Hence, our new criterion combined the magnitude and direction of the similarity, capturing various forms of negation in the process. The fine-tuned model was then trained to predict this new criterion so that it would learn that similar sentences have negative cosine similarities when one sentence negates or contradicts the other (for more detailed evaluation metrics, see Supplementary Note 6 in the Supplemental Material).
Step 2: domain adaptation
We found that the SBERT model’s predictions of item correlations were skewed by the presence of non-trait-related text in the item stems. Specifically, we identified a tendency for item correlations to be overestimated in statements containing the same adverbs of frequency. For example, the phrase “I often feel blue” from the depression facet of the Revised NEO Personality Inventory in the International Personality Item Pool exhibits similar cosine similarity to the two items “I feel that my life lacks direction” (Θ = .28) and “I often forget to put things back in their proper place” (Θ = .26) even though the first item is also from the depression facet and the second is from the orderliness facet.
To address this, we aimed to fine-tune the model to focus on text segments that convey information relevant to psychological traits and their similarity. This adjustment aimed to enhance the model’s accuracy in identifying and processing trait-relevant language and to teach it about personality structure, thus improving the validity of its synthetic correlations. We compiled training data from 29 publicly available online repositories (see Supplementary Note 4 in the Supplemental Material). Our inclusion criteria for the corpus mandated that raw item-level data be available, a minimum sample size of N ≥ 300, the use of a rating scale as response format, and clear mapping of item stems to variable names in the data sets. In preprocessing, we retained pairwise Pearson coefficients from the lower triangular matrix across all data sets and cleaned and standardized item stems. Further details on the preprocessing of data are available on OSF (https://osf.io/bfhzy). For cross-validation purposes, we distributed each item pair among training, validation, and test partitions, adhering to an 80-10-10 split. To avoid overfitting, we ensured that all items were unique to their partition. This led to the exclusion of a substantial portion of our training data. Specifically, from the initial pool of 204,424 item pairs, we retained 90,424 pairs. Of these, we randomly allocated 74,339 pairs (82%) to the training partition, 6,832 pairs (8%) to the validation partition, and 9,253 pairs (10%) to the test partition. To mitigate the risk of the model learning idiosyncratic characteristics inherent to the data set—item stems within a data set are more likely to exhibit resemblance than between data sets—we used an additional holdout data set. This data set comprised 87,153 item pairs obtained from Bainbridge et al. (2022), thereby providing a robust measure for evaluating the model’s generalizability to novel English-language items about personality and related individual differences. To ensure the integrity of the holdout data set, any items not exclusive to it were eliminated from the training, validation, and test partitions.
We optimized the hyperparameters for fine-tuning the model using the Optuna library in Python (Version 3.1.1; Akiba et al., 2019), with a focus on enhancing the model’s ability in predicting item correlations within the test partition. Details of the final hyperparameter selection are available in the online supplemental material (https://osf.io/b5ua7).
Pilot study
We found that the SurveyBot3000 model was highly accurate for all partitions of the curated corpus. Empirical interitem correlations and synthetic correlations were accurately predicted in the test set, r = .69 (95% CI = [.67, .70]; df = 9,251), and in the validation set, r = .71 (95% CI = [.70, .72]; df = 6,830). That accuracy was high in both test and validation sets shows the model’s strong generalizability within the corpus.
The SurveyBot3000 model was then tested using 87,153 item pairs obtained from Bainbridge et al. (2022), the holdout data set we withheld from the training process to avoid overfitting. Adjusted for sampling error in the empirical data (see Supplementary Note 1 in the Supplemental Material), the model’s synthetic correlations predicted the empirical interitem correlations with an accuracy of r = .71 (95% CI = [.70, .72]; manifest correlation: r = .67, 95% CI = [.67, .68]; Fig. 2). This consistency with the test-set performance shows the model’s ability to generalize beyond the idiosyncratic properties of the data seen in training. Figure 2 shows the prediction of item correlations through semantic similarity, as estimated by the SBERT and SurveyBot3000 models. The SBERT model had substantially lower accuracy in predicting interitem correlations in our holdout (manifest correlation: r = .19, 95% CI = [.18, .19]).
We further investigated the model’s ability to predict scale reliabilities, which can be calculated from interitem correlation matrices. Given that scales are typically designed to exhibit high internal consistency, we observed limited variability in the internal consistency measures across the 107 scales and subscales in the holdout data set. Empirical Cronbach’s α values had a mean of .75 (SD = .10) and ranged from .35 to .93. When new scales are designed, reliability varies more widely. We therefore circumvented the problem of restricted variance by randomly sampling items to create 200 additional, varied scales. We found that synthetic reliability estimates were highly accurate at r(307) = .89, 95% CI = [.84, .94] (manifest correlation: r = .89, 95% CI = [.86, .91]). Again, the SBERT model had substantially lower accuracy (manifest correlation: r = .38, 95% CI = [.28, .48]). Accuracy was lower when we excluded the randomly formed scales (manifest correlation: r = .63, 95% CI = [.50, .73]), as expected given the restricted range in the real scales (SD = .10 compared with SD = .23 in the combined set).
We subsequently investigated the model’s validity for scale-level predictions using the holdout data set. We averaged the vector representations of all items in each scale and then computed the cosine similarity of these averaged vectors. The convergence between empirical and synthetic scale correlations was remarkably high, exhibiting an accuracy of r(6,245) = .89, 95% CI = [.88, .90] (manifest correlation: r = .87, 95% CI = [.86, .87]). In other words, our fine-tuned LLM explained 80% of the latent variance in scale intercorrelations based on nothing but semantic information contained in the items (i.e., adopting the notion of distributional semantics that considers all contextual patterns as inherently semantic). Again, the SBERT model had substantially lower accuracy (manifest correlation: r = .33, 95% CI = [.30, .35]).
In summary, the LLM-based synthetic estimates closely approximated the empirical interitem and interscale correlations and reliability estimates and were robust to the checks detailed in Supplementary Note 2 in the Supplemental Material. Comparing predictions between the data sets used in this pilot study leads us to expect that the effects are robust and will generalize to new, previously unseen English-language items.
Design
The primary objective of our research was to test the generalizability of our model in predicting human-response patterns in survey data, that is, empirical item and scale correlations and scale reliabilities. Our model’s initial training data and our holdout represent a limited subset of the broader universe of survey items, with a skew toward personality psychology. We designed our validation study to challenge the model’s capabilities by sampling from a more varied array of psychological measures. We collected empirical data from a large online sample of English-speaking U.S. Americans, similar to most of the studies in our training data. Participants processed the scales in random order, and item order was randomized in each scale. Although we anticipated a modest reduction in effect size during Stage 2 compared with the outcomes observed in the pilot study, we expected that the LLM-based synthetic estimates would still be sufficiently accurate to be useful. For a summary of our methods and benchmarks, see Table 1.
Measures
To identify appropriate measures for our study, we conducted a comprehensive search of the APA PsycTests database. Our inclusion criteria for selecting scales were (a) utilization of rating scales as the response format; (b) items composed in the English language; (c) scales developed within the last 30 years to minimize confounding factors related to changes in the English language; (d) measures applicable to the general population, thus excluding scales applicable only to narrow target demographics, such as adoptive parents or particular professional groups; (e) measures applicable to a broad domain, thus excluding scales designed to rate specific consumer products or specific social attitudes; and (f) freely accessible, nonproprietary measures. These criteria were mainly intended to make it feasible to have an unselected sample respond to most items. Within these constraints, we sampled scales to cover a wide range of measures used in the social and behavioral sciences.
We did not always use all items in a scale so that we would be able to have participants respond to a large number in a scale. We included measures from industrial/organizational psychology, such as the Utrecht Work Engagement scale; social psychology, such as the Moral Foundations Questionnaire; developmental psychology, such as the Revised Adult Attachment Scale; clinical psychology, such as the Center for Epidemiological Studies Depression Scale (CES-D); emotion psychology, such as the Positive and Negative Affect Schedule (PANAS); personality psychology, such as Honesty-Humility in the HEXACO-60; and other social sciences, such as the Attitudes Toward AI in Defence Scale and the Survey Attitude Scale. For a full list of all scales, see Supplementary Note 5 in the Supplemental Material; all items are available on OSF. In all, we aimed to have participants answer 246 items distributed across 79 scales and subscales.
When possible, we adapted the response format to a 6-point Likert scale from strongly disagree to strongly agree. For the PANAS, CES-D, and the Perceived Stress Scale (PSS), we used a 6-point scale from never to most of the time to better fit the item content. Our guiding principle was that a more uniform presentation was more important than a perfectly faithful rendering of the original scale. In addition, our current model is unaware of differing response formats and cannot account for them.
Sampling plan
We used simulations to determine our number of scales, items, and survey participants. We wanted to precisely estimate the accuracy with which our synthetic estimates could approximate empirical estimates of interitem and interscale correlations. Sampling error at the participant level affects the standard error with which we estimate empirical interitem and interscale correlations and therefore would bias our accuracy estimates downward. To estimate empirical individual-item correlations, we planned to use an online panel provider to collect a U.S. quota sample of N = 450 before exclusions. In a quota sample, the panel provider attempts to approximately match the sample proportions to population proportions on three demographic variables: age, sex, and ethnicity. We had planned to limit participant recruitment to participants who have an approval rate exceeding 99% and have participated in at least 20 previous studies according to the sample provider, Prolific. However, this screener could not be combined with a quota sample, so no such limits were applied during recruitment. We paid participants regardless of whether they failed attention checks or completed the survey too quickly. In our planned analyses, we then estimated the accuracy of our manifest synthetic estimates for latent, error-free empirical estimates (see Supplementary Note 1 in the Supplemental Material).
From the APA PsycTests corpus, we sampled 246 items, which can be aggregated to 56 scales consisting of at least three items. We assumed we would retain a sample of at least 400 after exclusions. With the resulting 30,135 unique item pairs, we expected to infer the accuracy of our synthetic interitem correlations to a precision (standard error) of ±0.004, according to our simulations. Supplementing our 57 scales with 200 randomly constituted scales enabled us to infer the accuracy of our synthetic reliability estimates to a precision of ±0.03. With the resulting 1,568 unique scale pairs, without scale-subscale pairs, we aimed to infer the accuracy of our synthetic interscale correlations to a precision of ±0.007. The achieved precision is sufficient to detect even subtle deterioration in accuracy compared with our pilot-study estimates.
Analysis plan
We followed recommendations by Goldammer et al. (2020) and Yentes (2020) for identifying and excluding participants exhibiting problematic response patterns (e.g., careless responding). Accordingly, participants were excluded if any of the following conditions were met: (a) participants voluntarily indicated that they did not respond seriously, (b) multivariate outlier statistic using Mahalanobis distance exceeding a threshold set for 99% specificity, (c) psychometric synonyms (defined as item pairs with r > .60) correlate below r = .22 for the participant, (d) psychometric antonyms (defined as item pairs with r ≤ −.40) correlate above r = −.03, (e) low personal even-odd index across scales (r ≤ .45), and (f) average response times below 2 s per item. We checked the robustness of our conclusions to differently defined exclusion criteria.
We then computed all empirical interitem correlations, interscale correlations, and reliabilities. Interitem correlations used Pearson’s product-moment correlations. We aggregated scales as the means of their items after reversing reverse-coded items. Interscale correlations were then computed as manifest Pearson’s product-moment correlations. Reliability was estimated with the Cronbach’s alpha coefficient based on interitem correlation. We uploaded synthetic estimates of the SBERT model and the SurveyBot3000 model for all of these coefficients to the OSF. The code for our preregistered analyses mirrored the code from our pilot study, including the robustness checks detailed in Supplementary Note 2 in the Supplemental Material. We planned to freeze both code and point predictions as part of our preregistration, but because of a miscommunication between us, nobody froze the repository, and only point predictions for item correlations were uploaded to OSF. Because we discovered typographical errors in our version of the Moral Foundation Questionnaire, we revised the related point predictions after Stage 1 acceptance. After data collection, we merged empirical and synthetic estimates.
The central performance metric in this study is accuracy, defined as the convergence between synthetic and empirical estimates (not to be conflated with evaluation metrics of binary classifiers). We thus refer to “manifest accuracy” as the Pearson correlation between synthetic and empirical coefficients. We quantified “latent accuracy” using two complementary approaches that account for sampling error in empirical estimates. First, we used a structural-equation-modeling approach in which we fixed the residual variance of empirical estimates to the average sampling-error variance and allowed manifest synthetic estimates to correlate with the latent variable. Second, we disattenuated for the standard error of the empirical estimates using a Bayesian errors-in-variables model, which allows for heteroskedastic accuracy (see Supplementary Note 1 in the Supplemental Material). We used the latter model as our primary estimate for latent accuracy. We also report the prediction error for all three quantities and a plot similar to Figure 2. We furthermore report manifest and latent accuracies of the SBERT model, which we used as a benchmark (see Table 1).
Results
We collected data from 470 participants using Prolific’s online participant-recruitment system. Because a bug in our questionnaire disrupted participation for an initial batch of participants who later returned to the study, we exceeded our planned sample size of 450 (for deviations from preregistration, see Supplementary Note 7 in the Supplemental Material).
We preregistered overly strict exclusion criteria because we misread Goldammer et al. (2020). After applying the preregistered criteria, only 136 participants remained. Therefore, we used an adapted set of criteria that more closely followed Goldammer et al.’s recommendations for our main analyses, so that 387 remained (see Table S7 in the Supplemental Material). However, results for item-pair correlations were robust to different exclusion-criteria definitions and including all participants (see Supplementary Note 9 in the Supplemental Material). After applying the adapted exclusion criteria, the remaining sample had a mean age of 46.96 years (SD = 15.58, range = 18–86) and was 47% male. Most (63%) participants identified as non-Hispanic White, 13% identified as Black, and 12% identified as Hispanic. Four participants reported no high school education, 46 had a high school degree, 80 had some college experience, and 257 reported 3 or more years of college experience. For further and more detailed demographic information, see the online codebook.
All participants responded to a set of 219 items. Twenty-eight percent of the sample (n = 110) were unemployed (or students, etc.). Participants who reported being employed answered an additional set of 27 items specific to employment. We calculated pairwise Pearson’s product-moment correlations between all item pairs in this set. We tested the accuracy of the preregistered SurveyBot3000 synthetic correlations against the empirical correlations of the resulting 30,135 item pairs.
Item-pair correlations
Adjusting for sampling error in the empirical data (see Supplementary Note 1 in the Supplemental Material), we found that the model’s synthetic correlations predicted the empirical interitem correlations with an accuracy of r = .59 (95% CI = [.58, .60]; manifest correlation: r = .57, 95% CI = [.56, .58]; Fig. 3). Accuracy deteriorated compared with the holdout in our pilot study (to 83% of the r = .71 in the pilot), but our model was still able to generalize to this diverse set of items. Figure 3 shows the prediction of item correlations through semantic similarity, as estimated by the SBERT and SurveyBot3000 models. The SBERT model had substantially lower accuracy in predicting interitem correlations (accuracy = .33, 95% CI = [.32, .34]). We also computed the prediction error of the SurveyBot3000 in our model, that is, how far off predictions were after accounting for sampling error in the empirical correlations in our model. The average root mean square error (RMSE) was .17, 95% CI = [.17, .17]. However, prediction error was larger when synthetic correlations were middling (.00–.60) and smaller when they were negative or larger than .60 (see Fig. 4).

Scatter plots of the synthetic and empirical estimates, validation study (Stage 2). Showing N = 30,135 item-pair correlations, N = 257 scale reliabilities, and N = 1,568 scale-pair correlations for (top) the pretrained SBERT model and (bottom) the fine-tuned SurveyBot3000 model. SBERT = all-mpnet-base-v2 model.

Prediction error of the synthetic estimates, validation study (Stage 2). Our prediction model allowed the error term to vary freely according to the predictor, the synthetic estimate. The thin-plate splines show that some synthetic estimates were predictably more accurate.
Scale reliabilities
We investigated the model’s ability to predict scale reliabilities (Cronbach’s alpha), which can be calculated from interitem correlation matrices. For the 57 scales consisting of at least three items, the manifest accuracy of the synthetic alpha coefficients was .64, 95% CI = [.45, .77]. This accuracy was slightly reduced compared with the pilot (94% of r = .68). Because all scales from the literature had restricted variability in reliability coefficients, we randomly sampled items to create 200 additional, varied scales. Unlike in the pilot, we reversed items randomly (not according to empirical correlations) and did not omit scales whose empirical Cronbach’s alpha estimate was negative (see Table S7 in the Supplemental Material). We chose to make these changes to clarify that the synthetic alphas are in fact unbiased when we do not select on positive empirical alphas. We found that synthetic reliability estimates were highly accurate at r(257) = .84, 95% CI = [.79, .90] (manifest correlation: r = .85, 95% CI = [.81, .88]). The SBERT model had lower accuracy than the SurveyBot3000 but performed much better than in the pilot study (manifest correlation: r = .64, 95% CI = [.56, .71]). The average RMSE of the SurveyBot3000 estimates was .27, 95% CI = [.21, .33]. However, prediction error dropped below .10 when synthetic alphas entered the range seen in the real scales (above .60).
Scale-pair correlations
We investigated the model’s validity for scale-level predictions. For all scales with at least three items, we averaged the vector representations of all items (after reversing reverse-scored items) and then computed the cosine similarity of these averaged vectors. The accuracy of synthetic-scale correlations was r(1,568) = .84, 95% CI = [.82, .87] (excluding scale-subscale pairs; manifest correlation: r = .83, 95% CI = [.81, .85]). Our fine-tuned LLM explained 71% of the latent variance in scale intercorrelations based on nothing but semantic information contained in the items. Manifest accuracy for the 228 scale pairs in which each scale had at least five items was r = .88. Performance was slightly attenuated compared with the pilot (94% of r = .89), but this may be partly because scales in this set were slightly shorter (number of items: M = 5.75) than in the pilot (number of items: M = 6.79); see also Supplementary Note 8 in the Supplemental Material. As for synthetic reliabilities, the SBERT model had lower accuracy than the SurveyBot3000 but performed much better than in the pilot study (manifest correlation: r = .50 [.46, .54]). The average RMSE of the SurveyBot3000 estimates was .16, 95% CI = [.15, .17]. As for item correlations, prediction error was larger for middling synthetic estimates (.00–.50) than for negative and high positive estimates (Fig. 4).
By domain
We investigated the accuracy of our synthetic interitem correlations by domain. We had grouped scales into five domains (attitudes, personality, clinical, social, and occupational psychology). Manifest accuracy was lowest for attitudes (within the attitude domain: r = .34; when attitude items were correlated with items in other domains: r = .31) and highest for occupational psychology (within domains: r = .75; across domains: r = .65). In all domains, the SurveyBot3000 predictions outperformed the SBERT predictions, so there was no obvious trade-off between fine-tuning and generalizability (see Fig. 5).

Accuracy by domain. Accuracy differed across domains. SurveyBot3000 accuracy (colored) was always higher than SBERT accuracy (gray). Results were largely consistent whether accuracy of items was tested (left, circle) within domains or (right, cross) across domains.
Robustness checks
We repeated all robustness checks we conducted for the pilot study and added additional checks. Because we had preregistered overly strict exclusion criteria and because we were unable to combine quota sampling with a screener for highly rated Prolific participants, we estimated the accuracy of the synthetic item correlations after applying different sets of defensible exclusion criteria. After accounting for sampling error, accuracy varied between .57 and .59 depending on the exclusion criteria, that is, not substantially (Fig. 6; see also Supplementary Note 9 in the Supplemental Material). For further robustness checks and sensitivity analyses, see Supplementary Note 8 in the Supplemental Material.

Changes in manifest and latent accuracy after applying different exclusion criteria (or none) (see Supplementary Note 7 in the Supplemental Material available online).
Discussion
We introduce a computational-linguistics approach that synthetically predicts associations between survey responses—including item-level correlations, scale-level relationships, and derived psychometric properties—with high accuracy. Using our SurveyBot3000, these synthetic estimates have a margin of error that is comparable with a small pilot study but free and instant. Our preregistered validation study confirms the convergence between synthetic predictions and empirical data sets, validating the method’s ability to mirror real-world reliability coefficients, scale correlations, and covariance patterns, even outside the content domain of personality psychology.
Accuracy in our preregistered validation was attenuated compared with our pilot study (up to 83% of the pilot study’s accuracy for item pairs) but never to the level of the pretrained model. So even though the items spanned a broader domain, the synthetic estimates had margins of error comparable with a small pilot study. Attenuation was strongest for item pairs (r = .59, 95% CI = [.58, .60]). After aggregation, accuracy was higher for scale pairs (latent correlation: r = .84, 95% CI = [.82, .87]) and for reliabilities (r = .84, 95% CI = [.79, .90]; attenuation to 94% of the pilot study’s accuracy). Our prediction model allowed for the margin of error to depend on the synthetic estimate. Indeed, because the SurveyBot3000 still sometimes predicts positive correlations instead of negative correlations, negative synthetic estimates are more accurate (see Fig. 4). For instance, a negative synthetic-scale correlation is estimated about as accurately as in an N = 80 pilot study, whereas a positive correlation is about as accurately estimated as only an N = 20 pilot study (see Supplementary Note 10 in the Supplemental Material). The margin of error was also larger for synthetic reliabilities below commonly used cutoffs (i.e., <.60).
Recent related contributions on computational modeling for survey research (e.g., Hernandez & Nie, 2023; Schoenegger et al., 2024; Wulff & Mata, 2023) highlight the field’s growing interest in synthetic prediction of psychometric patterns. In a recent update to their work, Wulff and Mata (2025) adopted fine-tuning techniques that improve on their earlier results, yielding accuracies that approach the performance we report here but limited to absolute correlations. In another parallel effort, Schoenegger et al. (2024) reported comparable performance of the proprietary model PersonalityMap and the SurveyBot3000. However, this comparison is difficult to interpret because the SurveyBot3000 was trained on the data used as the test set and the PersonalityMap model is proprietary, which makes it difficult to assess leakage and generalizability.
Our work advances this area of synthetic survey modeling not mainly by reporting top-tier accuracy but through methodological innovations and practical tools designed to improve rigor, transparency, generalizability, and accessibility.
First, we introduce a two-step training protocol that refines sentence-transformer models for robust prediction of survey-response associations. Key safeguards include training on a diverse item corpus to minimize domain bias, strict contamination controls to prevent overfitting, and systematic hyperparameter optimization. A novel calibration step further enables the model to predict negative correlations (e.g., opposing items), more accurately reflecting the empirical distribution of coefficients. The resulting model, the SurveyBot3000, demonstrates performance exceeding known human capabilities in correlation judgment (Epstein & Teraspulsky, 1986).
Second, to ensure transparency and minimize analytic flexibility, we preregistered our validation protocol and underwent formal Stage 1 peer review before testing. This safeguards against overfitting and confirms that accuracy claims are not artifacts of post hoc adjustments.
Third, we systematically evaluate generalizability across psychological domains, including personality, clinical, and social psychology, and social attitudes. Although item-level accuracy varies with conceptual diversity—attenuated in cross-domain tests compared with our pilot study—the SurveyBot3000 always outperformed the pretrained baseline model (i.e., SBERT), so our fine-tuning did not impede generalizability.
Finally, we provide an open-access web application (https://huggingface.co/spaces/magnolia-psychometrics/synthetic-correlations) to democratize access to synthetic psychometric predictions. The tool generates immediate estimates of internal consistency, scale structure, and interitem correlations from text inputs, offering researchers a free pretesting resource with guidance for responsible interpretation. The application can be considered a free pilot study of survey items to investigate factor structure and internal consistency. Similar to pilot studies, synthetic estimates can tell researchers “where to look” but should always be followed up with more empirical data before conclusions are drawn.
As the behavioral sciences grapple with an ever-expanding universe of often redundant measures, our line of research has the potential to reorganize the vast collection of scales accumulated over the past decades of research and to help prevent further proliferation and fragmentation in the future (Anvari, Alsalti, Oehler, Hussey, et al., 2025; Anvari, Alsalti, Oehler, Marion, et al., 2025; Elson et al., 2023). Rosenbusch et al. (2020) laid important groundwork on computational language-based methods to semantically search for psychometric scales but were constrained by the technological limitations of their time. Our results and work on the SurveyBot3000 encourage us that the technological foundation for such an ambitious undertaking has matured.
The APA PsycTests database currently holds more than 78,000 records, and the majority of scales are being used only once or twice (Anvari, Alsalti, Oehler, Hussey, et al., 2025; Elson et al., 2023). With both the methodology and the data in place, we propose that future research efforts should be dedicated toward the development of a semantic search engine. Searching such a “synthetic nomological net” could reveal potential overlap between tens of thousands of items and scales and ultimately help researchers avoid redundancy and confusing labels. A more parsimonious ontology could then enable better evidence synthesis. A semantic search engine could be a tool in the scale-development and the peer-review processes to help authors and reviewers assess the incremental value of newly developed scales and proposed constructs. Potential redundancies and confusing labels (e.g., jingle/jangle fallacies; Wulff & Mata, 2023) could then be flagged for empirical follow-up. Such a system would make the search problem tractable. That is, the SurveyBot3000 could help pick scales out of the ten thousands in existence to empirically evaluate the novel scale for discriminant validity. That way, humans remain in the loop. We believe that this line of work exemplifies a responsible integration of LLMs into research, which is a topic of current debate (Binz et al., 2023). Specifically, the collaborative circumstances in scale development carry minimal risk for harmful effects on the scientific ecosystem. False negatives (i.e., the model fails to detect redundant scales) would merely maintain the status quo, which has led to construct proliferation in the first place. False positives (e.g., the model incorrectly flags two measures as redundant) would require researchers to verify this empirically before drawing conclusions. This balanced approach, in which LLMs accelerate discovery and human researchers retain interpretive authority, should characterize a productive human-artificial intelligence collaboration across the social and behavioral sciences.
To further strengthen the potential of computer-linguistic approaches to survey pattern prediction, we noted some limitations in the SurveyBot3000 that need to be addressed by future research. Despite the strong convergence between synthetic and empirical data in both the pilot and validation studies, the SurveyBot3000 occasionally struggled to infer negative correlations.
Although polarity calibration clearly improved the model’s handling of negatively worded items overall (see Supplementary Note 6 in the Supplemental Material), the synthetic estimates still had a bias toward positive signs. Of the empirical correlations, 59% were positive, whereas 67% of the synthetic correlations were. In keeping with this, a negative synthetic item correlation predicted the empirical sign incorrectly slightly less often (16%) than a positive synthetic item correlation (19%). If a human user of our app can correct the coefficient sign in these small-scale applications, this would improve manifest accuracy by .11, yielding an overall convergence of .68 between synthetic estimates and empirical correlations.
Various linguistic aspects were associated with impaired predictions, but no clear pattern emerged. For example, items that avoided self-directed language were predicted less accurately. However, for such items, we did not observe any increase in accuracy after rephrasing the statements to use first-person pronouns (see Supplementary Note 8 in the Supplemental Material). In the current study, item length, self-directedness, sentence complexity, and content domain are all confounded with one another. Further efforts could be directed toward systematically manipulating and investigating lexicographic (e.g., grammatical form, item length) and item-metric (e.g., observability, temporality; Leising et al., 2014; Leistner et al., 2024) features potentially influencing accuracy in survey pattern prediction independently of content domain (Hommel, 2024).
Both the sign prediction errors and accuracy fluctuations arising from unconventional linguistic aspects could potentially be addressed by recent innovations. For example, Opitz and Frank (2022) showed that vector representations of text can be decomposed into explainable semantic features. Instead of comparing vectors monolithically, future approaches could isolate psychometrically relevant information by separating residual features in vector space. This decomposition approach may help establish theoretical upper bounds on prediction accuracy by distinguishing between different types of semantic content captured in vector space, including conceptual meaning, but also peripheral semantic information, such as survey-response tendencies.
Beyond these technical refinements, model performance could be enhanced through a more balanced training corpus, as suggested by domain-specific variations in predictive accuracy. For instance, synthetic estimates for clinical-psychology measures performed worse than for social-psychology measures, reflecting the limited representation of psychopathology items in our training data. Balancing the corpus aligns with established principles of language-model development in which capabilities consistently improve with increased training data, model size, and computational resources (Kaplan et al., 2020). However, the same note of caution as above applies because content domain is confounded with lexicographic and item-metric aspects. In addition, the low accuracy of synthetic estimates in the attitude domain can be partly attributed to the fact that attitude items have lower absolute intercorrelations on average, so there is less variance to explain. On the RMSE metric of accuracy, which does not have this issue, attitude items had middling accuracy compared with other domains.
Robust evaluation protocols are essential to systematically assess and compare the capabilities and limitations of current and future model developments. To this end, benchmark tests are usually established for specific tasks related to language modeling using infrastructure providers such as Hugging Face (n.d.) and Kaggle (Kaggle, 2025). We recommend that efforts should be undertaken to develop such a standardized holdout set to objectively track future progress in survey-pattern prediction with comparable accuracy metrics. Although many currently available fine-tuned models are trained on the same or overlapping data (chiefly Synthetic Aperture Personality Assessment, Condon et al., 2017), it is currently difficult to compare models because teams divide training and test partitions differently, that is, one model is trained on the data that another team uses as its benchmark. For fair comparisons, the field needs transparency about the contents of the training data, including for proprietary models, or ways to come up with guaranteed novel items.
Our final report deviates from our preplanned Stage 1 protocol in several ways. We transparently communicated these deviations according to Willroth and Atherton (2024) in Supplementary Note 7 in the Supplemental Material and reported additional robustness checks to study the impact of these deviations on our results. We found that latent accuracy was largely unaffected after readjusting exclusion criteria and generally conclude that the deviations had little impact.
Sentence transformers can effectively model psychometric properties and relationships using solely the semantic information contained within item texts. Our work establishes a method that produces synthetic predictions that converge with empirical survey data and demonstrates robust generalization beyond the training domain. We see many potential applications, simplified through the web app we have released. The SurveyBot3000’s synthetic estimates have a margin of error comparable with a small pilot study. As with pilot studies, the synthetic estimates can guide an investigation but need to be followed up by human researchers with human data. By making synthetic estimates freely available, we hope to reduce ad hoc measurement culture. Researchers should now find it easier to compare existing measures and identify old and new measures with desirable psychometric properties.
Looking ahead, incorporating recent advances in computational linguistics may yield increasingly precise models that could serve as foundational tools for untangling the nomological net (Cronbach & Meehl, 1955) and constructing a unified taxonomy of psychological measures.
Supplemental Material
sj-docx-1-amp-10.1177_25152459251377093 – Supplemental material for Language Models Accurately Infer Correlations Between Psychological Items and Scales From Text Alone
Supplemental material, sj-docx-1-amp-10.1177_25152459251377093 for Language Models Accurately Infer Correlations Between Psychological Items and Scales From Text Alone by Björn E. Hommel and Ruben C. Arslan in Advances in Methods and Practices in Psychological Science
Footnotes
Acknowledgements
We thank Stefan Schmukle, Anne Scheel, Julia Rohrer, Malte Elson, Taym Alsalti, Ian Hussey, Saloni Dattani, David Condon, Dirk Wulff, and Jan Arnulf for helpful discussions. We also thank Jan-Paul Ries, Lorenz Oehler, and Sarah Lennartz for comments on an earlier version of this article. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. We have shared all key materials, including the training and analysis code, on OSF at https://osf.io/z47qs/. The existing data used for training and in the pilot study have been openly shared, and we link to the original sources. Anonymized data for the validation study have also been shared on OSF. Stage 1 preprint: https://osf.io/preprints/psyarxiv/kjuce_v1; statistical reports and interactive plots: https://synth-science.github.io/surveybot3000/; app:
.
Transparency
Action Editor: David A. Sbarra
Editor: David A. Sbarra
Author Contributions
B.E. Hommel and R. C. Arslan contributed equally to this article.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
