Abstract
Language-based assessments (LBAs), quantitative estimates of scientific constructs based on language, have advanced methods in the psychological and social sciences for more than a decade. LBAs based on individuals’ prompted descriptions analyzed with large language models to produce scores of their psychological states and traits have shown strong convergence with the corresponding rating scales (r > .80) and have often surpassed rating scales in predicting theoretically relevant behaviors (external criteria). Despite their high validity across numerous psychological outcomes and contexts, the broader adoption of LBA models (LBAMs) has been limited. Even when made available alongside research publications, these models often remain inaccessible because of technical complexities, inconsistent documentation, and the absence of a standardized repository. In this tutorial, we introduce a framework targeted to social and psychological scientists for accessible sharing models with others—the Language-Based Assessment Models (L-BAM) Library—and a toolkit for easily using LBAMs via the text package in R. L-BAM covers a wide range of models for assessing mental-health disorders (e.g., depression, anxiety), well-being (e.g., satisfaction with life, harmony in life), implicit motives (need for power, affiliation, and achievement), and more. The L-BAM Library aims to increase the availability and resource efficiency of LBAs of psychological constructs while encouraging replication, independent validation, and the broad application of preexisting LBAMs.
The language people use to describe themselves and their state of mind can answer research questions concerning what they think (Al-Mosaiwi & Johnstone, 2018), how they feel (Pennebaker et al., 2003; Zimmermann et al., 2017), what they do (Hu et al., 2016), who they are (e.g., J. Chen et al., 2020; Kwantes et al., 2016), how they interact with others (Bayram & Ta, 2019; Ireland et al., 2011), how they make sense of the world (Fausey & Boroditsky, 2010; Sterling et al., 2020), how they behave (Mehl et al., 2007; Tidwell et al., 2025), and much more. Language provides rich psychological information that extends beyond traditional closed-ended assessment methods (Boyd et al., 2024; Kjell, Kjell, & Schwartz, 2024).
Language-based assessments (LBAs) can be viewed as a new family of psychological-measurement tools based on the assumption that language reliably reflects underlying states, traits, values, thoughts, feelings, and so on (Boyd et al., 2024; Kjell et al., 2019; Park et al., 2015). Unlike traditional closed-ended scales, which constrain responses to predefined items, the LBAs leverage the natural expressiveness of language to derive quantitative assessment scores from natural language. For example, a large language model can be used to convert natural language (e.g., social media posts) into numerical representations, which are then used as predictors in a regression model to estimate depression-severity scores. Such models are called “LBA models” (LBAMs; see Argamon et al., 2007; Boyd & Schwartz, 2021; Kjell et al., 2024; Park et al., 2015; Tausczik & Pennebaker, 2010). They may serve as a complementary method to traditional assessment methods, such as informant reports, behavioral tasks, or physiological recordings, but with unique advantages tied to linguistic richness.
Using language to quantitatively assess psychological states and traits offers several advantages. First, natural language is the primary means through which individuals express complex psychological experiences (e.g., Tausczik & Pennebaker, 2010). Second, natural language possesses great measurement properties, including, for example, broad range, fine resolution, and openness (Kjell et al., 2024). The broad range of language (e.g., close to a million words in English vs. five, seven, or 11 scale steps) allows individuals to express extreme states (e.g., from hopeless to ecstatic), and its fine resolution enables distinctions between subtle emotional nuances (e.g., differentiating between worried, uneasy, tense, and panicked). Finally, the openness of language enables individuals to generate personalized responses, overcoming the limitations of predefined response categories found in traditional assessments.
Over the past years, researchers have made use of language quantitatively by transforming language to numbers using, for example, large language models, and with the numbers, developing regression models to predict psychological outcomes. These models make use of the nuances of language to predict a certain criterion, and these models are called “LBAMs” (see Argamon et al., 2007; Boyd & Schwartz, 2021; Kjell et al., 2024; Park et al., 2015; Tausczik & Pennebaker, 2010). Recently, several LBAMs have been developed and validated to assess psychological constructs, such as depression severity (Gu et al., 2024), harmony in life (Kjell et al., 2022), and implicit motives (Nilsson et al., 2025). Despite their validity across numerous psychological outcomes and contexts, the broader adoption of LBAMs has been limited. In this tutorial, we introduce and describe the Language-Based Assessment Model (L-BAM) Library, which serves as an open library for sharing pretrained LBAMs in which the models can easily be used with one function from the R package text. The L-BAM Library aims to facilitate the reproducibility, comparability, and accessibility of LBAMs by providing standardized tools and methodologies for researchers and is targeted toward social and psychological scientists. By making these models easily available, we encourage independent validation, broader application across diverse psychological domains, and more efficient use of existing resources. In this tutorial, we outline how researchers can use LBAMs to assess psychological constructs and provide guidelines for contributing new models to the library.
LBAs Can Improve Psychological Science
Accurate quantification of mental states and traits is essential for psychological science, enabling researchers to systematically assess, compare, and track psychological constructs and experiences across individuals. Over the last 90 years, rating scales based on narrowly defined questions coupled with closed-ended questions (i.e., Likert scales; Likert, 1932) have come to dominate the assessment of psychological constructs. Although the rating-scale method has led to important findings, the format comes with limitations, such as constraining respondents to comprehensively describe their unique experiences and state of mind. Although language initially is more complex to analyze than rating scales, recent advancements in artificial intelligence (AI) and natural language processing now allow researchers to translate rich language descriptions into meaningful numerical assessments that align with and even enhance traditional psychometric measures.
Methodological flexibility
LBAs offer considerable methodological flexibility and have been increasingly applied across a wide range of psychological constructs and related behaviors (e.g., Boyd & Schwartz, 2021; Kjell, Kjell, & Schwartz, 2024; Mihalcea et al., 2024). LBAs have enabled researchers to assess, among others, personality (Park et al., 2015; Schwartz et al., 2013), implicit motives (Brede et al., 2025; Nilsson et al., 2025), well-being (Jaidka et al., 2020; Sametoglu et al., 2024), and mental illness, such as depression (Gu et al., 2025; Perlis et al., 2024), anxiety (Gu et al., 2025; Teferra & Rose, 2023), and posttraumatic stress disorder (Son et al., 2020). LBAs can also be developed for behaviors that are theoretically relevant to psychological constructs, such as alcohol consumption (Jose et al., 2022; Nilsson et al., 2024), cooperation (Kjell et al., 2021), and suicide (Y. Chen et al., 2024; W. Zhou et al., 2023); somatic diseases (e.g., heart disease, Eichstaedt et al., 2015; or cancer, S. Zhou et al., 2022); and demographic variables, including age and gender (Ganesan et al., 2021; Sarwar et al., 2024).
LBAs can be applied to both probed language data—elicited through targeted open-ended questions—and already existing language data gathered from natural contexts. Probed LBAs ask individuals to describe their state of mind, personal experiences, or specific topics in their own words. These assessments have demonstrated very strong convergent validity with traditional rating scales, with an accuracy approaching or reaching the scales’ reliability, which is the theoretical upper limit of concurrent accuracy (r > .80; Gu et al., 2024; Kjell et al., 2022; Nilsson et al., in review). In addition, there are several examples of using probed language (i.e., answers to targeted open-ended questions), which includes asking participants to describe their activities (Nilsson et al., 2022) and themselves in various ways (e.g., Kwantes et al., 2016), recalling various memories (Yeung et al., 2024), reporting stream of consciousness (Sripada & Taxali, 2020), and so on.
LBAs using already existing language data have demonstrated the ability to assess a wide range of physical and psychological outcomes. For instance, social media language has been linked to mental- and physical-health markers (Eichstaedt et al., 2015, 2018; Kjell, Giorgi, Schwartz, & Eichstaedt, 2023), and transcripts of everyday speech have been used to capture emotional fluctuations throughout the day (Sun et al., 2020). There are many existing sources of language that can be analyzed using LBAs, including chats, blogs, text messages, emails, letters, personal diaries, and song lyrics. In addition, language from more specialized settings, such as therapy-session transcripts (Lalk et al., 2024), medical notes (Shah, 2024), and political speeches (Liu, Zhang, et al., 2022), can offer valuable insights for psychological analysis. Together, these two methodological approaches—probed and naturalistic—allow researchers to tailor LBAs to a wide range of research designs and data sources.
Theoretical depth
LBAs also hold promise for advancing psychological theory. For instance, LBAs can be used to explore how individuals naturally express psychological phenomena, offering bottom-up insights that can refine existing theories about constructs (see e.g., Bucur et al., 2021; Coppersmith et al., 2014; Gu et al., 2025; Liu, Ungar, et al., 2022; Nilsson et al., 2024; Stade et al., 2023). Furthermore, because LBAMs can be applied on a large scale quite easily, it opens the door for new applications that can help expand research in a field. For example, implicit-motive (i.e., subconscious needs) assessments have historically been resource-intensive because coding text for motives requires a lot of time and brain power, limiting research and theory development. With automated coding from implicit-motive LBAMs (e.g., Brede et al., 2025; Nilsson et al., 2025), it is possible to assess implicit motives at a much larger scale than was ever practically possible before: in terms of both assessing implicit motives via the classic picture-story exercise and applying the implicit-motive LBAMs on other texts, such as company reports or social media texts (which the coding manual theoretically allows; Winter, 1991), to understand, for example, if power-oriented companies are more or less successful.
By aligning psychological measurement with natural human expression, LBAs enable individuals to communicate their experiences in their own words, offering a powerful complement to traditional closed-ended assessments. For an overview of research studies developing LBAs included in the L-BAM Library, see Table 1.
Example Uses of Language-Based Assessment Models
The Need for the L-BAM Library
Despite solid evidence for the broad applicability, usefulness, validity, and reliability of LBAs, the sharing of models so that they can be easily used by others is currently limited, and there is no centralized library facilitating information and model sharing. Even when made available alongside research publications, these models often remain inaccessible because of technical complexities, inconsistent documentation, and the absence of a standardized library. These limitations restrict resource efficiency, hinder replicability, and impede independent evaluation and systematic testing of generalizability. We believe sharing LBAMs is essential for five key reasons: (a) It is resource-efficient to share because not every research group needs to develop their own models; (b) it supports replication and independent validation, which are critical for tackling psychology’s replication crisis (Simmons et al., 2011); (c) it ensures increased comparability when the same models are applied across studies; and (d) concerns about generalizability can be systematically addressed when models are openly shared, enabling researchers to evaluate performance across diverse samples, languages, and settings using the same tools. Finally, (e) for LBAs to have a practical impact on psychology (or other fields, e.g., medicine), models need to be shared in an accessible format that allows the broader scientific community to implement them effectively. Just as researchers have successfully shared validated questionnaires, they can also share LBAMs. Although there are repositories for uploading models, such as GitHub, Hugging Face, and OSF, the L-BAM Library is just that, a library from which the actual models are hosted on repository platforms.
Tutorial
In this tutorial, we use the text package (Version 1.8), an R package that lets users download and use large language models and develop LBAMs (Kjell, Giorgi, & Schwartz, 2023). There exist other packages for advanced language analysis, such as DLATK (Schwartz et al., 2017), Keras (Chollet et al., 2015), and PyTorch (Paszke et al., 2019) in Python, but in this tutorial, we focus on the text package in R, which streamlines these analyses in a user-friendly way tailored for social and behavioral scientists. In this tutorial, we aim to assist in increasing the sharing of LBAMs by introducing two resources. First, we describe how the textAssess function can automatically download models, preprocess language data, and apply models for assessment, prediction, or classification. Second, we introduce the L-BAM Library, where researchers can discover existing models and describe their own models with instructions for how they can be downloaded, used, and cited. We encourage researchers to contribute new models to the library, promoting collaboration and advancing open science in language-based analysis. In the tutorial, we predominantly cover prediction-based models through language. For researchers interested in interpretability and theory-driven exploration based on language analysis, we have developed a complementary tutorial describing multiple methods for visualizing human language (Eijsbroek et al., 2026, under review), which introduces methods such as keyword extraction, topic modeling, and AI-based visualizations that can be used alongside LBAs to provide further psychological insights. Before diving into the tutorial, we want to emphasize some caveats about generalization.
Essential caveat about generalization
LBAs can be developed in one setting (e.g., social media) and applied in another (e.g., clinical interview). Often, however, they do not generalize across all contexts: A model’s generalizability depends on several factors, such as the setting, population, distribution of target psychological measure, and language domain (i.e., the similarity between the training-data language and the target language being assessed). Therefore, users must take responsibility for evaluating the appropriateness of each model for their specific context. This includes assessing whether the model’s training and evaluation contexts and language distributions are sufficiently similar to their target data (for more information, see the Supplemental Material available online) and if needed, validating the model’s performance on a subset of their own data before drawing any substantive conclusions. This is why it is essential to carefully describe each model—its training data, performance metrics, and so on—as outlined in the section The L-BAM Library: Reproducibility, Replication, and Generalizability. Comprehensive documentation helps users evaluate whether a model is appropriate for their specific context and supports transparent, reproducible science. We discuss this further in the Responsible Applications and Generalizability section.
Theoretical overview of the L-BAM Library phases
There are three core phases involved in training LBAMs and applying models from the L-BAM Library. First, the language is converted into numerical representations (i.e., word embeddings) using a large language model (Fig. 1a); second, these word embeddings are used to train a model to assess or predict a criterion variable (Fig. 1b); and third, these models can be applied on new data for assessment or classification (Fig. 1c). Details on how to transform text to word embeddings and train LBAMs using the text package have been described in detail before (see Kjell, Giorgi, & Schwartz, 2023). Below, we provide an overview of these two steps (Figs. 1a and 1b) before describing the application and use of the L-BAM Library (Fig. 1c).

Overview of the Language-Based Assessment Model (L-BAM) Library components. (a) Convert language to word embeddings. (b) Develop and share language-based assessment models. (c) Apply language-based assessment models to new data.
Word embeddings
Language can be represented numerically through word embeddings, which are lists of values that capture the latent meaning of words in a structured format. Essentially, the word-embedding process transforms language into numerical representations, making it possible to analyze linguistic patterns computationally. This transformation is powered by large language models, such as GPT-4 or BERT, which are trained on vast amounts of text data from the internet, books, and other sources to develop a generalizable representation of language. The large language models’ main task in training is to predict the next word based on the previous context. Through this vast training procedure, the models create a multidimensional semantic space in which words—and even entire texts—can be positioned based on their contextualized meaning and usage such that each model represents language in several slightly different versions (i.e., layers; for more details, see Devlin et al., 2019; Vaswani et al., 2017). Text can be transformed into word embeddings using the textEmbed function from the text package, as described in detail by Kjell et al. (2023). Thus, this function transforms raw language data into meaningful numbers representing the language.
Training LBAMs
With word embeddings, it is possible to create linear regression models to predict psychological constructs and relevant outcomes. Normally in psychology, researchers make multiple linear regression models with predictor variables, such as the Big Five personality traits, age, and gender, to predict outcomes of interest, such as mental health, as the criterion variable. The models we introduce here are similar. The criterion variable works exactly the same. But instead of personality traits and demographics as the predictor variables, word embeddings are the predictor variables. Compared with personality traits and demographics as predictor variables (seven predictors), word embeddings commonly consist of hundreds or even thousands of dimensions (i.e., referred to here as “predictor variables”). An observant reader understands that such a model is likely to violate the assumption of multicollinearity (i.e., correlated predictor variables). To deal with this, models in the L-BAM Library use slightly more advanced forms of multiple linear regressions (e.g., ridge regression) that reduce the impact of irrelevant predictors through a penalty (a penalty, represented by a number, pushes abundant predictors toward 0, and the higher the penalty is, the stronger it pushes predictors toward 0). Furthermore, in a standard multiple regression, the predictors are fit to the criterion in one single model without testing if this fit works on new data. All models here, instead, have first fitted the criterion using various penalties on the predictors from a portion of the data. They are then tested on the remaining portion of the data for their predictive accuracy, and the degree of penalty is also evaluated. This process, known as “cross-validation,” is essential to secure the generalizability of models. The most common way to develop LBAMs is by using ridge regression via the
Applying models from the L-BAM Library
Finally, the LBAMs can be applied on new data, a fully automated process that takes the user’s new text data as input and provides a predicted score as outcome. This step is achieved with the
The textAssess() function
In the following, we describe how L-BAM Library users can quickly implement LBAMs on their own data in R using the text package. The function for doing this is called
For an example of how to download and apply an LBAM using the
Example on Depression Severity
Install the text Package
Code Box 2 shows how to set up the text package after installation, which is necessary the first time using the package. Because Python (another coding language) is used at the forefront of most large-language-model development and deployment, the text package relies on Python-based tools to access cutting-edge functionality. Under the hood, text sets up a dedicated Python environment using Miniconda (a lightweight collection of prebuilt tools for managing Python environments and packages). It then automatically installs key libraries, such as Hugging Face Transformers and PyTorch. This setup enables R users to seamlessly access powerful language models without needing to manually install or configure all the Python dependencies.
Some Python libraries require system-level dependencies that vary across operating systems and platforms. The text package automatically checks for these dependencies and if any are missing, provides instructions on how to install them. In some cases, this may require you to download and install tools using the Terminal. More information about platform-specific requirements and troubleshooting is available at https://r-text.org/articles/ext_install_guide.html. To ensure broad compatibility, the installation process and most of the package functionality are automatically and continuously tested on GitHub Actions across macOS, Windows, and Ubuntu systems.
For users who prefer not to install anything locally, we offer the ability to run the tutorial directly in Google Colab, requiring no setup on your own machine. Whenever using this option, make sure to follow the privacy concerns regarding your data.
Code Box 3 shows an example of how to download an LBAM to assess valence and apply it on satisfaction-with-life descriptions using the
Example on Valence and Well-Being
Code Box 4 shows another example of how to download LBAMs for assessing implicit motives and applying it on harmony-in-life descriptions using the
Example of Implicit Motives
Code Box 5 shows how the L-BAM Library can be examined in R. It is possible to import the library as a data frame using the
Examine Language-Based Assessment Model (L-BAM) Library
Input features: language, word embeddings, and other variables
The textAssess function requires predictors in the form of language features and/or other variables (see Table 2). Most models can take either raw language (
Main Function Parameters and Arguments for the textAssess() Function
The model and word embeddings will automatically be saved in the working directory when using a model trained with the text package. The function first checks if the working directory already has computed word embeddings for a given text; if not, the function retrieves them from the specified large language model (using the
Furthermore, some models have been trained using additional predictors other than language, that is, more than word embeddings, such as gender and age. In those cases, these variables are appended as a data frame using the x_append parameter. To know whether additional predictors are needed, you can access the model object to see if and what variables are needed for x_append (see information under “x_append” in the
Fine-tuned models
The textAssess function by default uses a model object trained in R with the text package (called “text-trained”) but can also use fine-tuned models. A text-trained model, which most of the models in the L-BAM Library are, is typically based on a predictive model algorithm (e.g., ridge regression) that has been trained on word embeddings to predict an outcome using a text-train function of the text package (e.g.,
The L-BAM Library: reproducibility, replication, and generalizability
To make model sharing more straightforward and accessible, we introduce the L-BAM Library. The L-BAM Library is a searchable online database (https://r-text.org/articles/LBAM.html 2 ) in which users can search for models and filter according to different model characteristics, such as the type of construct, the model’s predictive accuracy, or the type of language.
The L-BAM Library aims to comprise the most relevant information for model sharing, balancing thoroughness with practicality. We expect that most well-validated models will be accompanied by additional documentation, such as a peer-reviewed article, a model card describing the model in depth (Mitchell et al., 2019), and/or a data sheet clearly describing the training data set (Gebru et al., 2021). We also encourage reporting guidelines, such as the TRIPOD(+AI) statement for transparent reporting of research that develops, validates, or extends (updates) prediction models (Collins et al., 2024) and the LEADING statement for comprehensive reporting of how best-estimate assessments are achieved for training/evaluating prediction models that assess (psychiatric or medical) conditions lacking a more objective truth or “gold standard” (Eijsbroek et al., 2025).
Using and contributing to the L-BAM Library
Next, we outline the key information types of the L-BAM Library, offering a standardized format for describing the models, including outlining aspects of the outcome and training data, model performance and ethical considerations, and metadata and access (Table 3). These components are relevant for users to understand the models they use and for contributors who should report these when adding models to the L-BAM Library. Note that they are described briefly here and in more detail in the L-BAM Library (see https://docs.google.com/spreadsheets/d/14PcfTwQJZCKbSh6ylOq1Qm1VT44X4RD0aR4-dt6bink/edit?gid=194707973#gid=194707973) and in Tables S1 through S3 in the Supplemental Material.
Overview of the Language-Based Assessment Model Library
The content of the categories is detailed in Tables S1 through S3 in the Supplemental Material available online, including examples of them.
The outcome section describes what the model predicts, assesses, or classifies, such as the psychological construct or behavior (e.g., depression through PHQ-9), and details about the specific outcome the model was trained to and the type of language used to train the model (see Table S1 in the Supplemental Material).
The training-data section details the data set used to train the model, including the number of observations, where the data are attained (e.g., online, clinic), participants (e.g., demographics), and the type of labels used in training (e.g., self-reported). It also describes whether the model includes information about the language distribution (a word-frequency table) used in training to assess language similarity with new data (see Table S1 in the Supplemental Material).
The model section focuses on the technical aspects of the model, including the type of model used for prediction (e.g., ridge regression) and the features used for prediction, such as word embeddings and/or demographic information (see Table S2 in the Supplemental Material).
The model-performance section presents the key performance metrics of the model; one can include one primary metric (e.g., Pearson r or area under the curve), which is possible to filter, and then give additional relevant validation metrics (e.g., mean absolute error, sensitivity). One can also include accuracy from several evaluation frameworks, including (nested) cross-validation, held-out accuracy, and Sequential Evaluation with Model Pre-registration (SEMP; Kjell, Ganesan, et al., 2024; see Table S2 in the Supplemental Material). SEMP aims to address concerns that predictive models often underperform in independent or prospective samples (e.g., Chekroud et al., 2024; Kernbach & Staartjes, 2022; Spasic & Nenadic, 2020; also see Essential Caveat About Generalization section). It essentially involves preregistering LBAMs and expected outcomes before applying them to held-out evaluation data. If the results replicate with similar effect sizes, this adds strong evidence for the model’s robustness and generalizability. Note that not all models will include these estimates because they depend on how the model was developed and evaluated.
The ethical-considerations section includes the ethical-approval application ID associated with the model’s development (if applicable) and outlines ethical considerations or concerns addressed during the development and testing phases and those to consider in future applications (see Table S2 in the Supplemental Material).
The model-metadata-and-access section describes the study type (e.g., development or replication), citation details, licensing restrictions, and where the model can be accessed (see Table S3 in the Supplemental Material). If specific commands for using the models are applicable, they should also be mentioned here. We once again stress that all the information of what to add in the documentation is described in the L-BAM Library at https://r-text.org/articles/LBAM.html. Furthermore, we have uploaded a template with the headings of the library so that researchers can fill everything in on their local computer before adding all information to the library itself.
The L-BAM Library versus other repositories
The L-BAM Library focuses on social sciences and in particular, psychology, offering a standardized and accessible collection of models that are fully compatible with R and can be easily applied using
Ethical considerations, responsibility, and AI safety
The scores from LBAs can be used for further statistical analyses, such as standard hypothesis testing or predictive modeling. However, using LBAs comes with several ethical considerations.
Privacy
Language data are typically highly informative and personal, making it hard to anonymize, and researchers must consider ethical challenges and privacy issues in all steps of data collection, storage, and analyses of natural language (see e.g., Leidner & Plachouras, 2017). The L-BAM Library includes models that can be downloaded and run locally in the user’s own environment, allowing the user to avoid sharing sensitive information with a third party (e.g., ChatGPT). However, when uploading models, it is crucial to remember that certain types, such as fine-tuned large language models and text-trained models with language-distribution data, may contain sensitive information; hence, it is crucial to consider privacy concerns before sharing models. A text-trained model without a saved language distribution contains no language data on which it was trained.
Responsible applications and generalizability
The L-BAM Library does not involve peer review of models or warranty for the models it includes. As a class of techniques, research has shown that LBAs have comparable or exceeding validity and reliability as traditional rating scales (Kjell, Kjell, & Schwartz, 2024). However, each specific instance of such an assessment must undergo rigorous evaluation for validity and reliability for target populations and use contexts before being trusted (just as any rating scale would). Although information about the models and how to access them is provided in the L-BAM Library, it is essential for users to independently assess the accuracy, suitability, validity, and reliability of each model for their specific research needs (for details about responsible sharing and usage of LBAMs, see Box 1).
Responsible Sharing and Use of Language-Based Assessment Models (LBAMs)
Importantly, evaluating the suitability of a model includes explicit evaluation of generalizability. Our proposed “gold standard” for testing generalizability is to have similar assessments as the model was trained on in a subset of the new data. For example, suppose an LBAM has been trained to assess depression ratings from clinical interviews and a researcher wants to assess depression severity from social media language. In that case, the researcher should make sure there are enough participants having both social media language and depression-severity scores that the model’s generalizability can be tested on. The required sample size should be determined by the desired precision in estimating the model’s accuracy in the new sample, with attention to the width of its confidence interval (e.g., the confidence interval around a correlation coefficient). We propose this procedure as the “gold standard” because it is testing the model on a subset of the data it will subsequently be applied on.
When such paired data are unavailable, researchers may instead explore differences or similarities in language distributions between the training and target data sets. In the Supplemental Material, we show that one such approach—calculating target recall between training and test data—correlates meaningfully with generalizability performance (rs = .38–.39; n = 68 tests; see Table Box 1 in the Supplemental Material). This suggests target recall may offer a useful proxy for estimating generalizability, although further research is needed to refine and validate such distributional metrics.
Ultimately, the long-term goal is to accumulate enough well-documented model evaluations across diverse settings to enable meta-analyses and potentially predictive benchmarks of generalizability in new language contexts. This requires community-wide participation and is a key direction for future development of the LBAM ecosystem.
Ethical principles
Finally, there are several ethical principles (Jobin et al., 2019; Peters et al., 2020), regulations, and legal frameworks (European Commission, 2023; Hauglid & Mahler, 2023; U.S. Food & Drug Administration, 2021; Veale & Zuiderveen Borgesius, 2021; White House Office of Science and Technology Policy, 2022) concerning the development and use of AI and large language models. A review of more than 80 international guidelines identified five key ethical principles: transparency, justice and fairness, nonmaleficence, responsibility, and privacy (Jobin et al., 2019). We encourage users to explore, apply, and stay current with these resources.
Summary
In this tutorial, we presented the
Recommended Reading
We encourage researchers to expand the library with models predicting both psychological constructs and other social-science outcomes. We hope the library will help researchers to (a) use previously developed models, (b) upload information about their models, and (c) report relevant research that further validates existing models. By encouraging independent validation and transparency, the L-BAM Library can hopefully help strengthen research rigor on LBAs and advance the field.
Footnotes
Acknowledgements
A preprint of this manuscript is available on PsyArXiv. Open code and data available at https://r-text.org/, https://cloud.r-project.org/web/packages/text/index.html, https://github.com/OscarKjell/text/, and
.
Transparency
Action Editor: David A. Sbarra
Editor: David A. Sbarra
