Abstract
Artificial intelligence (AI) shows promise in identifying psychopathology through language, but replicability in AI models remains challenging. We develop an AI-based language assessment of posttraumatic-stress-disorder (PTSD) severity and introduce the sequential evaluation with model preregistration to rigorously evaluate its validity and replicability. This design includes two phases: development with preregistration and evaluation. Data included development (N = 1,437) and prospective (N = 346) samples, in which participants described their lives during automated interviews. In the prospective sample, preregistered models correlated with PTSD CheckList scores (r = .38, p < .001) and converged with PTSD diagnosis (area under the curve [AUC] = .76; outperforming demographics and trauma exposures: AUC = .61, p < .01). We found that for each standard-deviation increase, mental-health-care expenditure rose by $696.50 (p < .001). Our preregistered PTSD model assessments are replicable in prospectively collected clinical data and showed external validity against expense criteria. With further development, such models can be used to screen for PTSD or monitor treatment response, especially in telehealth or automated interviews, in which deployment can be seamless.
Keywords
Psychological trauma is common. Seventy percent of the general population experience trauma at some point in life (Kessler et al., 2017). Only a minority of people exposed to trauma develop posttraumatic stress disorder (PTSD; Galatzer-Levy et al., 2018; Kessler et al., 2017). This subgroup needs mental-health services, but their needs are often unrecognized (Goldstein et al., 2016; Lewis et al., 2019; Wang et al., 2005). Challenges to the detection of PTSD are twofold. First, the “gold-standard” assessment is a diagnostic interview, but these interviews require extensive training, and most medical clinics lack such personnel (Bruchmüller et al., 2011; Ventura et al., 1998). Second, self-report screeners are more scalable but prone to biases, such as social desirability, acquiescence, limited insight, recall biases, and others (Neal et al., 2022; Schuler et al., 2021). Moreover, the lack of objective measures of PTSD symptoms limits the ability of clinicians to monitor treatment progress and identify treatment nonresponders, which presents significant challenges to care. For instance, responders and survivors of the World Trade Center (WTC) attacks are entitled to free mental-health treatment, but 17% have clinically significant PTSD symptoms even 2 decades after the disaster, and rates of PTSD have not decreased (Lowell et al., 2018; Waszczuk et al., 2022).
Recent advances in language analyses based on artificial intelligence have demonstrated promise in addressing this gap and delivering behavioral assessments of mental health (Boyd & Schwartz, 2021; Kjell et al., 2024). However, existing language-based PTSD models have typically been limited to (a) text-based sentiment models (Sawalha et al., 2022) or unigram models (He et al., 2017) rather than large language models to detect PTSD, (b) models being trained to assess other conditions (e.g., AI models for detecting depression, anxiety, and sentiment evaluated against PTSD-symptom severity; Oltmanns et al., 2021; Sawalha et al., 2022; Son et al., 2021), or (c) models trained on nonclinical proxies for PTSD (e.g., public self-disclosures of PTSD; Coppersmith, Dredze, & Harman, 2014; Coppersmith, Harman, & Dredze, 2014; Preotiuc-Pietro et al., 2015; Todorov et al., 2020). Moreover, many of these studies used social media language (e.g., Coppersmith, Harman, & Dredze, 2014; Preotiuc-Pietro et al., 2015; Todorov et al., 2020) rather than data collected in clinical settings, and they are rarely evaluated against external clinical outcomes, such as clinician-assigned PTSD diagnoses or mental-health-care expenditure. The most significant limitation is that none of the existing models were preregistered and tested in a new, prospectively collected sample. Hence, their replicability is unknown, which is especially concerning given recent evidence that AI models can fail to perform when evaluated prospectively (Chekroud et al., 2024; Kernbach & Staartjes, 2022; Spasic & Nenadic, 2020).
Solutions to the “replication crisis” have been developed in many fields (Nosek et al., 2015), but some solutions do not directly translate to AI-based approaches. Whereas standard preregistration practices are suitable for many research goals, they fall short of meeting the needs for developing and evaluating robust models. For example, some preregistration practices require specifying exact data-processing steps, but larger and more complex data sets (e.g., open-ended language) needed for AI-model training often require unexpected bug fixes, several preprocessing steps, and hyper-parameter tuning that come to light only during model-development stages (Maharana et al., 2022; Tabassum & Patil, 2020). Furthermore, it is best practice in AI to evaluate models over held-out samples to control for overfit among these complex models (i.e., cross-validation; Hastie et al., 2001) rather than relying on hypothesis testing—the analytic approach assumed by standard preregistration. In short, AI-model development often involves an iterative refinement process that does not fit well within the standard, a priori preregistration paradigm developed for traditional hypothesis testing.
To address these limitations, we propose a new evaluation paradigm, sequential evaluation with model preregistration (SEMP), that aims to combine good scientific practices (i.e., preregistration) with robust AI-model-development practices (i.e., bug fixes, hyper-parameter tuning, and out-of-sample test). SEMP calls for first developing the model and then registering it (i.e., declaring the model code and weights). We first registered the steps for training the models and then registered the models before testing them on new prospective data (see Fig. 1). By registering the models before testing, the approach mitigates the overestimation of accuracies for clinical prediction models by (a) preventing overfitting of hyper-parameters (whether purposeful or accidental); (b) mitigating the risk of test-data leaks—when data from the test set inadvertently influence decisions in the training process, thereby yielding overestimated accuracy for new, unseen data; and (c) when using a prospective test set, more realistically testing the model application to a sample later in time, as would happen in practice.

Overview of the sequential evaluation with model preregistration (SEMP). The stages include data split (i.e., prospective data), preregistering the prediction-model development (e.g., ridge regression), n-fold cross-validation (e.g., 10-folds), hypotheses (H1–H4; in which hypotheses can change across stages, as visualized with the different colors), registered prediction models (i.e., the exact models to validate), preprocessing (e.g., removal of parts in which research assistants speak), and held-out perspective set evaluation (e.g., control variables).
SEMP integrates established reproducibility conventions—preregistration in psychology (e.g., Nosek et al., 2018), prospective evaluation in clinical prediction (e.g., see Collins, Dhiman, et al., 2024), and locked-algorithms and hidden-test-set frameworks common in machine-learning competitions (shared task; e.g., see Coppersmith et al., 2015)—into a unified, practical workflow tailored to clinical research and models. Specifically, we preregistered the data-cleaning pipeline, feature construction, final model weights, and analysis scripts before accessing a later-in-time test set, thereby minimizing leakage and overfitting and mimicking real clinical settings. This design preserves the freedom of iterative model development before preregistration while enforcing a fully frozen, prospective evaluation phase. Finally, SEMP is aligned with clinical-AI reporting guidance and emphasis on transparency in prediction-modeling research (Collins, Moons, et al., 2024).
In this study, we aimed to advance the application of AI in PTSD assessment through three key objectives. First, we aimed to develop AI models that assess PTSD-symptom severity, including overall severity and the four symptom clusters, from natural spoken language. These models were designed for use in clinical settings with populations exposed to documented trauma. Second, we rigorously evaluated the replicability of these models using the SEMP framework. Third, we validated these models against practitioner-relevant criteria, including PTSD diagnoses from electronic health records and the use of mental-health services. The sample comprises 1,783 WTC responders who completed a health-monitoring visit at the WTC Health Program (a setting akin to primary care) in 2021–2022.
Transparency and Openness
The two preregistration reports and the preregistered language-based assessment models are available on OSF (https://osf.io/ebgp3/?view_only=7f13f644dc594102a1e995729bd44ac3). The open materials are presented in the Supplemental Material available online. We report how we determined our sample size, all data exclusions, all manipulations, and all measures in the study. The Institutional Review Board (IRB No. 604113) approved the study at Stony Brook University.
Materials and Method
Design: SEMP
SEMP is a two-phase procedure to preregister models before using held-out test data. The development phase comprises developing the models; the evaluation phase involves a preregistration locking in the exact models and preprocessing (data cleaning) code. The two-phase procedure enables iteratively developing the preprocessing and modeling choices (i.e., “hyperparameter optimization”) without using the test data such that final accuracies will be established more robustly on held-out, optionally prospective data. Instead of testing whether models produce assessment accuracies better than chance, the development phase enables one to register specific effect-size intervals for the evaluation phase.
In the SEMP procedure, we first developed the AI-based PTSD-symptoms models using automatically transcribed language from video-recorded automated clinical interviews over a development “training-set” sample. This was done for preregistration Phase 1 (the development phase) to produce and preregister pretrained models. In the second phase (the evaluation phase), we applied the preregistered models to a “prospective held-out test” sample consisting of new participants. The base hypothesis for the evaluation phase is that the models trained during the development phase will continue to predict their intended outcomes on the unseen data.
To strengthen hypotheses, we included the following expected correlational-accuracy ranges: The registered language-based assessments of PTSD produce scores that (a) are positively associated with PTSD-symptom severity; (b) achieve a correlation r ≥ .35 for overall symptom severity and r ≥ .18 for the four individual-symptom cluster dimensions (the preregistered correlations are based on training-set cross-validated r = .41, 99% confidence interval [CI] = [.35, .46], N = 1,437, for the combined-symptom severity and r = .25, 99% CI = [.18, .31] for the individual-symptom dimensions, N = 1,422); and (c) are predictive above and beyond the pretrained models using baseline demographics (age, gender, occupation, and race). In our case, we also had hypotheses beyond testing an AI model (i.e., those based on preexisting language-based models) that also accompanied the SEMP process (depicted at the bottom of Fig. 1 as “sequential hypothesis preregistration”; for these hypotheses, see Section 3 in the Supplemental Material).
Participants
Participants were recruited from the Stony Brook WTC Health and Wellness Program, where their health has been monitored over several years. Participant data were split into two nontemporally overlapping parts: the “development” data (September 9, 2021–July 29, 2022) and the held-out prospective “evaluation” data (August 1, 2022–September 30, 2022). The development data totaled 1,437 participants (female = 7%, male = 93%; age: M = 57.9 years, SD = 8.0; 14.5% with reported PTSD diagnosis in their medical record). The prospective data include an additional 346 participants (female = 9%, male = 91%; age: M = 58.5 years, SD = 7.8; 15.6% with reported PTSD diagnosis in their medical record) enrolled after the participants in the development data set (see Section 1 in the Supplemental Material).
Measures and materials
Automated clinical interviews: video-recorded answers about life
Participants were recorded while answering questions automatically shown on a screen in a private room during a clinical visit (i.e., an automated clinical interview). Questions probed respondents to describe positive (e.g., “What are the three things in your life that you look forward to the most right now?”) and negative aspects of their (past, present, and future) life in general (e.g., nicest and worst things, challenges, support network) and about serious events (e.g., COVID-19 and 9/11; e.g., “How does 9/11 affect you now?”; for all questions, see Section 2 in the Supplemental Material). To maximize the generalizability of content, the questions were aimed at being broad and using layman’s terms (rather than, e.g., asking about specific clinical symptoms). The questions were presented on a screen with instructions on how not to read the questions out loud and to try spending at least 60 s answering each question. The questions were the same for everyone in the evaluation tests, although the questions were updated and changed over three iterations of the development phase to increase engagement and elicit more detailed answers. The recording for individuals responding with at least 150 words (the preregistered threshold) took, on average, 7.5 mins (SD = 4.1; range = 1.1–43.0).
The PTSD CheckList
The PTSD CheckList (PCL; Blanchard et al., 1996) comprises 17 items assessing PTSD-symptom severity based on the fourth edition of the Diagnostic and Statistical Manual of Mental Disorders (American Psychiatric Association, 1994). Respondents are asked to rate symptoms over the past month “in relation to 9/11” using a severity scale from 1 (not at all) to 5 (extremely). We computed the total mean score and the four subscales (King et al., 1998; Ruggero et al., 2013), including Re-Experiencing (e.g., intrusive thoughts of trauma), Avoidance (e.g., avoiding thoughts of trauma), Emotional Numbing (e.g., inability to recall aspects of trauma), and Hyperarousal (e.g., sleep disturbance). Cronbach’s alphas were acceptable for all scales in both data sets (≥ .70; see Section 1 in the Supplemental Material).
PTSD diagnosis in medical record
Diagnoses in the medical records are certifications that the participant has been diagnosed with a WTC-related condition (i.e., a WTC-related PTSD diagnosis) at any point since September 11, 2001. The psychiatrists at the Stony Brook WTC Health and Wellness Program diagnosed based on clinical history in the medical records and the semistructured Diagnostic Interview Schedule (see Dasaro et al., 2017).
Mental-health-care-service usage
Mental-health-care-service usage was operationalized as the total cost of services received by a given patient over the past 12 months, extracted from electronic health records of the WTC program.
WTC exposure
WTC exposure was assessed using a clinical interview at the initial monitoring visit (Dasaro et al., 2017). We use 10 dichotomous (yes/no) WTC-exposure variables that were associated with increased risk of PTSD and other health outcomes in prior work (Bromet et al., 2016; Pietrzak et al., 2014; Zvolensky, Farris et al., 2015; Zvolensky, Kotov et al., 2015).
Demographics
Self-reported age, gender, occupation, and race were collated from the monitoring data of the Stony Brook WTC Health and Wellness Program.
Procedure
The video recordings were collected in a clinical setting at the Stony Brook WTC Health and Wellness Program. All participants consented to participate and were informed about the study and their rights to withdraw at any time. A research assistant instructed the participants on how to conduct the automated interview. Last, the participants were debriefed.
Statistical analysis
The analyses were conducted using the Differential Language Analysis Toolkit (DLATK; Version 26; Schwartz, et al., 2017). Alpha was set at p < .05 with Benjamini-Hochberg adjustment to control for false-discovery (Type 1 error) rates. For analyses specific to preprocessing, the development of models, and Preregistration 1, see Section 1 in the Supplemental Material.
Linguistic-feature extractions
Two types of linguistic features were extracted for the mental-health assessment: (a) word embeddings (i.e., numeric representations of words that capture their meaning based on their context) from a large language model (RoBERTa-large, Layer 23; Liu et al., 2019) and (b) topics’ (N = 300) prevalence scores based on the topic model created in the development data set using latent Dirichlet allocation (LDA; Blei et al., 2003).
Machine learning
Models were developed using cross-validation on the development set. Following the procedure outlined in Preregistration 1, we trained the models using all development data with L2-penalized (ridge) linear regression. L2 regularization is a method that shrinks regression coefficients by applying a penalty to the maximum likelihood parameter estimates, reducing overfitting. Using 10-fold cross-validation, we partitioned the development data into 10 similar-sized subsets (folds). For each fold, the model was trained on nine folds and tested on the remaining fold, ensuring that each subset served as the test set once. This process was repeated across penalties ranging from 101 to 106 to identify the penalty minimizing prediction error. The penalty that yielded the best performance was then used to develop the final models. We created models for the PCL total score and its four subscales based on three input sets: (a) language only, (b) demographic controls only, and (c) a combination of language and demographic controls.
Preregistered models, lexica, and topics
Preregistered PTSD-severity models
We preregistered models for PCL total score and the four subscales based on (a) only language, (b) only the demographic controls, and (c) language and demographic controls (for more details about how these were developed, see the Supplemental Material).
Preregistered pretrained n-gram models and word-count lexica of theoretically grounded dimensions
To quantify the associations between PTSD severity and related theoretically grounded dimensions, we use pretrained n-gram models (weighted lexica) trained on Facebook and Twitter language (Park et al., 2015; Schwartz et al., 2014) to predict their self-reported neuroticism, depression, and anxiety. We use word-count lexica, including categories from the Linguistic Inquiry Word Count (Boyd et al., 2022), including death, first-person singular and plural pronouns, and word lengths, because these were significantly related to PTSD in previous research (Son et al., 2021, p. 20). To assess respondents’ reexperience of the WTC attack, we selected a lexicon combining five and seven LDA topics relating to reexperiencing the attack.
Open vocabulary topics: preregistered topics (word clouds)
After controlling for demographics, we plotted topics significantly associated with PCL scores using DLATK defaults. Three topics are preregistered from analyses of the development data set (for more details, see the Supplemental Material).
Results
Participants’ PCL scores did not significantly differ between the development (M = 26.29, SD = 11.7) and the prospective (M = 26.04, SD = 10.5) data sets (t = 0.36, df = 1,781, p = .722). The proportion of participants with a PCL score ≥ 44 was 10.0% in the development set and 8.4% in the prospective set. Likewise, the mean number of words in the development (M = 838, SD = 629) and prospective (M = 776, SD = 475) automated interviews (t = −1.74, df = 1,781, p = .082) did not differ significantly between the two data sets.
Language-based assessment of PTSD severity
In the prospective-language data, the preregistered pretrained language-based assessments of PTSD-symptom severity produced scores that significantly correlated with the PCL total scores (r = .38; Fig. 2 and Table SM11 in the Supplemental Material) and subscales (rs = .28–.37). All correlations are above the preregistered cutoffs based on the cross-validated correlations from the development set. Furthermore, all the PTSD-symptom-severity models based on language and demographics, except for the Re-Experience subscale, produced significantly less error than the preregistered baseline models using only demographics (rs = .10–.15; age, gender, occupation, and race). Overall, the preregistered models yielded correlations in the prospective data set corresponding to the cross-validated correlations in the development training data set (r difference range = .02–.08).

(a) Models based on embeddings (Layer 23, RoBERTa-large) and topics; demographics = age at visit, gender, occupation (police or not), and race. Asterisks indicate significance, *p < .05, **p < .01, ***p < .001. ↑ = the model accuracy is significantly higher than the corresponding demographics model (i.e., produces a significantly lower error; ↑ = p < .05, ↑↑ = p < .01, ↑↑↑ = p < .001); PCL = the Posttraumatic Stress Disorder (PTSD) CheckList. These results do not include adjustments to account for shrinkage via regularization in the machine-learning models (see Section 3 in the Supplemental Material available onlie). (b) Receiver operating characteristic (ROC) curves and classification-accuracy metrics for PTSD diagnosis in medical records. The preregistered language model (red) significantly outperforms a World Trade Center (WTC) exposure model (light blue; DeLong’s test: Z = 2.70, p = .007, with predictors associated with PTSD in previous research; Bromet et al., 2016; Pietrzak et al., 2014; Zvolensky, Farris et al., 2015; Zvolensky, Kotov et al., 2015) and a depression model (dark blue; DeLong’s test: Z = 3.47, p < .001), which was the most accurate language-based assessment for PTSD severity in Son et al. (2021). (c) The mental-health-care expenditure stratified by language-assessed PTSD-severity quintiles.
Clinical validation
The preregistered model for PTSD severity was finally validated against participants’ PTSD diagnosis from their medical records. For the maximum balanced-accuracy score (.72), the preregistered model for PTSD severity yields a sensitivity of .80 and a specificity of .64. The preregistered model yields an area under the curve (AUC) of .76 (Fig. 2); it significantly outperforms the demographics-based model (AUC = .61, p = .006), an exposure-based model (AUC = .61, p = .007) with predictors related to PTSD in previous research (Bromet et al., 2016; Pietrzak et al., 2014; Zvolensky, Farris et al., 2015; Zvolensky, Kotov et al., 2015), a model combining demographics and exposure (AUC = .61, p < .005), and a depression n-gram model (AUC = .60, p < .001).
Furthermore, the AI-based measure demonstrated significant incremental validity over PCL symptom severity with respect to external criterion of health-care expenditure. A multiple linear regression analysis was conducted to examine the contributions of PCL ratings and language-assessed PTSD severity (both standardized) in predicting mental-health-care expenditure. The model was statistically significant overall (p < .001). For each standard-deviation increase in language-assessed PTSD severity, expenditure rose by approximately $696.50 (p < .001). This relationship was stronger than the one observed with standardized PCL scores, which accounted for only a $254.10 increase and was not statistically significant (p = .183). Note that individuals in the top quintile of language-assessed PTSD-severity scores accounted for 72% of total mental-health-care expenditure.
Lexical assessments of PTSD-symptom severity
Lexical (word-based) assessments of depression (r = .18) and anxiety (r = .16) correlated significantly with overall PCL scores in the prospective language (Table 1). The difference in correlation between the development and prospective data sets ranges from .01 to .04, demonstrating that the relationship is robust but small.
Association of Lexicon (Word-Based) Assessments With PTSD-Symptom Severity
Note: β coefficients are from a standardized linear regression with demographics as covariates. (+) = positive associated registered; (–) = negative associateion registered;— = not preregistered; reexperiencing the WTC attack 1 (+) = five topics; reexperiencing the WTC attack 2 (+) = seven topics; demographics = age at visit, gender, occupation (police or not), and race; PTSD = posttraumatic stress disorder; WTC = World Trade Center.
p < .05. **p < .01. ***p < .001.
In addition, the three preregistered topics were also significantly associated with the PCL scores (Fig. 3): Topic 288 (on “stress, anxiety, and “pain”), Topic 286 (on “control”), and Topic 65 (on “mental health issues”). The difference in standardized beta between the development and prospective data sets ranged from 0.03 to 0.05, showing that the topic model generalized to future data by producing reliable results.

Topics (automatically grouped similar words) from the automated interview significantly related to posttraumatic-stress-disorder severity. Scores are standardized linear coefficients (beta) controlling for demographics. Topics under (a) were preregistered, and those in (b) were also significant in the development set but did not meet the power-analysis criteria to be tested in the smaller test set.
Discussion
In the present study, we developed the first replicated language-based AI models of PTSD symptoms—covering overall severity and the four symptom clusters—and introduced the SEMP framework for their prospective evaluation. The model was trained in a sample of 1,437 WTC responders to assess PCL scores and evaluated in another sample of 346 responders. We used the SEMP design to ensure rigorous replication and found that models perform equally well in the prospective held-out sample as in the sample they were trained in. Moreover, the models showed external validity in explaining PTSD diagnosis in medical records and the use of mental-health-care services.
According to the SEMP design, the models were preregistered before the evaluation was conducted in the prospective sample so that no changes could be made to the models, addressing a common concern about overfitting in machine-learning models for clinical prediction (Chekroud et al., 2024). This test is particularly stringent compared with typical retrospective validation practices, which often do not evaluate model generalizability in a new prospective sample (Son et al., 2021). Nevertheless, we found that each model produced predictions that correlated with its target (Re-Experiencing, Avoidance, Numbing, and Hyperarousal subscales and PCL total), in line with the preregistered cutoffs. The SEMP procedure enables precise hypotheses regarding the effect size (which is uncommon in current practice). Furthermore, all the assessment models based on language and demographics, except for reexperiencing, produced significantly less error than the baseline models based on only demographics.
Importantly, the preregistered models were subjected to external validation against clinically salient criteria derived from medical records and thus methodologically independent from the collected language. Language-assessed PTSD severity discriminated patients with PTSD diagnosis with an AUC of .76, which is significantly higher than for other types of predictors, including demographics, exposures, and a previous state-of-the-art depression-severity model (Son et al., 2021). This AUC score was well above the cutoff for common clinical standards (.70; e.g., see the consensus-based standards for the selection of health measurement instruments, Prinsen et al., 2018). Furthermore, language-assessed PTSD severity demonstrated incremental validity over self-reported PCL ratings. When both were included as predictors of mental-health-service usage, only language-assessed PTSD severity was statistically significant, and the top quintile accounted for more than 70% of total mental-health-care expenditures.
Note that natural-language questions were not designed to maximize convergence with PTSD-symptom measures, in which patients would be asked to describe their symptoms. Rather, patients were asked more natural questions to speak freely about their past and present and their aspirations in life. Thus, the associations observed reflect psychological signals embedded in more spontaneous self-expression than explicit symptom reporting. For such contexts of comparing behavior with a psychological score, the modal effect size is between 0.1 and 0.4 (Meyer et al., 2001; Roberts et al., 2007). This puts the model’s convergent validity at the upper end (r = .38, AUC = .76), representing a substantial, meaningful effect, comparable with correlations typically observed between naturalistic behaviors and psychological constructs (Baumeister et al., 2007).
Pretrained n-gram models for depression and anxiety were positively correlated with observed PTSD severity. Furthermore, the three preregistered topics were significantly related to PTSD severity. Greater symptoms were associated with the use of topics indicating struggles with stress, anxiety, and pain; control; and mental-health issues are more likely to report higher PTSD severity. These findings provide insights into the specific language related to PTSD severity. They are consistent with research showing that a stressful life contributes to the exacerbation of these symptoms (Gold et al., 2005; Zvolensky, Farris et al., 2015; Zvolensky, Kotov et al., 2015) and with evidence of high comorbidity between PTSD, depression, anxiety, and pain (Brady et al., 2000; Caramanica et al., 2014; Sharp & Harvey, 2001).
Implications
With further development, language-based AI models can have multiple potential applications. First, AI can facilitate clinical research by offering systematic, fine-grained, and reliable measures of PTSD. For example, these behavioral markers may serve as complementary outcome metrics in randomized clinical trials (e.g., to address placebo effects) or as targets for studies investigating the neural underpinnings of PTSD. In addition, techniques to understand the performance of AI in predicting PTSD from language can help to elucidate psychological mechanisms underpinning this disorder (Boyd & Schwartz, 2021).
Second, language-based AI can be deployed as a screener. Existing screeners, such as the PCL, are short and easy to administer. Nevertheless, they pose a certain burden on patients and clinicians and as a result, are not routinely administered in many medical settings (Greene et al., 2016). In contrast, AI-based language assessments can be seamlessly integrated into telemedicine (Haleem et al., 2021), unobtrusively estimating the likelihood of PTSD after a session. High language scores may trigger the administration of a traditional screener or the referral to a mental-health provider for an evaluation. Evidence is emerging that AI-based screening may improve patient outcomes (Rollwage et al., 2024).
Third, once PTSD treatment is initiated, language-based AI can be used to monitor treatment response objectively, estimating symptom severity after each session. These AI assessments would avoid biases inherent in self-report measures (Neal et al., 2022; Schuler et al., 2021) and offer clinical utility over these scales, as we found in the present incremental-validity analyses. Implementing AI is not limited to telehealth and can be performed with any audio-recorded visits. Certainly, the use of AI in clinical settings should be combined with best practices for ensuring patient privacy, informed consent, transparency, and data security to form a responsible and ethical implementation of AI in mental-health care (for a more elaborate discussion on responsible AI and ethics, see Peters et al., 2020).
Fourth, by introducing the SEMP study design, we provide a rigorous framework for developing and evaluating clinical AI models, helping to safeguard results against accuracy overestimations and poor generalization. Accuracy estimations are crucial when implementing clinical AI models (Prinsen et al., 2018). The SEMP design could play a critical role in clinical AI assessments similar to the randomized-controlled-trial design used to evaluate interventions. These reliable accuracy estimates would be combined with best practices for ensuring patient privacy, informed consent, transparency, and data security to form a responsible and ethical implementation of AI in mental-health care (for a more elaborate discussion on responsible AI and ethics, see Peters et al., 2020).
Limitations
Although we suggest a procedure to assess the generalizability of AI models, the context for our analyses should still be considered. First, the AI models were developed and validated in a relatively homogeneous population and linguistic context—responders to the WTC attacks comprising mainly males, sharing similar occupational and trauma histories, and all assessed through a standardized interview. Such sample homogeneity may reduce linguistic variability and could inflate model performance when tested in other populations or settings. Furthermore, the context in which language is elicited (e.g., intake vs. follow-up sessions, in person vs. telehealth, structured vs. open interviews) can influence both how patients express symptoms and how language-based AI models perform. Occupational role was the only available proxy for socioeconomic status (SES); direct measures such as income, education, or composite SES indices were not collected, which limits our ability to characterize socioeconomic diversity in the sample and assess the generalizability of the findings across SES subgroups. Therefore, future research should evaluate and validate model performance across more diverse populations, trauma types, and language contexts.
Second, motivated to prompt language generalizable to multiple outcomes, interview questions concerned participants’ lives and experiences overall rather than specific clinical symptoms (see e.g., Kjell et al., 2019, 2022), but further work is needed to verify if this language actually generalizes better. Third, the analyses are based on cross-sectional evaluations and do not examine the sensitivity to change in treatment effects. Finally, because the models were trained to self-reported PTSD severity, they may partly reflect biases inherent in self-report measures (e.g., social desirability, recall errors, limited insight). Although other options for PTSD assessment exist (e.g., the Clinician-Administered PTSD Scale; Weathers et al., 2018), our validation against PTSD diagnosis from medical records and service utilization mitigates this limitation.
Conclusions
This is the first preregistered and prospectively validated language-based model of PTSD severity—including both total- and cluster-level-severity assessments—developed from clinically collected, automated interview data in the unified SEMP workflow. Our preregistered PTSD-severity models demonstrated reliable and accurate assessments in prospective data, showing convergent validity with self-report and external validity with PTSD diagnoses from medical records and mental-health-service usage. The accuracy of our language-based assessments outperforms models based on demographics, exposures, and depression. By introducing new safeguards to evaluate models, we demonstrate robust results supporting the potential of language-based assessments in psychiatric research and practice. Analyses of behavioral, observable markers in automated interview language produced robust, scalable psychiatric assessments, overcoming limitations found in traditional assessments.
Supplemental Material
sj-docx-1-cpx-10.1177_21677026261439026 – Supplemental material for Replicability and Validity of a New Artificial-Intelligence Assessment of Posttraumatic Stress Disorder From Patient Language: A Sequential Evaluation With Model Preregistration
Supplemental material, sj-docx-1-cpx-10.1177_21677026261439026 for Replicability and Validity of a New Artificial-Intelligence Assessment of Posttraumatic Stress Disorder From Patient Language: A Sequential Evaluation With Model Preregistration by Oscar Kjell, Adithya V. Ganesan, Ryan L. Boyd, Joshua Oltmanns, Alfredo Rivero, Scott Feltman, Melissa A. Carr, Jorge Alves, Benjamin Luft, Roman Kotov and H. Andrew Schwartz in Clinical Psychological Science
Footnotes
Acknowledgements
We express our gratitude to the rescue and recovery workers of the World Trade Center (WTC) attacks for their selfless dedication following the WTC attacks and for participating in this continuous research. Our thanks also extend to the clinical staff of the World Trade Center Medical Monitoring and Treatment Programs for their unwavering commitment and to the labor and community organizations for their ongoing support.
Transparency
Action Editor: Jennifer Lau
Editor: Jennifer L. Tackett
Author Contributions
ORCID iDs
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
