Real-world implementation and predictive performance of suicide prediction models: A systematic review and meta-analysis

Abstract

Background

Machine learning (ML) models for suicide risk prediction show promise, but their real-world discriminative ability and the sources of performance variation remain unclear.

Methods

We systematically searched databases for studies evaluating statistical or ML suicide prediction models in real-world clinical settings. Random-effects meta-analyses pooled the area under the receiver operating characteristic curve (AUC) overall and by outcome type. We conducted exploratory univariable meta-regression with cluster-robust inference for outcome type, evaluation timing, and model type. Sensitivity analyses assessed the influence of within-study dependence.

Results

Nine studies (20 evaluations) were included. The overall pooled AUC was 0.849 (95% CI 0.827–0.869). Pooled AUCs were 0.865 (0.840–0.887) for suicide attempts, 0.835 (0.825–0.844) for suicidal ideation, and 0.842 (0.797–0.878) for suicide death. Heterogeneity was extreme (I² = 99.9%). Sensitivity analyses yielded similar pooled estimates. Exploratory univariable meta-regression showed no clear associations for outcome type, evaluation timing, or model type.

Conclusions

Suicide prediction models show good real-world discrimination, but performance varies substantially across evaluations. We found no clear evidence that outcome type, evaluation timing, or model type explained this heterogeneity. These findings support rigorous local validation and workflow-sensitive implementation rather than assuming that more complex algorithms will perform better in practice.

Keywords

suicide machine learning implementation science systematic review meta-analysis

1. Introduction

Suicide is a major global public health challenge, accounting for more than 700,000 deaths annually.¹ Because it is a leading cause of preventable mortality, effective prevention depends on identifying individuals at high risk accurately and in a timely manner. However, traditional risk assessment approaches, including clinical interviews and standardized questionnaires, have shown limited predictive validity. Their accuracy has changed little over the past 50 years.² The subjective nature and low predictive power of these approaches remain major barriers to suicide prevention.

Recent advances in artificial intelligence (AI), particularly machine learning (ML), together with the increasing availability of large-scale digital data, have created new opportunities for suicide risk prediction. ML models can identify subtle and complex patterns in diverse data sources, such as electronic health records (EHRs), administrative claims, survey responses, and social media data. As a result, they may offer more objective, scalable, and timely predictions than conventional approaches. Interest in this field has grown rapidly, and many systematic reviews have reported promising predictive performance, often with pooled area under the receiver operating characteristic curve (AUC) values above 0.85.^3,4

Despite these encouraging findings, important questions remain about whether such performance translates into real-world clinical effectiveness. First, most published models have been evaluated retrospectively using historical data. In contrast, prospective studies in real-world clinical settings, in which predictions may trigger alerts or interventions, remain rare.⁵ This research-practice gap is a major obstacle to understanding the true clinical utility of suicide prediction models. Second, the literature shows substantial heterogeneity in reported performance. This variation may reflect differences in target outcomes, such as suicidal ideation, attempt, or death, as well as differences in patient populations and care settings. In addition, although more complex ML algorithms are increasingly favored, it remains unclear whether they outperform traditional and more interpretable statistical models, such as logistic regression, when applied in practice.⁶ Without a clearer understanding of whether these study-level characteristics and algorithmic choices explain performance differences,⁷ it is difficult to develop practical guidance for clinical adoption.

Therefore, this study aimed to synthesize current evidence on the real-world applicability and effectiveness of suicide prediction models by estimating pooled predictive accuracy using AUC. We also explored potential sources of heterogeneity through subgroup analyses and exploratory univariable meta-regression focused on clinically important study-level characteristics, including outcome type, evaluation timing, and model type. By examining the gap between technical performance and real-world clinical effectiveness, we sought to suggest evidence-based directions for the future implementation of AI-based suicide prediction models.

2. Methods

2.1. Protocol and reporting guideline

This systematic review and meta-analysis were conducted and reported in accordance with the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) statement.⁸

2.2. Search strategy

A comprehensive literature search was conducted in PubMed, Embase, Scopus, and Web of Science for studies published up until September 30, 2025. The search strategy was developed around four core concepts joined by the ‘AND’ operator: (1) the outcome of interest (suicide), (2) predictive modeling terms, (3) real-world implementation terms, and (4) effectiveness and performance metrics.

For the first concept, terms for suicide were used (e.g., suicide, “Suicide” [Mesh]). The second concept included keywords for predictive models, such as “prediction model” and “risk score”. The third concept captured the implementation context using terms like “implementation” and “real-world”. The final concept focused on evaluation with keywords such as “effectiveness”, “validation”, and “performance”.

The search strategy combined both controlled vocabulary (i.e., MeSH in PubMed and Emtree in Embase) and free-text keywords searched in titles and abstracts ([Title/Abstract]). No language or date restrictions were applied during the initial search to ensure a comprehensive retrieval of all potentially relevant studies. Additionally, the reference lists of all included articles were manually screened for further eligible publications. The full, detailed search query for each database is provided in Supplementary Table S1.

2.3. Study selection and eligibility criteria

2.3.1. Study Selection process

The study selection was performed in a multi-stage process by two independent reviewers, following standard procedures recommended by the Cochrane Handbook for Systematic Reviews of Interventions.⁹ First, titles and abstracts of all retrieved records were screened for initial eligibility. Next, the full texts of potentially relevant articles were thoroughly assessed. Finally, a third screening was conducted to confirm that the studies met all criteria for quantitative synthesis in the meta-analysis.

Any disagreements between the reviewers at any stage were resolved through discussion and consensus or, if necessary, by consulting a third reviewer. This entire process was conducted and reported transparently in adherence with the PRISMA 2020 statement.⁸

2.4. Inclusion and exclusion criteria

2.4.1. Studies were included in the systematic review if they met the following criteria

• Population: Studies involving any human population where suicide risk was assessed.

• Intervention/Exposure: Studies that developed or validated a statistical or machine learning model to predict suicide-related outcomes (i.e., suicidal ideation, suicide attempt, or suicide death).

• Outcomes: Studies that reported at least one quantitative performance metric for the prediction model. For inclusion in the meta-analysis, studies were required to report an Area Under the Curve (AUC) value with its corresponding 95% confidence interval (CI), or provide sufficient data to calculate them.

• Study Design: Original, peer-reviewed research articles.

2.4.2. Studies were excluded based on the following criteria, identified during the screening process

• Studies that were not prediction models (e.g., scale validation studies, cost-effectiveness analyses without model application).

• Studies where the primary predicted outcome was not a suicide-related event (e.g., treatment-resistant depression, opioid overdose).

• Review articles, editorials, commentaries, or conference abstracts.

• Studies that did not report sufficient quantitative performance data to assess the model’s validity or to be included in the meta-analysis (e.g., missing AUC values).

2.5. Data extraction and quality assessment

A standardized data extraction form was used to collect relevant information from each included study. Two reviewers independently extracted the data, and any discrepancies were resolved by consensus after reviewing the source article.

The following variables were extracted from each study: study information (e.g., first author, publication year), study characteristics (e.g., country, sample size), model and implementation characteristics (e.g., model type, data source, implementation), outcome, and performance metrics (e.g., AUC with 95% CI).

The methodological quality of the included studies was assessed by two reviewers during the data extraction process, focusing on key domains of the Prediction model Risk Of Bias Assessment Tool (PROBAST) framework.¹⁰ We specifically evaluated the clarity regarding participants, predictors, and outcomes, as well as the appropriateness of the statistical analysis, to understand the overall robustness of the evidence base.

2.6. Statistical analysis

The primary effect measure for all quantitative syntheses was the area under the receiver operating characteristic curve (AUC) for discrimination.¹¹ For each model evaluation, we extracted the reported AUC and its 95% confidence interval (CI), when available. Model evaluations that did not report a CI were summarized descriptively in tables but were not included in the meta-analytic pooling. When multiple model evaluations were reported within a single study, they were treated as independent estimates in the primary analysis, acknowledging this as a pragmatic approach for variance estimation in the presence of extreme heterogeneity. Because AUC values are restricted to the interval [0, 1], we transformed AUCs to the logit scale before meta-analysis to stabilize variances and approximate normality.¹² For each model i, the effect size was defined as

θ_{i} = l o g i t ({A U C}_{i}) = \ln (\frac{{A U C}_{i}}{1 - {A U C}_{i}})

When 95% CIs were available, standard errors of logit (AUC) were derived from the reported CI bounds on the AUC scale. Specifically, the lower and upper CI limits were first transformed using the logit function, and the standard error was calculated as

S E (θ_{i}) = \frac{l o g i t ({U C L}_{i}) - l o g i t ({L C L}_{i})}{2 \times 1.96}

This approach provides an approximate standard error because it relies on reported confidence interval bounds rather than individual-level data and assumes interval symmetry after logit transformation.

Inverse-variance weights were then obtained as $w_{i} = {1 / S E (θ_{i})}^{2}$ .

We conducted random-effects meta-analyses of logit-transformed AUCs using a restricted maximum likelihood (REML) estimator for the between-study variance $τ^{2}$ ¹³. The REML estimator was selected due to its superior efficiency and reduced bias in estimating between-study variance compared to the traditional DerSimonian–Laird approach.¹³

For each synthesis, we first calculated the Cochran Q statistic and then estimated $τ^{2}$ and the pooled effect as

{\hat{θ}}_{R E} = \frac{\sum_{i} w_{i}^{*} θ i}{\sum_{i} w_{i}^{*}}, w_{i}^{*} = \frac{1}{{S E (θ_{i})}^{2} + τ^{2}}

Although both fixed- and REML random-effects models were fitted and displayed in the forest plots, our primary interpretations rely on the random-effects estimates due to the anticipated extreme between-study heterogeneity. The pooled effect and its 95% CI on the logit scale were subsequently back-transformed to the AUC scale using the inverse logit function to facilitate interpretation. Between-study heterogeneity was quantified using $τ^{2}$ and the I² statistic, defined as the proportion of total variability attributable to heterogeneity rather than sampling error.¹⁴

Pre-specified subgroup meta-analyses were performed according to outcome type: (1) suicide attempt–related outcomes, (2) suicidal ideation, and (3) suicide death or death-containing composite outcomes. Within each subgroup, a separate random-effects model was fitted on logit (AUC), and pooled estimates with 95% CIs and I² were reported. Forest plots were constructed to display study-specific AUCs with 95% CIs and the corresponding pooled AUCs for the overall analysis and each outcome subgroup.

In the primary analysis, multiple model evaluations reported within the same study were treated as separate estimates. To assess the potential influence of within-study dependence, we conducted two sensitivity analyses. First, we aggregated multiple evaluations within each study into a single study-level estimate using inverse-variance weighting and repeated the random-effects meta-analysis. Second, we applied cluster-robust variance estimation using study as the clustering unit in the evaluation-level model.

To explore potential sources of heterogeneity, we conducted exploratory moderator analyses with logit (AUC) as the dependent variable. Given the limited number of model evaluations and the risk of overfitting, we did not retain the original multivariable meta-regression model including seven study-level covariates. Instead, we focused on three clinically and methodologically central covariates—outcome type, evaluation timing, and model type—each examined separately in univariable random-effects meta-regression models.

Because some studies contributed multiple model evaluations, we additionally applied cluster-robust inference using study as the clustering unit to obtain more conservative standard errors and confidence intervals. These moderator analyses were treated as exploratory and underpowered, and were interpreted cautiously. All analyses were two-sided (p < 0.05) and conducted in R 4.4.0¹⁵ using the metafor package,¹⁶ with cluster-robust inference implemented for the revised univariable meta-regression analyses. Small-study effects were explored using funnel plots and Egger’s regression test.¹⁷

3. Results

3.1. Characteristics of included studies and models

A total of 9 studies reporting 20 evaluations of suicide prediction models were included in this review (Figure 1). The predicted outcomes encompassed suicide attempts, suicidal ideation, and suicide deaths, with models targeting suicide death being the most common. Several studies contributed multiple evaluations because models were assessed using different prediction horizons and/or modeling approaches.

Figure 1.

PRISMA 2020 flow diagram of study selection via databases and registers. The diagram summarizes the number of records identified through database and registry searches, the number remaining after deduplication, records excluded after title/abstract and full-text screening (with reasons), and the final number of studies included in the meta-analysis.

Most evaluations were conducted retrospectively using routinely collected clinical or administrative data, whereas only a subset of studies evaluated models prospectively or in near real-time. Clinical contexts ranged from mental health specialty care to general medical care, and a smaller number of evaluations were based on non-clinical or research-oriented data sources, including survey-based cohorts and population cohort/biobank data. For consistency, we classified the data source into four categories: Clinical, routine care and administrative data; Cohort, population cohort or biobank data; Survey, survey-based or student cohorts; and Overall, combined sources. A detailed summary of study designs, data sources, outcomes, prediction windows, and modeling approaches is provided in Table 1.

Table 1.

Characteristics of included model evaluations and discriminative performance (AUC). Each row represents a distinct model evaluation within a given study. The table summarizes the study, country, sample size, predicted outcome, data source, evaluation timing, model type, prediction window, and the AUC with its 95% confidence interval (lower and upper bounds).

study_label	N	Country	Outcome	Data source	Evaluation timing	Model type	time_window	auc	auc_lcl	auc_ucl
Walsh (2021) ¹⁸	77,973	USA	attempt	Overall	Prospective	ML	30 days	0.797	0.796	0.798
Walsh (2021) ¹⁸	77,973	USA	ideation	Overall	Prospective	ML	30 days	0.836	0.836	0.837
Walker (2021) ¹⁹	2,327,499	USA	attempt	Clinical	Prospective	LR	90 days	0.864	0.86	0.869
Walker (2021) ¹⁹	4,799,175	USA	death	Clinical	Prospective	LR	90 days	0.806	0.79	0.822
Walker (2021) ¹⁹	2,773,976	USA	death	Clinical	Prospective	LR	90 days	0.804	0.782	0.829
Walker (2021) ¹⁹	4,073,012	USA	attempt	Clinical	Retrospective	LR	90 days	0.862	0.86	0.864
Zang (2024) ²⁰	285,320	USA	attempt	Clinical	Retrospective	LR	24 days	0.879	0.876	0.882
Zang (2024) ²⁰	285,320	USA	attempt	Clinical	Retrospective	ML	24 days	0.887	0.884	0.89
Zang (2024) ²⁰	285,320	USA	attempt	Clinical	Retrospective	ML	24 days	0.901	0.896	0.905
Nielsen (2023) ²¹	912,118	Denmark	attempt	Clinical	Retrospective	ML	30 days	0.85	0.84	0.85
Nielsen (2023) ²¹	912,118	Denmark	death	Clinical	Retrospective	ML	30 days	0.71	0.7	0.73
Arunpongpaisal (2024) ²²	3,324	Thailand	death	Clinical	Retrospective	LR	Concurrent	0.902	0.886	0.917
Wang (2024) ²³	325,473	Canada	death	Clinical	Retrospective	LR	60 days	0.79	0.78	0.79
Wang (2024) ²³	331,202	Canada	death	Clinical	Retrospective	LR	60 days	0.85	0.84	0.86
Wang (2023) ²⁴	4,683	UK	death	Cohort	Retrospective	ML	1 year	0.919	0.852	0.985
Wang (2023) ²⁴	4,683	UK	death	Cohort	Retrospective	ML	1 year	0.901	0.821	0.981
Wang (2023) ²⁴	17,493	UK	death	Cohort	Retrospective	ML	6 years	0.892	0.844	0.94
Wang (2023) ²⁴	17,493	UK	death	Cohort	Retrospective	ML	6 years	0.885	0.834	0.936
He (2024) ²⁵	2,814	China	ideation	Survey	Retrospective	LR	Concurrent	0.85	0.82	0.88
Kaminsky (2024) ²⁶	1,524	Canada	ideation	Survey	Retrospective	ML	30 days	0.816	0.79	0.842

Risk-of-bias assessment using PROBAST indicated that methodological concerns were identified most often in the analysis domain, particularly in relation to limited reporting of calibration. Additional concerns in some studies related to outcome ascertainment based on unvalidated administrative codes and, in a smaller number of cases, to applicability issues in specific validation settings.

3.2. Overall discriminative performance of suicide prediction models

Across the 20 model evaluations that reported AUCs with 95% confidence intervals, the random-effects meta-analysis produced a pooled AUC of 0.849 (95% CI 0.827–0.869) for predicting any suicide-related outcome (Figure 2). This pooled estimate indicates that, on average, suicide prediction models evaluated in real-world settings achieve good discriminative performance, although the individual evaluation estimates show variability.

Figure 2.

Forest plot of AUCs for all suicide prediction model evaluations. Each line represents an individual model evaluation with its AUC and 95% confidence interval; the diamond at the bottom indicates the pooled AUC from the random-effects meta-analysis for any suicide-related outcome.

Although the pooled discrimination was good, heterogeneity was extremely high (I² = 99.9%, τ² = 0.121), indicating that predictive performance varied markedly across model evaluations. Such variability may reflect differences in the target outcome (attempt, ideation, or death), data source and care context, prediction horizon, modeling approach, and the extent of real-world implementation. Evaluation-specific AUC estimates with 95% confidence intervals are shown in the overall forest plot, and pooled results by outcome group are summarized in Table 2.

Table 2.

Pooled discriminative performance and heterogeneity for overall and outcome-specific suicide prediction models. Separate restricted maximum likelihood (REML). Random-effects meta-analyses were conducted for models predicting any suicide-related outcome, suicide attempts, suicidal ideation, and suicide deaths (including death-containing composite outcomes). k (model evaluations) included in each synthesis. “Pooled AUC” indicates the pooled area under the receiver operating characteristic curve with its 95% confidence interval (LCL = lower confidence limit, UCL = upper confidence limit). I² represents the percentage of total variability attributable to between-study heterogeneity.

Model group	Model evaluations	Pooled AUC	95% CI (LCL–UCL)	I² (%)
Suicide models	20	0.849	0.827 – 0.869	99.9
Suicide attempt	7	0.865	0.840 – 0.887	99.8
Suicidal ideation	3	0.835	0.825 – 0.844	22.8
Suicide death	10	0.842	0.797 – 0.878	98.6

Potential small-study effects were explored in the overall analysis using Egger’s regression test. Although the test was not statistically significant (p = 0.110) (Supplementary Figure S1), this result should be interpreted cautiously because the extreme between-evaluation heterogeneity limits the validity and interpretability of funnel plot–based asymmetry assessments.

3.3. Discriminative performance by outcome type

We conducted outcome-specific random-effects meta-analyses (Figure 3(a)–(c)). The pooled AUC was 0.865 (95% CI 0.840–0.887) for suicide attempt, 0.835 (95% CI 0.825–0.844) for suicidal ideation, and 0.842 (95% CI 0.797–0.878) for suicide death. Heterogeneity was substantial across outcome strata, with high inconsistency for suicide attempt and suicide death (I² = 99.8% and 98.6%, respectively) and lower inconsistency for suicidal ideation (I² = 22.8%), although the suicidal ideation subgroup included only three evaluations.

Figure 3.

Forest plots of pooled discriminative performance by outcome type. Forest plots of area under the receiver operating characteristic curve (AUC) from random-effects meta-analyses stratified by suicide-related outcome. Diamonds represent pooled AUCs with 95% confidence intervals, and squares represent individual model evaluations weighted by the inverse of their variance. (a) Suicide attempt–related outcomes (pooled AUC = 0.865, 95% CI 0.840–0.887). (b) Suicidal ideation outcomes (pooled AUC = 0.835, 95% CI 0.825–0.844). (c) Suicide death–related outcomes, including death-containing composite outcomes (pooled AUC = 0.842, 95% CI 0.797–0.878).

Differences between outcome strata were assessed using a test for subgroup differences (meta-ANOVA). Pairwise contrasts were not statistically significant: suicide death vs. suicidal ideation (p = .917), suicide death vs. suicide attempt (p = .217), and suicidal ideation vs. suicide attempt (p = .306). Pairwise contrasts are summarized in Table 3.

Table 3.

Pairwise Comparisons of pooled discriminative performance by outcome type (meta-ANOVA). Pairwise contrasts were obtained from a test for subgroup differences (meta-ANOVA) implemented in the dmetar package. Estimates represent differences in pooled effects between outcome groups on the analysis scale used in the meta-analysis; SE denotes the standard error, and p values are two-sided. Pairwise comparisons were conducted for exploratory purposes and were not adjusted for multiple comparisons.

Comparison	Estimate	SE	P-value
Suicide death vs suicidal ideation	0.025	0.238	0.917
Suicide death vs suicide attempt	-0.221	0.179	0.217
Suicidal ideation vs suicide attempt	-0.246	0.240	0.306

Sensitivity analyses addressing within-study dependence produced results similar to the primary analysis. The pooled AUC was 0.853 (95% CI 0.827–0.878) in the study-level aggregation analysis and 0.849 (95% CI 0.820–0.874) in the cluster-robust analysis, indicating that the overall findings were not substantially altered. These sensitivity results are summarized in Supplementary Table S2.

3.4. Meta-regression analyses

To explore potential sources of heterogeneity, exploratory univariable meta-regression analyses with cluster-robust inference were conducted for three clinically central covariates: outcome type, evaluation timing, and model type (Table 4). Relative to suicide attempt, the coefficients were -0.204 for suicidal ideation (p = .229) and -0.221 for suicide death (p = .398). The coefficient for prospective versus retrospective evaluation was -0.263 (p = .221), and the coefficient for machine learning versus logistic regression was 0.011 (p = .958). No covariate was significantly associated with logit-transformed AUC values.

Table 4.

Exploratory univariable meta-regression analyses of logit-transformed AUC values by outcome type, evaluation timing, and model type. Each coefficient represents the difference in logit (AUC) relative to the corresponding reference category. Suicide attempt, retrospective evaluation, and logistic regression served as the reference categories.

Covariate	Comparison	β (logit AUC)	SE	95% CI	p
Outcome type	Suicidal ideation vs attempt	-0.246	0.166	-0.755 to 0.263	0.229
Outcome type	Suicide death vs attempt	-0.221	0.237	-0.854 to 0.412	0.398
Evaluation timing	Prospective vs Retrospective	-0.263	0.144	-0.939 to 0.414	0.221
Model type	Machine Learning vs LR	0.011	0.198	-0.478 to 0.500	0.958

4. Discussion

4.1. Principal findings

In this systematic review and meta-analysis of real-world suicide prediction models, we found that models evaluated in routine clinical settings showed generally good overall discrimination. This suggests that data-driven suicide risk prediction can retain meaningful predictive value beyond development-only studies. At the same time, performance estimates varied substantially across evaluations, indicating that real-world effectiveness is not uniform across settings. Taken together, these findings support a balanced interpretation: suicide prediction models show promise in practice, but their performance is strongly context-dependent and should not be assumed to generalize automatically across institutions or workflows.

Importantly, our revised exploratory moderator analyses did not identify clear associations between discriminative performance and three clinically central study-level characteristics: outcome type, evaluation timing, and model type. These findings do not rule out the possibility that such factors matter in specific contexts. However, they suggest that broad study-level labels alone may be insufficient to explain the marked heterogeneity observed across evaluations. Instead, real-world performance may also depend on more localized factors, such as data quality, documentation patterns, feature construction, case-mix, and clinical workflow integration.

4.2. Interpretation and implications

A notable finding from the revised exploratory moderator analyses was the absence of a clear association between model type and discriminative performance. Within the limits of the current evidence base, this suggests that greater algorithmic complexity does not necessarily translate into better real-world discrimination.^3,27–29 From a clinical implementation perspective, this is important because simpler and more interpretable approaches, such as logistic regression, may remain reasonable options when their performance is not clearly inferior.^27–29

We also did not observe a clear association between evaluation timing and model discrimination. This should not be interpreted as evidence that implementation timing is unimportant. Rather, it suggests that prospective versus retrospective labeling alone may not fully explain the wide variability in real-world performance.^30,31 Taken together, these findings shift attention away from the search for a universally superior algorithm and toward the practical conditions under which models are validated, calibrated, and integrated into local workflows.^30–32 Because discrimination alone does not guarantee accurate absolute risk estimation, future evaluations should report calibration and clinical utility metrics alongside AUC.^30,32

4.3. Comparison with previous literature

Previous reviews have highlighted both the promise of machine learning for suicide prediction and the substantial heterogeneity across studies.^3,4,28,29 Implementation-oriented reviews have further emphasized that predictive performance alone is insufficient for clinical value unless models are supported by effective workflow integration, governance, and intervention pathways.^30–32 Our findings are broadly consistent with this literature and extend it by focusing specifically on real-world evaluations. Together, these results suggest that performance differences across health systems are likely shaped by contextual factors beyond broad study-level categories, reinforcing the need for careful local validation before transport across settings.^29,31

4.4. Strengths and limitations

The main strength of this study is its strict focus on model evaluations conducted in real-world or pragmatic contexts, which directly informs health-system deployment decisions. We synthesized discrimination using logit-transformed AUCs and explored potential sources of variability through meta-regression, allowing us to examine both average performance and the extent of real-world variability.^12,14

Several limitations should be acknowledged. First, the available evidence focused mainly on discrimination, whereas calibration and clinical utility metrics were rarely reported. This limits conclusions about absolute risk estimation and net clinical benefit.^30,32 In addition, the PROBAST-based assessment suggested that methodological concerns were most often concentrated in the analysis domain, particularly because of limited reporting of calibration. Therefore, strong AUC values should not be interpreted as evidence that these models are ready for implementation without further evaluation of calibration, threshold performance, and clinical usefulness. Second, heterogeneity remained extreme and largely unexplained despite exploratory moderator analyses, which may reflect residual differences in local clinical processes. Third, although we included 20 model evaluations from 9 studies, some subgroups were small, making the meta-regression findings exploratory. Fourth, while multiple evaluations from the same study were treated as independent in the primary analysis to maximize data use, this approach carries a risk of within-study correlation. We addressed this issue through sensitivity analyses, and the overall findings remained materially unchanged. Fifth, the revised moderator analyses were limited to a small number of clinically central covariates. Finally, the assessment of small-study effects should be interpreted cautiously. Although Egger’s regression test was not statistically significant, the extreme heterogeneity in effect estimates limits the interpretability of funnel plot asymmetry methods in this review.

4.5. Future directions

Future research on suicide prediction should move beyond a purely algorithm-centered agenda focused on small gains in AUC and place greater emphasis on implementation science.³¹ In highly heterogeneous real-world settings, the key challenge is no longer identifying a universally superior algorithm. Instead, it is determining how models can be effectively validated, calibrated, and integrated into local clinical environments.^29,30

Accordingly, future studies should prioritize rigorous local validation and prospective testing across diverse institutions. Models should be embedded into context-specific clinical workflows so that alerts are actionable, acceptable to clinicians, and capable of improving decision-making without increasing alert fatigue.^30–32 Routine validation should also include fairness assessments across vulnerable subgroups and continuous post-deployment surveillance to monitor performance drift and calibration decay over time.^10,29 Future implementation studies should therefore move beyond discrimination alone and evaluate whether predicted risks are well calibrated and clinically actionable in practice.^30,32

5. Conclusions

AI-based suicide prediction models demonstrate good discriminative performance in real-world clinical settings, with a pooled AUC of 0.849. However, this overall performance masks substantial heterogeneity that was not clearly explained by outcome type, evaluation timing, or model type. These findings suggest that the future of suicide prediction lies not in a universally optimal algorithm, but in careful local validation and workflow-sensitive implementation. Ultimately, unlocking the clinical utility of these models depends on adapting them to the realities of specific health systems while maintaining fairness, calibration, and actionability.

Supplemental material

Supplemental material - Real-world implementation and predictive performance of suicide prediction models: A systematic review and meta-analysis

Supplemental material for Real-world implementation and predictive performance of suicide prediction models: A systematic review and meta-analysis by KangHyun Kim, JuHee Kim, Myung-Gwan Kim, Hyun Wook Han in DIGITAL HEALTH

Supplemental material

Supplemental material - Real-world implementation and predictive performance of suicide prediction models: A systematic review and meta-analysis

Footnotes

ORCID iDs

KangHyun Kim

JuHee Kim

Myung-Gwan Kim

Hyun Wook Han

Author contributions

KH.K. conceptualized the study, conducted the main analyses, and drafted the manuscript. JH.K. assisted with data collection and conducted the literature review for the meta-analysis. MG.K. provided critical statistical and methodological advice. HW.H. supervised the overall research process, provided the research infrastructure, and critically revised the manuscript. All authors read and approved the final manuscript.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the Institute of Information & Communications Technology Planning & Evaluation(IITP)-ICAN(ICT Challenge and Advanced Network of HRD) grant funded by the Korea government(Ministry of Science and ICT) (IITP-2026-2710093245).

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

AI use disclosure

The authors used an artificial intelligence–based language tool only for language editing, translation support, and improvement of readability during manuscript revision. The authors reviewed and approved all revisions and take full responsibility for the final content of the manuscript.

Supplemental material

Supplemental material for this article is available online.

References

World Health Organization . Suicide worldwide in 2019: Global Health Estimates: WHO, 2021.

Franklin

Ribeiro

Fox

, et al. Risk factors for suicidal thoughts and behaviors: A meta-analysis of 50 years of research. Psychol Bull 2017; 143(2): 187–232. https://doi.org/10.1037/bul0000084

Bernert

Homer

Shah

, et al. Artificial intelligence and suicide prevention: a systematic review of machine learning investigations. Int J Environ Res Public Health 2020; 17(16): 5929. https://doi.org/10.3390/ijerph17165929

Belsher

Smolenski

Pruitt

, et al. Prediction models for suicide attempts and deaths: a systematic review and simulation. JAMA Psychiatry 2019; 76(6): 642–651. https://doi.org/10.1001/jamapsychiatry.2019.0174

D'Hotman

Lombardo

. AI enabled suicide prediction tools: a qualitative narrative review. BMJ Health Care Inform 2020; 27(3): e100175.

Christodoulou

Collins

, et al. A systematic review shows no evidence of clinical utility of machine learning over logistic regression for clinical prediction models. Lancet Digit Health 2019; 1(2): e82–e90.

Futoma

Simons

Panch

, et al. The myth of generalizability in clinical risk prediction: the importance of local context. Lancet Digit Health 2020; 2(9): e489–e491.

Page

McKenzie

Bossuyt

, et al. The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. BMJ 2021; 372: n71. https://doi.org/10.1136/bmj.n71

Higgins

JPT

Thomas

Chandler

, et al., (eds). Cochrane Handbook for Systematic Reviews of Interventions. 2nd ed.: John Wiley & Sons, 2019.

10.

Wolff

Moons

Riley

, et al. PROBAST: A tool to assess the risk of bias and applicability of prediction model studies. Ann Intern Med 2019; 170(1): 51–58. https://doi.org/10.7326/M18-1376

11.

Hanley

McNeil

. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 1982; 143(1): 29–36. https://doi.org/10.1148/radiology.143.1.7063747

12.

Snell

Ensor

Debray

, et al.

Meta-analysis of prediction model performance across multiple studies: Which scale helps ensure between-study normality for the AUC and C-statistic?

Stat Med 2018; 37(24): 3505–3518.

13.

Viechtbauer

. Bias and efficiency of meta-analytic variance estimators in the random-effects model. J Educ Behav Stat 2005; 30(3): 261–293. https://doi.org/10.3102/10769986030003261

14.

Higgins

Thompson

. Quantifying heterogeneity in a meta-analysis. Stat Med 2002; 21(11): 1539–1558. https://doi.org/10.1002/sim.1186

15.

R Core Team . R: A language and environment for statistical computing: R Foundation for Statistical Computing, 2024.

16.

Viechtbauer

. Conducting meta-analyses in R with the metafor package. J Stat Softw 2010; 36(3): 1–48. https://doi.org/10.18637/jss.v036.i03

17.

Egger

Davey Smith

Schneider

, et al. Bias in meta-analysis detected by a simple, graphical test. BMJ 1997; 315(7109): 629–634. https://doi.org/10.1136/bmj.315.7109.629

18.

Walsh

Johnson

Ripperger

, et al. Prospective validation of an electronic health record-based, real-time suicide risk model. JAMA Netw Open 2021; 4(3): e211428. https://doi.org/10.1001/jamanetworkopen.2021.1428

19.

Walker

Shortreed

Ziebell

, et al. Evaluation of electronic health record-based suicide risk prediction models on contemporary data. Appl Clin Inform 2021; 12(4): 778–787. https://doi.org/10.1055/s-0041-1733908

20.

Zang

Hou

Lyu

, et al. Accuracy and transportability of machine learning models for adolescent suicide prediction with longitudinal clinical records. Transl Psychiatry 2024; 14(1): 316. https://doi.org/10.1038/s41398-024-03034-3

21.

Nielsen

Christensen

RHB

Madsen

, et al. Prediction models of suicide and non-fatal suicide attempt after discharge from a psychiatric inpatient stay: A machine learning approach on nationwide Danish registers. Acta Psychiatr Scand 2023; 148(6): 525–537. https://doi.org/10.1111/acps.13629

22.

Arunpongpaisal

Assanangkornchai

Chongsuvivatwong

. Developing a risk prediction model for death at first suicide attempt-Identifying risk factors from Thailand's national suicide surveillance system data. PLoS One 2024; 19(4): e0297904. https://doi.org/10.1371/journal.pone.0297904

23.

Wang

Kharrat

FGZ

Gariépy

, et al. Predicting the population risk of suicide using routinely collected health administrative data in Quebec, Canada: Model-based synthetic estimation study. JMIR Public Health Surveill 2024; 10: e52773. https://doi.org/10.2196/52773

24.

Wang

Qiu

Zhu

, et al. Prediction of suicidal behaviors in the middle-aged population: Machine learning analyses of UK Biobank. JMIR Public Health Surveill 2023; 9: e43419. https://doi.org/10.2196/43419

25.

Pang

Yang

, et al. Development of a prediction model for suicidal ideation in patients with advanced cancer: A multicenter, real-world, pan-cancer study in China. Cancer Med 2024; 13(13): e7439. https://doi.org/10.1002/cam4.7439

26.

Kaminsky

McQuaid

Hellemans

KGC

, et al. Machine learning-based suicide risk prediction model for suicidal trajectory on social media following suicidal mentions: Independent algorithm validation. J Med Internet Res 2024; 26: e49927. https://doi.org/10.2196/49927

27.

Riley

Collins

. Stability of clinical prediction models developed using statistical or machine learning methods. Biom J 2023; 65(8): e2200302. https://doi.org/10.1002/bimj.202200302

28.

Collins

Reitsma

Altman

, et al. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD statement. BMJ 2015; 350: g7594. https://doi.org/10.1136/bmj.g7594

29.

Collins

Moons

KGM

Dhiman

, et al. TRIPOD+AI statement: updated guidance for reporting clinical prediction models that use regression or machine learning methods. BMJ 2024; 385: e078378. https://doi.org/10.1136/bmj-2023-078378

30.

Van Calster

McLernon

van Smeden

, et al. A calibration hierarchy for risk models was defined: from utopia to empirical data. J Clin Epidemiol 2016; 74: 167–176. https://doi.org/10.1016/j.jclinepi.2015.12.005

31.

Kirtley

van Mens

Hoogendoorn

, et al. Translating promise into practice: a review of machine learning in suicide research and prevention. Lancet Psychiatry 2022; 9(3): 243–252. https://doi.org/10.1016/S2215-0366(21)00254-6

32.

Vickers

Elkin

. Decision curve analysis: a novel method for evaluating prediction models. Med Decis Making 2006; 26(6): 565–574. https://doi.org/10.1177/0272989X06295361

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.22 MB

1.80 MB