Sage Journals: Discover world-class research

Abstract

Objective

This study aimed to evaluate the performance of large language models—ChatGPT-4o and Gemini 1.5 Pro—in assessing suicide risk and guiding treatment in adolescents presenting to the emergency department with suicidal ideation and/or attempts.

Materials and Methods

A retrospective review was conducted on child psychiatry consultation notes from 36 adolescents evaluated between February and March 2024. Structured clinical data were entered into ChatGPT and Gemini, and the resulting decisions were compared to those made by clinicians regarding hospitalization, sedation need, medication initiation, follow-up timing, and notification of social services or law enforcement.

Results

ChatGPT showed higher concordance with clinicians than Gemini, especially in hospitalization (41.6% agreement) and sedation decisions (100% agreement). ChatGPT recommended hospitalization in 58.3% of cases, compared to 33.3% by clinicians and 36.1% by Gemini. For outpatient cases, ChatGPT demonstrated partial alignment with clinical decisions on medication and follow-up, while Gemini’s responses were often uncertain or incomplete.

Conclusion

Large language models show promise as decision-support tools in adolescent psychiatric emergencies. ChatGPT was more consistent with clinical judgments than Gemini. However, limitations remain, and further studies involving broader populations are needed before routine clinical integration.

Plain Language Summary

Suicide attempts in adolescents are serious and complex situations that require careful evaluation by clinicians. In this study, we compared how two artificial intelligence (AI) systems, ChatGPT and Gemini, perform in supporting clinical decisions for adolescents presenting to the emergency department after a suicide attempt. We used real clinical cases and asked both AI systems to make decisions about hospitalization, need for sedation, medication use, and follow-up timing. We then compared these decisions with those made by experienced clinicians. Our findings showed that ChatGPT generally performed closer to clinicians, especially in decisions such as sedation and follow-up planning. However, it also tended to recommend hospitalization more often, suggesting a more cautious approach. Gemini, on the other hand, showed more uncertainty and lower agreement with clinicians. Although AI systems showed some strengths in structured decision-making, they were not consistent across all areas and relied entirely on the information provided by clinicians. This means that they cannot replace human judgment. Overall, AI tools may be helpful as support systems, but final decisions should always be made by trained healthcare professionals, especially in sensitive situations such as adolescent mental health emergencies.

Keywords

adolescent psychiatry suicide risk assessment artificial intelligence large language models psychiatric emergencies AI in mental health

Get full access to this article

View all access options for this article.

References

Abbas

Y. N.

Mahmood

Y. M.

Hassan

H. A.

Hamad

D. Q.

Hasan

S. J.

Omer

D. A.

Mohammed

S. H.

(2024). Role of ChatGPT and google bard in the diagnosis of psychiatric disorders: A comparative study. Barw Medical Journal, 2(1), 14–19. https://doi.org/10.58742/4vd6h741

Amisha Malik

Pathania

Rathaur

V. K.

(2019). Overview of artificial intelligence in medicine. Journal of Family Medicine and Primary Care, 8(7), 2328–2331. https://doi.org/10.4103/jfmpc.jfmpc_440_19

Arslan

Eyupoglu

Korkut

Turkdogan

K. A.

Altinbilek

(2024). The accuracy of AI-assisted chatbots on the annual assessment test for emergency medicine residents. Journal of Medicine, Surgery, and Public Health, 3(1), Article 100070. https://doi.org/10.1016/j.glmedi.2024.100070

Aydın

Ö.

(2023). Google bard generated literature review: Metaverse. Journal of AI, 7(1), 1–14. https://doi.org/10.61969/jai.1311271

Bilkay

H. İ.

Sarı

Gürhan

(2023). A brief overview of child-adolescent mental health in Turkey. Journal of Child and Development, 6(12), 78–92. https://doi.org/10.36731/cg.1315445

Chung

D. T.

Ryan

C. J.

Hadzi-Pavlovic

Singh

P. S.

Stanton

Large

M. M.

(2017). Suicide rates after discharge from psychiatric facilities: A systematic review and meta-analysis. JAMA Psychiatry, 74(7), 694–702. https://doi.org/10.1001/jamapsychiatry.2017.1044

Dergaa

Fekih-Romdhane

Hallit

Loch

A. A.

Glenn

J. M.

Fessi

M. S.

Ben Aissa

Souissi

Guelmami

Swed

El Omri

Bragazzi

N. L.

Ben Saad

(2024). ChatGPT is not ready yet for use in providing mental health assessment and interventions. Frontiers in Psychiatry, 14(1), Article 1277756. https://doi.org/10.3389/fpsyt.2023.1277756

Elyoseph

Levkovich

(2023). Beyond human expertise: The promise and limitations of ChatGPT in suicide risk assessment. Frontiers in Psychiatry, 14(1), Article 1213141. https://doi.org/10.3389/fpsyt.2023.1213141

Farhat

(2024). ChatGPT as a complementary mental health resource: A boon or a bane. Annals of Biomedical Engineering, 52(5), 1111–1114. https://doi.org/10.1007/s10439-023-03326-7

10.

Franco D'Souza

Amanullah

Mathew

Surapaneni

K. M.

(2023). Appraising the performance of ChatGPT in psychiatry using 100 clinical case vignettes. Asian Journal of Psychiatry, 89(1), Article 103770. https://doi.org/10.1016/j.ajp.2023.103770

11.

Gorenshtein

Fistel

Sorka

Telman

Winer

Peretz

Aran

Shelly

(2025). AI based clinical decision-making tool for neurologists in the emergency department. Journal of Clinical Medicine, 14(17), 6333. https://doi.org/10.3390/jcm14176333

12.

Graney

Hunt

I. M.

Quinlivan

Rodway

Turnbull

Gianatsi

Appleby

Kapur

(2020). Suicide risk assessment in UK mental health services: A national mixed-methods study. The Lancet Psychiatry, 7(12), 1046–1053. https://doi.org/10.1016/S2215-0366(20)30381-3

13.

Khalil

(2023). Will ChatGPT get you caught? Rethinking of plagiarism detection. In:International conference on human-computer interaction, Copenhagen, Denmark, 23–28 July 2023, pp. 475–487.

14.

Large

Kaneson

Myles

Gunaratne

Ryan

(2016). Meta-analysis of longitudinal cohort studies of suicide risk assessment among psychiatric patients. PLoS One, 11(6), Article e0156322. https://doi.org/10.1371/journal.pone.0156322

15.

Levkovich

Elyoseph

(2023). Suicide risk assessments through the eyes of ChatGPT-3.5 versus ChatGPT-4: Vignette study. JMIR Mental Health, 10(2023), Article e51232. https://doi.org/10.2196/51232

16.

Nori

King

McKinney

S. M.

Carignan

Horvitz

(2023). Capabilities of GPT-4 on medical challenge problems . Cornell University Computer Science. arXiv preprint arXiv:2303.13375.

17.

Ouyang

Yun

Zheng

(2025). AI as decision-maker: Ethics and risk preferences of LLMs . arXiv preprint arXiv:2406.01168.

18.

Oztermeli

A. D.

(2025). Is ChatGPT a reliable tool for explaining medical terms? Cureus, 17(1), e77258. https://doi.org/10.7759/cureus.77258

19.

Rao

Pang

Kim

Kamineni

Lie

Prasad

A. K.

Landman

Dreyer

Succi

M. D.

(2023). Assessing the utility of ChatGPT throughout the entire clinical workflow: Development and usability study. Journal of Medical Internet Research, 25(1), Article e48659. https://doi.org/10.2196/48659

20.

Salih

A. M.

Mohammed

B. A.

Hasan

K. M.

, et al. (2023). Mitigating the burden of meningitis outbreak; ChatGPT and google bard recommendations. Barw Medical Journal, 1(2), 27–31. https://doi.org/10.58742/bmj.v1i2.32

21.

Sallam

(2023). ChatGPT utility in healthcare education, research, and practice: Systematic review on the promising perspectives and valid concerns. Healthcare, 11(6), 887. https://doi.org/10.3390/healthcare11060887

22.

Sayers

(2001). The world health report 2001 - Mental health: New understanding, new hope. Bulletin of the World Health Organization, 79(1), 1085.

23.

Shen

Heacock

Elias

Hentel

K. D.

Reig

Shih

Moy

(2023). ChatGPT and other large language models are double-edged swords. Radiology, 307(2), Article e230163. https://doi.org/10.1148/radiol.230163

24.

Stanley

Brown

G. K.

(2012). Safety planning intervention: A brief intervention to mitigate suicide risk. Cognitive and Behavioral Practice, 19(2), 256–264. https://doi.org/10.1016/j.cbpra.2011.01.001

25.

Teshale

A. B.

Htun

H. L.

Vered

Owen

A. J.

Freak-Poli

(2024). A systematic review of artificial intelligence models for time-to-event outcome applied in cardiovascular disease risk prediction. Journal of Medical Systems, 48(1), 68. https://doi.org/10.1007/s10916-024-02087-7

26.

Vilar

Freitag

Cherry

Luo

Ratnakar

Foster

(2022). Prompting palm for translation: Assessing strategies and performance. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 15406-15427).

27.

Vrdoljak

Boban

Males

Skrabic

Kumric

Ottosen

Clemencau

Bozic

Völker

(2025). Evaluating large language and large reasoning models as decision support tools in emergency internal medicine. Computers in Biology and Medicine, 192(Pt B), Article 110351. https://doi.org/10.1016/j.compbiomed.2025.110351

Comparing ChatGPT-4o and Gemini 1.5 Pro in Adolescent Psychiatric Emergencies: A Real-World Evaluation of AI Support in Suicide Risk Assessment