Sage Journals: Discover world-class research

Abstract

Purpose: To compare the diagnostic and management accuracy of large language model chatbots vs that of humans in performing outpatient retina triage in on-call telephone emergencies. Methods: Four large language model chatbots, 3 vitreoretinal surgery fellows, and 3 certified ophthalmic technicians with on-call experience were presented with 10 simulated retina cases representing after-hours telephone calls from patients. Diagnosis and triage recommendations were obtained from chatbots and humans. Recommendations were graded for each chatbot and human respondent. Results: Human graders were significantly more accurate than chatbots in diagnosis (95% vs 76.7%, respectively; P < .01) and follow-up recommendations (85% vs 70%, respectively; P = .03). However, chatbot performance varied. ChatGPT (OpenAI; 90%, P = .4) and Claude (Anthropic; 83.3%, P = .11) were noninferior to humans in diagnosis, while Meta (Meta Platforms Inc; 76.7%, P = .01) and Gemini (Google LLC; 56.7%, P < .001) performed significantly worse than humans. ChatGPT (93.3%, P = .32) and Claude (90%, P = .74) were also noninferior to humans in follow-up recommendations, but Gemini (50%, P < .001) and Meta (46.7%, P < .001) were worse than humans. Conclusions: The current pilot study found that overall, humans performed better than large language model–based chatbots in diagnosing and triaging retina-specific on-call telephone emergencies. However, chatbot accuracy was variable, with ChatGPT and Claude showing noninferior performance compared with humans. These findings suggest that with further validation, certain large language models could serve as useful aides for managing emergency telephone calls of varying medical urgency.

Keywords

large language models artificial intelligence retina triage retina emergency

Get full access to this article

View all access options for this article.

References

Patil

Gudivada

A review of current trends, techniques, and challenges in large language models (LLMs). Appl Sci. 2024;14(5):2074.

Omar

Nadkarni

Klang

Glicksberg

BS.

Large language models in medicine: a review of current clinical trials across healthcare applications. PLOS Digit Health. 2024;3(11):e0000662. doi:10.1371/journal.pdig.0000662

Tian

Jiang

Zhang

The role of large language models in medical image processing: a narrative review. Quant Imaging Med Surg. 2024;14(1):1108-1121. doi:10.21037/qims-23-892

Paul

Sanap

Shenoy

Kalyane

Kalia

Tekade

RK.

Artificial intelligence in drug discovery and development. Drug Discov Today. 2021;26(1):80-93. doi:10.1016/j.drudis.2020.10.010

Sallam

ChatGPT utility in healthcare education, research, and practice: systematic review on the promising perspectives and valid concerns. Healthcare (Basel). 2023;11(6):887. doi:10.3390/healthcare11060887

Singhal

Azizi

, et al. Large language models encode clinical knowledge. Nature. 2023;620(7972):172-180. doi:10.1038/s41586-023-06291-2

Singhal

Gottweis

, et al. Toward expert-level medical question answering with large language models. Nat Med. 2025;31(3):943-950. doi:10.1038/s41591-024-03423-7

Clay

Da Custodia Steel

Jacobs

Human-computer interaction: a literature review of artificial intelligence and communication in healthcare. Cureus. 2024;16(11):e73763. doi:10.7759/cureus.73763

Pham

Thongprayoon

Miao

, et al. Large language model triaging of simulated nephrology patient inbox messages. Front Artif Intell. 2024;7:1452469. doi:10.3389/frai.2024.1452469

10.

Gelston

Deitz

GA.

Eye emergencies. Am Fam Physician. 2020;102(9):539-545.

11.

Katz

Kaltsounis

Halloran

Mondor

Patient safety and telephone medicine: some lessons from closed claim case review. J Gen Intern Med. 2008;23(5):517-522. doi:10.1007/s11606-007-0491-y

12.

McDonald

Iordanous

Ophthalmology on call: evaluating the volume, urgency, and type of pages received at a tertiary care center. Cureus. 2022;14(4):e23824. doi:10.7759/cureus.23824

13.

Tan

Mickelsen

Villegas

, et al. Evaluation of interventions targeting follow-up appointment scheduling after emergency department referral to ophthalmology clinics using A3 problem solving. JAMA Ophthalmol. 2022;140(6):561-567. doi:10.1001/jamaophthalmol.2022.0889

14.

Mandalos

Tsouris

Artificial versus human intelligence in the diagnostic approach of ophthalmic case scenarios: a qualitative evaluation of performance and consistency. Cureus. 2024;16(6):e62471. doi:10.7759/cureus.62471

15.

Ran

Nguyen

, et al. What can GPT-4 do for diagnosing rare eye diseases? A pilot study. Ophthalmol Ther. 2023;12(6):3395-3402. doi:10.1007/s40123-023-00789-8

16.

Shanmugam

Wilkinson

Allergic contact dermatitis caused by a cyanoacrylate-containing false eyelash glue. Contact Dermatitis. 2012;67(5):309-310. doi:10.1111/cod.12000

17.

Tan

DNH

Tham

Koh

, et al. Evaluating chatbot responses to patient questions in the field of glaucoma. Front Med (Lausanne). 2024;11:1359073. doi:10.3389/fmed.2024.1359073

18.

Pushpanathan

Lim

Er Yew

, et al. Popular large language model chatbots’ accuracy, comprehensiveness, and self-awareness in answering ocular symptom queries. iScience. 2023;26(11):108163. doi:10.1016/j.isci.2023.108163

19.

Lim

Pushpanathan

Yew

SME

, et al. Benchmarking large language models’ performances for myopia care: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and Google Bard. EBioMedicine. 2023;95:104770. doi:10.1016/j.ebiom.2023.104770

20.

Ichhpujani

Parmar

UPS

Kumar

Appropriateness and readability of Google Bard and ChatGPT-3.5 generated responses for surgical treatment of glaucoma. Rom J Ophthalmol. 2024;68(3):243-248. doi:10.22336/rjo.2024.45

21.

Carlà

Gambini

Baldascino

, et al. Large language models as assistance for glaucoma surgical cases: a ChatGPT vs. Google Gemini comparison. Graefes Arch Clin Exp Ophthalmol. 2024;262(9):2945-2959. doi:10.1007/s00417-024-06470-5

22.

Athaluri

Manthena

Kesapragada

Yarlagadda

Dave

Duddumpudi

RTS

. Exploring the boundaries of reality: investigating the phenomenon of artificial intelligence hallucination in scientific writing through ChatGPT references. Cureus. 2023;15(4):e37432. doi:10.7759/cureus.37432

23.

Wang

Chen

Deng

, et al. Prompt engineering in consistency and reliability with the evidence-based guideline for LLMs. NPJ Digit Med. 2024;7(1):41. doi:10.1038/s41746-024-01029-4

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.01 MB

0.02 MB

Large Language Models Triage of Retina Patient Emergency Telephone Calls: A Pilot Study

Abstract

Keywords

Get full access to this article

References

Supplementary Material