Using Large Language Models in the Diagnosis of Acute Cholecystitis: Assessing Accuracy and Guidelines Compliance

Abstract

Background

Large language models (LLMs) are advanced tools capable of understanding and generating human-like text. This study evaluated the accuracy of several commercial LLMs in addressing clinical questions related to diagnosis and management of acute cholecystitis, as outlined in the Tokyo Guidelines 2018 (TG18). We assessed their congruence with the expert panel discussions presented in the guidelines.

Methods

We evaluated ChatGPT4.0, Gemini Advanced, and GPTo1-preview on ten clinical questions. Eight derived from TG18, and two were formulated by the authors. Two authors independently rated the accuracy of each LLM’s responses on a four-point scale: (1) accurate and comprehensive, (2) accurate but not comprehensive, (3) partially accurate, partially inaccurate, and (4) entirely inaccurate. A third author resolved any scoring discrepancies. Then, we comparatively analyzed the performance of ChatGPT4.0 against newer large language models (LLMs), specifically Gemini Advanced and GPTo1-preview, on the same set of questions to delineate their respective strengths and limitations.

Results

ChatGPT4.0 provided consistent responses for 90% of the questions. It delivered “accurate and comprehensive” answers for 4/10 (40%) questions and “accurate but not comprehensive” answers for 5/10 (50%). One response (10%) was rated as “partially accurate, partially inaccurate.” Gemini Advanced demonstrated higher accuracy on some questions but yielded a similar percentage of “partially accurate, partially inaccurate” responses. Notably, neither model produced “entirely inaccurate” answers.

Discussion

LLMs, such as ChatGPT and Gemini Advanced, demonstrate potential in accurately addressing clinical questions regarding acute cholecystitis. With awareness of their limitations, their careful implementation, and ongoing refinement, LLMs could serve as valuable resources for physician education and patient information, potentially improving clinical decision-making in the future.

Keywords

large language models GPT-4 gemini advanced acute cholecystitis Tokyo guidelines clinical decision support

Get full access to this article

View all access options for this article.

References

Gilson

Safranek

Huang

, et al. Correction: how does ChatGPT perform on the United States medical licensing examination (USMLE)? The implications of large language models for medical education and knowledge assessment. JMIR Med Educ. 2024;10:e57594. doi:10.2196/57594

Sharp

. Acute cholecystitis. Surg Clin North Am. 1988;68(2):269-279. doi:10.1016/S0039-6109(16)44477-4

Yokoe

Hata

Takada

, et al. Tokyo guidelines 2018: diagnostic criteria and severity grading of acute cholecystitis (with videos). J Hepatobiliary Pancreat Sci. 2018;25(1):41-54. doi:10.1002/jhbp.515

Gemini Team Google . Gemini: A Family of Highly Capable Multimodal Models. Google. 2024. https://doi.org/10.48550/arXiv.2312.11805

Samaan

Yeo

Rajeev

, et al. Assessing the accuracy of responses by the language model ChatGPT to questions regarding bariatric surgery. Obes Surg. 2023;33(6):1790-1796. doi:10.1007/s11695-023-06603-5

The CHART Collaborative . Protocol for the development of the chatbot assessment reporting tool (CHART) for clinical advice. BMJ Open. 2024;14(5):e081155. doi:10.1136/bmjopen-2023-081155

Deng

Wang

Yangzhang , et al. Evaluation of large language models in breast cancer clinical scenarios: a comparative analysis based on ChatGPT-3.5, ChatGPT-4.0, and Claude2. Int J Surg. 2024;110(4):1941-1950. doi:10.1097/JS9.0000000000001066

Goglia

Pace

Yusef

, et al. Artificial intelligence and ChatGPT in abdominopelvic surgery: a systematic review of applications and impact. Vivo (Brooklyn). 2024;38(3):1009-1015. doi:10.21873/invivo.13534

Barash

Klang

Konen

Sorin

. ChatGPT-4 assistance in optimizing emergency department radiology referrals and imaging selection. J Am Coll Radiol. 2023;20(10):998-1003. doi:10.1016/j.jacr.2023.06.009

10.

Nasef

Patel

Amin

, et al. Evaluating the accuracy, comprehensiveness, and validity of ChatGPT compared to evidence-based sources regarding common surgical conditions: surgeons’ perspectives. Am Surg. 2024;25:31348241256075. doi:10.1177/00031348241256075

11.

Hermann

Patel

Boyd

Growdon

Aviki

Stasenko

. Let’s chat about cervical cancer: assessing the accuracy of ChatGPT responses to cervical cancer questions. Gynecol Oncol. 2023;179:164-168. doi:10.1016/j.ygyno.2023.11.008

12.

Alhur

. Redefining healthcare with artificial intelligence (AI): the contributions of ChatGPT, gemini, and co-pilot. Cureus. 2024;16:e57795. doi:10.7759/cureus.57795

13.

Rane

Choudhary

Rane

. Gemini versus ChatGPT: applications, performance, architecture, capabilities, and implementation. J Appl Artif Intell. 2024;5(1):69-93. doi:10.48185/JAAI.V5I1.1052

14.

Saab

Weng

, et al. Capabilities of gemini models in medicine. arXiv preprint arXiv:240418416. 2024.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.89 MB