Sage Journals: Discover world-class research

Abstract

Background

Artificial intelligence (AI), particularly large language models (LLMs), has gained attention for its clinical applications. While LLMs have shown utility in various medical fields, their performance in inguinal hernia repair (IHR) remains understudied. This study seeks to evaluate the accuracy and readability of LLM-generated responses to IHR-related questions, as well as their performance across distinct clinical categories.

Methods

Thirty questions were developed based on clinical guidelines for IHR and categorized into four subgroups: diagnosis, perioperative care, surgical management, and other. Questions were entered into Microsoft Copilot®, Google Gemini®, and OpenAI ChatGPT-4®. Responses were anonymized and evaluated by six fellowship-trained, minimally invasive surgeons using a validated 5-point Likert scale. Readability was assessed with six validated formulae.

Results

GPT-4 and Gemini outperformed Copilot in overall mean scores for response accuracy (Copilot: 3.75 ± 0.99, Gemini: 4.35 ± 0.82, and GPT-4: 4.30 ± 0.89; P < 0.001). Subgroup analysis revealed significantly higher scores for Gemini and GPT-4 in perioperative care (P = 0.025) and surgical management (P < 0.001). Readability scores were comparable across models, with all responses at college to college-graduate reading levels.

Discussion

This study highlights the variability in LLM performance, with GPT-4 and Gemini producing higher-quality responses than Copilot for IHR-related questions. However, the consistently high reading level of responses may limit accessibility for patients. These findings underscore the potential of LLMs to serve as valuable adjunct tools in surgical practice, with ongoing advancements expected to further enhance their accuracy, readability, and applicability.

Get full access to this article

View all access options for this article.

References

Inguinal Hernia - PubMed. Accessed February 3, 2025. https://pubmed.ncbi.nlm.nih.gov/30020704/

Shanafelt

West

Dyrbye

, et al. Changes in burnout and satisfaction with work-life integration in physicians during the first 2 years of the COVID-19 pandemic. Mayo Clin Proc. 2022;97(12):2248-2258. doi:10.1016/J.MAYOCP.2022.09.002

Jeblick

Schachtner

Dexl

, et al. ChatGPT makes medicine easy to swallow: An Exploratory Case Study on Simplified Radiology Reports;1:3. doi:10.1007/s00330-023-10213-1

Sarraju

Bruemmer

Van Iterson

Cho

Rodriguez

Laffin

. Appropriateness of cardiovascular disease prevention recommendations obtained from a popular online chat-based artificial intelligence model. JAMA. 2023;329(10):842-844. doi:10.1001/JAMA.2023.1044

Lee

Shin

Tessier

, et al. Harnessing artificial intelligence in bariatric surgery: comparative analysis of ChatGPT-4, bing, and bard in generating clinician-level bariatric surgery recommendations. Surg Obes Relat Dis. 2024;20(7):603-608. doi:10.1016/J.SOARD.2024.03.011

Kung

Cheatham

Medenilla

, et al. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLOS Digit Health. 2023;2(2):e0000198. doi:10.1371/JOURNAL.PDIG.0000198

Group

. International guidelines for Groin Hernia management. doi:10.1007/s10029-017-1673-0

Enzmann

. Physician burnout: a hidden cause. Acad Radiol. 2024;31(2):718-723. doi:10.1016/J.ACRA.2023.10.028

Tai-Seale

Dillon

Yang

, et al. Physicians’ well-being linked to In-Basket messages generated by algorithms in electronic health records. Health Aff. 2019;38(7):1073-1078. doi:10.1377/HLTHAFF.2018.05509

10.

Holmgren

Downing

Tang

Sharp

Longhurst

Huckman

. Assessing the impact of the COVID-19 pandemic on clinician ambulatory electronic health record use. J Am Med Inf Assoc. 2022;29(3):453-460. doi:10.1093/JAMIA/OCAB268

11.

Shalaby

Elsheikh

Hamed

SURG-SAT-19 Collaborative Group . Burnout among surgeons before and during the SARS-CoV-2 pandemic: an international survey. BMC Psychol. 2024;12(1):48. doi:10.1186/S40359-023-01517-4

12.

Zulman

Verghese

. Virtual care, telemedicine visits, and real connection in the era of COVID-19: unforeseen opportunity in the face of adversity. JAMA. 2021;325(5):437-438. doi:10.1001/JAMA.2020.27304

13.

Sorin

Brin

Barash

, et al. Large language models and empathy: systematic review. J Med Internet Res. 2024;26(1):e52597. doi:10.2196/52597

14.

Chen

Miao

. Opportunities, challenges, and future directions of large language models, including ChatGPT in medical education: a systematic scoping review. J Educ Eval Health Prof. 2024;21:6. doi:10.3352/JEEHP.2024.21.6

15.

Sun

Owens

, et al. Development of a liver disease–specific large language model chat interface using retrieval-augmented generation. Hepatology. 2024;80(5):1158-1168. doi:10.1097/hep.0000000000000834

16.

Campbell

Estephan

Sina

, et al. Evaluating ChatGPT responses on thyroid nodules for patient education. Thyroid. 2024;34(3):371-377. doi:10.1089/THY.2023.0491

17.

Aljamaan

Temsah

Altamimi

, et al. Reference hallucination score for medical artificial intelligence chatbots: development and usability study. JMIR Med Inform. 2024;12(1):e54345. doi:10.2196/54345

18.

Menz

Kuderer

Bacchi

, et al. Current safeguards, risk mitigation, and transparency measures of large language models against the generation of health disinformation: repeated cross sectional analysis. 2024;384:e078538. doi:10.1136/bmj-2023-078538

19.

Zack

Lehman

Suzgun

, et al. Assessing the potential of GPT-4 to perpetuate racial and gender biases in health care: a model evaluation study. Lancet Digit Health. 2024;6(1):e12-e22. doi:10.1016/S2589-7500(23)00225-X

20.

McGreevey

Hanson

Koppel

. Clinical, legal, and ethical aspects of artificial intelligence-assisted conversational agents in health care. JAMA. 2020;324(6):552-553. doi:10.1001/JAMA.2020.2724

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.39 MB

A Comparative Analysis of the Accuracy and Readability of Popular Artificial Intelligence-Chat Bots for Inguinal Hernia Management