Sage Journals: Discover world-class research

Abstract

Objective

This study aimed to evaluate the performance of large language models (LLMs) in answering questions from the American Board of Surgery In-Training Examination (ABSITE).

Methods

Multiple choice ABSITE Quiz was entered into the most popular LLMs as prompts. ChatGPT-4 (OpenAI), Copilot (Microsoft), and Gemini (Google) were used in the study. The research comprised 170 questions from 2017 to 2022, which were divided into four subgroups: Definitions, Biochemistry/Pharmaceutical, Case Scenario, and Treatment & Surgical Procedures. All questions were queried in LLMs, between October 1, 2024, and October 5, 2024. Correct answer rates of LLMs were evaluated.

Results

The correct response rates for all questions were 79.4% for ChatGPT, 77.6% for Copilot, and 52.9% for Gemini, with Gemini significantly lower than both LLMs (P < 0.001). In the definition category, the correct response rates were 93.5% for ChatGPT, 90.3% for Copilot, and 64.5% for Gemini, with Gemini significantly lower (P = 0.005 and P = 0.015, respectively). In the Biochemistry/Pharmaceutical question category, the correct response rates were equal in all three groups (83.3%). In the Case Scenario category, the correct response rates were 76.3% in ChatGPT, 72.8% for Copilot, and 46.5% for Gemini, with Gemini significantly lower (P < 0.001). In the Treatment & Surgical Procedures category, the correct response rates were 69.2% for ChatGPT, 84.6% for Copilot, and 53.8% for Gemini. Although Gemini had the lowest accuracy, there was no statistically significant difference (P = 0.236).

Conclusion

In the ABSITE Quiz, ChatGPT and Copilot had similar success, whereas Gemini was significantly behind.

Keywords

American Board of Surgery artificial intelligence ChatGPT Gemini Copilot

Get full access to this article

View all access options for this article.

References

Baxter

Zhou

Zhang

. The practical implementation of artificial intelligence technologies in medicine. Nat Med. 2019;25:30-36.

Khalpey

Kumar

King

Abraham

Khalpey

. Large Language models take on cardiothoracic surgery: a comparative analysis of the performance of four models on American board of thoracic surgery exam questions in 2023. Cureus. 2024;16(7):e65083. doi:10.7759/cureus.65083

Azizoğlu

Aydoğdu

Bahattin

. How does ChatGPT perform on the European board of pediatric surgery examination? A randomized comparative study. Academic Journal of Health Sciences. 2023;39(1):23-26. doi:10.3306/AJHS.2024.39.01.23

Long

Lowe

Zhang

. A Novel Evaluation Model for Assessing ChatGPT on Otolaryngology–Head and Neck Surgery Certification Examinations: Performance Study. JMIR Med Educ. 2024;10(e49970):1-8. doi:10.2196/49970

Antaki

Touma

Milad

El-Khoury

Duval

. Evaluating the performance of ChatGPT in ophthalmology: an analysis of its successes and shortcomings. Ophthalmol Sci. 2023;3:100324.

Ali

Tang

Connolly

. Performance of ChatGPT and GPT-4 on Neurosurgery Written Board Examinations. Neurosurgery. 2024;93(6):1353-1365. doi:10.1227/neu.0000000000002632

Ali

Tang

Connolly

. Performance of ChatGPT, GPT-4, and Google Bard on a Neurosurgery Oral Boards Preparation Question Bank. Neurosurgery. 2023;93(5):1090-1098. doi:10.1227/neu.0000000000002551

Choi

Lee

. ChatGPT goes to the operating room: evaluating GPT-4 performance and its potential in surgical education and training in the era of large language models. Ann Surg Treat Res. 2023;104:269-273.

Shah

Bogdanovich

Patel

Boyd

. Assessing the plastic surgery knowledge of three natural language processor artificial intelligence programs. J Plast Reconstr Aesthet Surg. 2024;88:193-195. doi:10.1016/j.bjps.2023.10.141

10.

Schoch

Schmelz

Strauch

Borgmann

Nestler

. Performance of ChatGPT-3.5 and ChatGPT-4 on the European board of urology (EBU) exams: a comparative analysis. World J Urol. 2024;42(1):445. doi:10.1007/s00345-024-05137-4

11.

Kung

Cheatham

Medenilla

, et al. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLoS Digit Health. 2023;2:e0000198.

12.

Lee

Tessier

Brar

, et al. Performance of artificial intelligence in bariatric surgery: a comparative analysis of ChatGPT-4, Bing, and Bard in the American Society for metabolic and bariatric surgery textbook of bariatric surgery questions. Surg Obes Relat Dis. 2024;20(7):609-613. doi:10.1016/j.soard.2024.04.014

13.

Tran

Chang

Sherman

De Andrade

. Performance of ChatGPT on American board of surgery in-training examination preparation questions. J Surg Res. 2024;299:329-335. doi:10.1016/j.jss.2024.04.060

14.

GPT-4 technical report. https://cdn.openai.com/papers/gpt-4.pdf. Accessed [January 2, 2025].

15.

Copilot report. https://learn.microsoft.com/en-us/dynamics365/customer-service/use/copilot-analytics-report#copilot-report. Accessed [January 2, 2025].

16.

Introducing Gemini: our largest and most capable AI model. https://blog.google/technology/ai/google-gemini-ai/#sundar-note. Accessed [January 2, 2025].

17.

Liu

Cao

Liu

Ding

Jin

. Datasets for large language models: a comprehensive survey. arXiv preprint. 2024;2402.18041. doi: 10.48550/arXiv.2402.18041

18.

Yuan

Bao

Yuan

. Large language models illuminate a progressive pathway to artificial intelligent healthcare assistant. Medicine Plus. 2024;1(2):100030. doi:10.1016/j.medp.2024.100030

19.

Schmidgall

Harris

Essien

, et al. Evaluation and mitigation of cognitive biases in medical language models. NPJ Digit Med. 2024;7:295. doi:10.1038/s41746-024-01283-6

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.83 MB