Sage Journals: Discover world-class research

Abstract

Objective

This study aimed to directly compare the performance of 2 successive versions of ChatGPT (ChatGPT-4o and ChatGPT-5) in answering questions from the American Board of Surgery In-Training Examination (ABSITE) Quiz.

Methods

A total of 170 multiple-choice ABSITE Quiz questions (2017-2022) were categorized into 4 subgroups: Definitions, Biochemistry/Pharmaceutical, Case Scenario, and Treatment & Surgical Procedures. Each question was entered into both ChatGPT versions using the same question set. Correct answer rates were recorded, and paired comparisons were conducted using McNemar’s test.

Results

Overall accuracy was 79.4% for ChatGPT-4o and 87.1% for ChatGPT-5, with the improvement statistically significant (P < 0.001). In the Case Scenario category, accuracy increased from 76.3% to 86.8% (+10.5 points, P = 0.008), reflecting enhanced performance in multi-step clinical decision making. In contrast, Definitions (93.5% vs 93.5%) and Biochemistry/Pharmaceutical (83.3% vs 83.3%) showed no significant difference due to ceiling effects. In the Treatment & Surgical Procedures category, accuracy improved from 69.2% to 76.9%, but without statistical significance owing to the small sample size.

Conclusion

ChatGPT-5 demonstrated significantly higher accuracy than ChatGPT-4o in ABSITE Quiz questions, particularly in case-based scenarios requiring clinical reasoning. These findings suggest that newer LLM versions may provide more reliable support in surgical education and exam preparation, though further validation in multimodal and real exam settings is needed.

Keywords

American Board of Surgery artificial intelligence ChatGPT surgical education

Get full access to this article

View all access options for this article.

References

Sanli

Tekcan Sanli

Karabulut

. Can American board of surgery in training examinations be passed by large language models? Comparative assessment of gemini, copilot, and ChatGPT. Am Surg. 2025;91:31348251341956.

Gilson

Safranek

Huang

, et al. How does ChatGPT perform on the United States medical licensing examination (USMLE)? The implications of large language models for medical education and knowledge assessment. JMIR Med Educ. 2023;9:e45312.

Lee

Tessier

Brar

, et al. Performance of artificial intelligence in bariatric surgery: a comparative analysis of ChatGPT-4, bing, and bard in the American society for metabolic and bariatric surgery textbook of bariatric surgery questions. Surg Obes Relat Dis. 2024;20(7):609-613.

Riedel

Kaefinger

Stuehrenberg

, et al. ChatGPT's performance in German OB/GYN exams - paving the way for AI-enhanced medical education and clinical practice. Front Med. 2023;10:1296615.

Jin

Elangovan

, et al. Retrieval augmented generation for 10 large language models and its generalizability in assessing medical fitness. NPJ Digit Med. 2025;8(1):187.

Siu

Gibson

, et al. Emplying large language models for surgical education: an In-depth analysis of ChatGPT-4. J Med Educ. 2023;22(1):e137753.

Pinnola

Kaufmann

. Structured textbook review and individualized learning plans successfully remediate underperforming residents and improve general surgery program performance on the ABSITE. HCA Healthc J Med. 2024;5(1):49-54.

Kung

Cheatham

Medenilla

, et al. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLOS Digit Health. 2023;2:e0000198.

Dinc

Bardak

Bahar

Noronha

. Comparative analysis of large language models in clinical diagnosis: performance evaluation across common and complex medical cases. JAMIA Open. 2025;8:ooaf055.

10.

Tran

Chang

Sherman

De Andrade

. Performance of ChatGPT on American board of surgery In-Training examination preparation questions. J Surg Res. 2024;299:329-335.

11.

Park

Y-J

Pillai

Deng

, et al. Assessing the research landscape and clinical utility of large language models: a scoping review. BMC Med Inform Decis Mak. 2024;24:72.

12.

Rios-Hoyo

Shan

Pearson

Pusztai

Howard

. Evaluation of large language models as a diagnostic aid for complex medical cases. Front Med. 2024;11:1380148.

13.

Goh

Gallo

Hom

, et al. Large language model influence on diagnostic reasoning: a randomized clinical trial. JAMA Netw Open. 2024;7:e2440969.

14.

Kanjee

Crowe

Rodman

. Accuracy of a generative artificial intelligence model in a complex diagnostic challenge. JAMA. 2023;330:78-80.

15.

Lendvai

. ChatGPT in academic writing: a scientometric analysis of literature published between 2022 and 2023. J Empir Res Hum Res Ethics. 2025;20:131-148.

16.

Marescotti

. To ChatGPT or not to ChatGPT: the use of artificial intelligence in writing scientific papers. Brain Commun. 2023;5(6):fcad266.

17.

Kleebayoon

Wiwanitkit

. ChatGPT and scientific paper. J Korean Assoc Oral Maxillofac Surg. 2023;49:239-240.

18.

Chen

Huang

Yang

, et al. Performance of ChatGPT and bard on the medical licensing examinations varies across different cultures: a comparison study. BMC Med Educ. 2024;24:1372.

19.

Lin

Hsu

Yeh

Hsu

Kao

. Assessing AI efficacy in medical knowledge tests: a study using Taiwan's internal medicine exam questions from 2020 to 2023. Digit Health. 2024;10:20552076241291404.

20.

Kavian

Wilkey

Patel

Boyd

. Harvesting the power of artificial intelligence for surgery: uses, implications, and ethical considerations. Am Surg. 2023;89(12):5102-5104. doi:10.1177/00031348231175454

21.

Bogdanovich

Shah

Patel

Boyd

. Keeping up with AI evolution: ChatGPT-4o in surgery. Am Surg. 2025;91(1):5-6. doi:10.1177/00031348241272423

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.41 MB

Advances in Large Language Model Performance: A Comparative Study of ChatGPT-4 and ChatGPT-5 on ABSITE Questions