Abstract
Objective
This study aimed to directly compare the performance of 2 successive versions of ChatGPT (ChatGPT-4o and ChatGPT-5) in answering questions from the American Board of Surgery In-Training Examination (ABSITE) Quiz.
Methods
A total of 170 multiple-choice ABSITE Quiz questions (2017-2022) were categorized into 4 subgroups: Definitions, Biochemistry/Pharmaceutical, Case Scenario, and Treatment & Surgical Procedures. Each question was entered into both ChatGPT versions using the same question set. Correct answer rates were recorded, and paired comparisons were conducted using McNemar’s test.
Results
Overall accuracy was 79.4% for ChatGPT-4o and 87.1% for ChatGPT-5, with the improvement statistically significant (
Conclusion
ChatGPT-5 demonstrated significantly higher accuracy than ChatGPT-4o in ABSITE Quiz questions, particularly in case-based scenarios requiring clinical reasoning. These findings suggest that newer LLM versions may provide more reliable support in surgical education and exam preparation, though further validation in multimodal and real exam settings is needed.
Get full access to this article
View all access options for this article.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
