Comparative Performance Analysis of AI Engines in Answering American Board of Surgery In-Training Examination Questions: A Multi-Subspecialty Evaluation

Abstract

Background

The rapid advancement of artificial intelligence (AI) has led to its increasing application in the medical field, particularly in providing accurate and reliable information for complex medical queries.

Purpose

This study evaluates the performance of four AI engines—Perplexity, Chat GPT, DeepSeek, and Gemini in answering 100 multiple-choice questions derived from the American Board of Surgery In-Training Examination (ABSITE). A set of questions focused on five surgical subspecialties including colorectal surgery, acute care and trauma surgery (ACS), upper GI Surgery, breast and endocrine surgery, and hepatopancreatobiliary surgery (HPB).

Data collection

We evaluated these AI engines’ ability to provide accurate and focused medical knowledge as the main objective. The research study consisting of a two-month duration was conducted from January 1, 2025, to March 28, 2025. All AI engines received identical questions through then a comparison between correct and wrong responses was performed relative to the ABSITE key answers. Each question was entered manually into the chatbots, ensuring no memory retention bias.

Statistical analysis

The researchers conducted their statistical analysis with JASP software for performance evaluation between different subspecialties and AI engines through univariate and multivariate investigations.

Results

Among the available AI tools, DeepSeek produced the most accurate responses at 74% while Chat GPT delivered 70% accuracy Gemini achieved 69% and Perplexity attained 65%. Results showed that Chat GPT achieved 83.3% accuracy in colorectal surgery yet DeepSeek scored the best at 84.6% and 67.6% for HPB Surgery and ACS respectively. Perplexity achieved a 100% accuracy rate in breast and endocrine surgery which proved to be the highest score recorded throughout the study. The analysis showed that Chat GPT exhibited performance variability between different Surgical subspecialties since it registered significant variations (P < .05), especially in acute care and trauma Surgery. The results of logistic regression indicated that Gemini along with Perplexity scored the most consistent answers among AI systems with a significant odds ratio of 2.5 (P < .01). AI engines show different combinations of precision and reliability when solving medical questions about surgery yet DeepSeek stands out by remaining the most reliable overall.

Conclusions

Medical application AI models need additional development because performance results show major differences between medical specialties.

Keywords

Artificial intelligence AI in medicine Surgical education General surgery MCQ ABSITE AI accuracy Medical decision-making AI performance evaluation

Get full access to this article

View all access options for this article.

References

Altamimi

Alhumimidi

Alshehri

, et al. The scientific knowledge of three large language models in cardiology: multiple-choice questions examination-based performance. Ann Med Surg. 2024;86(6):3261-3266. doi:10.1097/MS9.0000000000002120

Hutson

. Could AI help you to write your next paper? Nature. 2022;611(7934):192-193. doi:10.1038/d41586-022-03479-w

Ram

Verma

. Artificial intelligence (AI)-based chatbot study of ChatGPT, Google AI bard, and Baidu AI. World J Adv Eng Technol Sci. 2023;8:258-261.

Sarangi

Mondal

. Response generated by large language models depends on the structure of the prompt. Indian J Radiol Imaging. 2024;34(3):574-575. doi:10.1055/s-0044-1782165

Kinikoglu

. Evaluating ChatGPT and Google Gemini performance and implications in Turkish dental education. Cureus. 2025;17(1):e77292. doi:10.7759/cureus.77292

Tran

Chang

Sherman

De Andrade

. Performance of ChatGPT on American board of surgery in-training examination preparation questions. J Surg Res. 2024;299:329-335. doi:10.1016/j.jss.2024.04.060

Narayanan

Ramakrishnan

Durairaj

Das

. Artificial intelligence revolutionizing the field of medical education. Cureus. 2023;15(11):e49604. doi:10.7759/cureus.49604

Guerrero

Asaad

Rajesh

Hassan

Butler

. Advancing surgical education: the use of artificial intelligence in surgical training. Am Surg. 2023;89(1):49-54. doi:10.1177/00031348221101503

Banerjee

Chatterjee

Goyal

Sarangi

. Performance of ChatGPT-3.5 and ChatGPT-4 in solving questions based on core concepts in cardiovascular physiology. Cureus. 2025;17(5):e83552. doi:10.7759/cureus.83552

10.

Morris

Fiocco

Caneva

Yiapanis

Orgill

. Current and future applications of artificial intelligence in surgery: implications for clinical practice and research. Front Surg. 2024;11:1393898. doi:10.3389/fsurg.2024.1393898

11.

Guerrero

Asaad

Rajesh

Hassan

Butler

. While the potential benefits of AI in surgical education are significant, it is crucial to establish guidelines for its appropriate use and ensure that human expertise remains central to surgical decision-making. Am Surg. 2023;89(1):49-54. doi:10.1177/00031348221101503

12.

Perplexity AI . Changelog: sonar model improvements and new search modes. 2025. https://docs.perplexity.ai/changelog. Accessed March 28, 2025.

13.

OpenAI . Introducing GPT-4o: a step towards a more natural human-computer interaction. 2025. https://openai.com/index/gpt-4o/. Accessed March 28, 2025.

14.

DeepSeek AI . DeepSeek-V3-0324: release announcement. 2025. https://api-docs.deepseek.com/news/news250325. Accessed March 28, 2025.

15.

Google DeepMind . Pushing the limits of AI with Gemini 2.5 Pro experimental. 2025. https://blog.google/technology/google-deepmind/gemini-model-thinking-updates-march-2025/. Accessed March 28, 2025.

16.

Longwell

Hirsch

Binder

, et al. Performance of large language models on medical oncology examination questions. JAMA Netw Open. 2024;7(6):e2417641. doi:10.1001/jamanetworkopen.2024.17641

17.

Silva

Nascimento

Dantas

, et al. Impact of artificial intelligence on the training of general surgeons of the future: a scoping review of the advances and challenges. Acta Cir Bras. 2024;39:e396224. doi:10.1590/acb396224

18.

Valente

Brasil

Spinelli

Vilela

MAP

Rhoden

. A narrative review of transforming surgical education with artificial intelligence: opportunities and challenges. AME Surg J. 2025;5:1. https://asj.amegroups.org/article/view/97081/html

19.

Ray

Meizoso

Horkan

, et al. Effect of question bank usage on performance on the American board of surgery in-training examination in general surgery residents. J Am Coll Surg. 2017;225(4):S183. doi:10.1016/j.jamcollsurg.2017.07.417

20.

Society for Surgery of the Alimentary Tract . ABSITE quiz. https://ssat.com/residents/ABSITEquiz/. Accessed February 28, 2025.

21.

Mahajan

Tran

Tseng

, et al. Performance of trauma-trained large language models on surgical assessment questions: a new approach in resource identification. Surgery. 2024;179:108793. doi:10.1016/j.surg.2024.08.026

22.

Guerrero

Asaad

Rajesh

Hassan

Butler

. Advancing surgical education: the use of artificial intelligence in surgical training. Am Surg. 2022;89(1):49-54. doi:10.1177/00031348221101503