Sage Journals: Discover world-class research

Abstract

Introduction

Visceral aneurysms pose diagnostic and therapeutic challenges in vascular surgery. Large language models (LLMs) may assist in clinical decision-making, but their application requires rigorous validation. Traditional validation methods are labor-intensive and difficult to scale.

Objective

We examined the capability of an LLM in managing visceral aneurysms and explored an automated framework for validating AI-generated clinical responses.

Methods

Using Python with the Pandas library and OpenAI API, we probed the Society for Vascular Surgery (SVS) clinical practice guidelines on visceral aneurysm management. ChatGPT-4o-mini was instructed to review guideline recommendations, generate clinical scenarios, propose management strategies, and evaluate its own responses using a four-tier rubric (1 = completely correct; 2 = partially correct; 3 = partially incorrect; 4 = no correct information). Human evaluators independently assessed the same responses and graded questions as good, fair, or poor and whether they were leading.

Results

Eighty visceral aneurysm scenarios were generated and evaluated. ChatGPT-4o-mini self-assessed 89% of responses as correct (scores 1-2), compared to 67% by human evaluators (chi-square, P < 0.0001), with the greatest discrepancy in the partially correct category. Most AI-generated questions were of good quality (56%), though 44% were considered leading questions.

Conclusion

An automated validation framework for AI-generated clinical responses is feasible. However, the 67% correctness rate and systematic AI self-overestimation indicate that current LLMs remain unsuitable for independent clinical use, reinforcing the need for expert oversight. The integration of Python-driven automation, structured AI inference, and expert review holds promise for increasing the efficiency of evaluating LLMs at-scale across clinical domains.

Keywords

vascular surgery special topics surgical education

Get full access to this article

View all access options for this article.

References

Pitts

Lacey

. Visceral artery aneurysms. Ann Vasc Surg. 2019;56:322-329. doi:10.1016/j.avsg.2018.08.098.

Abbas

Fowl

Stone

, et al. Hepatic artery aneurysm: factors that predict complications. J Vasc Surg. 2003;38(1):41-45. doi:10.1016/s0741-5214(03)00090-9.

Chaer

Abularrage

Coleman

, et al. The Society for Vascular Surgery clinical practice guidelines on the management of visceral aneurysms. J Vasc Surg. 2020;72(1S):3S-39S. doi:10.1016/j.jvs.2020.01.039.

Barrionuevo

Desai

Javed

, et al. Management of splenic artery aneurysms and pseudoaneurysms: a systematic review. J Vasc Surg. 2019;70(6):1997-2004. doi:10.1016/j.jvs.2019.05.077.

Topol

. High-performance medicine: the convergence of human and artificial intelligence. Nat Med. 2019;25(1):44-56. doi:10.1038/s41591-018-0300-7.

Rajpurkar

Chen

Banerjee

Topol

. AI in health and medicine. Nat Med. 2022;28(1):31-38. doi:10.1038/s41591-021-01614-0.

Kavian

Wilkey

Patel

Boyd

. Harvesting the power of artificial intelligence for surgery: uses, implications, and ethical considerations. Am Surg. 2023;89(12):5102-5104. doi:10.1177/00031348231175454.

Bresler

Pandya

Meyer

Htway

Fujita

. From bytes to best practices: tracing ChatGPT-3.5's evolution and alignment with the National Comprehensive Cancer Network® guidelines in pancreatic adenocarcinoma management. Am Surg. 2024;90(9):2543-2547. doi:10.1177/00031348241248801.

Pandya

Bresler

Wilson

Htway

Fujita

. Decoding the NCCN guidelines with AI: a comparative evaluation of ChatGPT-4.0 and Llama 2 in the management of thyroid carcinoma. Am Surg. 2025;91(1):94-98. doi:10.1177/00031348241269430.

10.

Bresler

Wilson

Makaryan

, et al. AI at the forefront: navigating oncologic care for six gastrointestinal cancers according to the NCCN guidelines utilizing Gemini-1.0 Ultra and ChatGPT-4. J Surg Oncol. 2025;132:1-6. doi:10.1002/jso.70005.

11.

van Dis

EAM

Bollen

Zuidema

van Rooij

Bockting

. ChatGPT: five priorities for research. Nature. 2023;614(7947):224-226. doi:10.1038/d41586-023-00288-7.

12.

Baumgartner

. A regulatory challenge for natural language processing (NLP)-based tools such as ChatGPT to be legally used for healthcare decisions. Where are we now? Clin Transl Med. 2023;13(8):e1362. doi:10.1002/ctm2.1362.

13.

Gao

Howard

Markov

, et al. Comparing scientific abstracts generated by ChatGPT to real abstracts with detectors and blinded human reviewers. npj Digit Med. 2023;6(1):75. doi:10.1038/s41746-023-00819-6.

14.

Jones

. ‘In awe’: scientists impressed by latest ChatGPT model o1. Nature. 2024;634(8033):275-276. doi:10.1038/d41586-024-03169-9.

AI Assessment and Management of Visceral Aneurysms Using ChatGPT-4o-mini: A Pilot Study Examining the Feasibility of Automating the AI Validation Process

Abstract

Introduction

Objective

Methods

Results

Conclusion

Keywords

Get full access to this article

References