Abstract
Introduction
Visceral aneurysms pose diagnostic and therapeutic challenges in vascular surgery. Large language models (LLMs) may assist in clinical decision-making, but their application requires rigorous validation. Traditional validation methods are labor-intensive and difficult to scale.
Objective
We examined the capability of an LLM in managing visceral aneurysms and explored an automated framework for validating AI-generated clinical responses.
Methods
Using Python with the Pandas library and OpenAI API, we probed the Society for Vascular Surgery (SVS) clinical practice guidelines on visceral aneurysm management. ChatGPT-4o-mini was instructed to review guideline recommendations, generate clinical scenarios, propose management strategies, and evaluate its own responses using a four-tier rubric (1 = completely correct; 2 = partially correct; 3 = partially incorrect; 4 = no correct information). Human evaluators independently assessed the same responses and graded questions as good, fair, or poor and whether they were leading.
Results
Eighty visceral aneurysm scenarios were generated and evaluated. ChatGPT-4o-mini self-assessed 89% of responses as correct (scores 1-2), compared to 67% by human evaluators (chi-square, P < 0.0001), with the greatest discrepancy in the partially correct category. Most AI-generated questions were of good quality (56%), though 44% were considered leading questions.
Conclusion
An automated validation framework for AI-generated clinical responses is feasible. However, the 67% correctness rate and systematic AI self-overestimation indicate that current LLMs remain unsuitable for independent clinical use, reinforcing the need for expert oversight. The integration of Python-driven automation, structured AI inference, and expert review holds promise for increasing the efficiency of evaluating LLMs at-scale across clinical domains.
Get full access to this article
View all access options for this article.
