Abstract
This study compared the macroscopic adhesion scoring performance of large language models (LLMs: ChatGPT-o3, ChatGPT-5, Gemini-2.5 Pro) with that of novice veterinary surgeons, using expert consensus as the reference. Eighty standardized postoperative laparotomy cases in Wistar rats were photographed and scored using the Nair 0–4 adhesion scale. Two novice surgeons and three LLMs independently evaluated each case; the expert reference was defined by a surgeon and a pathologist. Group differences were analyzed using the Kruskal–Wallis test with Dunn–Bonferroni post hoc comparisons, correlations by Bonferroni-adjusted Spearman coefficients, human interobserver reliability by intraclass correlation coefficient (ICC) (A,1), and agreement with the expert by quadratic-weighted Cohen’s κ and exact-match accuracy. Overall differences were significant. ChatGPT-o3, ChatGPT-5, Gemini-2.5 Pro, and Novice 1 assigned lower scores, while Novice 2 assigned higher scores. Correlations with the expert were significant for Novice 1 (ρ = 0.706), Novice 2 (ρ = 0.593), and ChatGPT-o3 (ρ = 0.617), but not for ChatGPT-5 or Gemini-2.5 Pro. Inter-observer reliability among human raters was moderate (ICC = 0.55). Importantly, absolute exact-match accuracies were modest across all evaluators, with the highest accuracy observed for Novice 1 (33.8%) and ⩽26.3% for the LLMs. While novices outperformed the models, these findings highlight the intrinsic difficulty of fine-grained Nair 0–4 adhesion scoring on two-dimensional intraoperative images and indicate that current LLMs are better suited as calibrated decision-support tools rather than stand-alone raters.
Get full access to this article
View all access options for this article.
