Sage Journals: Discover world-class research

Abstract

German

Spanish

French

This study compared the macroscopic adhesion scoring performance of large language models (LLMs: ChatGPT-o3, ChatGPT-5, Gemini-2.5 Pro) with that of novice veterinary surgeons, using expert consensus as the reference. Eighty standardized postoperative laparotomy cases in Wistar rats were photographed and scored using the Nair 0–4 adhesion scale. Two novice surgeons and three LLMs independently evaluated each case; the expert reference was defined by a surgeon and a pathologist. Group differences were analyzed using the Kruskal–Wallis test with Dunn–Bonferroni post hoc comparisons, correlations by Bonferroni-adjusted Spearman coefficients, human interobserver reliability by intraclass correlation coefficient (ICC) (A,1), and agreement with the expert by quadratic-weighted Cohen’s κ and exact-match accuracy. Overall differences were significant. ChatGPT-o3, ChatGPT-5, Gemini-2.5 Pro, and Novice 1 assigned lower scores, while Novice 2 assigned higher scores. Correlations with the expert were significant for Novice 1 (ρ = 0.706), Novice 2 (ρ = 0.593), and ChatGPT-o3 (ρ = 0.617), but not for ChatGPT-5 or Gemini-2.5 Pro. Inter-observer reliability among human raters was moderate (ICC = 0.55). Importantly, absolute exact-match accuracies were modest across all evaluators, with the highest accuracy observed for Novice 1 (33.8%) and ⩽26.3% for the LLMs. While novices outperformed the models, these findings highlight the intrinsic difficulty of fine-grained Nair 0–4 adhesion scoring on two-dimensional intraoperative images and indicate that current LLMs are better suited as calibrated decision-support tools rather than stand-alone raters.

Keywords

Artificial intelligence ChatGPT Gemini large language models adhesion scoring veterinary decision support

Get full access to this article

View all access options for this article.

References

Kitaguchi

Harai

Kosugi

, et al. Artificial intelligence for the recognition of key anatomical structures in laparoscopic colorectal surgery. Br J Surg 2023; 110: 1355–1358.

Piccione

Anderson

Neal

, et al. Digital pathology in veterinary clinical pathology: A review. Vet Pathol 2025; 62: 631–645.

Omar

Ullanat

Loda

, et al. ChatGPT for digital pathology research. Lancet Digit Health 2024; 6: e595–e600.

Ullah

Parwani

Baig

, et al. Challenges and barriers of using large language models (LLM) such as ChatGPT for diagnostic medicine with a focus on digital pathology: A recent scoping review. Diagn Pathol 2024; 19: 43.

Alqahtani

Rotgans

Mamede

, et al. Does time pressure have a negative effect on diagnostic accuracy? Acad Med 2016; 91: 710–716.

Corson

Batzer

Gocial

, et al. Intra-observer and inter-observer variability in scoring laparoscopic diagnosis of pelvic adhesions. Hum Reprod 1995; 10: 161–164.

Chelliah

Chellathurai

Gnanasigmani

, et al. Diagnostic role of multidetector computed tomography in detecting peritoneal adhesions: A cross-sectional study. Med Innovatica 2023; 12: 40–45.

Demirtaş

Celepli

Kızılgün

, et al. The role of platelet-rich plasma for preventing postoperative peritoneal adhesions in adhesive intestinal obstruction in rats. Hamidiye Med J 2022; 3: 14–20.

Burti

Banzato

Coghlan

, et al. Artificial intelligence in veterinary diagnostic imaging: Perspectives and limitations. Res Vet Sci 2024; 175: 105317.

10.

Zhao

Zhang

, et al. Artificial general intelligence for medical imaging analysis. IEEE Rev Biomed Eng 2025; 18: 113–129.

11.

Dumortier

Guépin

Delignette-Muller

, et al. Deep learning in veterinary medicine: An approach based on CNN to detect pulmonary abnormalities from lateral thoracic radiographs in cats. Sci Rep 2022; 12: 1–12.

12.

Gaertner

Hagerman

Felemovicius

, et al. Two experimental models for generating abdominal adhesions. J Surg Res 2008; 146: 241–245.

13.

Okur

Modoğlu

Baykal

, et al. Comparison of diagnostic performance between large language models and veterinary evaluators in feline ocular diseases based on clinical summaries and anterior segment photographs. Vet Ophthalmol 2026; 29: e70052.

14.

Gunes

Cesur

The diagnostic performance of large language models and general radiologists in thoracic radiology cases: A comparative study. J Thorac Imaging 2025; 40: e0805.

15.

Mbakwe

Lourentzou

Celi

, et al. ChatGPT passing USMLE shines a spotlight on the flaws of medical education. PLoS Digit Health 2023; 2: e0000205.

16.

Lacaita

Galijasevic

Swoboda

, et al. The accuracy of ChatGPT-4o in interpreting chest and abdominal X-ray images. J Pers Med 2025; 15: 194.

17.

Huppertz

Siepmann

Topp

, et al. Revolution or risk? Assessing the potential and challenges of GPT-4V in radiologic image interpretation. Eur Radiol 2025; 35: 1111–1121.

18.

Wang

, et al. Is ChatGPT-5 ready for mammogram VQA? arXiv, http://arxiv.org/abs/2508.11628 (2025, accessed July, 2025).

19.

Temel

Erden

Bağcıer

Evaluating artificial intelligence performance in medical image analysis: Sensitivity, specificity, accuracy, and precision of ChatGPT-4o on Kellgren-Lawrence grading of knee X-ray radiographs. Knee 2025; 55: 79–84.

20.

Madani

Namazi

Altieri

, et al. Artificial intelligence for intraoperative guidance: Using semantic segmentation to identify surgical anatomy during laparoscopic cholecystectomy. Ann Surg 2022; 276: 363–369.

21.

Sessa

Guardo

Esposito

, et al. From description to diagnostics: Assessing AI’s capabilities in forensic gunshot wound classification. Diagnostics 2025; 15: 2094.

Comparison of adhesion scoring performance between humans and large language models in experimental rat laparotomy

Abstract

Keywords

Get full access to this article

References