Sage Journals: Discover world-class research

Abstract

To critically evaluate the performance of Generative Pre-trained Transformer (GPT)-4-based large language models (LLMs) for extracting imaging findings from oncology records, with a primary focus on quantifying the impact of reference data quality on measured performance.

A two-phase study was conducted on 40 oncology medical records. In Phase 1, model outputs were compared against existing, uncurated reference summaries. In Phase 2, outputs for a 20-record subset were re-evaluated against a new “gold standard” of expert-curated, standardized summaries created by a board-certified radiologist. We systematically tested two model versions (text-only GPT-4.0 vs. multimodal GPT-4.1), two prompt designs, two input modalities (text vs. image), and two document scopes. Performance was assessed using lexical metrics (BLEU, ROUGE, METEOR) and a semantic alignment metric (Kullback–Leibler [KL] Divergence).

A profound performance disparity was observed between phases. Phase 1 evaluation against uncurated references yielded modest scores (e.g., max ROUGE-1 ≈ 0.45, BLEU ≈ 0.15) and high semantic divergence (KL > 7.7). In contrast, Phase 2 evaluation against the gold-standard references resulted in substantial improvements across all configurations. The top-performing configuration—multimodal GPT-4.1 using image-based input on the full document—achieved a ROUGE-1 of 0.57, BLEU of 0.25, and a significantly lower KL Divergence of 5.96, closely approaching the expert standard.

The quality and consistency of the reference standard are the most critical drivers of measured LLM performance in clinical information extraction tasks. Standard NLP metrics can be misleading when applied to uncurated “ground truth.” With a clinically validated reference, advanced multimodal models like GPT-4.1 demonstrate a powerful capability to accurately summarize complex oncology reports, highlighting the necessity of codeveloping AI models and their evaluation frameworks.

Keywords

multimodal large language models oncology imaging findings clinical information extraction GPT-4 gold standard evaluation

Get full access to this article

View all access options for this article.

References

Chen

, Alnassar

, Avison

, et al. Large language model applications for health information extraction in oncology: Scoping review. JMIR Cancer, 2025; 11(1):e65984; doi: 10.2196/65984

Van Veen

, Van Uden

, Blankemeier

, et al. Adapted large language models can outperform medical experts in clinical text summarization. Nat Med, 2024; 30(4):1134–1142; doi: 10.1038/s41591-024-02855-5

, Lacson

, Guenette

, et al. Use of ChatGPT large language models to extract details of recommendations for additional imaging from free-text impressions of radiology reports. AJR Am J Roentgenol, 2025; 224(4):e2432341; doi: 10.2214/AJR.24.32341

Ruan

, Huang

, Yang

. Comprehensive evaluation of multimodal AI models in medical imaging diagnosis: From data augmentation to preference-based comparison. arXiv [Preprint], 2024. Available from: https://arxiv.org/abs/2412.05536

Kruse

, Hu

, Derby

, et al. Zero-shot large language models for long clinical text summarization with temporal reasoning. arXiv [Preprint], 2025. Available from: https://arxiv.org/abs/2501.18724

Tang

, Sun

, Idnay

, et al. Evaluating large language models on medical evidence summarization. NPJ Digit Med, 2023; 6(1):158; doi: 10.1038/s41746-023-00896-7

Ostmeier

, Cohan

, Wallace

, et al. GREEN: Generative radiology report evaluation and error notation. In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP)., 2024:374–390. doi: 10.18653/v1/2024.findings-emnlp.21

Aczon

, Chen

, Esteva

, et al. Multimodal large language models in health care. J Med Internet Res, 2024; 26:e59505; doi: 10.2196/59505

Yang

, Zhang

, Chen

, et al. Data extraction from free-text stroke CT reports using GPT-4o and Llama-3.3-70B: Performance evaluation and guidelines impact. Eur Radiol Exp, 2025; 9(1):61; doi: 10.1186/s41747-025-00600-2

10.

Sun

, Ong

, Kennedy

, et al. Evaluating GPT-4 on impressions generation in radiology reports. Radiology, 2023; 307(5):e231259; doi: 10.1148/radiol.231259

11.

Reverberi

, Rigon

, Solari

, et al. Experimental evidence of effective human–AI collaboration in medical decision-making. Scientific Reports 12 2022; doi: 10.1038/s41598-022-18751-2

12.

Peng

, Chen

, Lopez

, et al. Measurement of semantic textual similarity in clinical texts: Comparison of transformer-based models. JMIR Med Inform, 2020; 8(11):e19735; doi: 10.2196/19735

13.

Nagendran

, Chen

, Lovejoy

, et al. Artificial intelligence versus clinicians: Systematic review of design, reporting standards, and claims of deep learning studies. BMJ, 2020; 368:m689; doi: 10.1136/bmj.m689

14.

Zhang

, Jin

, Zhou

, et al. Closing the gap between open source and commercial large language models for medical evidence summarization. NPJ Digit Med, 2024; 7(1):239; doi: 10.1038/s41746-024-01239-w

15.

Sciannameo

, Pagliari

, Urru

, et al. Information extraction from medical case reports using OpenAI InstructGPT. Comput Methods Programs Biomed, 2024; 255:108326; doi: 10.1016/j.cmpb.2024.108326

16.

Kopanichuk

, Anokhin

, Shaposhnikov

, et al. How to evaluate medical AI. arXiv [Preprint], 2025. Available from: https://arxiv.org/abs/2509.11941

17.

Park

, Han

. Methodologic guide for evaluating clinical performance and effect of artificial intelligence technology for medical diagnosis and prediction. Radiology, 2018; 286(3):800–809; doi: 10.1148/radiol.2017171920

18.

Zambrano Chaves

, Huang

S-C

, Xu

, et al. A clinically accessible small multimodal radiology model and evaluation metric for chest X-ray findings. Nat Commun, 2025; 16(1):3108; doi: 10.1038/s41467-025-58344-x

19.

Bitran

. Evaluating Quality of AI in Healthcare: Ground Truth, Metrics, and the Human Factor. Digital Health Innovation; 2025.

20.

Bannur

, Bouzid

, Castro

, et al. Maira-2: Grounded radiology report generation. arXiv Preprint arXiv:2406.04449, 2024.

21.

, Endo

, Krishnan

, et al. Evaluating progress in automatic chest X-ray radiology report generation. Patterns (N Y), 2023; 4(9):100802; doi: 10.1016/j.patter.2023.100802

22.

Gao

, Kruse

, Hu

, et al. Provider documentation summarization quality instrument (PDSQI-9): development and validation of an evaluation framework for AI-generated clinical summaries. medRxiv [Preprint], 2024; doi: 10.1101/2024.12.17.24319206

23.

Huang

, Banerjee

, Wu

, et al. FineRadScore: A radiology report line-by-line evaluation technique generating corrections with severity scores. ArXiv, 2024; doi: 10.48550/arXiv.2405.20613.2405.20613

24.

Zhang

, Kishore

, Wu

, et al. BERTScore: Evaluating text generation with BERT. arXiv [Preprint], 2019. Available from: https://arxiv.org/abs/1904.09675

25.

Karpukhin

, Oğuz

, Min

, et al. Dense passage retrieval for open-domain question answering. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020:6769–6781. doi: 10.18653/v1/2020.emnlp-main.550

26.

, Schuster

, Chen

, et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv [Preprint], 2016. Available from: https://arxiv.org/abs/1609.08144

27.

Loaiza-Bonilla

, Thaker

, Chung

, et al. Driving knowledge to action: Building a better future with artificial intelligence-enabled multidisciplinary oncology. Am Soc Clin Oncol Educ Book, 2025; 45(3):e100048; doi: 10.1200/EDBK-25-100048

Evaluating Multimodal LLMs for Information Extraction from Oncology Reports Requires a Clinically Curated Ground Truth—Two-Phase Evaluation of GPT-4.1 versus GPT-4.0

Abstract

Keywords

Get full access to this article

References