Abstract
To critically evaluate the performance of Generative Pre-trained Transformer (GPT)-4-based large language models (LLMs) for extracting imaging findings from oncology records, with a primary focus on quantifying the impact of reference data quality on measured performance.
A two-phase study was conducted on 40 oncology medical records. In Phase 1, model outputs were compared against existing, uncurated reference summaries. In Phase 2, outputs for a 20-record subset were re-evaluated against a new “gold standard” of expert-curated, standardized summaries created by a board-certified radiologist. We systematically tested two model versions (text-only GPT-4.0 vs. multimodal GPT-4.1), two prompt designs, two input modalities (text vs. image), and two document scopes. Performance was assessed using lexical metrics (BLEU, ROUGE, METEOR) and a semantic alignment metric (Kullback–Leibler [KL] Divergence).
A profound performance disparity was observed between phases. Phase 1 evaluation against uncurated references yielded modest scores (e.g., max ROUGE-1 ≈ 0.45, BLEU ≈ 0.15) and high semantic divergence (KL > 7.7). In contrast, Phase 2 evaluation against the gold-standard references resulted in substantial improvements across all configurations. The top-performing configuration—multimodal GPT-4.1 using image-based input on the full document—achieved a ROUGE-1 of 0.57, BLEU of 0.25, and a significantly lower KL Divergence of 5.96, closely approaching the expert standard.
The quality and consistency of the reference standard are the most critical drivers of measured LLM performance in clinical information extraction tasks. Standard NLP metrics can be misleading when applied to uncurated “ground truth.” With a clinically validated reference, advanced multimodal models like GPT-4.1 demonstrate a powerful capability to accurately summarize complex oncology reports, highlighting the necessity of codeveloping AI models and their evaluation frameworks.
Keywords
Get full access to this article
View all access options for this article.
