Benchmarking Multimodal Vision Frontier Models With Lumbar Spine MRIs for Grading Lumbar Spinal Stenosis

Abstract

Study Design

Diagnostic accuracy study.

Objective

Prior evaluations of frontier models as radiology decision-support tools relied on 2-dimensional images or text reports; their ability to interpret volumetric data remains unclear. This study assessed Google Gemini 3 Pro for grading lumbar spinal canal stenosis on video lumbar magnetic resonance imaging (MRI) and evaluated diagnostic accuracy, agreement with neuroradiologist consensus, and the effect of localizer-assisted input.

Methods

The Radiological Society of North America (RSNA) 2024 Lumbar Spine Degenerative Classification Dataset, with American Society of Neuroradiology (ASNR) consensus labels, served as a reference benchmark; interobserver agreement among contributing readers was not reported. 100 examinations yielded 500 disc-level observations (371 normal/mild, 74 moderate, 55 severe), demonstrating marked class imbalance. Native imaging series were converted into synchronized video montages. Gemini 3 Pro generated one grade per disc level with and without localizer overlays. Primary outcome was linearly weighted kappa (κw); secondary outcomes included class-wise performance, severe-case error patterns, and overall accuracy.

Results

Without localizer, overall accuracy was 75.6% (378/500) with fair agreement (κw = 0.39). Severe stenosis sensitivity was 41.8%; 43.6% of severe cases were downgraded to normal/mild, and 58.2% to non-severe. With localizer overlays, accuracy was 73.2% (366/500) with κw = 0.32, and severe sensitivity decreased to 30.9%; severe-to-normal/mild misses increased to 52.7%. Differences were not significant.

Conclusions

Gemini 3 Pro showed fair agreement with the neuroradiologist consensus benchmark, but apparent overall accuracy was inflated by the majority normal/mild class and masked clinically unacceptable under-detection of severe stenosis. Localizer-assisted input did not improve performance.

Keywords

multimodal vision frontier models lumbar spinal canal stenosis MRI cine video diagnostic accuracy inter-rater reliability

Introduction

The medical image interpretation performance of current multimodal vision frontier models (M-VFMs), which are state-of-the-art large language models with integrated visual processing capabilities, is heavily biased toward text rather than true image-based analysis.^1–4 Khazanchi et al and Ziegeler et al demonstrated that M-VFMs are effective at simplifying language or extracting structured findings in lumbar spine images, but frequently omit or misrepresent clinically essential details.^5,6 These models tend to produce clinically aligned conclusions when they report incongruent image findings.

Importantly, M-VFMs natively accept Digital Imaging and Communications in Medicine (DICOM) files, thereby losing modality-specific metadata, such as sequence parameters and slice spacing.^3,7 Lee et al described a technique for converting DICOM stacks into video-like formats that allow multimodal models to treat them as images.⁸ However, this conversion sacrifices radiologic fidelity, resulting in irreversible loss of quantitative radiographic features, including Hounsfield units, radiomic signatures, and intensity data.^9,10

In late 2025, Gemini 3 Pro (Google DeepMind, London, UK) was released as a general-purpose M-VFM capable of interpreting medical images and achieving strong performance on public benchmarks, such as the Visual Question Answering in Radiology benchmark.^11,12 On Radiology’s Last Exam (RadLE v1), a benchmark requiring free-text diagnosis of expert-curated static radiologic images, Gemini 3 Pro outperformed radiology trainees (51% vs 45%) but not board-certified radiologists (∼83%).¹³ Despite its superior performance relative to other commercial models, the medical literature has not yet evaluated its radiographic image interpretation capabilities. Prior studies have demonstrated that accurate video-based radiologic data interpretation is achievable with deep learning (DL) models; however, these models differ from M-VFMs, as they require training data and validation pipelines, rendering them clinically impractical.^14,15 Whether a generalist multimodal model can safely identify severe stenosis is clinically important because high-grade canal narrowing can influence specialist referral and surgical decision-making when aligned with symptoms. Gemini 3 Pro was selected a priori as the index model because it was a commercially accessible frontier M-VFM that was operationally compatible with our full-video lumbar MRI workflow. This pragmatic choice also aligned with the broader emergence of video-capable multimodal systems for radiology and 3-dimensional medical image interpretation.^7,8

In our pilot workflow, we provide the first systematic evaluation of the general-purpose multimodal foundation model Gemini 3 Pro, using full-video lumbar spine magnetic resonance imaging (MRI) for stenosis grading that requires spatial orientation across many frames, compared with expert neuroradiologist consensus.

Methods

Study Design and Objective

This study evaluates the ability of Gemini 3 Pro to interpret 3-dimensional video-based MRI inputs for grading lumbar spinal stenosis and to determine whether the multimodal vision frontier model could achieve diagnostic agreement with a consensus panel of 60 volunteer neuroradiologists using an ordinal three-tier scale (normal/mild, moderate, severe) across five lumbar disc levels (L1/2-L5/S1). Model performance was assessed by comparing it with expert consensus to establish benchmarking accuracy metrics. Because case-level interobserver agreement for contributing readers was not reported, this consensus was treated as a pragmatic benchmark rather than a formal gold standard.

Data Source and Reference Standard

Reference standard labels were sourced from the 2024 Radiological Society of North America (RSNA) Lumbar Spine Degenerative Classification Dataset. Expert consensus grading was conducted under the oversight of the American Society of Neuroradiology (ASNR). It is a collection of over 2600 MRI scans of the lumbar spine annotated by approximately 60 volunteer radiologists recruited by RSNA, the American Society for Spine Radiology, and ASNR to identify the location and severity of five degenerative conditions across the five intervertebral disc levels (L1/L2, L2/L3, L3/L4, L4/L5, and L5/S1). The imaging data, including over 8500 image series (sagittal T2, axial T2, and sagittal T1), were provided by 12 institutions worldwide. Initially assembled in 2024 for the RSNA Lumbar Spine Degenerative Classification artificial intelligence challenge hosted on the Kaggle platform,¹⁶ it represents the largest publicly available collection of its kind. Additional information about the dataset and how to use it is provided in the data resource publication listed below, as well as on the Kaggle competition website, which also offers access to models developed during the competition. Accordingly, the RSNA labels were used as a pragmatic reference benchmark for this study. However, the dataset documentation does not report interobserver agreement among contributing readers, so the benchmark reflects curated consensus labels rather than a validated gold standard.

Central spinal canal stenosis was graded using the RSNA-standardized three-level ordinal scale: normal/mild (preserved thecal sac with visible surrounding cerebrospinal fluid [CSF] and cauda equina rootlets), moderate (compressed/narrowed thecal sac with clumped rootlets but residual visible CSF signal), or severe (slit-like/barely discernible thecal sac with obliteration of the CSF space and rootlets compressed into a tight bundle).¹⁷

Sample Size and Power Considerations

The primary endpoint was agreement between Gemini 3 Pro and the neuroradiologist consensus, quantified by linearly weighted Cohen’s kappa (κw) for a three-level ordinal spinal canal stenosis grade using linear weights (1 = exact agreement, 0.5 = adjacent disagreement, 0 = extreme disagreement). Power calculations targeted a one-sided test of H0: κw ≤ 0.40 vs H1: kw = 0.60 at α = 0.05 with 80% power. Expected category prevalences for planning were based on the anticipated study cohort. The asymptotic variance of κw was obtained using a delta-method approximation. Evaluating the variance conservatively at κw = 0.40 gave a requirement of approximately 162 independent disc-level observations. Each examination contributes five-disc levels, clustering observations. Hence, the disc-level requirement was inflated by a design effect. Assuming ρ = 0.25 (sensitivity analysis ρ = 0.10-0.30), this corresponds to approximately 324-disc levels (range 227-357)/65 examinations (range 46-72). Therefore, the protocol targeted 100 examinations (500 disc levels), providing>80% power under clustering (0.926-0.962 at κw = 0.60 with ρ = 0.25, depending on whether the variance is evaluated at κw = 0.40 or 0.60).

MRI Inputs and Series Identification

For each examination, sagittal and axial T2-weighted DICOM series were used. A metadata table with the dataset was parsed in Python to map each study identifier to the corresponding sagittal and axial T2 series identifiers based on the series description fields. Only studies with both sequences were included in video generation.

Video Generation, Model Inference, and Structured Prompting

A step-by-step outline of video generation is provided in Supplement A. All video montages were analyzed with Gemini 3 Pro through an application programming interface (API)-based multimodal inference workflow. A single orchestration script submits a standardized prompt (Supplement B) with the uploaded video. The prompt instructed the model to behave as a neuroradiologist, review the sagittal/axial montages, and assign a stenosis grade to each of the five-disc motion segments (L1/2 through L5/S1) using the RSNA-defined three-tier criteria. Outputs were constrained to a machine-readable JavaScript Object Notation (JSON) object that lists disc motion segments and grades to support automated scoring. After inference, the uploaded file was deleted from the remote file store during the pipeline cleanup routine. The pipeline recorded explicit failure states if an API request failed or if a response could not be parsed as valid JSON. An exemplary video montage without the segment localizer is provided in Video 1, while the corresponding montage with the integrated localizer overlay is shown in Video 2. Both videos were processed within the same multimodal inference pipeline described above, ensuring that the presence or absence of the localizer constituted the only difference between the analyzed inputs.

Outcomes and Statistical Analysis

The primary endpoint was inter-rater agreement between Gemini 3 Pro and the reference standard, measured using linearly weighted Cohen’s kappa (κw). Secondary endpoints emphasized error characteristics for severe stenosis, including sensitivity for severe disease and misclassification of diagnoses. Overall accuracy was computed as the proportion of correct three-class predictions for disc motion segments. Given the imbalanced class distribution, overall accuracy was interpreted alongside class-wise operating characteristics and severe-case error patterns rather than as a standalone measure of clinical utility. Class-wise operating characteristics (sensitivity, specificity, and positive predictive value) were calculated in a one-vs-rest framework. Diagnostic utility metrics, including likelihood ratios, were derived from the same binary decomposition.

Confidence intervals (95% CIs) for proportions were computed using binomial methods and are reported in the text, tables, and figure annotations where applicable. Paired comparisons between standard montage and localizer-overlay workflows were conducted at the disc motion segments using paired categorical tests appropriate for matched predictions, and two-sided P-values <0.05 were considered statistically significant.

Software and Reproducibility

All preprocessing, video generation, API model orchestration, and data parsing were performed in Python (v3.14.4). Specifically, DICOM handling and pixel extraction were performed using a DICOM library; numerical operations employed array-based computation; montage rendering relied on image-processing utilities; and video creation used a video-writing backend. Parsing tasks included mapping study identifiers from dataset metadata and structuring the model's JSON outputs. Statistical analyses and figure generation were performed in R (v4.6.0) via the RStudio integrated development environment. Processing parameters were fixed a priori and applied uniformly across examinations for reproducibility.

Results

A total of 100 lumbar stenosis cases (500 disc motion segments) were evaluated against neuroradiologist vignettes as reference standard labels (normal/mild: 371 [74.2%]; moderate: 74 [14.8%]; severe: 55 [11.0%]). This distribution was skewed toward normal/mild cases. Using the standard workflow without a localizer, the model assigned 395 normal/mild, 45 moderate, and 60 severe predictions. The overall results are summarized in Supplement C.

Agreement With Case Reference Labels

The diagnostic agreement is summarized in the confusion matrix (Figure 1), which displays reference-standard grades on the x-axis and Gemini 3 Pro predictions on the y-axis with cell-level case counts. Correct classifications were concentrated in the normal/mild category (332/371, 89.5%), whereas performance was lower for moderate (23/74, 31.1%) and severe (23/55, 41.8%).

Figure 1.

Confusion matrix of reference-standard vs Gemini 3 Pro stenosis grades (without localizer).

Misclassification patterns showed a predominance of downgrading. For moderate cases, the most common error was under-calling as normal/mild (39/74, 52.7%), with fewer over-calls to severe (12/74, 16.2%). In severe cases, 24/55 (43.6%) were classified as normal/mild and 8/55 (14.5%) as moderate, indicating that 32/55 (58.2%) were classified as non-severe.

Clinical Error Characteristics and Accuracy

Key error characteristics are shown in Figure 2 and Table 1, with case counts and 95% confidence intervals annotated. Sensitivity for detecting severe stenosis was 41.8% (95% CI, 28.7-55.9%). The critical miss rate for severe cases downgraded to normal/mild was 43.6% (95% CI, 30.3-57.7%), and the overall non-severe miss rate for severe disease was 58.2% (32/55). Although overall multiclass accuracy was 75.6% (95% CI, 71.6-79.3%), this aggregate value was driven by the dominant normal/mild class and does not indicate clinically acceptable severe-case detection.

Figure 2.

Severe stenosis safety profile (n = 55 reference severe cases).

Table 1.

Clinical Safety and Accuracy Metrics of the Gemini 3 Pro Model (Without Localizer)

Clinical Metric	Observed Result (%)	95% CI
Sensitivity (“severe” cases)	41.8	28.7-55.9%
Severe cases called “normal/Mild”	43.6	30.3-57.7%
Accuracy	75.6	71.6-79.3%

Summary of key clinical performance indicators, focusing on the model’s ability to detect severe pathology and the rate of severe cases called normal/mild, alongside overall model accuracy. Because the dataset was imbalanced toward normal/mild disease, overall accuracy should be interpreted together with the severity-specific metrics shown above. Data are presented with 95% confidence intervals (95% CIs).

Abbreviations: CI, Confidence Interval.

Performance Stratified by Severity Grade

Severity-stratified performance metrics are summarized in Figure 3, which annotates case counts (normal/mild n = 371, moderate n = 74, severe n = 55) and 95% confidence intervals. In one-vs-rest analyses, sensitivity was 89.5% (95% CI, 85.9-92.4%) for normal/mild, 31.1% (95% CI, 20.8-42.9%) for moderate, and 41.8% (95% CI, 28.7-55.9%) for severe stenosis. The positive predictive value for severe predictions was 38.3% (95% CI, 26.1-51.8%).

Figure 3.

One-vs-rest diagnostic performance of Gemini 3 Pro by stenosis severity grade.

Reliability, Heterogeneity, and Direction of Errors

Statistical reliability metrics are provided in Table 2. A positive likelihood ratio (LR+) of 5.03 provided greater confidence when the model predicted severe, while a negative likelihood ratio (LR-) of 0.63 offered limited reassurance for non-severe outcomes. McNemar’s test demonstrated a significant asymmetry in discordant classifications (P = 0.006), consistent with a tendency to undergrade.

Table 2.

Statistical Reliability Metrics of the Gemini 3 Pro Model (Without Localizer)

Statistical Test	Result	Clinical Translation
Positive likelihood ratio	5.03	Confidence in “severe”
Negative likelihood ratio	0.63	Safety of “normal/Mild”
McNemar’s test	P = 0.006	Errors are not random
Direction of error	Tendency to under-grade	Systematic tendency

Assessment of the model’s diagnostic utility using Likelihood Ratios and an evaluation of systematic bias using McNemar’s test. The “Direction of error” indicates the model’s general tendency when disagreements occur (eg, under-grading vs over-grading).

Abbreviations: LR+, Positive Likelihood Ratio; LR-, Negative Likelihood Ratio; P, P-value.

Pairwise heterogeneity analyses (Table 3) demonstrated that the model differentiated severe vs normal/mild with high accuracy (83.3% [95% CI, 79.4-86.7]) and moderate vs normal/mild with high accuracy (79.8% [95% CI, 75.7-83.4]). In contrast, discrimination between severe vs moderate was poorer (35.7% [95% CI, 27.4-44.6]), identifying diagnostic confusion.

Table 3.

Statistical Heterogeneity and Pairwise Diagnostic Accuracy of the Gemini 3 Pro Model (Without Localizer)

Comparison Pair	Differentiation Accuracy (95%CI)	Sensitivity for Higher Grade (95% CI)
“Severe” vs “moderate”	35.7% (27.4–44.6)	41.8% (28.7–55.9)
“Severe” vs “normal/Mild”	83.3% (79.4–86.7)	41.8% (28.7–55.9)
“Moderate” vs “normal/Mild”	79.8% (75.7–83.4)	31.1% (20.8–42.9)

Detailed breakdown of the model’s accuracy in distinguishing between specific pairs of severity. This table highlights specific areas of diagnostic uncertainty, particularly in distinguishing between “severe” and “moderate” stenosis.

Abbreviations: CI, Confidence Interval.

Bias testing by pair (Table 4) localized systematic error primarily to the moderate vs normal/mild boundary, where the model showed a significant tendency to under-grade (P = 0.001). For comparisons involving severe (severe vs moderate; severe vs normal/mild), no statistically significant directional bias was observed (P = 0.502 and P = 1.000, respectively).

Table 4.

Pairwise Bias Analysis and Direction of Diagnostic Errors of the Gemini 3 Pro Model (Without Localizer)

Comparison Pair	Direction of Error	McNemar P-value	Interpretation
“Severe” vs “moderate”	Tendency to over-grade	0.502	Random errors
“Severe” vs “normal/Mild”	Tendency to over-grade	1.000	Random errors
“Moderate” vs “normal/Mild”	Tendency to under-grade	0.001	Significant bias

Analysis of systematic errors for specific diagnostic pairs. McNemar’s test determines if errors are random or biased. The “Direction of Error” indicates whether the model tends to under- or over-estimate severity in discordant cases.

Abbreviations: P, P-value.

Anatomical Performance by Lumbar Level

Diagnostic accuracy varied by spinal level (Figure 4), with 95% confidence intervals annotated for each level (n = 100 per level), ranging from 51.0% (95% CI, 40.8-61.1%) at L4/5 to 95.0% (95% CI, 88.7-98.4%) at L5/S1. Intermediate accuracies were 65.0% (95% CI, 54.8-74.3%) at L3/4, 78.0% (95% CI, 68.6-85.7%) at L2/3, and 89.0% (95% CI, 81.2-94.4%) at L1/2, indicating level-dependent performance.

Figure 4.

Level-specific diagnostic accuracy across lumbar disc motion segments.

Head-To-Head Comparison: Standard Versus Localizer-assisted Workflow

In direct workflow comparison (Table 5), adding a localizer did not improve performance. Overall accuracy was 75.6% (95% CI, 71.6-79.3%) without the localizer vs 73.2% (95% CI, 69.1-77.0%) with the localizer (P = 0.303). Sensitivity for severe stenosis decreased from 41.8% (95% CI, 28.7-55.9%) to 30.9% (95% CI, 19.1-44.8%) (P = 0.286), and severe-to-normal/mild critical misses increased from 43.6% (24/55) to 52.7% (29/55).

Table 5.

Head-To-Head Comparison of Standard vs Localizer-Assisted Gemini 3 Pro Workflows

Metric	Gemini 3 Pro (Without Localizer)	Gemini 3 Pro (With Localizer)	Comparison (P-Value)
Accuracy (95% CI)	75.6% (71.6-79.3)	73.2% (69.1-77.0)	0.303
Sensitivity for “severe” (95% CI)	41.8% (28.7-55.9)	30.9% (19.1-44.8)	0.286

Comparative analysis of the standard Gemini 3 Pro model without localizer against a workflow utilizing a localizer for the specific vertebral level. P-values indicate the statistical significance of differences in accuracy and sensitivity for “severe” stenosis between the two methods.

Abbreviations: CI, Confidence Interval; P, P-value.

Localizer-Assisted MRI Analysis

When workflow incorporated a vertebral-level localizer, overall multiclass accuracy was 73.2% (95% CI, 69.1-77.0), and sensitivity for severe stenosis decreased to 30.9% (95% CI, 19.1-44.8). Misclassification rate increased to 52.7% (95% CI, 38.8-66.3), and the overall non-severe miss rate for severe disease reached 69.1% (38/55). In the corresponding classification breakdown, correct predictions were highest for normal/mild (330/371, 88.9%), with lower performance for moderate (19/74, 25.7%) and severe (17/55, 30.9%). Most severe cases were downgraded (29/55 to normal/mild and 9/55 to moderate). Severity-stratified operating characteristics were normal/mild sensitivity 0.89, specificity 0.47, PPV 0.83; moderate sensitivity 0.26, specificity 0.95, PPV 0.45; and severe sensitivity 0.31, specificity 0.90, PPV 0.28. Reliability metrics showed LR + 3.20 and LR- 0.76, with McNemar testing indicating non-random discordance (P = 0.003). Pairwise differentiation remained poorest for severe vs moderate (27.9% [95% CI, 20.4-36.5]), while separation of severe vs normal/mild (81.5% [95% CI, 77.4-85.0]) and moderate vs normal/mild (78.4% [95% CI, 74.3-82.2]) was higher. Directional bias under-grading remained significant at the moderate vs normal/mild boundary (P = 0.001), whereas comparisons involving severe did not show significant directional bias (P = 0.230 and P = 0.894). Level-wise accuracy ranged from 56% at L3/4 to 92% at L5/S1.

Parity With the Neuroradiologist

To provide a “patient-level” measure, each lumbar MRI was treated as a single case comprising five-disc levels (L1/2 through L5/S1). Concordance was defined as the number of disc levels matching the neuroradiologist’s reference grade (range 0 to 5).

In the standard workflow, the model agreed with the neuroradiologist reference on average at 3.78/5-disc levels per examination (75.6% disc-level accuracy). Complete parity was observed in 34 of 100 cases, while 31 had agreement at four of five levels; 17 matched at three levels, 15 matched at two levels, and 3 matched at only one level. In clinical terms, this corresponds to complete parity in roughly 3-4 out of 10 patients and to agreement at four or more levels in approximately 6-7 out of 10 patients.

With localizer overlays, the mean parity was 3.66/5-disc levels per examination (73.2% disc-level accuracy). Complete parity occurred in 29 of 100 cases, while 29 matched at four of five levels; 23 matched at three levels, 17 matched at two levels, and 2 matched at one level. In other words, with localizer assistance, full agreement occurred in about 3 of 10 patients, and agreement at 4 or more levels occurred in about 6 of 10 patients.

Discussion

Model Performance Characteristics

Gemini 3 Pro demonstrated fair agreement with expert neuroradiologists (κw = 0.39) in grading lumbar spinal canal stenosis, with the highest sensitivity for correctly identifying normal/mild disease (89.5%). Performance declined substantially in the moderate (31.1%) and severe (41.8%) stenosis categories. Because 371/500 disc levels (74.2%) belonged to the normal/mild class, the headline accuracy of 75.6% was heavily weighted toward the majority class and should not be interpreted as uniform performance across clinically relevant severities. We defined normal/mild stenosis as preserved thecal sac morphology and CSF surrounding freely mobile nerve roots, features that are readily apparent on imaging. In contrast, the moderate category is characterized by nuanced morphologic findings, such as nerve root aggregation with partial preservation of cerebrospinal fluid signal. These features are more subtle and likely contribute to reduced sensitivity. Prior studies demonstrate that task-specific DL architectures, including grading systems based on convolutional neural networks (CNNs), anatomically guided detection pipelines, and You Only Look Once (YOLO)-style localization frameworks, can automate MRI evaluation for lumbar spinal stenosis, achieving multiclass severity grading with kappa values ranging from 0.54 to 0.96.^18–20 As summarized in Table 6, these specialist systems outperform κw of 0.39 observed for Gemini 3 Pro because they are optimized for a constrained task with curated labels, explicit localization, and task-specific training. The trade-off is that such models are narrower in scope and require dedicated development pipelines, whereas a generalist M-VFM offers flexible zero- and few-shot inference across tasks but poorer calibration for clinically critical minority classes.

Table 6.

Published Deep-Learning Studies That Grade Lumbar Spinal Stenosis on Magnetic Resonance Imaging

Author (Year)	Architecture	Dataset & Input	How the Localizer Works	How the Classifier Uses the Localizer	Performance	Benchmark and Evidence Localization Helped
Jamaludin et al²¹ (2017)	SpineNet (multi-task CNN)	2009 MRI scans; sagittal T2	Automatic disc-level detector	Each disc level is cropped and fed to the classifier; the full scan is never seen	∼95% accuracy for central canal stenosis; 70-95% across other tasks	Matched an expert radiologist; the whole pipeline is built around the localizer
Hallinan et al²⁰ (2021)	Two-stage CNN (detector + classifier)	446 internal +100 external scans; axial T2, sagittal T1	Detector draws a box around each region of interest	Classifier sees only the cropped region – not the box itself	κ 0.89-0.96 internal; 0.95-0.96 external	Matched two subspecialty radiologists; two-stage design chosen specifically to focus the classifier on relevant anatomy
Bharadwaj et al¹⁸ (2023)	Segmentation + CNN, with a parallel decision tree	200 scans; axial T2	Segments the dural sac and disc; rules then draw boxes for facet and foramen	CNN branch: Cropped region only. Decision-tree branch: Uses the localizer’s measurements (eg, dural-sac ratios), not pixels	Multi-class κ 0.54 (CNN) vs 0.80 (decision tree); AUROC 0.92 (foraminal), 0.93 (facet)	Direct in-paper comparison: Using the localizer’s measurements instead of pixels added 0.26 to κ; inter-reader κ 0.74-0.86
Su et al²² (2022)	Single-stage multi-task CNN	Large multi-center axial T2 dataset	Disc-level detector sits before the classifier heads	Each cropped disc-level patch is fed to three heads (disc herniation, foraminal, canal)	79.7-87.0% accuracy (internal + external) across three stenosis endpoints	The multi-task head only works well once the detector restricts input to the right disc level
Won et al²³ (2024)	Three-stage CNN (detector + two classifiers)	Multi-level lumbar axial scans; two expert raters	Detector learns expert-drawn canal boxes; one box per slice	Classifier sees only the localized canal patch; surrounding tissue is cropped out	Classifier–rater agreement 91.5-92.7% (F1 75.7-78.4%)	Model–rater agreement is higher than the two raters agreed with each other (89.2%)
Tumko et al²⁴ (2024)	Three-stage CNN (segmentation, detection, grading)	1635 training +150 external scans graded by 7 radiologists; sagittal + axial T2	Stage 1 segments anatomy; stage 2 detects the canal, recess, and foramen	Only the cropped, level-matched region enters the grading head; sagittal and axial are fused	AUROC 0.96/0.91/0.95 (canal/recess/foramen); multi-class κ 0.64 (model) vs 0.62 (radiologists)	On par with a 7-radiologist panel; authors credit the dedicated detection stage
Shahzadi et al²⁵ (2023)	CNN with segmentation front-end	1545 axial scans (L3–L5); public dataset	Segments four structures (disc, posterior element, thecal sac, and the AP diameter)	Tests two setups: Wider crop (all four structures) vs tighter crop (only the most relevant one)	Wider crop 97.01% vs tighter crop 97.71% accuracy	Direct ablation – tighter cropping improved accuracy and reduced per-class variance
van der graaf et al²⁶ (2025)	3-Dimensional segmentation CNN + random-forest classifier	186 patients; 683 disc levels labeled by 4 radiologists; sagittal T2 only	3-Dimensional segmentation of the canal and discs yields measurements at every disc level	Classifier uses the measurements (canal area, diameter), not raw pixels	Multi-class weighted κ 0.86; binary κ 0.94	Matched radiologists who read sagittal + axial together (κ 0.73-0.85); sagittal-only inputs only work because of the segmentation step
RSNA 2024 challenge pipeline (2024, preprint)²⁷	YOLO v8 detector + CNN grader	RSNA 2024 multi-site dataset; sagittal T1/T2 + axial T2	YOLO detector trained on expert coordinates	Detected region is cropped and passed to the grader; the detection box is not drawn on the image	89% (canal), 81% (foraminal), 74% (subarticular) accuracy; F1 0.81-0.91	No direct radiologist comparison; YOLO is used as pre-processing, as in the CNN pipelines above
Lee et al ²⁸ (2025)	CNN vs transformer (head-to-head)	564 scans (464 train/100 test) + 100 external; axial T2, sagittal T1	Shared detector for both architectures	Detected region is cropped and fed to the grader; detection output is not drawn on the image	Internal κ: Canal 0.99/0.99; recess 0.98/0.94; foramen 0.98/0.95 (CNN/transformer). External κ: 0.97-0.99 (CNN), 0.90-0.97 (transformer)	CNN beat – and transformer matched or beat – 2 radiologists, 2 residents, 2 orthopedists, and 2 orthopedic trainees

Abbreviations: AUROC, area under the ROC curve (a measure of how well the model separates classes); CNN, convolutional neural network (the standard type of image-recognition model); F1, a combined score of precision and recall; κ, Cohen’s kappa (a measure of agreement between two raters, where 1.0 is perfect and 0 is chance); MRI, magnetic resonance imaging; ROC, receiver-operating-characteristic.

Note: The 10 studies cover a range of model designs and a κ range of 0.54-0.99. In every one of them, the localizer is used to crop or measure the image before the classifier sees it – the localizer’s box is never drawn onto the image itself. This contrasts with the overlay condition we tested for Gemini 3 Pro in the main text (κw 0.39), in which the box was drawn on top of the scan. Only two studies directly test with vs without localization: Shahzadi et al (tighter vs wider crops) and Bharadwaj et al (measurements vs raw pixels, which added 0.26 to κ). For the other studies, the evidence comes from how the pipelines were designed rather than from a direct comparison. Numbers are taken from each paper as reported – κ and accuracy scores measure different things and are not directly comparable across studies.

Importantly, this study did not “train” Gemini 3 Pro; it evaluated fixed model weights during inference (prompted prediction). In machine learning, training refers to iteratively updating a model’s parameters using data to minimize a defined loss. Inputs are paired with targets – either expert labels (supervised learning) or self-supervised signals such as next-token prediction and image–text alignment – after curation and preprocessing (eg, de-identification, quality control, normalization/harmonization, and explicit train/validation/test splits). Optimization typically uses gradient-based updates over many iterations, and model/hyperparameter choices are guided by validation performance rather than the test set.^29–31 Training design also emphasizes generalization: reducing overfitting to the development set via regularization/augmentation and confirming performance with external validation to detect dataset shift. When a general model underperforms on a clinical task, transfer learning (fine-tuning) further updates pretrained weights using high-quality, task-specific labels to improve clinically salient error modes (eg, severe-case sensitivity), while explicitly monitoring subgroup bias and privacy risks in patient data. Reproducible training requires transparent reporting of data provenance, preprocessing, model versioning, and evaluation protocols.^32–36

Error Patterns

Over half of severe cases (58.2%) were misclassified as non-severe, and 43.6% were downgraded to normal/mild, highlighting a clinically unacceptable failure mode in high-grade stenosis classification. In practice, severe central canal stenosis is often the imaging finding most likely to prompt closer clinical correlation, specialist referral, or surgical discussion when concordant with neurogenic claudication or neurologic deficits. A model that systematically undercalls this group could therefore generate false reassurance if used without expert oversight. Given that severe stenosis is defined as complete CSF obliteration with tightly compressed nerve root bundles across multiple contiguous slices, slice-level variation and volumetric context can be misinterpreted when MRIs are converted from three-dimensional imaging into video format.⁸ Video conversion may reduce radiologic fidelity through resampling, grayscale compression, and loss of native DICOM metadata, while a general-purpose multimodal model may further rely on global visual heuristics rather than disciplined slice-by-slice morphologic comparison. Non-task-specific models, such as Gemini 3 Pro, are not trained to recognize the distinct radiographic features required for stenosis grading; their performance may therefore be limited by both input degradation and inherent model limitations in subtle morphologic interpretation. Despite a specificity of 91.7% for severe disease, a negative likelihood ratio of 0.63 indicates that normal/mild and moderate predictions provide limited reassurance in excluding severe stenosis. Positive predictive value was lowest for severe predictions (38.3%), showing that the model was unreliable in detecting severe stenosis. Pairwise analysis reveals that the model can strongly discriminate between normal/mild and severe stenosis but cannot accurately differentiate between moderate and severe stenosis. Model accuracy across lumbar levels ranged from 51.0% at L4/L5 to 95.0% at L5/S1, with intermediate performance observed across levels. The lowest accuracy at L4/L5 is notable, as this transitional segment has a high prevalence of degenerative pathology due to substantial axial load and mobility in the sagittal plane.³⁷ Given this level’s greater anatomic and pathologic variability, interpretation is more complex and may contribute to reduced model performance compared with other lumbar levels.

Outcomes of Localizer-Assisted Workflow

The localizer is cropped to the target level and serves as a visual guide on sagittal MRI, helping the M-VFM spatially orient the corresponding axial slice during video playback. In specialist DL pipelines, localizers can improve scan-plane isolation and level identification when they are incorporated as trained detection or preprocessing modules (Table 6) rather than displayed as inference-time overlays.^38,39 We hypothesized that this approach would improve identification of the correct spinal level and the location of stenosis. However, in our inference-only workflow, it reduced accuracy, agreement, and sensitivity for severe stenosis. Several mechanisms may explain this counterintuitive result. First, the overlay introduced extra arrows, boxes, and cropped context that may have distracted the model from subtle canal morphology. Second, any mismatch between the highlighted sagittal level and the axial frame sequence may have amplified spatial confusion rather than reduced it. Third, unlike the CNN- and YOLO-based localization workflows summarized in Table 6, our model was not trained to use the localizer as a structured cue; it encountered the overlay as additional out-of-distribution visual content at inference time. This distinction is methodologically important, as it introduces a degree of non-equivalence in the comparison: CNN- and YOLO-based approaches remain fundamentally grounded in 2-dimensional image analysis rather than true volumetric or temporally integrated interpretation. Although transformer-based models may be subject to a similar limitation, this cannot be conclusively determined without direct examination of the model architecture and learned weights. In analogous contexts, even minor extraneous inputs have been shown to mislead multimodal models.¹ Thus, the discrepancy with prior DL literature likely reflects the difference between task-specific, trained localization modules and a generalist M-VFM confronted with a visually complex overlay without retraining or textual grounding.^18,40

Future Research and Limitations

Despite fair agreement and strong performance for normal/mild disease, Gemini 3 Pro under-detected severe stenosis, limiting its reliability for identifying clinically significant lumbar central canal stenosis. Given the multifactorial nature of spinal pathology, future work should evaluate model performance across additional lumbar imaging features, including foraminal stenosis, facet arthropathy, and degenerative disc disease, and extend the assessment to the cervical and thoracic spine to evaluate global stenosis and myelopathy. Integrating imaging with expert radiologic interpretation and relevant electronic health record context may improve downstream applicability.

Beyond reporting the diagnostic accuracy of a single model, this study also provides a time-stamped benchmark of the current status quo for general-purpose multimodal AI in spine MRI interpretation. This is relevant not only scientifically, but also clinically and societally, because members of the public are increasingly using large language model tools for health information, while radiology workflows are simultaneously moving toward patient-facing AI communication and simplification.^41–44 In that context, our findings help spine clinicians, professional societies, and patient organizations contextualize AI-generated impressions by showing that apparently acceptable aggregate accuracy can coexist with clinically important under-detection of severe disease. The present work should therefore be viewed not merely as a model-specific accuracy study, but as an early benchmark that can guide subsequent investigations into why these failure modes occur and how prompt design, multimodal workflow refinement, task-specific tuning, and prospective validation might improve future performance before unsupervised patient-facing deployment is considered.

This study has several limitations. First, only a single multimodal model was evaluated; performance may differ across other frontier multimodal models, radiology-tuned systems, future model versions, and workflows that use native DICOM or alternative prompting strategies. A formal head-to-head evaluation of Gemini against other frontier multimodal models under identical imaging, prompting, and parsing conditions remains an important next step.^7,8,45 Second, the study used one public dataset with pronounced class imbalance, limiting generalizability to populations with different disease prevalence, acquisition parameters, and care pathways. Third, video-based rather than native volumetric MRI was evaluated, and the absence of native volumetric data and radiology-specific pretraining likely limits tasks that require fine 3-dimensional and radiomic context. Fourth, the cohort size was limited to 100 scans; larger samples, supported by patient-level power analyses and external validation, would improve robustness. Finally, the dataset lacked reported interobserver agreement among contributing expert readers, so the reference standard should be viewed as a curated consensus benchmark rather than a definitive gold standard.

Conclusions

Gemini 3 Pro demonstrated fair agreement with neuroradiologist consensus, but its apparently reasonable overall accuracy was driven largely by correct normal/mild classifications in an imbalanced dataset. The clinically unacceptable under-detection of severe stenosis argues against unsupervised use for lumbar canal stenosis grading. Localizer-assisted inputs did not improve performance and worsened several severe-case safety metrics. These findings highlight current performance limitations and underscore the need for targeted optimization, better-characterized reference standards, and prospective, clinically grounded validation before wider deployment. While generalist models currently underperform relative to task-specific DL systems, their strong industry backing and widespread use make them a relevant target for further training in clinical radiographic interpretation.

Supplemental Material

Supplemental Material - Benchmarking Multimodal Vision Frontier Models With Lumbar Spine MRIs for Grading Lumbar Spinal Stenosis

Supplemental Material for Benchmarking Multimodal Vision Frontier Models With Lumbar Spine MRIs for Grading Lumbar Spinal Stenosis by Harry Gebhard, Ahmet Kartal, Noel F. Manalil, Lawrance K. Chung, Sohail R. Daulat, Chiungwen D. Cheng, Ayesha Akbar Waheed, Andia Shahzadi, Ibrahim Hussain, Roger Härtl, Galal A. Elsayed in Global Spine Journal

Supplemental Material

Supplemental Material - Benchmarking Multimodal Vision Frontier Models With Lumbar Spine MRIs for Grading Lumbar Spinal Stenosis

Supplemental Material

Supplemental Material - Benchmarking Multimodal Vision Frontier Models With Lumbar Spine MRIs for Grading Lumbar Spinal Stenosis

Supplemental Material

Footnotes

ORCID iD

Ahmet Kartal

Ethical Considerations

This project does not constitute human subjects research because it used only published, de-identified case information; therefore, Institutional Review Board (IRB) approval was not required.

Consent for Publication

Patient consent was not required because the study analyzed information abstracted from previously published, de-identified case reports, collected no identifiable data, and involved no contact with human subjects.

Author Contributions

Conceptualization: H.G., A.K., I.H., R.H., G.A.E.

Methodology: A.K., G.A.E.

Data curation: A.K., N.F.M., C.D.C.

Formal analysis and investigation: A.K.

Writing – original draft: H.G., A.K., N.F.M., L.K.C., S.R.D., C.D.C., A.A.W., A.S., I.H., R.H., G.A.E.

Writing – Review & Editing: H.G., A.K., N.F.M., L.K.C., S.R.D., C.D.C., A.A.W., A.S., I.H., R.H., G.A.E.

Supervision: H.G., I.H., R.H., G.A.E.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data Availability Statement

The RSNA 2024 Lumbar Spine Degenerative Classification Dataset is publicly available through RSNA/Kaggle in accordance with the dataset’s terms of use. The Python scripts used for DICOM preprocessing, MP4 montage generation (with and without localizer overlays), and inference orchestration, as well as the R scripts used for statistical analysis, are available from the authors upon reasonable request. Derived montage videos may be shared with qualified investigators, subject to the dataset’s data-use and redistribution requirements.*

Disclosure of Financial Relationships/Potential Conflicts of Interest

Roger Härtl reports consulting relationships with DePuy Synthes, Brainlab, and Aclarion and serves as an advisor to RealSpine. He also has financial relationships with Realists and OnPoint outside the submitted work.

Supplemental Material

Supplemental material for this article is available online.

References

Buckley

Diao

Rajpurkar

Rodman

Manrai

. Multimodal foundation models exploit text to make medical image predictions. arXiv e-prints. 2023;05591. doi:10.48550/arXiv.2311.05591. arXiv: 2311.

Sorich

Mangoni

Bacchi

Menz

Hopkins

. The triage and diagnostic accuracy of frontier large language models: updated comparison to physician performance. J Med Internet Res. 2024;26:e67409. doi:10.2196/67409. (In eng).

AlSaad

Abd-Alrazaq

Boughorbel

, et al. Multimodal large language models in health care: applications, challenges, and future outlook. J Med Internet Res. 2024;26:e59505. doi:10.2196/59505. (In eng).

Dai

Zhao

, et al. Why Text Prevails: Vision May Undermine Multimodal Medical Decision Making. arXiv preprint arXiv:251213747. doi:10.48550/arXiv.2512.13747.

Khazanchi

Chen

Desai

, et al. Assessing the ability of large language models to simplify lumbar spine imaging reports into patient-facing text: a pilot study of GPT-4. Skelet Radiol. 2026;55(2):361-366. doi:10.1007/s00256-025-05027-9. (In eng).

Ziegeler

Kreutzinger

Tong

, et al. Information extraction from lumbar Spine MRI radiology reports using GPT4: accuracy and benchmarking against research-grade comprehensive scoring. Diagnostics (Basel). 2025;15(7):930. doi:10.3390/diagnostics15070930. In eng.

Nam

Kim

Kyung

, et al. Multimodal large language models in medical imaging: current state and future directions. Korean J Radiol. 2025;26(10):900-923. doi:10.3348/kjr.2025.0599. (In eng).

Lee

Zhou

Berzin

Sodickson

Rajpurkar

. Multimodal generative AI for interpreting 3D medical images and videos. npj Digit Med. 2025;8(1):273. doi:10.1038/s41746-025-01649-4. (In eng).

Larobina

. Thirty years of the DICOM standard. Tomography. 2023;9(5):1829-1838. doi:10.3390/tomography9050145. (In eng).

10.

Miranda-Viana

Fontenele

Nogueira-Reis

, et al. DICOM file format has better radiographic image quality than other file formats: an objective study. Braz Dent J. 2023;34(4):150-157. doi:10.1590/0103-6440202305499. (In eng).

11.

Doshi

. Gemini 3 Pro: The Frontier of Vision AI. The Keyword; 2025. Published. https://blog.google/technology/developers/gemini-3-pro-vision/. Accessed 27 December 2025.

12.

Lau

Gayen

Ben Abacha

Demner-Fushman

. A dataset of clinically generated visual questions and answers about radiology images. Sci Data. 2018;5:180251. doi:10.1038/sdata.2018.251. (In eng).

13.

Datta

Buchireddygari

Kaza

LVC

Karnwal

Bhatti

HBS

Singh

. Gemini 3.0 Pro Surpasses Radiology Trainees on Radiology's Last Exam (Radle). CRASH Lab. Published November 20, 2025. https://drdattaaiims.github.io/Gemini-3.0-Radiology-2025.html. Accessed 27 December 2025.

14.

Dai

Xiong

Zhu

, et al. Deep learning assessment of small renal masses at contrast-enhanced multiphase CT. Radiology. 2024;311(2):e232178. doi:10.1148/radiol.232178. (In eng).

15.

Wang

, et al. Predicting invasiveness of lung adenocarcinoma from chest CT with few-shot vision-language ternary classification model. npj Digit Med. 2025;9:60. doi:10.1038/s41746-025-02229-2. (In eng).

16.

Radiological Society of North America (RSNA) . RSNA 2024 Lumbar Spine Degenerative Classification Dataset. Kaggle. https://www.kaggle.com/competitions/rsna-2024-lumbar-spine-degenerative-classification. Accessed 17 March 2026.

17.

Richards

Flanders

Colak

, et al. 2025;The RSNA lumbar degenerative imaging Spine classification (LumbarDISC) dataset. arXiv preprint arXiv:2506.09162. doi:10.48550/arXiv.2506.09162.

18.

Bharadwaj

Christine

, et al. Deep learning for automated, interpretable classification of lumbar spinal stenosis and facet arthropathy from axial MRI. Eur Radiol. 2023;33(5):3435-3443. doi:10.1007/s00330-023-09483-6. (In eng).

19.

Wang

Zhang

, et al. A novel deep learning system for automated diagnosis and grading of lumbar spinal stenosis based on spine MRI: model development and validation. Neurosurg Focus. 2025;59(1):E6. doi:10.3171/2025.4.Focus24670. (In eng).

20.

Hallinan

Zhu

Yang

, et al. Deep learning model for automated detection and classification of central canal, lateral recess, and neural foraminal stenosis at Lumbar Spine MRI. Radiology. 2021;300(1):130-138. doi:10.1148/radiol.2021204289. (In eng).

21.

Jamaludin

Lootus

Kadir

, et al. ISSLS prize in bioengineering science 2017: automation of reading of radiological features from magnetic resonance images (MRIs) of the lumbar spine without human intervention is comparable with an expert radiologist. Eur Spine J. 2017;26(5):1374-1383. doi:10.1007/s00586-017-4956-3. (In eng).

22.

Liu

Yang

, et al. Automatic grading of disc herniation, central canal stenosis and nerve roots compression in lumbar magnetic resonance image diagnosis. Front Endocrinol. 2022;13:890371. doi:10.3389/fendo.2022.890371. (In eng).

23.

Won

Lee

Park

. Lumbar spinal stenosis grading in multiple level magnetic resonance imaging using deep convolutional neural networks. Glob Spine J. 2025;15(4):2309-2317. doi:10.1177/21925682241299332. (In eng).

24.

Tumko

Kim

Uspenskaia

, et al. A neural network model for detection and classification of lumbar spinal stenosis on MRI. Eur Spine J. 2024;33(3):941-948. doi:10.1007/s00586-023-08089-2. (In eng).

25.

Shahzadi

Ali

Majeed

, et al. Nerve root compression analysis to find lumbar spine stenosis on MRI using CNN. Diagnostics (Basel). 2023;13(18):2975. doi:10.3390/diagnostics13182975. (In eng).

26.

van der Graaf

Brundel

van Hooff

, et al. AI-based lumbar central canal stenosis classification on sagittal MR images is comparable to experienced radiologists using axial images. Eur Radiol. 2025;35(4):2298-2306. doi:10.1007/s00330-024-11080-0. (In eng).

27.

Mousavi

. Lumbar spine degenerative classification using YOLO v8 and DeepScoreNet. medRxiv 2024;2024, 12. 06.24318595. doi:10.1101/2024.12.06.24318595

28.

Lee

Liu

Ting

, et al. Deep learning models for lumbar spinal stenosis on MRI: model comparison and clinical benchmarking. Eur Spine J. 2026;35(2):548-558. doi:10.1007/s00586-025-09443-2. (In eng).

29.

LeCun

Bengio

Hinton

. Deep learning. Nature. 2015;521(7553):436-444. doi:10.1038/nature14539. (In eng).

30.

Litjens

Kooi

Bejnordi

, et al. A survey on deep learning in medical image analysis. Med Image Anal. 2017;42:60-88. doi:10.1016/j.media.2017.07.005. (In eng).

31.

Willemink

Koszek

Hardell

, et al. Preparing medical imaging data for machine learning. Radiology. 2020;295(1):4-15. doi:10.1148/radiol.2020192224. (In eng).

32.

Char

Shah

Magnus

. Implementing machine learning in health care - addressing ethical challenges. N Engl J Med. 2018;378(11):981-983. doi:10.1056/NEJMp1714229. (In eng).

33.

Geis

Brady

, et al. Ethics of Artificial Intelligence in Radiology: summary of the joint European and North American multisociety statement. Radiology. 2019;293(2):436-440. doi:10.1148/radiol.2019191586. (In eng).

34.

Liu

Cruz Rivera

Moher

Calvert

Denniston

SPIRIT-AI and CONSORT-AI Working Group . Reporting guidelines for clinical trial reports for interventions involving artificial intelligence: the CONSORT-AI extension. Nat Med. 2020;26(9):1364-1374. doi:10.1038/s41591-020-1034-x. (In eng).

35.

Tajbakhsh

Shin

Gurudu

, et al.

Convolutional neural networks for medical image analysis: full training or fine tuning?

IEEE Trans Med Imag. 2016;35(5):1299-1312. doi:10.1109/tmi.2016.2535302. (In eng).

36.

Zech

Badgeley

Liu

Costa

Titano

Oermann

. Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: a cross-sectional study. PLoS Med. 2018;15(11):e1002683. doi:10.1371/journal.pmed.1002683. (In eng).

37.

White

Panjabi

. Clinical Biomechanics of the Spine. 2nd ed. Philadelphia, PA: J. B. Lippincott; 1990. https://www.leomed.at/listhoscan/white_90.pdf. Accessed 27 December 2025.

38.

de Vos

Wolterink

de Jong

Leiner

Viergever

Isgum

. ConvNet-Based localization of anatomical structures in 3-D medical images. IEEE Trans Med Imag. 2017;36(7):1470-1481. doi:10.1109/tmi.2017.2673121. (In eng).

39.

Zhu

Shen

Sun

, et al. Deep learning-based automated scan plane positioning for brain magnetic resonance imaging. Quant Imag Med Surg. 2024;14(6):4015-4030. doi:10.21037/qims-23-1740. (In eng).

40.

Shi

Qiu

Abaxi

SMD

Wei

Yuan

. Generalist vision foundation models for medical imaging: a case study of Segment Anything Model on zero-shot medical segmentation. Diagnostics (Basel). 2023;13(11):1947. doi:10.3390/diagnostics13111947. (In eng).

41.

Alabed

Anderson

Maiter

, et al. Large language models for simplifying radiology reports: a systematic review and meta-analysis of patient, public, and clinician evaluations. Lancet Digit Health. 2026;8(2):100960. doi:10.1016/j.landig.2025.100960. (In eng).

42.

Herwald

Shah

Johnston

Olsen

Delbrouck

Langlotz

. RadGPT: a system based on a large language model that generates sets of patient-centered materials to explain radiology report information. J Am Coll Radiol. 2025;22(9):1050-1059. doi:10.1016/j.jacr.2025.06.013. (In eng).

43.

Mendel

Singh

Mann

Wiesenfeld

Nov

. Laypeople's use of and attitudes toward large language models and search engines for health queries: Survey study. J Med Internet Res. 2025;27:e64290. doi:10.2196/64290. (In eng).

44.

Yun

Bickmore

. Online health information-seeking in the era of large language models: cross-sectional web-based survey study. J Med Internet Res. 2025;27:e68560. doi:10.2196/68560. (In eng).

45.

Nakaura

Kobayashi

Masuda

, et al. Performance of State-of-the-Art multimodal large language models on an image-rich radiology board examination: Comparison to human examinees. Acad Radiol. 2026;33(2):348-358. doi:10.1016/j.acra.2025.10.051. (In eng).

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.02 MB

0.11 MB

0.00 MB

0.11 MB