Sage Journals: Discover world-class research

Abstract

Purpose: To evaluate the diagnostic accuracy of GPT-4.o, Claude 3.7 Sonnet, and Gemini 2.5 Pro for real-world retinal cases. Methods: Forty retina cases from the University of Iowa’s EyeRounds repository were assessed under 2 conditions: (1) full clinical context (textual history, examination, and image descriptions); and (2) image-only (raw clinical images). Outputs were evaluated for diagnostic accuracy, symptom/sign recognition, differential diagnosis, and treatment recommendations. Cochran Q, McNemar test, analysis of variance, and Cohen kappa were used to make statistical comparisons. Results: In full-context cases, GPT-4.o achieved the highest diagnostic accuracy (78.4%), followed by Claude (73.0%) and Gemini (29.7%) (P < .001). In image-only cases, Claude was superior (73.7%), outperforming GPT-4.o (63.2%) and Gemini (31.6%) (P < .001). GPT-4.o was best at identifying signs/symptoms in full-context cases (64.0%), while Claude excelled in image-only cases (63.3%). GPT-4.o (42.9%) and Claude (37.0%) outperformed Gemini (25.9%) on differential diagnosis (P ≤ .011). Claude showed superior treatment accuracy in image-only cases (61.1%), while GPT-4.o was superior in full-context cases (57.3%), and both outperformed Gemini (35.8%). GPT-4.o and Claude showed substantial agreement in image-based diagnoses (κ=0.658), while Gemini showed minimal agreement (κ≤0.196). Conclusions: GPT-4.o and Claude demonstrated strong diagnostic and clinical reasoning in retinal cases, with Claude excelling in image-based analysis and GPT-4.o in text-rich contexts. Gemini’s lower performance underscores the importance of careful model selection in clinical applications.

Keywords

artificial intelligence retina multimodal models diagnostic accuracy ophthalmology

Introduction

Artificial intelligence (AI) has substantially influenced ophthalmology, augmenting diagnostic capabilities, improving disease monitoring, and enhancing patient outcomes through timely intervention and precision medicine approaches.^1,2 Retinal diseases, such as diabetic retinopathy (DR), age-related macular degeneration (AMD), and glaucoma, are some of the leading causes of visual impairment globally.^3–5 The effective diagnosis and management of these conditions frequently rely on multimodal imaging technologies and detailed clinical assessments, areas increasingly supported by AI.^6,7

Large language models, advanced AI architectures designed to process and generate human-like text, have emerged as powerful tools in medicine, offering considerable potential to streamline diagnostic workflows and decision-making.^8,9 Models such as GPT-4.o, Claude, and Gemini are distinguished by their capacity for multimodal integration, combining textual and visual data-processing capabilities, which is particularly relevant in the evaluation of retinal disease.^10–12 Recent studies suggest that these advanced large language models could approach or even surpass human performance in specific medical assessments, such as the United States Medical Licensing Examination and board-style ophthalmology questions.^13–15

Previous research into AI applications within ophthalmology has predominantly focused on image-based diagnostics, employing deep learning frameworks like convolutional neural networks for detecting conditions such as DR, glaucoma, and AMD.^16,17 These models have shown high accuracy in clinical trials and real-world settings, highlighting their potential for improving diagnostic efficiency and accuracy.^18,19 However, the integration of textual clinical information alongside imaging data remains relatively unexplored, representing a substantial opportunity for improvement in diagnostic precision and clinical decision-making.

Despite the rapidly expanding body of evidence supporting AI applications in ophthalmology, significant gaps persist in our understanding of how effectively multimodal large language models integrate complex clinical narratives with detailed imaging, particularly within retina-specific contexts. Specifically, the comparative effectiveness of multimodal large language models for retinal conditions using real-world data has been poorly characterized. Using a series of real-world retinal cases derived from the University of Iowa’s EyeRounds repository, the current study systematically evaluated and compared the diagnostic accuracy of 3 prominent multimodal large language models, GPT-4.o, Claude 3.7 Sonnet, and Google Gemini 2.5 Pro.²⁰ By examining each model’s capabilities in interpreting complex clinical narratives alongside detailed ophthalmic images, we strived to elucidate their respective strengths and limitations.

Methods

This study, carried out in March 2025, systematically evaluated and compared the diagnostic accuracy of 3 advanced multimodal large language models, GPT-4.o (OpenAI), Claude 3.7 Sonnet (Anthropic), and Google Gemini 2.5 Pro, on a curated dataset of retinal cases. All 40 retinal cases were included from the University of Iowa’s publicly available EyeRounds clinical repository, ensuring coverage of a diverse range of retinal disorders, including DR, AMD, retinal vascular disorders, and retinal detachments (RDs).²⁰ Each case included comprehensive clinical data encompassing detailed patient histories, thorough ophthalmic examination results, and multimodal imaging, such as fundus photographs, optical coherence tomography, and fluorescein angiography.

The selected EyeRounds cases were manually processed and organized into a structured Excel datasheet, categorizing each case by clinical history, examination details, imaging findings, and ancillary test results. Any explanatory commentary or educational annotation originally present in the repository was removed to prevent bias. The study involved publicly available, de-identified educational case data; therefore, it was exempt from institutional review board review. For evaluation purposes, the cases were presented to each large language model in 2 distinct formats. The first format, “full clinical context,” provided complete textual narratives, including patient history, examination data, descriptive annotations of imaging, and ancillary testing results. The second format, “image-only context,” presented raw imaging data exclusively without any accompanying textual descriptions, requiring the large language model to interpret the imaging findings independently.

GPT-4.o, Claude 3.7 Sonnet, and Google Gemini 2.5 Pro were assessed under both full clinical text and image-only conditions. Uniform prompts explicitly guided each model to identify relevant clinical signs and symptoms, generate appropriate differential diagnoses, determine the most likely primary diagnosis, and propose suitable treatment recommendations. All outputs from the models were collected verbatim without any postprocessing or alterations. Each EyeRounds case was presented independently to the models, and outputs were generated in a single interaction. No feedback, corrections, or case-to-case carryover occurred, ensuring that the model’s performance reflected their static capabilities at the time of evaluation, rather than iterative learning. Prompts were administered in a standardized manner using a structured spreadsheet containing complete case information. For the full clinical context condition, each row included the patient history, examination findings, and descriptive imaging annotations, which were uploaded and referenced directly in the model prompt (Figure 1). For the image-only condition, the raw clinical images were provided, in addition to patient history and examination findings that were not present in the clinical image. A researcher sequentially presented each case to the large language model, which was instructed to generate its diagnostic impression, differential diagnoses, signs and symptoms, and treatment recommendations. Models were not permitted to reference external resources or revisit previous cases.

Figure 1.

Example of large language model input and output for determining the key signs and symptoms associated with Case 1 using GPT-4.o.

Each large language model’s response underwent independent evaluation by 2 trained reviewers using a standardized scoring protocol. Diagnostic accuracy was determined as a binary outcome, defined by whether the model’s primary diagnosis exactly matched the reference diagnosis provided by EyeRounds. Partial matches (eg, responses containing some correct features but not the correct final diagnosis) were not given partial credit. Accuracy in identifying clinical signs and symptoms, differential diagnoses, and recommended treatments was assessed based on the proportion of matches with the reference content, with class-level matches accepted for treatments (eg, if the reference was corticosteroids, a model response of ‘prednisone’ was counted as correct) (Table 1). For differential diagnoses, the proportion of matches listed by each model was recorded in comparison to the EyeRounds reference diagnoses. Additionally, in cases where the primary diagnosis was incorrect, performance was assessed based on whether the correct reference diagnosis appeared within the model’s differential list. The models were not prompted to provide descriptive explanations of findings; rather, they were evaluated solely on their ability to generate diagnostic outputs from the provided case materials.

Table 1.

Comparison of Claude, Gemini, and GPT-4.0 Responses for Retina Case #10 (Central Retina Artery Occlusion).

Case S/S	S/S-Claude Full	S/S-Claude Image	S/S-Gemini Full	S/S-Gemini Image	S/S-GPT Full	S/S-GPT Image
Sudden, painless loss of vision in 1 eye	Sudden, painless vision loss in 1 eye	Sudden, painless vision loss in one eye	Progressive vision loss	Erythema of the puncta	Sudden, painless vision loss in 1 eye	Sudden, painless vision loss in 1 eye
History of amaurosis fugax	Black spot spreading to complete visual field over 15 minutes	Inferotemporal scotoma on Amsler grid testing	Night blindness	Epiphora	Exudate near the optic disc	Black spot in vision that expands over 15 minutes
Vision loss	Hand motion vision	Hand motion vision in 1 eye	Peripheral vision loss	Decreased vision	Light perception vision in 1 eye	Light perception vision in 1 eye
Cherry-red spot	Cherry-red spot	Retinal whitening with cherry-red spot in macula	Tunnel vision	Erythema of the canaliculus	Cherry-red spot in the macula	Mild nuclear sclerotic cataract in both eyes
Vascular attenuation in arteries and veins	Inferior island on visual field	White dots along vessel walls	Photopsia	Expression of purulent material from the punctum	Retinal hemorrhage superior to the disc	Inferior island on visual field
Embolus in vessel on optic nerve	Afferent pupillary defect	Afferent pupillary defect	Reduced visual acuity	Swelling and tenderness over the lacrimal sac	Afferent pupillary defect	Afferent pupillary defect
Accuracy	4/6 (67%)	3/6 (50%)	1/6 (17%)	1/6 (17%)	3/6 (50%)	3/6 (50%)
Case Dx	Dx-Claude Full	Dx-Claude Image	Dx-Gemini Full	Dx-Gemini Image	Dx-GPT Full	Dx-GPT Image
Central retinal artery occlusion	Central retinal artery occlusion	Central retinal artery occlusion	Retinitis pigmentosa	Dacryocystitis	Central retinal artery occlusion	Central retinal artery occlusion
Accuracy	Yes	Yes	No	No	Yes	Yes
Case DDx	DDx-Claude Full	DDx-Claude Image	DDx-Gemini Full	DDx-Gemini Image	DDx-GPT Full	DDx-GPT Image
Central retinal artery occlusion	Central retinal artery occlusion	Central retinal artery occlusion	Retinitis pigmentosa	Canaliculitis	Central retinal artery occlusion	Central retinal artery occlusion
Tay-Sachs disease	Ophthalmic artery occlusion	Ophthalmic artery occlusion	Usher syndrome	Dacryocystitis	Giant cell arteritis	Giant cell arteritis-related ischemic optic neuropathy
GM1 gangliosidosis	Retinal artery branch occlusion	Acute retinal ischemia secondary to giant cell arteritis	Gyrate atrophy	Conjunctivitis with secondary punctal involvement	Ophthalmic artery occlusion	Ophthalmic artery occlusion
Niemann-Pick disease	Ischemic optic neuropathy	Cilioretinal artery sparing central retinal artery occlusion	Cone-rod dystrophy	Foreign body in the lacrimal system	Central retinal vein occlusion with arterial hypoperfusion	Central retinal vein occlusion
Accuracy	1/4 (25%)	1/4 (25%)	0/4 (0%)	0/4 (0%)	1/4 (25%)	1/4 (25%)
Case Tx	Tx-Claude Full	Tx-Claude Image	Tx-Gemini Full	Tx-Gemini Image	Tx-GPT Full	Tx-GPT Image
Ocular-digital massage	Ocular massage	Ocular massage	Vitamin A supplementation	Warm compresses	Ocular massage	Ocular massage
Anterior chamber paracentesis	Anterior chamber paracentesis	Anterior chamber paracentesis	Omega-3 fatty acids	Topical antibiotics	Anterior chamber paracentesis	Anterior chamber paracentesis
Acetazolamide	Intraocular pressure-lowering medications	Urgent carotid and cardiac evaluation for embolic source	Gene therapy	Oral antibiotics	Acetazolamide	Workup for carotid artery disease
Carbogen therapy	Carotid and cardiac evaluation for embolic source	Carbogen therapy to increase retinal oxygenation	Low vision aids	Dacryocystorhinostomy	Urgent stroke workup and vascular imaging	Hyperbaric oxygen treatment
Accuracy	3/4 (75%)	3/4 (75%)	0/4 (0%)	0/4 (0%)	3/4 (75%)	2/4 (50%)

Abbreviations: DDx, differential diagnoses; Dx, primary diagnosis; S/S, key signs and symptoms; Tx, treatment recommendation.

All available retina cases within the Iowa EyeRounds were used in this multimodal clinical reasoning study. Statistical analysis was performed with SPSS (version 29, IBM Corp), with a significance level of α = 0.05. The primary diagnostic accuracy differences across the 3 models were assessed using Cochran Q test for related samples, followed by post-hoc pairwise comparisons using McNemar tests with Bonferroni correction for multiple comparisons. For continuous clinical reasoning performance metrics (signs/symptoms recognition, differential diagnosis generation, and treatment recommendations), repeated measure analysis of variance was used to compare model performance within each metric. Greenhouse-Geisser correction was applied when sphericity assumptions were violated. Within-model performance differences between full case and image-only conditions were evaluated using paired sample t test for continuous variables and McNemar test for dichotomous diagnostic outcomes. Wilcoxon signed rank tests were used when normality assumptions were violated. Effect size was calculated using Cohen d for significant differences. Interrater agreement between models and across modalities was assessed using Cohen kappa (κ) statistics with 95% CIs. With a sample size of 40 cases, the power analysis showed that paired t tests have 87.3% power to detect medium effects (d=0.5), repeated measures analysis of variance has 43.4% power for medium effects (η²=0.06), and correlations achieve 73.1% power for moderate associations (r=0.4). McNemar tests reach 49.1% power with 15 discordant pairs. The current study can reliably detect differences with 80% power when Cohen d ≥0.53 for paired comparisons, η² ≥0.13 for analysis of variance, and r ≥0.42 for correlations. This sample size should be adequate for detecting clinically meaningful differences in the diagnostic performance of large language models.

Results

Significant differences in primary diagnostic accuracy were identified among the models. With access to the full clinical context, GPT-4.o achieved the highest accuracy, at 78.4%, closely followed by Claude, at 73.0%; both significantly surpassed Gemini’s 29.7% accuracy (Cochran Q=24.33, P < .001) (Table 2, Figure 2). In the image-only condition, Claude demonstrated the highest accuracy, at 73.7%, significantly outperforming both the 63.2% accuracy of GPT-4.o and Gemini’s 31.6% (Cochran Q=21.90, P < .001, Figure 2).

Table 2.

Comparative Performance of Multimodal Large Language Models in Clinical Reasoning Match.

Input Condition	GPT	Claude	Gemini	P Value	Effect Size η²
Image-Only
Signs/symptoms matchMean, % (SD)	59.1 (25.7)	63.3 (12.9)^d	44.2 (26.3)	< .001^c	0.21
Differential diagnosis matchMean, % (SD)	39.5 (28.2)^d	42.6 (29.0)^d	25.9 (23.9)	< .001^c	0.193
Treatment matchMean, % (SD)	53.1 (43.4)	61.1 (40.8)^d	35.8 (42.0)	.014^a	0.152
Full Case
Signs/symptoms matchMean, % (SD)	64.0 (18.2)^e	51.2 (20.9)	56.5 (22.5)	.022^a	0.103
Differential diagnosis matchMean, % (SD)	42.9 (29.7)^d	37.0 (25.0)	25.9 (27.1)	.003^b	0.152
Treatment matchMean, % (SD)	57.3 (39.2)	57.6 (37.5)	37.8 (44.0)	.046^a	0.095

P < .05.

P < .01.

P < .001.

Significantly higher than Gemini (Bonferroni-corrected pairwise comparisons, P < .05).

Significantly higher than Claude (Bonferroni-corrected pairwise comparisons, P < .001).

Figure 2.

Primary diagnostic accuracy of the large language model and condition.

When analyzing recognition of signs and symptoms, GPT-4.o’s 64.0% accuracy significantly outperformed Claude’s 51.2% under the full clinical context condition (P < .001). Gemini demonstrated intermediate performance, at 56.5%. Conversely, Claude excelled in the image-only condition, achieving 63.3% accuracy, which was significantly better than Gemini’s 44.2% (P = .002); GPT-4.o displayed intermediate performance, at 59.1%. Notably, Claude significantly improved diagnostic accuracy from 51.2% in the full clinical context to 63.3% in the image-only condition (Cohen d=−0.663, P < .001), highlighting its unique visual diagnostic capability.

Differential diagnosis accuracy presented a notable challenge for all large language models, but GPT-4.o (42.9%) and Claude (37.0%) consistently and significantly outperformed Gemini (25.9%) (P ≤ .011) across both evaluation modalities. Regarding treatment recommendations, Claude achieved the highest accuracy in the image-only condition (61.1%), while GPT-4.o provided superior treatment accuracy in the full clinical context (57.3%). Both models significantly exceeded Gemini’s 35.8% accuracy (P < .05).

Substantial intermodel agreement was noted between GPT-4.o and Claude for image-based diagnoses (κ=0.658, P < .001) (Table 3). In contrast, Gemini demonstrated minimal agreement with both GPT-4.o and Claude (κ≤0.196), emphasizing fundamental performance differences among these multimodal large language models.

Table 3.

Cohen’s Kappa Analysis of Large Language Model Diagnostic Agreement.

Input Condition	Cohen’s κ	95% CI	P Value	Agreement Level
Within-Model Consistency: Image vs Full Case
GPT	0.414^b	[0.128, 0.700]	.006	Moderate
Claude	0.321^a	[−0.014, 0.656]	.048	Fair
Gemini	0.483^b	[0.176, 0.790]	.003	Moderate
Full Between-Model Agreement
GPT vs Claude	0.420^b	[0.087, 0.753]	.009	Moderate
GPT vs Gemini	0.196^a	[0.043, 0.349]	.042	Slight
Claude vs Gemini	−0.002	[−0.226, 0.222]	.983	None
Image-only Between-Model Agreement
GPT vs Claude	0.658^c	[0.413, 0.903]	< .001	Substantial
GPT vs Gemini	0.328^a	[0.097, 0.559]	.013	Fair
Claude vs Gemini	0.193	[−0.009, 0.395]	.087	Slight

P < .05.

P < .01.

P < .001.

Conclusions

This study underscores the robust diagnostic performance of multimodal large language models in interpreting real-world retinal cases, particularly highlighting the capabilities of GPT-4.o and Claude 3.7 Sonnet. GPT-4.o demonstrated superior diagnostic accuracy (78.4%) when provided with comprehensive clinical narratives, aligning closely with previous research that emphasized its strong capabilities in complex clinical reasoning and integration of detailed patient histories.^10,21,22 Claude notably excelled in the image-only condition, achieving a diagnostic accuracy of 73.7%, consistent with previous studies using deep learning algorithms that reported similarly high accuracy levels (>70%) for retinal conditions such as DR and AMD.^6,16,17

Intermodel agreement analyses revealed substantial consistency between GPT-4.o and Claude in image-based diagnostic decisions (κ=0.658). However, minimal agreement involving Gemini (κ≤0.196) highlighted fundamental differences in model diagnostic approaches. Such discrepancies underscore the necessity for rigorously selecting and validating AI tools tailored specifically to ophthalmic clinical tasks before their deployment, ensuring alignment with clinical needs and expectations.^23,24

Moreover, Gemini consistently demonstrated significantly lower diagnostic performance across all evaluation metrics, achieving only 29.7% accuracy in the full-context condition and 31.6% in the image-only condition. Such limitations suggest intrinsic weaknesses within Gemini’s model architecture or potential deficits in ophthalmology-specific training data. Similarly, previous studies have reported Gemini’s substantial performance challenges in medical contexts. For instance, Pal and Sankarasubbu²⁵ found that Gemini achieved significantly lower accuracy (approximately 61%) compared with the 88% achieved by GPT-4V (GPT4.o with additional pretrained vision capabilities), with notable reasoning errors and hallucinations in multimodal medical assessments. Furthermore, Yan et al²⁶ documented Gemini’s performance as below random chance in specialized medical visual question-answering tasks, underscoring the model’s inherent reliability concerns. Additionally, Carlà et al found that Gemini correctly suggested surgical approaches for RD cases in only 70% of instances compared with the 84% achieved by ChatGPT-4. Notably, Gemini failed to generate any surgical plan in 10% of the evaluated cases. In addition, Gemini also demonstrated significantly lower Global Quality Scores (mean, 3.5) relative to ChatGPT-4 (mean, 4.2), underscoring its limitations in handling complex ophthalmic imaging and planning scenarios.²⁷

Claude’s unique improvement from full-context (51.2%) to image-only conditions (63.3%) highlights its specific strength in visual interpretation, suggesting potential use in telemedicine and remote ophthalmic screening, particularly in underserved settings. The superior performance with image-only inputs has profound implications for teleophthalmology and remote screening programs, areas increasingly critical for early detection and management in underserved populations or regions lacking specialist access. This modality-specific strength aligns with landmark studies demonstrating the effectiveness of AI-driven retinal imaging analyses in screening for DR and other retinal diseases.^3,6,16,17 Claude’s superior performance may stem from its advanced vision transformer architecture or alignment strategies, which emphasize carefully curated multimodal training datasets and reinforced learning from human feedback, optimizing its ability to accurately interpret ophthalmic images even in the absence of textual context.^28,29

In the current study, superior diagnostic performance was demonstrated by Claude compared with GPT models, achieving 73.7% accuracy in image-only scenarios for retinal disease diagnostics. In contrast, Jiao et al³⁰ reported GPT-4.o as the best-performing model for corneal diseases, with an accuracy of 80.0%, while Claude achieved slightly lower accuracy, at 70.0%. The divergent outcomes between the current retina-focused study and Jiao et al’s cornea-focused research highlight important modality-specific strengths among large language models. Particularly, Claude’s unique improvement when processing image-only cases suggests a pronounced capability in interpreting ophthalmic images, such as fundus photographs, which are more routinely used in retina practice. In contrast, corneal diagnostics rely more heavily on slitlamp photography, which is less common and might not be as effectively represented in Claude’s training corpus. Thus, Claude’s performance advantage may be attributed to the greater prevalence and clarity of retinal imaging data, enhancing its image-processing proficiency. Understanding these nuances is crucial for selecting and optimizing AI models tailored to specific clinical ophthalmologic applications, thus offering the potential to improve patient outcomes with enhanced diagnostic accuracy.

The clinical implications of these findings are significant, particularly for retinal practice. Retinal diseases, including DR, AMD, and RD, often require prompt diagnosis and accurate monitoring to preserve visual function and prevent irreversible blindness.^16,17 The demonstrated capability of GPT-4.o to effectively synthesize detailed clinical narratives makes it a powerful adjunct tool for comprehensive patient assessments, especially in complex cases where nuanced clinical histories are essential.

The reliability of these large language models must be carefully assessed, given the documented risks of AI-generated hallucinations, outputs that are plausible yet clinically inaccurate and which could potentially harm patient care.³¹ Transparency in model limitations and careful clinician oversight remain essential to prevent overreliance on these tools, particularly in cases where AI systems are tasked with interpreting complex multimodal clinical inputs without adequate human verification. Therefore, to ensure accuracy, reliability, and patient safety in ophthalmic practice, ongoing education and structured integration of AI into clinical workflows are imperative.

Despite promising primary diagnostic accuracy, making a differential diagnosis, which inherently requires nuanced clinical reasoning and the simultaneous consideration of multiple potential conditions, remained challenging for all 3 models (GPT-4.o: 42.9%, Claude: 37.0%, Gemini: 25.9%). These findings are consistent with previous research that similarly identified challenges for AI models in generating comprehensive differentials, especially in ophthalmology.^7,21

This study has several limitations that should be acknowledged. First, the relatively modest sample size may limit the generalizability of the findings, particularly given the broad spectrum of retinal pathologies encountered clinically. Larger studies using more diverse and extensive datasets are necessary to confirm these results. This study was also conducted at a time when GPT-4.o and Claude 3.7 represented the most advanced large language models available. Since then, newer versions, such as GPT-5.o and Sonnet 4.1, have been released. Although these updates offer incremental performance improvements, they do not fundamentally alter the core capabilities under evaluation, namely the first generation of multimodal models capable of processing both ophthalmic images and clinical text. Future studies evaluating these newer models will be important to determine whether these improvements translate into meaningful differences in ophthalmic performance. In addition, the complexity and variability of real-world clinical decision-making may not be fully captured by our study’s retrospective and standardized experimental design. Prospective validation studies using larger, diverse datasets are essential to comprehensively validate these findings and assess these models’ practical use in clinical settings.^32,33

Another limitation involves potential previous exposure of the publicly available EyeRounds cases during model training, which could inflate the observed diagnostic accuracies. Future studies should explicitly address and mitigate dataset overlap to ensure genuine model evaluation. Additionally, this study did not stratify diagnostic performance by retinal disease subtype (eg, DR, AMD, RD, vascular occlusions). Although the dataset encompassed a broad spectrum of cases, the number of examples within each category was insufficient to permit meaningful subgroup analyses. Future studies with larger and more diverse datasets will be necessary to determine whether model performance varies systematically across different categories of retinal diseases. Lastly, the structured prompts and high-quality imaging provided in this controlled setting might not reflect typical clinical variability, possibly leading to an overestimation of real-world performance. Further investigation into AI-related inaccuracies, patient safety implications, and clinician acceptance will be essential before widespread clinical implementation.

In conclusion, this study highlights the diagnostic capabilities of GPT-4.o, Claude, and Gemini for retinal diseases, with each model demonstrating distinct modality-specific strengths. GPT-4.o excels with comprehensive clinical narratives, while Claude shows superior performance with image-only data. These findings emphasize the necessity of thoughtful selection and integration of multimodal AI tools in clinical ophthalmology. Future advancements should prioritize enhancing visual reasoning and differential diagnosis capabilities in retinal disease management to maximize clinical effectiveness and improve patient care.

Footnotes

Acknowledgements

Odum Institute for Research in Social Science, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA.

Ethical Approval

This study was deemed exempt from institutional review board review, as it involved the use of publicly available, de-identified clinical cases from the University of Iowa’s EyeRounds repository.

Statement of Informed Consent

No human subjects were directly involved, and no identifiable patient data were used.

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

ORCID iDs

Iden Amiri

Alice Yang Zhang

Data Availability Statement

All clinical cases used in this study are publicly accessible through the University of Iowa’s EyeRounds online repository ().

References

Tong

Xing

Chen

Shen

Applications of artificial intelligence in ophthalmology: general overview. J Ophthalmol. 2018;2018(1):5278196. doi:10.1155/2018/5278196

Ting

DSW

Pasquale

Peng

, et al. Artificial intelligence and deep learning in ophthalmology. Br J Ophthalmol. 2019;103(2):167-175. doi:10.1136/bjophthalmol-2018-313173

Yim

Chopra

Spitz

, et al. Predicting conversion to wet age-related macular degeneration using deep learning. Nat Med. 2020;26(6):892-899. doi:10.1038/s41591-020-0867-7

Bourne

RRA

Stevens

White

, et al. Causes of vision loss worldwide, 1990-2010: a systematic analysis. Lancet Glob Health. 2013;1(6):e339-349. doi:10.1016/S2214-109X(13)70113-X

Tham

Wong

Quigley

Aung

Cheng

CY.

Global prevalence of glaucoma and projections of glaucoma burden through 2040: a systematic review and meta-analysis. Ophthalmology. 2014;121(11):2081-2090. doi:10.1016/j.ophtha.2014.05.013

De Fauw

Ledsam

Romera-Paredes

, et al. Clinically applicable deep learning for diagnosis and referral in retinal disease. Nat Med. 2018;24(9):1342-1350. doi:10.1038/s41591-018-0107-6

Antaki

Coussa

Kahwati

Hammamji

Sebag

Duval

Accuracy of automated machine learning in classifying retinal pathologies from ultra-widefield pseudocolour fundus images. Br J Ophthalmol. 2023;107(1):90-95. doi:10.1136/bjophthalmol-2021-319030

Singhal

Azizi

, et al. Large language models encode clinical knowledge. Nature. 2023;620(7972):172-180. doi:10.1038/s41586-023-06291-2

Kung

Cheatham

Medenilla

, et al. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLOS Digit Health. 2023;2(2):e0000198. doi:10.1371/journal.pdig.0000198

10.

Chen

Reddy

Al-Sharif

, et al. Analysis of ChatGPT responses to ophthalmic cases: can ChatGPT think like an ophthalmologist? Ophthalmol Sci. 2025;5(1):100600. doi:10.1016/j.xops.2024.100600

11.

Jalili

Jiravarnsirikul

Bowd

, et al. Glaucoma detection and feature identification via GPT-4V fundus image analysis. Ophthalmol Sci. 2025;5(2):100667. doi:10.1016/j.xops.2024.100667

12.

Delsoz

Raja

Madadi

, et al. The use of ChatGPT to assist in diagnosing glaucoma based on clinical case reports. Ophthalmol Ther. 2023;12(6):3121-3132. doi:10.1007/s40123-023-00805-x

13.

Liévin

Hother

Motzfeldt

Winther

Can large language models reason about medical questions?

Patterns (N Y). 2024;5(3):100943. doi:10.1016/j.patter.2024.100943

14.

Chen

Huang

Jomy

, et al. Performance of multimodal artificial intelligence chatbots evaluated on clinical oncology cases. JAMA Netw Open. 2024;7(10):e2437711. doi:10.1001/jamanetworkopen.2024.37711

15.

Mihalache

Huang

Popovic

, et al. Accuracy of an artificial intelligence chatbot’s interpretation of clinical ophthalmic images. JAMA Ophthalmol. 2024;142(4):321-326. doi:10.1001/jamaophthalmol.2024.0017

16.

Gulshan

Peng

Coram

, et al. Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs. JAMA. 2016;316(22):2402-2410. doi:10.1001/jama.2016.17216

17.

Ting

DSW

Cheung

CYL

Lim

, et al. Development and validation of a deep learning system for diabetic retinopathy and related eye diseases using retinal images from multiethnic populations with diabetes. JAMA. 2017;318(22):2211-2223. doi:10.1001/jama.2017.18152

18.

De Fauw

Keane

Tomasev

, et al. Automated analysis of retinal imaging using machine learning techniques for computer vision. F1000Res. 2016;5:1573. doi:10.12688/f1000research.8996.2

19.

Schmidt-Erfurth

Sadeghipour

Gerendas

Waldstein

Bogunović

Artificial intelligence in retina. Prog Retin Eye Res. 2018;67:1-29. doi:10.1016/j.preteyeres.2018.07.004

20.

EyeRounds. Case reports, tutorials, videos, images from Univ of Iowa Dept of Ophthalmology. Accessed June 12, 2025. https://www.eyerounds.org/#gsc.tab=0

21.

Huang

Hirabayashi

Barna

Parikh

Pasquale

LR.

Assessment of a large language model’s responses to questions and cases about glaucoma and retina management. JAMA Ophthalmol. 2024;142(4):371-375. doi:10.1001/jamaophthalmol.2023.6917

22.

Holmes

, et al. Evaluating large language models in ophthalmology. arXiv [preprint]. Published online November 7, 2023. arXiv:2311.04933. https://arxiv.org/abs/2311.04933

23.

Ahmed

Fatani

Vargas

, et al. Physicians’ perspectives on ChatGPT in ophthalmology: insights on artificial intelligence (AI) integration in clinical practice. Cureus. 2025;17(1):e78069. doi:10.7759/cureus.78069

24.

Young

Amara

Bhattacharya

Wei

ML.

Patient and general public attitudes towards clinical artificial intelligence: a mixed methods systematic review. Lancet Digit Health. 2021;3(9):e599-e611. doi:10.1016/S2589-7500(21)00132-1

25.

Pal

Sankarasubbu

. Gemini goes to med school: exploring the capabilities of multimodal large language models on medical challenge problems & hallucinations. In: Naumann

Ben Abacha

Bethard

, eds. Proceedings of the 6th Clinical Natural Language Processing Workshop. Association for Computational Linguistics; 2024:21-46. https://aclanthology.org/2024.clinicalnlp-1.3/

26.

Yan

Yue

Wang

. Worse than random? an embarrassingly simple probing evaluation of large multimodal models in medical VQA. In: Che

Nabende

Shutova

, eds. Findings of the Association for Computational Linguistics: ACL 2025. Association for Computational Linguistics; 2025:19188-19205. https://aclanthology.org/2025.findings-acl.981/

27.

Carlà

Crincoli

Rizzo

Retinal imaging analysis performed by ChatGPT-4O and Gemini Advanced: the turning point of the revolution?

Retina. 2025;45(4):694-702. doi:10.1097/IAE.0000000000004351

28.

Anthropic. Introducing the next generation of Claude. March 4, 2024. Accessed June 8, 2025. https://www.anthropic.com/news/claude-3-family

29.

Oladele

GPT-4o vs. Gemini 1.5 Pro vs. Claude 3 Opus: multimodal AI model comparison. May 15, 2024. Accessed June 8, 2025. https://encord.com/blog/gpt-4o-vs-gemini-vs-claude-3-opus/

30.

Jiao

Rosas

Asadigandomani

, et al. Diagnostic performance of publicly available large language models in corneal diseases: a comparison with human specialists. Diagn Basel Switz. 2025;15(10):1221. doi:10.3390/diagnostics15101221

31.

Salvagno

Taccone

Gerli

AG.

Artificial intelligence hallucinations. Crit Care Lond Engl. 2023;27(1):180. doi:10.1186/s13054-023-04473-y

32.

Nath

Marie

Ellershaw

Korot

Keane

PA.

New meaning for NLP: the trials and tribulations of natural language processing with GPT-3 in ophthalmology. Br J Ophthalmol. 2022;106(7):889-892. doi:10.1136/bjophthalmol-2022-321141

33.

Kreso

Boban

Kabic

, et al. Using large language models as decision support tools in emergency ophthalmology. Int J Med Inf. 2025;199:105886. doi:10.1016/j.ijmedinf.2025.105886

Multimodal Diagnostic Accuracy of GPT-4.o,Claude 3.7,and Gemini 2.5 on Real-World Retina Cases

Abstract

Keywords

Introduction

Methods

Results

Conclusions

Footnotes

Acknowledgements

Ethical Approval

Statement of Informed Consent

Declaration of Conflicting Interests

Funding

ORCID iDs

Data Availability Statement

References