Sage Journals: Discover world-class research

Abstract

Purpose

To evaluate the performance of three large language models (LLMs) in automated recognition of IOLMaster 700 reports and preoperative toric intraocular lens (IOL) planning.

Methods

The retrospective study analyzed preoperative examination reports of patients who underwent cataract surgery with toric IOL implantation. Three models (ChatGPT-5, ChatGPT-5 Thinking and DeepSeek Thinking) were instructed to extract key biometric parameters, evaluate a patient’s suitability for toric IOL implantation, and generate a plan. Model performance was evaluated based on structured-data recognition, refractive prediction outcomes and thinking times.

Results

Fifty-four eyes of 54 patients were analyzed. ChatGPT-5 Thinking model consistently achieved the highest agreement with clinical reference for all extracted parameters, and demonstrated more reliable extraction of axis information. ChatGPT-5 showed intermediate performance, while DeepSeek Thinking was the least consistent in axis-dependent fields but performed adequately for basic biometry. Refractive and axis prediction errors were smallest with ChatGPT-5 Thinking, yielding the largest proportion of cases within prespecified clinical thresholds and the highest concordance with the calculator-based reference plan. Analysis of thinking times showed that longer processing did not necessarily correlate with better accuracy.

Conclusions

Advanced LLMs show promise for automated interpretation of ophthalmic biometry reports and calculator-based toric IOL planning workflows. These findings support the feasibility of LLM-assisted workflow automation, with ChatGPT-5 Thinking providing the most favorable balance of accuracy and efficiency in this setting.

Keywords

toric intraocular lens cataract surgery IOLmaster 700 large language model clinical workflow automation

Introduction

Cataract surgery with intraocular lens (IOL) implantation is the most commonly performed surgical procedure worldwide.¹ Nowadays, cataract surgery has evolved from a sight-restoring procedure to a refractive intervention, with growing patient expectations for high-quality uncorrected vision and spectacle independence.² It has been reported that approximately one-fourth of patients with cataracts have ≥ 1.0 diopter (D) of corneal astigmatism worldwide.^3–5 With the development of toric IOLs, surgeons can provide satisfactory refractive results as patients demand spectacle independence, even in cases of corneal astigmatism.

The accuracy of postoperative refraction depends on three factors: precise preoperative ocular biometry and reliable IOL power/axis calculations, precise intraoperative alignment, and postoperative position of the toric IOL.^6,7 According to Hirnschall et al., preoperative corneal measurement is the largest source of error in toric IOL power calculation, contributing 27% of refractive astigmatic error, followed by IOL misalignment (14.4%) and IOL tilt (11.3%).⁶ In addition, study indicates that each degree of rotational misalignment decreases the effectiveness of a toric IOL by approximately 3%.⁸ Therefore, the persistent challenges in minimizing these human-dependent errors necessitate the adoption of automated technologies to improve preoperative accuracy.

Large language models (LLMs) exhibit remarkable capabilities in processing and interpreting large volumes of complex datasets, which has already supported clinicians in diagnosis and treatment planning. In ophthalmology, LLMs have been applied to tasks such as diabetic retinopathy screening and glaucoma detection.^9,10 As technology advances, LLMs now accept multimodal inputs, allowing them to analyze images and PDF documents in addition to text. Recent feasibility work has begun to explore the use of multimodal large language models in cataract surgery–related quantitative tasks, including IOL power calculation and comparison with standard formula outputs. These studies suggest that MLLMs may serve as workflow-support or backup tools, while also highlighting that current evidence remains focused on feasibility and formula replication rather than outcome-based clinical validation.^11,12 A vision-enabled LLM can automate data extraction, streamline planning, reduce manual transcription errors, and provide surgeons with quicker, data-driven assistance for toric IOL planning However, research into LLM-assisted report recognition remains limited. Accordingly, this study compares three LLMs for their accuracy in identifying these biometric parameters and in supporting toric IOL planning for cataract surgery.

Methods

Study design and approval

This single-center, retrospective and methodological study was conducted at the Eye & ENT Hospital of Fudan University, Shanghai, China, utilizing patient data collected from November 2024 to May 2025. The study was approved by the hospital’s Institutional Review Board (Approval No. 2025275) and adhered to the principles of the Declaration of Helsinki. All participants provided written informed consent, and all patient-identifying information was anonymized prior to data analysis.

Patient selection

Eligible participants were cataract patients who had undergone phacoemulsification and implantation of an AcrySof IQ Toric monofocal IOL (Alcon Laboratories, TX, USA). For inclusion, eyes were required to have a postoperative corrected distance visual acuity (CDVA) of 0.10 logMAR or better and an absolute IOL rotation less than 10° at the 1-month follow-up examination. The exclusion criteria were as follows: (1) incomplete biometric data on the examination report; (2) a history of previous ocular surgery or ocular trauma; (3) the occurrence of intraoperative complications, such as an anterior capsular tear or posterior capsular rupture; and (4) the development of significant postoperative complications, including but not limited to severe intraocular infection or inadequate pupillary dilation. Given the high degree of similarity between bilateral eyes in the same individual, only one eye was included in the analysis for this study.

Biometric measurement

The IOLMaster 700 (Carl Zeiss Meditec AG) uses swept-source optical coherence tomography (SS-OCT) with a 1055-nm laser, enabling an acquisition speed of 2000 A-scans per second, a tissue penetration depth of 44 mm, and six line scans with an axial resolution of 22 µm.¹³ Each patient’s axial length (AL), anterior chamber depth (ACD), lens thickness (LT), white-to-white (WTW) distance, keratometry (K) and total keratometry (TK) were measured. For toric IOL implantation, the spherical power (Sph), cylindrical power (Cyl), and intended axis were determined using the manufacturer’s online calculator (the manufacturer’s toric IOL calculator used represents the standard planning tool routinely used by cataract surgeons in clinical practice). Each IOLMaster 700 report was exported from the device software and converted into a full-page JPG image. During this process, all patient-identifying information was removed to ensure anonymization prior to analysis. Aside from anonymization, no additional manual annotation or modification of the biometric data was performed.

LLMs evaluation

The evaluation of LLM performance was conducted using three different models: ChatGPT-5 (Model A; OpenAI, USA), ChatGPT-5 Thinking (Model B; OpenAI, USA), and DeepSeek Thinking (Model C; DeepSeek AI, China). All models were accessed through their official web interfaces using default configurations, without parameter tuning, plugin integration, or external tools. For each evaluation, a new session was initiated to prevent carryover memory from previous interactions. The same prompt instructions and identical image inputs were used across all models to ensure consistent experimental conditions.

The LLMs were required to complete the following tasks: (1) Extract key biometric parameters from the IOLMaster 700 report, including AL, ACD, LT, WTW, K1/K2 and their axes, and TK1/TK2 and their axes; (2) Calculate the corneal astigmatism values, including ΔK and ΔTK; (3) Determine the suitability for implantation of a toric IOL; (4) Select the most appropriate spherical power for the IOL. If a toric IOL is recommended for implantation, calculate the cylinder power and the optimal implantation axis based on a fixed surgically induced astigmatism (SIA) of 0.2 D for the left eye and a standard incision location (140°). (Prompt in Supplementary material).

To account for potential output variability, each case was evaluated three times per model in independent sessions, with identical inputs and prompts and without access to previous outputs. These repeated runs were performed to assess response variability and repeatability and were not treated as independent clinical samples. In addition, system-reported completion times (ChatGPT-5 and DeepSeek) were recorded for exploratory analysis of processing duration.

Outcome measures

The primary outcomes of this study were structured recognition and refractive prediction performance. Structured recognition accuracy was quantified for each parameter and model using observed agreement and correctness (defined as the proportion of exactly matched parameters among all values). Model responses were provided in text format. The relevant biometric parameters and IOL recommendations were extracted from the model outputs by identifying the numerical values explicitly reported for each requested variable. These extracted values were then compared with the corresponding reference values from the original IOLMaster 700 reports. If a response contained multiple candidate values or ambiguous expressions, the value clearly labeled for the corresponding parameter was recorded. Responses that did not provide a valid numerical value for a required parameter were classified as invalid outputs and were excluded from agreement analyses. Refractive performance was evaluated for Sph, Cyl, and axis using mean error (ME), mean absolute error (MAE), and median absolute error (MedAE). Clinically relevant thresholds were pre-specified, including absolute spherical error ≤ 0.50 D, ≤ 1.00 D, and ≤ 1.50 D; absolute cylindrical error ≤ 0.50 D, ≤ 0.75 D, and ≤ 1.50 D; and absolute axis error ≤ 5°, ≤ 10°, and ≤ 15°.

Statistical analyses

All statistical analyses were performed using SPSS Statistics for Windows (v. 22.0, IBM Corp) and R statistical software (v. 4.3.3). Normality was assessed with the Shapiro–Wilk test. Continuous variables are summarized as mean ± standard deviation (SD) or median (interquartile range, IQR) and compared across models by t-test, one-way ANOVA, or Kruskal–Wallis tests, as appropriate. Categorical variables are presented as n (%) and compared by χ2 or Fisher’s exact tests with 95% confidence intervals (CIs). Agreement for structured recognition was quantified per parameter and model by Cohen’s kappa (κ) with 95% CIs. Furthermore, the association between model thinking time and per-case recognition accuracy was modeled as a continuous exposure using nonlinear smoothing (LOESS). All tests were two-sided, and p < 0.05 was considered statistically significant.

Results

This study comprised 54 eyes of 54 patients. Each case was evaluated three times per model in independent sessions to assess response variability and repeatability. Patient demographics and baseline characteristics are summarized in Table S1.

Parameter recognition accuracy

Parameter-level recognition accuracy and agreement with the real report differed significantly among the models (Table 1 and S2). Model B consistently showed the highest recognition accuracy (all p < 0.05 for B vs A and B vs C) and the strongest agreement with the real report, achieving a near-perfect agreement (κ ≥ 0.81) for all measured parameters (Table 2). Model A showed a near-perfect agreement for basic biometry parameters (AL, ACD, LT), but its accuracy and agreement were significantly lower for keratometric parameters (K1, K2) and astigmatism-related indices (ΔK, ΔTK), where its agreement grade dropped to slight-to-moderate levels. Model C generally achieved almost perfect or substantial agreement, but its performance also declined for astigmatism-related parameters (ΔK, ΔTK), showing only moderate agreement.

Table 1.

The accuracy rates of parameter identification for the three models.

Parameter	Model A (n, %)	Model B (n, %)	Model C (n, %)	Cochran Q	p value
AL	159 (98.1)	156 (96.3)	141 (87.0)	26.5714	< 0.001
ACD	144 (88.9)	159 (98.1)	147 (90.7)	10.5	0.005
LT	158 (97.5)	159 (98.1)	150 (92.6)	7.6842	0.021
WTW	104 (64.2)	159 (98.1)	114 (70.4)	54.2105	< 0.001
K1	95 (58.6)	155 (95.7)	137 (84.6)	64.6364	< 0.001
K1 axis	127 (78.4)	159 (98.1)	122 (75.3)	39	< 0.001
K2	32 (19.8)	153 (94.4)	133 (82.1)	185.6029	< 0.001
K2 axis	129 (79.6)	159 (98.1)	102 (63.0)	58.7711	< 0.001
ΔK	20 (12.3)	116 (71.6)	95 (58.6)	144.1698	< 0.001
TK1	30 (18.5)	156 (96.3)	117 (72.2)	178.3286	< 0.001
TK1 axis	72 (44.4)	150 (92.6)	124 (76.5)	84.5	< 0.001
TK2	68 (42.0)	156 (96.3)	132 (81.5)	125.4141	< 0.001
TK2 axis	81 (50.0)	153 (94.4)	109 (67.3)	69.9469	< 0.001
ΔTK	13 (8.00)	90 (55.6)	65 (40.1)	114.2963	< 0.001
Recommend Toric IOL	151 (93.2)	162 (100)	152 (93.8)	11.1	0.004

AL = axial length, ACD = anterior chamber depth, LT = lens thickness, WTW = white-white diameter, K1 = flat keratometry, K2 = steep keratometry, TK = Total keratometry, IOL = Intraocular lenses.

Table 2.

Cohen’s κ for agreement between the clinical reference and three models.

Parameter	Model A		Model B		Model C
Parameter	Kappa (95% CI)	Grade	Kappa (95% CI)	Grade	Kappa (95% CI)	Grade
AL	0.98 (0.96 to 1.00)	Almost perfect	0.96 (0.93 to 0.99)	Almost perfect	0.87 (0.81 to 0.92)	Almost perfect
ACD	0.89 (0.84 to 0.94)	Almost perfect	0.98 (0.96 to 1.00)	Almost perfect	0.90 (0.86 to 0.95)	Almost perfect
LT	0.97 (0.95 to 1.00)	Almost perfect	0.98 (0.96 to 1.00)	Almost perfect	0.92 (0.88 to 0.97)	Almost perfect
WTW	0.62 (0.54 to 0.70)	Substantial	0.90 (0.96 to 1.00)	Almost perfect	0.67 (0.59 to 0.75)	Substantial
K1	0.58 (0.51 to 0.66)	Moderate	0.96 (0.92 to 0.99)	Almost perfect	0.84 (0.78 to 0.90)	Almost perfect
K1 Axis	0.78 (0.71 to 0.84)	Substantial	0.98 (0.96 to 1.00)	Almost perfect	0.74 (0.68 to 0.81)	Substantial
K2	0.20 (0.13 to 0.26)	Slight	0.94 (0.91 to 0.98)	Almost perfect	0.82 (0.76 to 0.88)	Almost perfect
K2 Axis	0.79 (0.72 to 0.85)	Substantial	0.98 (0.96 to 1.00)	Almost perfect	0.62 (0.55 to 0.70)	Substantial
ΔK	0.12 (0.07 to 0.17	Slight	0.71 (0.64 to 0.78)	Substantial	0.58 (0.50 to 0.65)	Moderate
TK1	0.18 (0.12 to 0.24)	Slight	0.96 (0.93 to 0.99)	Almost perfect	0.72 (0.65 to 0.79)	Substantial
TK1 Axis	0.43 (0.35 to 0.51)	Moderate	0.94 (0.91 to 0.98)	Almost perfect	0.76 (0.70 to 0.83)	Substantial
TK2	0.42 (0.34 to 0.49)	Moderate	0.98 (0.96 to 1.00)	Almost perfect	0.83 (0.77 to 0.89)	Almost perfect
TK2 Axis	0.49 (0.41 to 0.57)	Moderate	0.96 (0.93 to 0.99)	Almost perfect	0.67 (0.59 to 0.74)	Substantial
ΔTK	0.07 (0.03 to 0.12)	Slight	0.55 (0.47 to 0.62)	Moderate	0.41 (0.33 to 0.49)	Moderate

CI = Confidence interval AL = axial length, ACD = anterior chamber depth, LT = lens thickness, WTW = white-white diameter, K1 = flat keratometry, K2 = steep keratometry, TK = Total keratometry.

Refractive prediction and toric IOL planning assistance

The refractive prediction results indicated that, compared with Model A and Model C, Model B exhibited the smallest errors in Sph, Cyl, and axis (all p < 0.001 for A vs. B and B vs. C). Moreover, Model B showed the largest proportions within pre-specified clinical thresholds (e.g., ≤ 0.50 D for Sph and ≤ 5° for Axis). For the toric candidacy component of the planning workflow, the accuracy of recommending toric IOL implantation was highest for Model B across repeated model outputs (100%), which was significantly better than both Model A and Model C (p < 0.004) (Table 3 and S3). For visualization purposes, the distribution of prediction errors was summarized as the percentage of valid cases falling within predefined MAE thresholds (Figure S1).

Table 3.

Comparison of ME, MAE, and MedAE for predicted sph, cyl, and axis among three models.

Parameter		Model A	Model B	Model C	A vs B	A vs C	B Vs C	Total
ME (SD)	Sph	-0.91 (1.24)	-0.43 (0.82)	-0.82 (1.69)	< 0.001	0.001	< 0.001	< 0.001
	Cyl	-0.14 (0.70)	-0.06 (0.57)	0.06 (0.67)	0.159	0.003	0.109	0.123
	Axis	13.34 (45.54)	5.91 (28.10)	1.70 (57.06)	0.001	0.084	0.786	< 0.001
MAE (SD)	Sph	1.03 (1.15)	0.56 (0.74)	1.00 (1.59)
	Cyl	0.46 (0.55)	0.34 (0.46)	0.41 (0.53)
	Axis	29.95 (36.74)	13.17 (25.50)	42.63 (37.81)
MedAE (IQR)	Sph	0.64 (0.50 to 1.00)	0.50 (0.00 to 0.50)	0.50 (0.27 to 1.00)	< 0.001	0.001	< 0.001	< 0.001
	Cyl	0.00 (0.00 to 0.75)	0.00 (0.00 to 0.75)	0.00 (0.00 to 0.75)	0.081	0.294	0.435	0.287
	Axis	8.00 (2.00 to 81.00)	4.00 (1.00 to 8.25)	35.50 (3.25 to 84.00)	< 0.001	0.005	< 0.001	< 0.001

ME = Mean error, MAE = Mean absolute error, MedAE = Median absolute error, SD = Standard deviation, IQR = Interquartile range, Sph = Spherical, Cyl = Cylindrical.

Thinking times analysis

Analyses of thinking times showed that Model C required a longer thinking time distribution compared to Model B (p < 0.001) (Figure 1). To further explore the relationship between reasoning duration and model performance, we analyzed the association between thinking time and parameter recognition accuracy, defined as the proportion of correctly extracted biometric parameters relative to the reference report (Figure 2). The LOESS smoothing curve demonstrates an initial increase in recognition accuracy with increasing thinking time, followed by a gradual stabilization of the curve. This pattern suggests diminishing improvements in accuracy beyond a certain reasoning duration, which we describe as a plateau in the accuracy-time relationship. Representative example cases spanning oblique, with-the-rule (WTR), and against-the-rule (ATR) astigmatism phenotypes are provided in Supplementary Table S4 to improve clinical interpretability of the model outputs.

Figure 1.

Comparison of thinking time between Model B and Model C.

Figure 2.

Relationship between thinking time and accuracy for Model B and Model C.

Discussion

In recent years, numerous studies have indicated that LLMs hold immense potential in text-based tasks within clinical ophthalmology.^14,15 More recently, feasibility studies have begun to explore the use of multimodal large language models for cataract surgery–related quantitative tasks, including IOL power calculation and comparison with standard formula outputs. These studies suggest that MLLMs may support certain computational or workflow-assistance tasks in cataract surgery planning.¹² To our knowledge, this study provides a systematic comparison of three advanced LLMs in ophthalmic biometry parameter recognition, refractive prediction accuracy, and preoperative workflow assistance for toric IOL planning. The study demonstrates significant differences in performance across the three models. Specifically, ChatGPT-5 Thinking consistently achieved the highest accuracy and the lowest predictive errors across all evaluated dimensions, relative to ChatGPT-5 and DeepSeek Thinking, particularly in refractive prediction and toric IOL planning assistance.

Characteristics of LLMs in ophthalmic data extraction and agreement

Our results confirm that advanced LLMs are fully capable of accurately extracting complex parameters from ophthalmic reports. ChatGPT-5 Thinking achieved “near-perfect” (κ ≥ 0.81) agreement with the clinical reference for all measured biometric and keratometric parameters. This finding aligns closely with the emerging trend of utilizing LLMs in medical domains, specifically for processing ophthalmic reports.¹⁶ Previous research has demonstrated that LLMs can extract critical parameters such as AL, ACD, and corneal astigmatism from raw biometry reports with high accuracy, ranging from 95% to 100%.¹¹ The performance of ChatGPT-5 Thinking, especially in handling complex K and TK data, demonstrates that sophisticated LLMs can transcend simple text extraction to understand and process the complex relationships and precise numerical values of the parameters in ophthalmic reports.

However, we observed that the performance of ChatGPT-5 and DeepSeek Thinking dropped significantly to only “Moderate” when recognizing astigmatism-related parameters. This highlights that not all LLMs possess equal complex reasoning capabilities. Accurate identification of astigmatism indices requires the model to perform precise multivariable reasoning and calculation.⁷ This limitation underscores the critical need for rigorous validation of LLMs’ domain-specific task performance before their integration into clinical workflows.

LLMs in toric IOL planning

For astigmatic patients, accurate toric IOL planning is essential to achieving satisfactory uncorrected vision, as errors in candidacy assessment, cylindrical power choice, or alignment may lead to significant residual astigmatism.¹⁷ Toric IOL implantation is generally indicated for patients with corneal astigmatism exceeding 0.75-1.0 D.^18,19 Our study revealed clear differences in the models’ reliability for recommending toric IOL implantation. ChatGPT-5 Thinking achieved the highest accuracy rate at 100.0%, a result that was significantly better than both ChatGPT-5 and DeepSeek Thinking. This performance suggests that ChatGPT-5 Thinking may serve as a reliable workflow-assistance tool for the toric candidacy component of preoperative planning by translating biometry data into an appropriate binary recommendation. Interestingly, the high accuracy of the LLMs suggests that their decision-making process is not reliant on a simple numerical threshold. Previous research recommends distinct clinical thresholds for Toric IOL implantation: Toric IOLs are considered for with-the-rule (WTR) astigmatism when the value exceeds 1.5 D, but for against-the-rule (ATR) astigmatism, implantation is advised when the value exceeds only 0.4 D.^20,21 Therefore, the high accuracy (>90%) demonstrated by the three LLMs confirms that they do not rely solely on a simple ΔK threshold. Instead, their ability to perfectly manage toric IOL recommendations suggests they successfully integrated complex decision rules that account for astigmatism type.

Clinical superiority in refractive prediction

The accurate calculation of toric IOL power is a key factor for minimizing postoperative refractive error and ensuring spectacle independence in astigmatic patients.^6,22 Regarding refractive prediction, ChatGPT-5 Thinking exhibited an advantage compared with both ChatGPT-5 and DeepSeek Thinking. This result is consistent with the current trend in cataract refractive surgery to utilize AI to enhance IOL calculation accuracy.^23,24 Moreover, ChatGPT-5 Thinking had the largest proportion of cases falling within the tightest clinically acceptable thresholds (e.g., Sph ≤ 0.50 D). The excellent refractive prediction performance of ChatGPT-5 Thinking suggests potential utility in calculator-based toric IOL planning workflows. However, a significant limitation stems from the non-transparent nature of LLM computation. While the model’s strong performance suggests it successfully learned and internalized the underlying optical principles and empirical relationships of IOL calculation from its massive medical training data, the exact basis of its numerical outputs is not publicly disclosed. Consequently, it is plausible that these calculations are heavily reliant on, or derived from, existing, validated IOL calculation formulas (such as Barrett Universal II).

Analysis of efficiency and time-accuracy balance

The efficiency and computational cost of LLMs are critical factors for practical deployment. This is especially true within a high-throughput clinical setting, such as an ophthalmology clinic. Our thinking time analysis revealed that DeepSeek Thinking required a longer thinking time than ChatGPT-5 Thinking. However, its accuracy did not improve correspondingly. This result underscores the trade-off between efficiency and accuracy.²⁵ Furthermore, the plateau observed in the accuracy-time curve suggests a limit to the model’s thinking efficiency. Beyond a certain optimal processing time, additional reasoning does not yield significant marginal gains in recognition accuracy. Future development of LLMs for ophthalmic applications should focus on efficiency optimization to ensure their practicality within the fast-paced environment of an ophthalmology clinic.

Limitations

Our study has several limitations. First, the cohort consisted of 54 eyes, and each case was evaluated three times per model in independent sessions to assess response variability and repeatability. These repeated runs represent response-level repeatability rather than independent clinical samples. Second, the reference standard used in this study was the manufacturer’s toric IOL calculator rather than surgeon-selected IOL plans or postoperative refractive outcomes. Therefore, the findings should be interpreted primarily as evidence of feasibility for automated report interpretation and workflow assistance, rather than direct validation of clinical refractive outcomes. Third, the sample size was relatively small and did not include important subpopulations such as highly myopic eyes or other complex ocular conditions. In clinical practice, surgeons often intentionally target slight residual myopia in these cases to reduce the risk of postoperative hyperopic shift and improve patient satisfaction, which deviates from the emmetropic target used in standard calculations. Fourth, the study was conducted at a single institution in one geographic region. Finally, the LLM outputs were evaluated using measurements from a single optical biometer, whereas real-world cataract planning typically integrates multimodal clinical data. Therefore, the current findings should be interpreted primarily as preliminary evidence supporting the feasibility of automated report interpretation using LLMs. Future studies should validate the performance of LLMs across more diverse patient populations, incorporate multimodal clinical measurements, and evaluate locally deployed open-source distilled models, which may offer advantages in deployment cost, data governance, and integration into hospital information systems for real-world clinical implementation.

Conclusion

In summary, this study systematically compared the performance of three advanced LLMs in preoperative toric IOL planning. The results demonstrate that ChatGPT-5 Thinking was significantly better than both ChatGPT-5 and DeepSeek Thinking, achieving high accuracy in both ophthalmic biometry parameter recognition and refractive prediction. These findings support the feasibility of applying general-purpose LLMs to automated interpretation of ophthalmic biometry reports and toric IOL planning workflows. LLMs possess substantial potential to evolve into reliable and efficient workflow-assistance tools in ophthalmology.

Supplemental material

Supplemental material - Comparison of three large language models in recognizing ophthalmological examination and supporting preoperative toric IOL planning

Supplemental material for Comparison of three large language models in recognizing ophthalmological examination and supporting preoperative toric IOL planning by Xuanqiao Lin, Yizhou Yang, Songlian Wang, Lei Cai, and Jin Yang in Digital Health.

Supplemental material

Supplemental material - Comparison of three large language models in recognizing ophthalmological examination and supporting preoperative toric IOL planning

Footnotes

ORCID iDs

Xuanqiao Lin

Jin Yang

Ethical considerations

This single-center retrospective methodological study was approved by the Institutional Review Board of the Eye & ENT Hospital of Fudan University (Approval No. 2025275) and conducted in accordance with the Declaration of Helsinki.

Author contributions

X.L. and Y.Y. contributed equally to study design, development of the LLM evaluation protocol, data collection, statistical analysis, data interpretation, visualization, and drafting of the manuscript. S.W. was responsible for data management, coding support, and assisted with statistical analysis and result validation. L.C. and J.Y. conceived and designed the study, provided clinical supervision and resources, oversaw project administration, and critically revised the manuscript for important intellectual content. All authors read and approved the final version of the manuscript and agree to be accountable for all aspects of the work.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This study was funded by the National Natural Science Foundation of China (Grant number 82171039).

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data Availability Statement

The de-identified datasets analyzed during the current study are not publicly available because they will be used for subsequent related research, but are available from the corresponding author on reasonable request.*

AI use disclosure

AI-assisted tools were used only for language editing and improvement of expression during manuscript preparation. All scientific content, data interpretation, and final revision of the manuscript were performed and approved by the authors.

Guarantor

Xuanqiao Lin is the guarantor of this work and accepts full responsibility for the integrity of the data and the accuracy of the data analysis.

Supplemental material

Supplemental material for this article is available online.

References

Shekhawat

Stock

Baze

, et al. Impact of First Eye versus Second Eye Cataract Surgery on Visual Function and Quality of Life. Ophthalmology 2017; 124(10): 1496–1503. https://doi.org/10.1016/j.ophtha.2017.04.014

Chang

. The Continuing Evolution of Cataract Surgery. Asia Pac J Ophthalmol (Phila) 2017; 6(4): 308. https://doi.org/10.22608/APO.2017191

Chen

Zuo

Chen

, et al. Prevalence of corneal astigmatism before cataract surgery in Chinese patients. J Cataract Refract Surg 2013; 39(2): 188–192. https://doi.org/10.1016/j.jcrs.2012.08.060

Guan

Yuan

, et al. Analysis of corneal astigmatism in cataract surgery candidates at a teaching hospital in Shanghai, China. J Cataract Refract Surg 2012; 38(11): 1970–1977. https://doi.org/10.1016/j.jcrs.2012.07.025

Kaur

Shaikh

Falera

, et al. Optimizing outcomes with toric intraocular lenses. Indian J Ophthalmol 2017; 65(12): 1301–1313. https://doi.org/10.4103/ijo.IJO_810_17

Hirnschall

Findl

Bayer

, et al. Sources of Error in Toric Intraocular Lens Power Calculation. J Refract Surg 2020; 36(10): 646–652. https://doi.org/10.3928/1081597X-20200729-03

Jin

Zhang

, et al. Effect of Posterior Corneal Astigmatism Measured With Different Biometers on Toric IOL Power Calculation. J Refract Surg 2025; 41(10): e1032–e1041. https://doi.org/10.3928/1081597X-20250930-02

Novis

. Astigmatism and toric intraocular lenses. Curr Opin Ophthalmol 2000; 11(1): 47–50. https://doi.org/10.1097/00055735-200002000-00007

Guan

Wang

, et al. Integrated image-based deep learning and language models for primary diabetes care. Nat Med 2024; 30(10): 2886–2896. https://doi.org/10.1038/s41591-024-03139-8

10.

Huang

Raja

Madadi

, et al. Predicting Glaucoma Before Onset Using a Large Language Model Chatbot. Am J Ophthalmol 2024; 266: 289–299. https://doi.org/10.1016/j.ajo.2024.05.022

11.

Tan

JCK

. Using a large language model to process biometry reports and select intraocular lens for cataract surgery. J Cataract Refract Surg 2025; 51(4): 351–352. https://doi.org/10.1097/j.jcrs.0000000000001620

12.

Jun

Ryu

Yoo

. Multimodal large language models for IOL power calculation in cataract surgery: a feasibility study. AJO International 2025; 2: 100205. https://doi.org/10.1016/j.ajoint.2025.100205

13.

Budiman

Knoch

AMH

Boesoirie

, et al. Agreement between IOLMaster 700 and Pentacam AXL for IOL power measurement in patients with high myopia. Indian J Ophthalmol 2024; 72(7): 1021–1025. https://doi.org/10.4103/IJO.IJO_1350_23

14.

Jin

Yuan

, et al. Exploring large language model for next generation of artificial intelligence in ophthalmology. Front Med (Lausanne) 2023; 10: 1291404. https://doi.org/10.3389/fmed.2023.1291404

15.

Lin

Bai

Zhao

, et al. Online platform vs. doctors: a comparative exploration of congenital cataract patient education from virtual to reality. Front Artif Intell 2025; 8: 1548385. https://doi.org/10.3389/frai.2025.1548385

16.

Biswas

Davies

Sheppard

, et al. Utility of artificial intelligence-based large language models in ophthalmic care. Ophthalmic Physiol Opt 2024; 44(3): 641–671. https://doi.org/10.1111/opo.13284

17.

Schallhorn

Hettinger

Pelouskova

, et al. Effect of residual astigmatism on uncorrected visual acuity and patient satisfaction in pseudophakic patients. J Cataract Refract Surg 2021; 47(8): 991–998. https://doi.org/10.1097/j.jcrs.0000000000000560

18.

Bissen-Miyajima

Ota

Yaguchi

, et al. Clinical Results of a Trifocal Toric Intraocular Lens Using the Holladay Total Surgically Induced Astigmatism Formula for Correcting Low Corneal Astigmatism in Japanese Patients. Clin Ophthalmol 2024; 18: 755–763. https://doi.org/10.2147/OPTH.S448427

19.

Mohankumar

Mohan

20.

Leon

Pastore

Zanei

, et al. Correction of low corneal astigmatism in cataract surgery. Int J Ophthalmol 2015; 8(4): 719–724. https://doi.org/10.3980/j.issn.2222-3959.2015.04.14

21.

Pahuja

Ashar

Garg

. Long-term change in corneal astigmatism after sutureless cataract surgery. Am J Ophthalmol 2011; 152(6): 1084–1085, author reply 1084-5. https://doi.org/10.1016/j.ajo.2011.08.014

22.

Yang

Han

Lee

. Comparative Accuracy Of Five Modern Toric Intraocular Lens Formulas. Am J Ophthalmol 2025; 274(1-8): 1–8. https://doi.org/10.1016/j.ajo.2025.02.028

23.

Wang

Burwinkel

Bensaid

, et al. Evaluation of an artificial intelligence-based intraocular lens calculator: AI-based IOL-optimized formula. J Cataract Refract Surg 2024; 51(4): 332–336. https://doi.org/10.1097/j.jcrs.0000000000001603

24.

Stopyra

Voytsekhivskyy

Grzybowski

. Prediction of Seven Artificial Intelligence-Based Intraocular Lens Power Calculation Formulas in Medium-Long Caucasian Eyes. Life (Basel) 2025; 15(1): 45. https://doi.org/10.3390/life15010045

25.

Yang

Lin

, et al. Towards thinking-optimal scaling of test-time compute for llm reasoning. arXiv preprint arXiv:250218080, 2025.