Abstract
Objectives
Lipedema is a chronic disorder characterized by pain and disproportionate fat distribution, and its diagnosis is frequently overlooked. The aim of this study was to evaluate and compare the responses generated by contemporary artificial intelligence models—ChatGPT-5o, Gemini-3, and Perplexity AI—to structured clinical questions developed in accordance with the 2024 S2k Lipedema Guideline. The models were analyzed in terms of clinical accuracy, readability, and reference reliability to assess their performance in delivering guideline-based medical information.
Methods
This cross-sectional and comparative study was conducted by submitting 30 structured clinical questions, prepared on the basis of the relevant guideline, to three large language models. Responses collected on 10 February 2026, were evaluated using a seven-point Likert scale (reliability) and a five-point scale (accuracy). Text readability was assessed using six established indices, including the Flesch Reading Ease Score (FRES), Flesch–Kincaid Grade Level (FKGL), and Gunning Fog Index (GFOG). Reference reliability was examined by analyzing hallucination tendencies as defined in the literature.
Results
A statistically significant difference in reliability was observed among the models (p = .041); Perplexity (4.95 ± 1.20) achieved significantly higher scores than ChatGPT-5o (4.38 ± 1.05) (p = .038). In readability analyses, Perplexity (12.80 ± 2.10) required a significantly higher educational level according to FKGL scores compared to both ChatGPT-5o (p = .041) and Gemini-3 (p = .036). Regarding reference reliability, ChatGPT-5o outperformed Perplexity in source verifiability (p = .031), bibliographic precision (p = .044), and total RHS scores (p = .027), emerging as the most robust model in this domain. No statistically significant differences were found among the models in terms of clinical accuracy and usefulness (p > .05). Inter-rater agreement was excellent (Kappa: 0.92–0.97).
Conclusion
In this study, ChatGPT-5o distinguished itself in reference quality, whereas Perplexity demonstrated superior reliability. However, the complex linguistic structures accompanying efforts to maintain high medical accuracy may constitute a significant barrier for individuals with limited e-health literacy. Although these systems show strong potential as medical information resources, they cannot yet replace expert physician oversight in terms of patient safety. A balanced approach between technical reliability and patient-centered simplification remains necessary.
Keywords
Get full access to this article
View all access options for this article.
