Sage Journals: Discover world-class research

Abstract

Objective

This study aimed to systematically evaluate five leading LLMs—ChatGPT, DeepSeek, Copilot, Gemini, and Perplexity—in providing MHD-related health information. The primary objectives were to determine (1) the reliability of MHD-related information generated by LLMs and (2) whether its readability meets the recommended standards for patient educational materials.

Methods

A cross-sectional comparative design was adopted. The approximate timeframe during which the responses were generated was October 2025. Seventeen frequently asked MHD-related questions were identified using Google Trends and two online patient–caregiver forums. Each query was input into the five LLMs (ChatGPT-4o, Copilot, Gemini 2.5 Pro, Perplexity Pro, and DeepSeek-V3.2-Exp), and their responses were assessed using DISCERN, EQIP, JAMA, and GQS criteria for reliability, alongside FKGL, FRES, SMOG, CLI, ARI, and LWF readability indices. A heatmap analysis was also conducted to evaluate intra-model response variability.

Results

High inter-rater reliability was confirmed between the two experts (ICC for average measures ranged from 0.851 to 0.879, all P < .001). Significant differences were observed among the five LLMs in both reliability and readability. Overall reliability scores were relatively low; however, Perplexity consistently achieved higher DISCERN, EQIP, and JAMA scores compared with Gemini, ChatGPT, Copilot, and DeepSeek (P < .001). In terms of readability, all models produced texts exceeding the sixth-grade reading level. Their ARI, GFI, FKGL, CLI, and SMOG scores were notably higher than recommended, while FRES scores were substantially below the 80–90 range. Heatmap analysis further demonstrated that although Perplexity and ChatGPT maintained relatively stable mean scores, they exhibited higher variability across different queries.

Conclusions

Current large language models (LLMs) exhibit significant variability in delivering maintenance hemodialysis information. While all five evaluated models demonstrated limitations in information quality, transparency, and readability, Perplexity performed relatively better overall. However, persistent deficiencies in source attribution, language accessibility, and response consistency limit their immediate clinical and educational utility. Future LLM development should prioritize readability optimization and context-aware customization to better support patient education.

Keywords

Large language models health communication maintenance hemodialysis reliability readability

End-stage renal disease (ESRD), characterized by its irreversibility and the need for lifelong treatment, has become a major global public health concern threatening human health worldwide.¹ Maintenance hemodialysis (MHD) is one of the primary renal replacement therapies sustaining the lives of patients with ESRD.² It is estimated that more than 3.9 million individuals globally are receiving renal replacement therapy, among whom approximately 69% undergo MHD.^3,4 Although MHD remains an effective treatment for ESRD, patient education is one of the key determinants influencing self-management capacity.^5,6 Patients who possess a better understanding of the etiology, pathophysiology, therapeutic strategies, and preventive measures of their disease are generally more likely to engage actively in treatment decisions and demonstrate higher treatment adherence.⁷ This study represents, to our knowledge, the first comprehensive comparison of five leading publicly accessible LLMs specifically within the domain of MHD.

LLMs in patient education: Applications and challenges

While healthcare professionals remain the primary source of health information, the widespread availability of the Internet has changed the way patients seek knowledge. Studies have shown that approximately 80% of Internet users search for health-related information online before consulting a medical professional, highlighting the Internet's significant role in shaping patients’ understanding and expectations.⁸ In recent years, large language models (LLMs) based on artificial intelligence (AI) have emerged as an increasingly popular supplementary tool for public health information acquisition due to their convenience and interactivity. Prominent models such as ChatGPT, DeepSeek, Copilot, Gemini, and Perplexity can instantly generate patient-oriented explanations and provide accessible educational content on disease management. However, the quality and presentation of such AI-generated information remain questionable. Issues such as potentially misleading content, inconsistent information quality, and limited readability have raised growing concerns.^9,10 To ensure the comprehensibility of patient education materials, authoritative organizations—including the American Medical Association (AMA), the National Institutes of Health (NIH), and the US Department of Health and Human Services (HHS)—recommend that written materials be kept at or below a sixth-grade reading level to facilitate understanding and action among individuals with limited health literacy.^11,12

Several medical fields have begun to evaluate both the reliability and readability of AI-generated content, where reliability is typically assessed based on structural quality, transparency, and presentation rather than direct factual accuracy.^13–15 Prior research has shown considerable variability in the performance of different LLMs regarding both response quality and readability. Some responses were characterized by a lack of source transparency or excessive linguistic complexity, potentially impairing patients’ comprehension and application of key health information. For instance, a recent evaluation in the field of sexual health education revealed significant differences in the reliability of responses generated by various LLMs, with none achieving ideal readability standards.¹³ Similarly, an assessment of AI-generated responses in palliative care indicated that five leading LLMs (Bard, Copilot, Perplexity, ChatGPT, and Gemini) produced content that exceeded recommended readability levels and achieved only modest quality scores.¹⁴ Another study investigating common questions related to rhinoplasty found that ChatGPT outperformed Gemini and Claude in terms of accuracy and overall quality; however, all three models relied heavily on medical terminology, resulting in readability levels equivalent to those of college education, which may hinder comprehension by the general public.¹⁵

However, research evaluating LLMs specifically within the context of maintenance hemodialysis remains scarce. In this context, the present study aims to systematically evaluate the performance of ChatGPT, DeepSeek, Copilot, Gemini, and Perplexity in responding to questions related to maintenance hemodialysis. Specifically, this research focuses on analyzing the reliability and readability of AI-generated content, with the goal of providing scientific evidence and developmental insights for the standardized application of LLMs in MHD-related health communication.

Materials and methods

Study design and data collection

This study adopted a cross-sectional comparative design to investigate the performance of five large language models (LLMs) in generating health information related to maintenance hemodialysis (MHD). This study adheres to the Chatbot Assessment Reporting Tool (CHART) statement guidelines.¹⁶ The five LLMs were selected to reflect diversity in both regional origin and algorithmic architecture. The evaluated models included: ChatGPT-4o (OpenAI; accessed via ChatGPT Plus), Copilot (Microsoft), Gemini 2.5 Pro (Google; accessed via Gemini Advanced), Perplexity Pro (Perplexity.ai; configured with Claude 4.5 Opus model), and DeepSeek-V3.2-Exp (DeepSeek AI, China). All responses were generated between October 1 and 20, 2025. To reflect real-world public usage patterns, all models were accessed through their official web interfaces using default system settings, and a new conversation session was initiated for each query to ensure consistency.

Initially, standardized terminologies related to maintenance hemodialysis (MHD) were identified through the Medical Subject Headings (MeSH) database to ensure conceptual comprehensiveness and consistency of expression. The identified MeSH entries included “maintenance hemodialysis,” “hemodialysis,” and “maintenance hemodialysis symptoms.” These terms were subsequently entered into Google Trends (parameters: Worldwide; All categories; Popular searches; time frame: September 2020 to September 2025) to extract a unified topic, “maintenance hemodialysis (MHD).” According to Google's official definition, a topic represents a group of search terms that share the same concept across multiple languages and reflect global search trends.¹⁷ However, this approach may not capture all aspects of MHD-related queries. To enhance topic coverage, the researchers additionally included two online discussion platforms where MHD patients and their caregivers actively exchange experiences:

National Kidney Foundation—Online Communities (https://www.kidney.org/treatment-support/communities)

Home Dialysis Central Forums (https://forums.homedialysis.org)

Based on the principle of data saturation, 17 common questions related to MHD were identified from these sources (Table 1). In addition, ten patients undergoing maintenance hemodialysis were invited to optimize the phrasing of prompt questions. Each question was then standardized and entered into five LLMs (ChatGPT, DeepSeek, Copilot, Gemini, and Perplexity) to obtain English-language responses. To ensure consistency and isolate the effect of each query, every one of the 17 questions was submitted in a separate, new chat session for each LLM. Prompts were entered sequentially (one after another) within the same session of a given model, rather than concurrently across different models or browser tabs. This sequential, session-isolated approach was strictly followed for all five models. Before each query was input, browsing history and personalized data were cleared to minimize potential bias. This procedure ensured that the responses generated by the AI models relied solely on their internal algorithms, without influence from cached data or prior interactions. All queries were submitted from Anhui, China in October 2025 (IP address location: China).

Table 1.
Seventeen common questions of concern to patients regarding maintenance hemodialysis (MHD) worldwide, 2020–2025.

Question

1 What is hemodialysis?

2 What is a hemodialysis machine?

3 What is a hemodialysis fistula?

4 What is a hemodialysis catheter?

5 What is home hemodialysis?

6 What is a hemodialysis nurse?

7 What is the difference between peritoneal dialysis and hemodialysis?

8 What are the treatments for kidney failure?

9 What are the symptoms associated with dialysis?

10 What are the complications of hemodialysis?

11 How is vascular access maintained?

12 What are the dietary guidelines for hemodialysis patients?

13 What are the exercise guidelines for hemodialysis patients?

14 What is dry weight?

15 What is the buttonhole technique?

16 How is hemodialysis performed for infectious patients?

17 What is heparin?

Readability and reliability evaluation

Readability was measured using six established indices (ARI, FRES, FKGL, SMOG, GFI, CLI), capturing different dimensions of syntactic complexity and semantic density. These metrics, while objective, are not without limitations—particularly in evaluating multilingual or culturally nuanced content. Reliability was assessed using four validated tools: DISCERN (content integrity),¹⁸ EQIP (presentation clarity),¹⁹ Global Quality Score (narrative coherence),²⁰ and JAMA benchmarks (source transparency).²¹ All evaluations were independently performed by two medical researchers in accordance with the Kidney Disease: Improving Global Outcomes (KDIGO) guidelines and their clinical expertise.²² In cases of disagreement between the two raters, a senior investigator reviewed and adjudicated the final score.

Readability evaluation

To assess the accessibility of AI-generated outputs, six well-established readability metrics were applied via an online readability calculator (https://readabilityformulas.com):

Automated Readability Index (ARI)²³:

$4.71 (\frac{characters}{words}) + 0.5 (\frac{words}{sentences}) - 21.43$

Flesch Reading Ease Score (FRES)²³:

$206.835 - 1.015 (\frac{words}{sentences}) - 84.6 (\frac{syllables}{words})$

Gunning Fog Index (GFI)²⁴:

$0.4 [(\frac{words}{sentences}) + 100 (\frac{complex words}{words})]$ +100( $\frac{complex words}{words}$

Flesch–Kincaid Grade Level (FKGL)²⁴:

$0.39 (\frac{words}{sentences}) + 11.8 (\frac{syllables}{words}) - 15.59$

Coleman–Liau Index (CL)²⁵:

$5.89 (\frac{characters}{words}) - 0.3 (\frac{sentences}{words}) - 15.8$

Simple Measure of Gobbledygook (SMOG)²⁶:

$1.0430 \times \sqrt{polysyllables \times \frac{30}{sentences}} + 3.1291$

The readability results were compared against benchmarks established by the American Medical Association (AMA) and National Institutes of Health (NIH),^27,28 which recommend that health materials be written at or below the sixth-grade reading level. An FRES score ≥80 and grade-level scores <6 were considered acceptable.

Reliability and quality evaluation

To evaluate the reliability and quality of information provided by each LLM, four validated instruments were applied:

DISCERN Instrument: A 16-item questionnaire assessing quality of health information, particularly treatment choices.¹⁸ Scores were categorized as: excellent (63–75), good (51–62), fair (39–50), poor (27–38), and very poor (16–26).²⁹

Ensuring Quality Information for Patients (EQIP): A 20-item tool scoring clarity, structure, and presentation of health information.³⁰ Final scores (percentage scale) were classified as: 76–100% (excellent), 51–75% (good), 26–50% (serious quality issues), and 0–25% (severe quality problems).³¹

Global Quality Scale (GQS): A 5-point Likert scale rating overall quality, flow, and usability, with scores from 1 (very poor) to 5 (excellent).

JAMA Benchmark Criteria: Evaluates authorship, attribution, disclosure, and currency.²¹ Each component scored 0 or 1, with a maximum total of 4 points.

Statistical analysis

Nonparametric methods were applied for the analysis of reliability scores, whereas parametric methods were used for readability scores. The Kolmogorov–Smirnov test was first used to assess the normality of the data distribution, which revealed that reliability scores did not conform to the assumption of normality, while readability scores did. Consequently, the Kruskal–Wallis rank-sum test was employed to examine overall differences in reliability among the five large language models (LLMs). When statistically significant differences were observed, Dunn's post-hoc tests with Bonferroni correction were subsequently performed for pairwise comparisons using the dunn.test package.

For the readability analysis, scores were calculated for each LLM based on the six indices. As these data satisfied the normality assumption, one-way analysis of variance (ANOVA) was used to compare intergroup differences. To determine whether the readability levels met the recommended sixth-grade threshold, the one-sample t-test was conducted to compare the AI-generated scores with the established benchmark.

To assess the reliability of the rating process, both test-retest and inter-rater reliability were quantified using the Intraclass Correlation Coefficient (ICC). Specifically, a two-way random-effects model was employed via the psych package. All statistical analyses were performed using R software (version 4.4.2, R Foundation for Statistical Computing, Vienna, Austria), and a two-sided p-value < 0.05 was considered statistically significant.

Results

Reliability analysis

The reliability of the responses generated by the large language models (LLMs) was evaluated using four validated instruments: DISCERN, EQIP, JAMA, and the Global Quality Score (GQS). The intraclass correlation coefficient (ICC) for inter-tool consistency was 0.867, indicating good agreement. The descriptive statistics for each reliability indicator, expressed as median (P25, P75), are summarized in Table 2.

Table 2.
Reliability scores across five LLMs [M (P₂₅, P₇₅)].

DISCERN EQIP GQS JAMA

ChatGPT 42(39,46) 50(40,60) 3(2,3) 0(0,0)

Copilot 42(33,44) 55(50,70) 2(1,2) 0(0,1)

DeepSeek 50(35,53) 60(55,65) 3(2,3) 0(0,0)

Gemini 50(42,58) 65(60,70) 3(2,3) 0(0,0)

Perplexity 52(50,59) 65(55,70) 4(4,5) 1(1,1)

H 24.238 19.619 45.150 57.840

P <.001 <.001 <.001 <.001

According to the DISCERN scores, Perplexity achieved the highest median score of 52 (50, 59), indicating good quality information. The other four LLMs scored between 39 and 50, corresponding to fair quality levels.

The EQIP results showed that ChatGPT scored 50 (40, 60), suggesting serious quality issues, whereas the other four LLMs scored between 51 and 75, which were classified as good quality with minor issues.

For the GQS assessment, Perplexity again obtained the highest score of 4 (4, 5), rated as good, while ChatGPT, DeepSeek, and Gemini each scored 3 (2, 3), indicating fair quality and moderately informational quality.

Finally, the JAMA scores demonstrated that Perplexity achieved the highest median score of 1 (1, 1), as shown in Figure 1.

Figure 1.
Raincloud plot of reliability scores for answers to common questions in maintenance hemodialysis patients generated by five LLMs.

Pairwise comparisons using the Dunn's test revealed that Perplexity consistently achieved higher scores on both the DISCERN and EQIP assessments compared with Gemini, ChatGPT, Copilot, and DeepSeek, particularly in contrast to ChatGPT (Table 3).

Table 3.
Dunn's test results for reliability scores (P-values).

Pairwise comparison DISCERN EQIP GQS JAMA

ChatGPT vs Copilot 0.818 0.173 0 . 014 0.007

ChatGPT vs DeepSeek 0.097 0.080 0.712 1.000

Copilot vs DeepSeek 0.066 0.553 0.046 0.008

ChatGPT vs Gemini 0.043 <0.001 0.861 0.708

Copilot vs Gemini 0.030 0.038 0.019 0.023

DeepSeek vs Gemini 0.645 0.106 0.771 0.796

ChatGPT vs Perplexity <0.001 0.007 <0.001 <0.001

Copilot vs Perplexity <0.001 0.162 <0.001 0.002

DeepSeek vs Perplexity 0.066 0.359 <0.001 <0.001

Gemini vs Perplexity 0.144 0.426 <0.001 <0.001

Readability analysis

The readability of the responses generated by the large language models (LLMs) was evaluated using six established indices: Automated Readability Index (ARI), Flesch Reading Ease Score (FRES), Gunning Fog Index (GFI), Flesch–Kincaid Grade Level (FKGL), Coleman–Liau Index (CLI), and Simple Measure of Gobbledygook (SMOG). The Wilcoxon signed-rank test indicated that the readability scores of all LLMs failed to meet the benchmark level corresponding to the sixth-grade reading standard (Table 4).

Table 4.
Readability scores across LLMs (means ± SD).

ARI^a CL^b FKGL^c FRES^d GFI^e SMOG^f

ChatGPT 12.19 ± 1.89 15.56 ± 2.33 11.76 ± 2.03 26.76 ± 14.12 14.49 ± 2.37 9.26 ± 1.42

Copilot 11.88 ± 2.29 14.34 ± 2.52 11.53 ± 2.13 32.47 ± 14.38 14.06 ± 2.12 9.97 ± 1.38

DeepSeek 11.91 ± 2.09 14.35 ± 2.13 11.52 ± 1.85 33.65 ± 11.85 13.74 ± 1.87 10.01 ± 1.36

Gemini 13.30 ± 1.94 15.51 ± 2.24 12.48 ± 1.98 29.18 ± 13.12 14.11 ± 2.03 10.63 ± 1.51

Perplexity 15.00 ± 2.75 15.25 ± 2.45 14.24 ± 2.09 27.00 ± 12.57 14.70 ± 2.39 12.30 ± 1.71

6th-grade level score 6 6 6 80–90 6 6

a
ARI: Automated Readability Index.

b
CL: Coleman–Liau Index.

c
FKGL: Flesch–Kincaid Grade Level.

d
FRES: Flesch Reading Ease Score.

e
GFI: Gunning Fog Index.

f
SMOG: Simple Measure of Gobbledygook.

Specifically, the FRES values were significantly below the recommended range of 80–90 for sixth-grade readability, while both GFI and FKGL scores demonstrated that the text complexity exceeded the comprehension capacity of general readers (Figure 2). As shown in Table 4 and Figure 2, DeepSeek performed best among the evaluated models (ARI: 11.91 ± 2.09; CLI: 14.35 ± 2.13; FKGL: 11.52 ± 1.85; GFI: 13.74 ± 1.87; FRES: 33.65 ± 11.85), followed closely by Copilot. In contrast, Perplexity and Gemini consistently produced outputs that exceeded acceptable complexity thresholds (e.g., Perplexity ARI: 15.00 ± 2.75).

Figure 2.
Bar plot of readability scores for answers to common questions in maintenance hemodialysis patients generated by five LLMs.

Table 4 and Figure 2 present the mean readability scores of each model across different indices, with particular emphasis on deviations from the sixth-grade benchmark. The red dashed line in the figure represents the threshold for acceptable readability. Figure 2 visually illustrates the readability scores corresponding to each query, facilitating the comparison of variability and consistency among different prompts.

Among all readability indices, DeepSeek and Copilot consistently generated more comprehensible texts compared with Perplexity, Gemini, and ChatGPT. However, none of these large language models met the sixth-grade reading standard recommended by the US National Institutes of Health (NIH) and the American Medical Association (AMA). As shown in Table 4, significant differences were observed across all six readability metrics (all p < 0.001, Kruskal–Wallis test).

Query-level variability: Heatmap and distribution insights

The heatmap and Kruskal–Wallis test results (Table 2, all p < 0.001 for H values) further highlighted another critical pattern—intra-model inconsistency. Within the same model, response quality varied significantly depending on the specific query. For instance, while Perplexity and ChatGPT maintained relatively stable average scores across queries, their standard deviations were notably higher. This finding suggests heterogeneity in content structure and factual rigor across different topics.

The heatmap (Figure 3) provides a visual comparison of the performance of each model in terms of reliability and readability: the left panel (Figure 3(a)) displays reliability scores, while the right panel (Figure 3(b)) presents readability scores. The visualization clearly reveals the distributional differences in performance across multiple indicators for the different models.

Figure 3.
Heatmap of LLMs outputs for (a) reliability and (b) readability.

Discussion

Principal findings

This comparative study identified three key findings in its evaluation of five leading LLMs: (1) overall reliability scores across all five LLMs were relatively low, Perplexity demonstrated relatively superior performance across the reliability metrics; (2) none of the evaluated models met the sixth grade reading comprehension standard, underscoring the persistent barrier to accessible health information; and (3) significant variability was observed among models, indicating that response quality fluctuates depending on the query context.

These findings suggest that, although AI-based tools hold substantial potential for scalable health education, their current deployment still requires human oversight and domain-specific refinement to ensure accuracy, safety, and accessibility.

Comparison with prior work

Although LLMs have shown favorable results in many medical fields.^32,33 Our study represents the first comparative analysis of five leading LLMs—ChatGPT, DeepSeek, Copilot, Gemini, and Perplexity—in the field of maintenance hemodialysis. Previous studies have focused on the extraction of three LLMs or specific types of reports.^34,35 Our study extends these findings by demonstrating that all five LLMs demonstrated limited reliability and failed to meet established readability standards. This result aligns with the study by Özduran et al.,³⁶ which reported poor readability and low reliability and quality scores in AI-generated responses to back pain–related questions. Together, these findings highlight the persistent limitations of current mainstream LLMs in delivering highly specialized and personalized medical guidance.

Several factors may account for these deficiencies. First, LLMs are designed to cover a broad range of topics, structural reliability and comprehensiveness of information. They may lack exposure to or comprehensive understanding of the most recent Clinical Practice Guidelines for Nephrology, authoritative textbooks, and up-to-date peer-reviewed literature. Second, the Internet contains a vast amount of medical information of variable quality, including outdated therapies, commercially biased content, and anecdotal personal experiences. Without rigorous quality filtering during training, content with limited transparency and structural integrity. Third, LLMs are constrained by their knowledge update latency. Medicine evolves rapidly, with clinical guidelines and treatment standards frequently revised. Because model training relies on static datasets with fixed cut-off dates, these systems cannot integrate the most recent evidence in real time. Consequently, their responses may fail to reflect the latest recommendations in dialysis management—such as updates in pharmacological protocols or standards of dialysis adequacy.

Although all five models exhibited fair-to-moderate reliability scores, Perplexity achieved comparatively higher scores. This indicates that Perplexity tended to provide more structured and source-attributed information within the constraints of the current models, rather than implying high absolute reliability. In our analysis, the DISCERN instrument—which emphasizes clarity, balance, and comprehensiveness of health information—indicated that Perplexity generated more balanced and source-based responses. In contrast, ChatGPT and Copilot scored significantly lower due to issues related to transparency and citation of information sources. While both Gemini and Perplexity provided hyperlinks, many of Gemini's links were invalid. Perplexity, however, consistently referenced authoritative and verifiable sources, resulting in higher scores on the JAMA benchmark, which evaluates credibility based on authorship, attribution, disclosure, and currency. This advantage is likely attributable to its built-in citation features.

Readability represented another major focus of this study. However, it is important to note that readability indices are formula-based proxies for text complexity and may not fully capture the actual comprehension experience of patients with varying health literacy levels. Given the health literacy level of patients undergoing maintenance hemodialysis (MHD), the comprehensibility of educational materials is of particular importance. Based on six established readability indices—ARI, FRES, GFI, FKGL, CLI, and SMOG—DeepSeek produced simpler and more understandable text, whereas Perplexity and Gemini generated more complex narratives. This finding raises important questions about how to make AI-generated educational content accessible to individuals with lower educational attainment. Despite the apparent reliability of these models, none achieved the sixth-grade reading standard recommended by the U.S. National Institutes of Health (NIH) and the American Medical Association (AMA). These results indicate that while LLMs can generate internally consistent and structured content, their linguistic complexity may hinder comprehension for the general public. This observation is consistent with the findings of Ensari et al.,³⁷ who reported similar readability limitations in AI-generated responses to pediatric dialysis–related queries.

Fundamentally, current AI systems function as “knowledge translators” rather than “patient-centered communicators.” Their insufficient readability stems from the lack of algorithmic adaptability to audiences with limited health literacy, thereby constraining their practical value in patient education. For example, ChatGPT's responses regarding hemodialysis complications often include technical terms such as “hypotension caused by rapid ultrafiltration” and “dialysis adequacy (Kt/V),” which may be confusing for lay users. A more accessible phrasing—such as “During dialysis, removing fluid and toxins too quickly from the body may lead to low blood pressure”—would simplify understanding and enhance patient engagement.

Previous studies have demonstrated that the ChatGPT series can reduce text complexity when trained with specific prompts or instructions.^38,39 This suggests that future applications could improve readability through prompt engineering focused on simplified language generation. Similarly, the study by Mondal et al. examined several LLM-based chatbots—including ChatGPT, Claude, Copilot, Gemini, Meta AI, and Perplexity—and confirmed their potential in producing plain language summaries (PLSs) from scientific abstracts. These findings indicate that such tools can assist non-native English speakers in overcoming linguistic barriers and substantially enhance the accessibility of scientific knowledge.⁴⁰

Readability is a fundamental determinant of effective patient education. Long and syntactically complex sentences can diminish reader confidence and impair comprehension of written health information. Research has shown that reducing sentence length by 8–10 words can significantly improve the readability of health materials.⁴⁰ More readable content enhances health literacy, which in turn improves treatment adherence, shortens hospital stays, and reduces emergency visits.¹⁴ Education constitutes a cornerstone of disease management.⁵ For dialysis patients, adherence to treatment schedules, compliance with fluid and dietary restrictions, and adherence to hygiene recommendations collectively determine both disease progression and treatment outcomes.^6,41 Studies have further demonstrated that when patients receive personalized and comprehensible health information, their adherence to medical advice improves significantly, leading to better health outcomes.⁹

Nevertheless, despite the growing integration of medical technology and artificial intelligence into healthcare delivery, critical concerns remain—AI-generated content may still contain misleading information, inconsistent quality, and limited readability for the general public.¹¹ These challenges underscore the need for continued human oversight, health literacy–sensitive algorithm development, and the inclusion of clinical expertise in AI-based health communication systems.

Strengths and limitations

Our study had several strengths. To the best of our knowledge, it represents the first comparative analysis of five leading LLMs—ChatGPT, DeepSeek, Copilot, Gemini, and Perplexity—in the field of maintenance hemodialysis, thereby filling an important gap in the literature. By adopting a systematic, multidimensional evaluation framework and employing validated assessment tools, this research provides a robust and reproducible analysis of reliability, content quality, and readability. Importantly, the study not only assesses the current performance of artificial intelligence in specialized medical question-answering but also offers insights into its future development. The findings may serve as a practical reference for healthcare professionals and patients increasingly reliant on digital health information systems.

However, several limitations of this study must be acknowledged. First, the evaluation of LLMs in terms of reliability, informational quality, and readability may face challenges related to timeliness, as these algorithms are continuously updated and iteratively optimized. Consequently, the performance snapshot captured in this study reflects the models at a specific point in time, and the absolute scores reported here may not be directly generalizable to future versions. This inherent dynamism underscores the need for continuous, longitudinal assessment of LLMs in clinical contexts. Second, since the analysis was based solely on English-language responses, the generalizability of the findings to non-English contexts may be limited. Third, although the 17 selected questions were structurally reasonable, they did not fully capture the individualized variations arising from diverse patient characteristics, which may have affected the comprehensiveness of the generated answers. Furthermore, the use of default prompts may not accurately reflect real-world patient queries, limiting the generalizability of our findings. Fourth, our assessment of readability relied on a single online calculator (https://readabilityformulas.com). While it provides a standardized set of indices, its algorithmic implementation, potential handling of medical terminology, and lack of validation against patient comprehension in this specific context constitute a methodological limitation. This reliance on a single tool may not fully capture the nuanced readability of medical texts for the target patient population. Finally, while each query was initiated in a new chat session to clear the immediate context window, all prompts were entered sequentially within the same user environment. We acknowledge that this approach may not have fully bypassed sophisticated system-level caching or session-persistent algorithms employed by LLM providers. Consequently, although “New Chat” sessions isolate short-term memory, potential carryover effects or interaction biases between consecutive queries cannot be definitively ruled out. Future studies should consider using fully independent, randomized, and parallel sessions across different accounts or devices to achieve absolute isolation.

Future directions

Based on the findings of this study, future development of artificial intelligence in health information delivery should place greater emphasis on personalized interaction and content accessibility. First, AI systems could differentiate between healthcare professionals and patients through preliminary interactions (e.g., by asking about the user's role), thereby tailoring the level of expertise and style of communication accordingly. Second, systems may assess a user's educational background or preferred reading difficulty, enabling the generation and dissemination of content—either educational or professional—that is precisely matched to the user's literacy level, optimizing both efficiency and acceptability of information transfer. Finally, AI algorithms should prioritize the retrieval and integration of information from authoritative medical sources, including peer-reviewed journals, clinical practice guidelines (e.g., KDIGO), and regulatory agencies. These references should be explicitly cited in responses to enhance transparency and credibility.

Future research should focus on translating large language models into reliable, patient-centered clinical adjuncts through deliberate design that prioritizes professionalization, safety, and usability. This entails developing systems anchored in evidence-based, specialized knowledge bases, integrating readability optimization and safety escalation mechanisms for high-risk topics, and establishing rigorous clinical validation and audit frameworks. Concurrently, policy efforts must promote an interdisciplinary AI quality assurance framework, including specialty-specific certification, standardized guideline testing, and post-market surveillance. Finally, responsible science communication by media is essential to shape informed public expectations and foster a balanced discourse that supports innovation while ensuring accountability.

Conclusions

The evaluation of large language models (LLMs) in responding to common questions related to maintenance hemodialysis revealed that Perplexity outperformed ChatGPT, DeepSeek, Copilot, and Gemini in terms of reliability and content quality. However, Perplexity's tendency to employ more complex linguistic structures poses a clear limitation for users with lower health literacy. While all LLMs demonstrated satisfactory performance in delivering basic health information, their reliability scores and information quality decreased when addressing topics that require clinical judgment—such as risk management and recovery processes. These findings indicate that, although LLMs cannot replace professional medical consultation, they may serve as valuable informational tools by providing structured and standardized educational content for patients.

Future research should focus on both algorithmic advancement and ethical governance, promoting the systematic integration of artificial intelligence into medical communication. Continuous monitoring and optimization are also needed to enhance the readability, content consistency, and reference adequacy of domain-specific AI models, particularly in chronic disease management.

	Question
1	What is hemodialysis?
2	What is a hemodialysis machine?
3	What is a hemodialysis fistula?
4	What is a hemodialysis catheter?
5	What is home hemodialysis?
6	What is a hemodialysis nurse?
7	What is the difference between peritoneal dialysis and hemodialysis?
8	What are the treatments for kidney failure?
9	What are the symptoms associated with dialysis?
10	What are the complications of hemodialysis?
11	How is vascular access maintained?
12	What are the dietary guidelines for hemodialysis patients?
13	What are the exercise guidelines for hemodialysis patients?
14	What is dry weight?
15	What is the buttonhole technique?
16	How is hemodialysis performed for infectious patients?
17	What is heparin?

	DISCERN	EQIP	GQS	JAMA
ChatGPT	42(39,46)	50(40,60)	3(2,3)	0(0,0)
Copilot	42(33,44)	55(50,70)	2(1,2)	0(0,1)
DeepSeek	50(35,53)	60(55,65)	3(2,3)	0(0,0)
Gemini	50(42,58)	65(60,70)	3(2,3)	0(0,0)
Perplexity	52(50,59)	65(55,70)	4(4,5)	1(1,1)
H	24.238	19.619	45.150	57.840
P	<.001	<.001	<.001	<.001

Pairwise comparison	DISCERN	EQIP	GQS	JAMA
ChatGPT vs Copilot	0.818	0.173	0 . 014	0.007
ChatGPT vs DeepSeek	0.097	0.080	0.712	1.000
Copilot vs DeepSeek	0.066	0.553	0.046	0.008
ChatGPT vs Gemini	0.043	<0.001	0.861	0.708
Copilot vs Gemini	0.030	0.038	0.019	0.023
DeepSeek vs Gemini	0.645	0.106	0.771	0.796
ChatGPT vs Perplexity	<0.001	0.007	<0.001	<0.001
Copilot vs Perplexity	<0.001	0.162	<0.001	0.002
DeepSeek vs Perplexity	0.066	0.359	<0.001	<0.001
Gemini vs Perplexity	0.144	0.426	<0.001	<0.001

	ARI^a	CL^b	FKGL^c	FRES^d	GFI^e	SMOG^f
ChatGPT	12.19 ± 1.89	15.56 ± 2.33	11.76 ± 2.03	26.76 ± 14.12	14.49 ± 2.37	9.26 ± 1.42
Copilot	11.88 ± 2.29	14.34 ± 2.52	11.53 ± 2.13	32.47 ± 14.38	14.06 ± 2.12	9.97 ± 1.38
DeepSeek	11.91 ± 2.09	14.35 ± 2.13	11.52 ± 1.85	33.65 ± 11.85	13.74 ± 1.87	10.01 ± 1.36
Gemini	13.30 ± 1.94	15.51 ± 2.24	12.48 ± 1.98	29.18 ± 13.12	14.11 ± 2.03	10.63 ± 1.51
Perplexity	15.00 ± 2.75	15.25 ± 2.45	14.24 ± 2.09	27.00 ± 12.57	14.70 ± 2.39	12.30 ± 1.71
6th-grade level score	6	6	6	80–90	6	6

Footnotes

Acknowledgments

The authors thank the anonymous reviewers for their constructive comments. We also extend our gratitude to the patients for their voluntary participation and to the developers and contributors of the large language models studied herein. Their work provided the foundational tools for this research. Additionally, during the statistical analysis phase, various artificial intelligence (AI) tools were employed to assist in the code development for generating the heatmaps.

ORCID iDs

Jinjin Cao

Zhishui Wu

Ya Hu

Yiying Liu

Ethical approval

This study was determined to be exempt from ethical review and was granted approval by the Ethics Committee of the Fuyang People's Hospital. Although this study was granted ethical exemption, specific measures were implemented during patient involvement in optimizing question phrasing. All participating patients were informed of the project objectives and the voluntary nature of their participation. Written informed consent was obtained from each participant prior to discussion. To protect participant privacy, all feedback was anonymized, and no identifiable personal information was recorded or linked to the responses.

Contributorship

All listed authors made significant contributions to this study, whether in research design, data collection, organization and analysis, or in the writing and revision of the manuscript. All listed authors gave final approval of the version to be published, agreed on the journal to which the article has been submitted, and agree to be accountable for all aspects of the work.

Conceptualization: Jinjin Cao, Zhishui Wu. Data curation: Jinjin Cao, Zhonghua Wu, Saiwen Dai. Formal analysis: Jinjin Cao, Zhishui Wu. Funding acquisition: Yiying Liu. Investigation: Jinjin Cao, Zhonghua Wu. Methodology: Jinjin Cao, Zhishui Wu. Project administration: Yiying Liu. Supervision: Yiying Liu. Validation: Yiying Liu, Ya Hu. Visualization: Jinjin Cao. Writing—original draft: Jinjin Cao, Yiying Liu. Writing—review and editing: Jinjin Cao, Zhonghua Wu, Zhishui Wu, Ya Hu, Saiwen Dai, Yiying Liu.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the Clinical Medical Research Transformation Special Project of the Fuyang Key Research and Development Program (grant number FK20245554-1).

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data availability statement

The datasets generated or analyzed during this study are available from the corresponding author on reasonable request.

Guarantor

The authors take responsibility for the manuscript. All authors take responsibility for any liabilities regarding this case.

References

Adibifard

Bozorgi

Kolangi

, et al. Effects of Pistacia genus on gastrointestinal tract disorders: a systematic and comprehensive review. Fitoterapia 2024; 176: 106038.

Bello

Okpechi

Osman

, et al. Epidemiology of haemodialysis outcomes. Nat Rev Nephrol 2023; 18: 378–395.

Jager

Kovesdy

Langham

, et al. A single number for advocacy and communication—worldwide more than 850 million individuals have kidney diseases. Nephrol Dial Transplant 2019; 34: 1803–1805.

Pecoits-Filho

Okpechi

Donner

, et al. Capturing and monitoring global differences in untreated and treated end-stage kidney disease, kidney replacement therapy modality, and outcomes. Kidney Int Suppl 2020; 10: e3–e9.

Wolf

Gazmararian

Baker

. Health literacy and functional health status among older adults. Arch Intern Med 2005; 165: 1946–1952.

Dewalt

Berkman

Sheridan

, et al. Literacy and health outcomes: a systematic review of the literature. J Gen Intern Med 2004; 19: 1228–1239.

Ozduran

Hanci

Erkin

. Evaluating the readability, quality and reliability of online patient education materials on chronic low back pain. Natl Med J India 2024; 37: 124–130.

Ozduran

Hanci

. YouTube as a source of information about stroke rehabilitation during the COVID-19 pandemic. Neurol Asia 2023; 28: 907–915.

Grippaudo

Nigrelli

Patrignani

, et al.

Quality of the information provided by ChatGPT for patients in breast plastic surgery: are we already in the future?

JPRAS Open 2024; 40: 99–105.

10.

Gül

Erdemir

Hanci

, et al. How artificial intelligence can provide information about subdural hematoma: assessment of readability, reliability, and quality of ChatGPT, Bard, and Perplexity responses. Medicine (Baltimore) 2024; 103: e38009.

11.

Kutner

Greenberg

Jin

, et al. The health literacy of America's adults: results from the 2003 National Assessment of Adult Literacy. Washington (DC): National Center for Education Statistics, 2006.

12.

Marquez

Ladd

. Promoting health literacy: finding consumer health resources and writing health materials for patients. J Hosp Librariansh 2019; 19: 156–164.

13.

Lyu

Huang

Zhang

, et al. Quality, bias, and readability of responses generated by ChatGPT for sexually transmitted diseases: a cross-sectional study. JMIR Public Health Surveill 2023; 9: e46999.

14.

Hancı

Ergün

Gül

, et al. Assessment of readability, reliability, and quality of ChatGPT, Bard, Gemini, Copilot, and Perplexity responses on palliative care. Medicine (Baltimore) 2024; 103: e39305.

15.

Meyer

MKR

Kandathil

Davis

, et al. Evaluation of rhinoplasty information from ChatGPT, gemini, and claude for readability and accuracy. Aesthet Plast Surg 2024; 49: 1–6.

16.

CHART Collaborative. Reporting guideline for chatbot health advice studies: the Chatbot assessment reporting tool (CHART) statement. BMJ Med 2025; 4: e001632.

17.

Watad

Mahroum

, et al. Forecasting the West Nile virus in the United States: an extensive data-based time series analysis and structural equation modeling of digital searching behavior. JMIR Public Health Surveill 2019; 5: e9176.

18.

Charnock

Shepperd

Needham

, et al. DISCERN: an instrument for judging the quality of written consumer health information on treatment choices. J Epidemiol Community Health 1999; 53: 105–111.

19.

Moult

Franck

Brady

. Ensuring quality information for patients: development and preliminary validation of a new instrument to improve patient education materials. Patient Educ Couns 2004; 54: 273–280.

20.

Bernard

Langille

Linton

, et al. Identification and assessment of online patient education resources for maternal-fetal medicine. J Obstet Gynaecol Can 2007; 29: 969–973.

21.

Silberg

Lundberg

Musacchio

. Assessing, controlling, and assuring the quality of medical information on the internet: caveat lector et viewor—let the reader and viewer beware. JAMA 1997; 277: 1244–1245.

22.

Eckardt

Delgado

Heerspink

HJL

, et al. Trends and perspectives for improving quality of chronic kidney disease care: conclusions from a KDIGO controversies conference. Kidney Int 2023; 104: 888–903.

23.

Smith

Senter

. Automated readability index. AMRL-TR Aerospace Med Res Lab 1967: 1–14.

24.

Kincaid

PJ.

Derivation of new readability formulas. Chief of Naval Technical Training, Naval Air Station Memphis, 1975.

25.

Coleman

Liau

. A computer readability formula designed for machine scoring. J Appl Psychol 1975; 60: 283–284.

26.

Hedman

. Using the SMOG formula to revise a health-related document. Am J Health Educ 2008; 39: 61–64.

27.

Weiss

BD.

Health literacy and patient safety: help patients understand. Chicago (IL): American Medical Association Foundation, 2007, vol. 2, pp.1–48.

28.

Karten

. Easy to write? Creating easy-to-read patient education materials. Clin J Oncol Nurs 2007; 11: 506–510.

29.

Weil

Bojanowski

Jamart

, et al. Evaluation of the quality of information on the Internet available to patients undergoing cervical spine surgery. World Neurosurg 2014; 82: e31–e39.

30.

Moult

Franck

Brady

. Ensuring quality information for patients: development and preliminary validation of a new instrument to improve the quality of written health care information. Health Expect 2004; 7: 165–175.

31.

Hain

. Improving the quality of health information: the contribution of C-H-i-Q. Health Expect 2002; 5: 270–273.

32.

World Health Organization. Global action plan for the prevention and control of noncommunicable diseases 2013–2020. Geneva: WHO, 2013.

33.

Gilson

Safranek

Huang

, et al. How does ChatGPT perform on the United States medical licensing examination (USMLE)? the implications of large language models for medical education and knowledge assessment. JMIR Med Educ 2023; 9: e45312.

34.

Musheyev

Pan

Loeb

, et al.

How well do artificial intelligence Chatbots respond to the top search queries about urological malignancies?

Eur Urol 2024; 85: 13–16.

35.

Wang

Yun

. Accuracy and readability of kidney stone patient information materials generated by a large language model compared to official urologic organizations. Urology 2024; 189: 12.

36.

Ozduran

Hancı

Erkin

, et al. Assessing the readability, quality and reliability of responses produced by ChatGPT, gemini, and perplexity regarding most frequently asked keywords about low back pain. PeerJ 2025; 13: e18847.

37.

Ensari

Onder

ENA

Ertan

. A comparative assessment of large language models in pediatric dialysis: reliability, quality and readability. Ther Apher Dial 2025; 29: 739–746.

38.

Garcia Valencia

Thongprayoon

Miao

, et al. Empowering inclusivity: improving readability of living kidney donation information with ChatGPT. Front Digit Health 2024; 6: 1366967.

39.

Zaki

Mai

Abdel-Megid

, et al. Using ChatGPT to improve readability of interventional radiology procedure descriptions. Cardiovasc Intervent Radiol 2024; 47: 1134–1141.

40.

Mondal

Gupta

Sarangi

, et al. Assessing the capability of large language model chatbots in generating plain language summaries. Cureus 2025; 17: e80976.

41.

Berkman

Sheridan

Donahue

, et al. Low health literacy and health outcomes: an updated systematic review. Ann Intern Med 2011; 155: 97–107.

The reliability and readability of large language models in answering patient questions on maintenance hemodialysis: A comparative study

Abstract

Objective

Methods

Results

Conclusions

Keywords

LLMs in patient education: Applications and challenges

Materials and methods

Study design and data collection

Readability and reliability evaluation

Readability evaluation

Reliability and quality evaluation

Statistical analysis

Results

Reliability analysis

Readability analysis

Query-level variability: Heatmap and distribution insights

Discussion

Principal findings

Comparison with prior work

Strengths and limitations

Future directions

Conclusions

Footnotes

Acknowledgments

ORCID iDs

Ethical approval

Contributorship

Funding

Declaration of conflicting interests

Data availability statement

Guarantor

References