Sage Journals: Discover world-class research

Abstract

Background

AI tools are becoming primary information sources for patients with chronic kidney disease (CKD). However, as AI sometimes generates factual or inaccurate information, the reliability of information must be assessed.

Methods

This study assessed the AI-generated responses to frequently asked questions on CKD. We entered Japanese prompts with top CKD-related keywords into ChatGPT, Copilot, and Gemini. The Quality Analysis of Medical Artificial Intelligence (QAMAI) tool was used to evaluate the reliability of the information.

Results

We included 207 AI responses from 23 prompts. The AI tools generated reliable information, with a median QAMAI score of 23 (interquartile range: 7) out of 30. However, information accuracy and resource availability varied (median (IQR): ChatGPT versus Copilot versus Gemini = 18 (2) versus 25 (3) versus 24 (5), p < 0.01). Among AI tools, ChatGPT provided the least accurate information and did not provide any resources.

Conclusion

The quality of AI responses on CKD was generally acceptable. While most information provided was reliable and comprehensive, some information lacked accuracy and references.

Keywords

chronic kidney disease health communication patient education artificial intellingence

Introduction

Chronic kidney disease (CKD), defined by KDIGO 2024 guidelines as abnormalities of kidney structure or function present for a minimum of 3 months [1], is a serious health problem worldwide. A systematic analysis as of 2017 found 697.5 million cases of all-stage CKD, for a global prevalence of 9.1%.¹ As CKD progresses, it considerably impairs the patient’s quality of life and causes economic problems due to high medical costs.² Self-management by patients themselves from the early stages of diagnosis is essential for preventing the progression of CKD, and information that can support patients’ health behaviors needs to be provided in a tailored and continuous manner.

Since OpenAI released ChatGPT, the first artificial intelligence (AI) chatbot based on large language models (LLMs), in November 2022, AI has been applied in various fields. In healthcare, AI provides highly personalized information for patients and the general public, bridging the knowledge gap between patients and healthcare professionals. This is no exception in the area of CKD, and AI can be a crucial information source and a consultant for patients with CKD.

However, AI may create an illusion of intelligence, so-called hallucination.³ Hallucination is a phenomenon in which AI produces incorrect information. This phenomenon occurs when there are discrepancies in the given data when using a large training dataset, or when the prompts are ambiguous or inductive^4,5. The past literature review indicated four types of errors in the information of the generative AI, including accuracy, reliability, bias, and toxicity.⁶ There is concern that the hallucinations may lead to patient and public misunderstanding of diseases and conditions, practice of incorrect health behaviors, and risk of exposing patients to social disadvantages. The important thing to remember is that AI-generated information must be thoroughly evaluated by researchers and healthcare professionals, or it could be harmful to patients and the general public.⁷ Previous studies have reported that AI-generated information is not accurate,^8,9 and there is a high risk of misinformation if AI answers are provided as is to patients and the general public. To date, analysis of medical information in the area of CKD, which is frequently searched and generated by AI tools, has not yet been reported. In addition, although AI tools are used in multiple languages and cultures, most previous studies analyzed information generated in English. Therefore, the efficacy and usefulness of information from AI responses in languages other than English need to be investigated.

The objective of the study is to determine the reliability of AI-generated patient information about CKD and its potential to be applied to patient education. Therefore, we pose the following research questions as follows:

RQ1

How reliable is the AI-generated patient information about CKD?

RQ2

Which elements show challenges in the reliability of AI-generated patient information about CKD?

RQ3

Which AI platform can generate the most reliable patient information about CKD?

Materials and methods

This is a cross-sectional study that quantitatively analyzed the responses generated by AI tools. This study was exempted from approval by the Research Ethics Committee as the materials were available to the public and did not include patient records or personal information.

Data collection

This study included responses from the top three AI chatbots by market share as of August 2024: ChatGPT -4o mini, Microsoft Copilot, and Gemini.¹⁰ In every AI chatbot we used a version that is available online for free without the need to register for an account. We selected keywords that patients with CKD would frequently use. We used Google Trend to identify the most frequently searched keywords related to CKD in the year from July 22, 2023. We excluded the following keywords: (1) overlapping keywords (e.g., “chronic kidney disease CKD”), (2) unrelated keywords (e.g., “chronic kidney disease cat”), (3) keywords that sound unnatural when entered as prompts (e.g., “Can you tell me about ‘chronic kidney disease about’”), and (4) keywords for healthcare professionals (e.g., “chronic kidney disease nursing”). From July 27 to 29, we entered the prompt “Tell me about [keyword]” for each AI tool. The same prompt was entered three times to allow for possible fluctuations in the AI tools’ responses to the same questions, and all responses were exported to Word (Microsoft Inc.) each time. We did not provide specialized medical training to the AI tools used in this study to generate medical information. Instead, reflecting the typical usage patterns observed among patients and the general public, we employed simple, keyword-based prompts without additional context or guidance. The input history was cleared before entering the following prompt.

Evaluation methods

Previous studies have reported that AI-generated information is not accurate,^8,9 and there is a high risk of misinformation if AI answers are provided as is to patients and the general public. Therefore, it is necessary to confirm that the generated information is medically accurate. In this study, we used the Quality Analysis of Medical Artificial Intelligence (QAMAI) tool¹¹ to analyze the reliability of the responses generated by AI tools. The DISCERN criteria, on which this indicator is based, were developed in 1999 by the University of Oxford research team to evaluate the quality and reliability of content related to consumer health information on treatment options.¹² The mDISCERN tool was modified by Singh et al.¹³ and is based on a five-point Likert scale that examines goals, reliability of information sources, bias, areas of uncertainty, and additional sources. In 2024, Vaira et al. developed the QAMAI criteria, consisting of the following six elements—accuracy, clarity, relevance, completeness, provision of sources and references, and usefulness—to adapt mDISCERN for evaluating AI-generated health information.¹¹ Each of these six items was rated on a five-point scale, with scores ranging from 6 to 30 (Supplemental Table 1). The threshold for each item was set at four points or higher. The tool has been validated for structural validity, internal consistency, inter-rater reliability, and retest reliability. To evaluate accuracy and clarity, we used international guidelines¹⁴ and standards of good clinical practice as references.

Statistical analysis

Descriptive statistics were used to summarize the characteristics of the AI responses and QAMAI scores. To compare scores between AI tools, the Kruskall-Wallis test was performed as a multiple comparison of nonparametric data. The chi-square test was used for multiple comparisons of binary data. Since the QAMAI tool is a subjective measure, two board-certified internal medicine specialists evaluated one-fifth of the AI responses independently. Intraclass correlation (ICC) was calculated to examine inter-rater reliability. ICC >0.75 or higher was determined to have good reliability. All statistical analyses were conducted using R (version 4.4.0; 2024-04-24). P < 0.05 was considered statistically significant.

Results

This study included 207 AI responses from 23 CKD-related questions. The keywords used to generate the prompts and the keywords entered are listed in Table 1 and Supplemental Table 2 (in Japanese). The median QAMAI tool was 23 (IQR 7), indicating AI responses were of good quality, meaning that the AI system provided mostly reliable and complete information. Inter-rater reliability was good with ICC 0.78 (95% CI: 0.51–0.90).

Table 1.

List of questions provided to the AI tools.

Rank	Search demand^a	Keyword	Prompt	Reason for exclusion
1	100	Kidney	Please tell me about kidney
2	97	Chronic kidney disease	Please tell me about chronic kidney disease
3	10	Chronic kidney disease stage	Please tell me about chronic kidney disease stage
4	10	Kidney disease stage	Please tell me about kidney disease stage
5	9	Kidney failure	Please tell me about kidney failure
6	9	Kidney disease definition	Please tell me about kidney disease definition
7	9	Chronic kidney failure	Please tell me about chronic kidney failure
8	9	Chronic kidney disease definition	Please tell me about chronic kidney disease definition
9	7	Chronic kidney disease symptom	Please tell me about chronic kidney disease symptom
10	7	Kidney disease symptom	Please tell me about kidney disease symptom
11	6	Chronic kidney disease diet	Please tell me about chronic kidney disease diet
12	6	Kidney disease diet	Please tell me about kidney disease diet
13	6	Dialysis	Please tell me about dialysis
14	5	CKD chronic kidney disease	Please tell me about CKD chronic kidney disease
15	5	CKD	Please tell me about CKD
16	5	Chronic kidney disease cause	Please tell me about chronic kidney disease cause
17	5	Kidney disease cause	Please tell me about kidney disease cause
18	4	Chronic kidney disease cause	Please tell me about chronic kidney disease cause
19	4	Chronic kidney disease treatment	Please tell me about chronic kidney disease treatment
20	4	Chronic kidney disease medication	Please tell me about chronic kidney disease medication
21	4	Kidney disease treatment	Please tell me about kidney disease treatment
22	4	Chronic kidney disease testing	Please tell me about chronic kidney disease testing
23	3	Diabetes	Please tell me about diabetes
24	3	Cats chronic kidney disease	Please tell me about cats chronic kidney disease	Irrelevant
25	3	Cats kidney disease	Please tell me about cats kidney disease	Irrelevant

^aSearch demand: The values displayed in Google Trend are relative to the total number of searches, from a minimum of 0 to a maximum of 100.

Issues regarding the quality of AI-generated responses

For each domain of the QAMAI tool, the median (IQR) values for accuracy, clarity, relevance, completeness, provision of sources and references, and usability were 4(2), 4(1), 4 (1), 4(1), 3(4), and 4 (1), respectively. The AI responses that met the item criteria (≥4 points) were 122 (58.9%), 118 (57.0%), 197 (95.2%), 152 (73.4%), 100 (48.3%), and 105 (50.7%), respectively. Provision of sources and references, which had the lowest median score among the items, the score distribution varied. Although 129 (62.3%) AI responses provided reliable information sources created by medical institutions or government agencies, 4 (1.9%) had invalid URLs, and 9 (4.3%) provided the name of a website but no URL. In addition, 5 (2.4%) presented only the URL of the top page, making it difficult for patients to access information about CKD.

Comparison of response quality among AI tools

Among the AI tools, Copilot exhibited the highest level of quality (median (IQR) of QAMAI score: ChatGPT versus Copilot versus Gemini = 18 (2) versus 25 (3) versus 24 (5), p < 0.01(Kruskall-Wallis test) (Figure 1). Table 2 shows the percentage of responses that met the criteria for each of the QAMAI tool items for each AI tool. For accuracy, none of the ChatGPT responses satisfied the requirements. In particular, three (4.3%) of the ChatGPT responses about CKD medications included references to fictitious drug names, and one response provided a false statement of the drug’s mechanism (e.g., spironolactone is used to promote the excretion of potassium). For the provision of sources and references, none of the ChatGPT responses provided references, while 67 (97.1%) and 33 (47.8%) of the Copilot and Gemini responses offered reliable references, respectively. However, two of the Copilot responses provided links to specific dietary supplements with a lack of clinical evidence in the description of CKD treatment. Relevance was the only item for which there was no apparent difference in the percentage of documents that met the criteria between the AI tools (ChatGPT vs Copilot vs Gemini = 95.7% vs 97.1% vs 92.8%, p = 0.61 (chi-square test)).

Figure 1.

Details of QAMAI score distribution.

Table 2.

Number of responses that met the threshold for each QAMAI tool item for each AI tool.

	ChatGPT (n = 69)		Copilot (n = 69)		Gemini (n = 69)		p
	n	%	n	%	n	%	p
Accuracy≧4 points	0	0.0	56	81.2	66	95.7	0.00
Clarity≧4 points	20	29.0	45	65.2	53	76.8	0.00
Relevance≧4 points	66	95.7	67	97.1	64	92.8	0.61
Completeness≧4points	52	75.4	39	56.5	61	88.4	0.00
Provision of sources and references≧4 points	0	0.0	67	97.1	33	47.8	0.00
Usefulness≧4 points	1	1.4	57	82.6	47	68.1	0.00

Chi-square test.

Discussion

This study showed that the AI tool returned good-quality responses to common patient questions about CKD. In particular, the AI responses satisfied the questions’ relevance and completeness. The variation in the quality of answers within the same AI tool was small, indicating a stable generation of quality medical information for patients. However, there were significant differences in performance among the AI tools. ChatGPT, which has the largest number of users, had the lowest QAMAI score among the three tools, especially in terms of accuracy of information and presentation of resources. Even in Copilot and Gemini, which returned higher quality responses than ChatGPT, some areas were not user-friendly, such as incomplete resource presentation in some cases.

Our findings are consistent with those of previous studies, which showed AI-generated responses were mostly appropriate.^9,15–17 When comparing the quality of responses by AI tools in 2023 to 24, prior research has shown that the quality of responses varied depending on the AI tool. Some earlier studies found ChatGPT was more accurate than the other AI tools, including Bing AI and Google Bard, which were the predecessors of Copilot and Gemini, respectively.¹⁸ However, more recent studies focusing on urological health information in 2024 reported that responses generated by ChatGPT were more difficult to read and understand compared to those produced by other tools^19,20. In this study,the quality of the ChatGPT responses was lower than that of the Copilot and Gemini responses. The possible reasons include ChatGPT’s nearly 60% market share among AI tools.¹⁰ While ChatGPT boasts an overwhelming number of users and information generation, tuning may be unable to keep up with the speed of generating misinformation. Copilot and Gemini stated that “AI information is not complete and individual questions should be discussed with an expert,” which suggests that these AI tools address the risk of generating misinformation. Providing them with access to the latest medical literature and clinical databases will enhance ability to provide reliable, high-quality answers to patient questions.²¹ Healthcare professionals using ChatGPT as their main AI tool need to collaborate with AI tool developers to review and manage training data to prevent the distribution of patient information based on ChatGPT’s incorrect information.^18,22 Limiting the use of AI tools to tasks based on existing text (i.e., writing, summarizing, translating) reduces the risk of hallucinations. Therefore, in clinical practice, it is advisable for healthcare professionals to use appropriate prompts to edit existing medical information, such as guidelines and evidence-based resources, with AI, and to carefully review the content themselves.

This study is the first to test the reliability of AI-generated information in the field of CKD. The results of this study can provide a better understanding of the LLMs for frequently asked questions about CKD and provide suggestions for future development and updating of AI tools. For healthcare professionals involved in patient education, the results offer recommendations on what to remember when disseminating AI-generated CKD medical information to patients. Healthcare professionals must be aware that their patients are at risk of being exposed to misinformation more than ever through AI tools. Academic organizations and medical institutions have traditionally produced education materials that simplified evidence-based clinical guidelines for use by patients and the general public. Professional organizations should incorporate these materials with guidance on the risks of misinformation that the patient public is likely to face and how to confront information provided by AI.

This study has several limitations. First, this study has limitations in comprehensiveness. Since AI tools generate customized responses for each user, this study could not cover the information presented to all users. In addition, while users may further explore their questions based on AI responses, this study did not include follow-up questions. Second, our study evaluated responses from the generative AI as of July 2024; AI tools may be updated over time, which could alter their performance and the validity of current survey results. Additionally, since this research focused exclusively on text-based medical information, generative AI tools for images or videos were beyond the scope of our evaluation. Nonetheless, we recognize that audiovisual formats (e.g., videos, podcasts) are gaining popularity in patient education. Future studies may benefit from exploring a broader range of AI modalities to accommodate these evolving trends. Furthermore, although the QAMAI tool was verified for reliability and validity, it is unclear whether it is applicable in the Japanese healthcare context since we have directly adapted the original English version. Further research will be needed to develop and validate the Japanese version of the QAMAI tool.

Conclusion

This study demonstrated that AI responses to frequently asked questions about CKD were of acceptable quality. They provided mostly reliable and complete information; however, a few AI responses posed critical misinformation. Moreover, some of the responses lacked references, such as specific URLs, making it difficult for patients to access detailed information. As maintaining reliable information is crucial to safeguarding laypeople from potential harm, healthcare professionals should be aware of the quality of AI-generated medical information.

Supplemental Material

Supplemental Material - Reliability of AI-generated responses on frequently-posed questions by patients with chronic kidney disease

Supplemental Material for Reliability of AI-generated responses on frequently-posed questions by patients with chronic kidney disease by Emi Furukawa, Tsuyoshi Okuhara, Hiroko Okada, Yuriko Nishiie, Takahiro Kiuchi in Health Informatics Journal

Footnotes

Acknowledgments

We thank Dr Luigi Angelo Vaira of University of Sassari, for sharing the material on coding the QAMAI tool.

ORCID iD

Emi Furukawa

Ethical considerations

This study was exempted from approval by the Research Ethics Committee of the University of Tokyo Graduate School of Medicine and Faculty of Medicine as the materials were available to the public and did not include patient records or personal information.

Author contributions

EF designed the study, the main conceptual ideas, and the proof outline. EF and YN collected and analyzed the data. EF and YN analyzed and interpreted the results and drafted the manuscript. EF, TO, and HO aided in interpreting the results and worked on the manuscript. TK supervised the project.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the Grants-in-Aid for Scientific Research (KAKEN) [grant number 24K23676].

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data Availability Statement

The data that support the findings of this study are available from the corresponding author, Emi Furukawa, upon reasonable request.*

Supplemental Material

Supplemental material for this article is available online.

References

GBD Chronic Kidney Disease Collaboration . Global, regional, and national burden of chronic kidney disease, 1990-2017: a systematic analysis for the global burden of disease study 2017. Lancet 2020; 395: 709–733.

Tangri

Svensson

Bodegård

, et al. Mortality, health care burden, and treatment of CKD: a multinational, observational study (OPTIMISE-CKD). Kidney 2024; 5: 352–362.

Marcus

. AI platforms like ChatGPT are easy to use but also potentially dangerous. https://www.scientificamerican.com/article/ai-platforms-like-chatgpt-are-easy-to-use-but-also-potentially-dangerous/ (2022, accessed 23 July 2024).

Lee

Frieske

, et al. Survey of hallucination in natural language generation. ACM Comput Surv 2023; 55: 1–38.

Simhi

Herzig

Szpektor

, et al. Distinguishing ignorance from error in LLM hallucinations. arXiv 2025.

Monteith

Glenn

Geddes

, et al. Artificial intelligence and increasing misinformation. Br J Psychiatry 2024; 224: 33–35.

Sallam

. ChatGPT utility in healthcare education, research, and practice: systematic review on the promising perspectives and valid concerns. Healthcare 2023; 11: 88710.3390/healthcare11060887.

Rouhi

Ghanem

Yolchieva

, et al. Can artificial intelligence improve the readability of patient education materials on aortic stenosis? A pilot study. Cardiol Ther 2024 ; 13: 137–147.

Hung

Chaker

Sigel

, et al. Comparison of patient education materials generated by chat generative pre-trained transformer versus experts: an innovative way to increase readability of patient education materials. Ann Plast Surg 2023; 91: 409–412.

10.

Bailyn

. Top generative AI chatbots by market share – september 2024 – first page sage. SEO Blog. FirstPageSage, 2024.

11.

Vaira

Lechien

Abbate

, et al. Validation of the quality analysis of medical artificial intelligence (QAMAI) tool: a new tool to assess the quality of health information provided by AI platforms. Eur Arch Otorhinolaryngol 2024; 281: 6123–6131.

12.

Charnock

Shepperd

Needham

, et al. DISCERN: an instrument for judging the quality of written consumer health information on treatment choices. J Epidemiol Community Health 1999; 53: 105–111.

13.

Singh

. YouTube for information on rheumatoid arthritis--a wakeup call? J Rheumatol 2012; 39: 899–903.

14.

KDIGO 2024 clinical practice guideline for the evaluation and management of chronic kidney disease. Kidney Int 2024; 105: S117-s31410.1016/j.kint.2023.10.018.

15.

Haver

Ambinder

Bahl

, et al. Appropriateness of breast cancer prevention and screening recommendations provided by ChatGPT. Radiology 2023; 307: e230424.

16.

Szczesniewski

Tellez Fouz

Ramos Alba

, et al. ChatGPT and Most frequent urological diseases: analysing the quality of information and potential risks for patients. World J Urol 2023; 41: 3149–3153.

17.

Cocci

Pezzoli

Lo Re

, et al. Quality of information and appropriateness of ChatGPT outputs for urology patients. Prostate Cancer Prostatic Dis 2024; 27: 103–108.

18.

Rahsepar

Tavakoli

Kim

GHJ

, et al. How AI responds to common lung cancer questions: ChatGPT vs Google Bard. Radiology 2023; 307: e230922.

19.

Sahin

Ates

Keles

, et al. Responses of five different artificial intelligence chatbots to the top searched queries about erectile dysfunction: a comparative analysis. J Med Syst 2024; 48: 6, Article.

20.

Şahin

Keleş

Özcan

, et al. Evaluation of information accuracy and clarity: ChatGPT responses to the most frequently asked questions about premature ejaculation. Sex Med 2024; 12: qfae036.

21.

Malak

Şahin

. How useful are current chatbots regarding urology patient information? Comparison of the ten Most popular chatbots' responses about female urinary incontinence. J Med Syst 2024; 48: 102.

22.

Ayers

Poliak

Dredze

, et al. Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social media forum. JAMA Intern Med 2023; 183: 589–596.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.83 MB