Abstract
Background
Professional opinion polling has become a popular means of seeking advice for complex nephrology questions in the #AskRenal community on X. ChatGPT is a large language model with remarkable problem-solving capabilities, but its ability to provide solutions for real-world clinical scenarios remains unproven. This study seeks to evaluate how closely ChatGPT's responses align with current prevailing medical opinions in nephrology.
Methods
Nephrology polls from X were submitted to ChatGPT-4, which generated answers without prior knowledge of the poll outcomes. Its responses were compared to the poll results (inter-rater) and a second set of responses given after a one-week interval (intra-rater) using Cohen's kappa statistic (κ). Subgroup analysis was performed based on question subject matter.
Results
Our analysis comprised two rounds of testing ChatGPT on 271 nephrology-related questions. In the first round, ChatGPT's responses agreed with poll results for 163 of the 271 questions (60.2%; κ = 0.42, 95% CI: 0.38–0.46). In the second round, conducted to assess reproducibility, agreement improved slightly to 171 out of 271 questions (63.1%; κ = 0.46, 95% CI: 0.42–0.50). Comparison of ChatGPT's responses between the two rounds demonstrated high internal consistency, with agreement in 245 out of 271 responses (90.4%; κ = 0.86, 95% CI: 0.82–0.90). Subgroup analysis revealed stronger performance in the combined areas of homeostasis, nephrolithiasis, and pharmacology (κ = 0.53, 95% CI: 0.47–0.59 in both rounds), compared to other nephrology subfields.
Conclusion
ChatGPT-4 demonstrates modest capability in replicating prevailing professional opinion in nephrology polls overall, with varying performance levels between question topics and excellent internal consistency. This study provides insights into the potential and limitations of using ChatGPT in medical decision making.
Keywords
Background
Healthcare is a constantly evolving landscape in which medical professionals frequently encounter complex clinical scenarios without straightforward solutions that align neatly with established guidelines.1,2 This highlights the necessity of drawing on the expertise of colleagues who can share unique insight gleaned from personal experience.3–6 Nephrology is a field in which this practice has become common due to the increased prevalence of co-morbid conditions, new etiologies of kidney injury (e.g. immunotherapies), and novel therapeutics (e.g. SGLT2 inhibitors, GLP-1 receptor agonists, endothelin receptor antagonists) that have added to disease complexity and burden.7,8 The intricacy of nephrology cases and the absence of clear guidelines for specific situations accentuate the need for expert opinion and consensus in clinical decision making.9,10
Professional collaboration with the goal of optimizing patient care is critically important, as kidney diseases impact millions worldwide. The Global Burden of Disease study indicates that chronic kidney disease (CKD) was a primary cause of morbidity and mortality globally in 2017, with an estimated prevalence of 9.1%. 11 As the field if nephrology adapts to meet these evolving challenges, professionals have recognized the importance of staying connected and leveraging collective expertise to address the complexities inherent in managing kidney disease. To that end, the #AskRenal community on X has emerged as a valuable platform for nephrologists to exchange knowledge, seek advice, and engage in discussions about challenging cases. 12 The community uses an automated account to engage the nephrology community by broadcasting questions related to the field. This approach has allowed those with smaller social media followings or who are still in training to participate in discussions, enabling the widespread dissemination of nephrology knowledge to all members of the community. 12 Queries are typically posed using X's polling feature, which allows users to quickly survey a large audience of nephrologists. The collective intelligence gained from this not only helps in navigating intricate medical scenarios but also fosters a deeper sense of community among specialists who might otherwise be isolated by the specifics of their practice.
ChatGPT, a sophisticated large language model (LLM) developed by OpenAI, has demonstrated remarkable capabilities in various fields including healthcare.13–18 Its proficiency in interpreting natural language inputs and using deep learning techniques to produce human-like responses has spurred considerable interest in its potential applications in medical decision making.19,20 However, its ability to address real-world scenarios that arise in day-to-day clinical practice remains unproven. We set out to examine this ability in the present study by assessing ChatGPT's effectiveness in answering polls posted by the nephrology community on X. These queries reflect typical issues encountered in practice, ranging from diagnosis and management of kidney diseases to patient care decisions underpinned by intricate medical data. Our objective was to determine how well ChatGPT aligns with the current medical consensus among nephrologists. By doing this, we aimed to identify both the strengths and potential limitations of using such advanced AI systems in a healthcare environment. This evaluation not only helps in understanding ChatGPT's capabilities but also assists in pinpointing areas where the model might need further refinement or additional training to enhance its utility in clinical decision support systems. Through this analysis, we seek to contribute to the broader discourse on the integration of AI technologies like ChatGPT in medicine, emphasizing the importance of aligning these tools with professional healthcare practices and standards.
Methods
This study was designed as an observational analysis comparing responses to nephrology-related polls from the social media platform X with those generated by the large language model ChatGPT-4. The research was conducted at Mayo Clinic in Rochester, Minnesota, USA, over a 2-week period from April 1 to April 15, 2024. We utilized publicly available poll data from the #AskRenal community on X, which represents an international group of medical professionals and individuals interested in nephrology-related topics. The study aimed to evaluate the alignment between prevailing professional opinions in nephrology and AI-generated responses across various subspecialty areas within the field.
#AskRenal dataset
Nephrology-related opinion polls were obtained from posts by independent users on the social media site X. Posts targeted toward the professional nephrology community were identified by their inclusion of the hashtag #AskRenal, and all polls posted between April 2021 and March 2024 were considered. To mitigate the potential impact of nonexpert responses, we implemented strict inclusion criteria for the polls. Each poll under consideration was reviewed qualitatively by members of our team. Polls were included if they were deemed to pose a medically relevant question pertaining to a topic within nephrology and had a definitive voting result (majority of respondents selecting a particular answer). Exclusion criteria included: non-multiple choice format, irrelevant topics (issues unrelated to nephrology or those soliciting personal opinions on nonmedical topics), insufficient response (fewer than 10 respondents), and lack of clarity (e.g. excessive typos, unclear phrasing, or extraneous text that created ambiguity in how the query could be interpreted). These criteria yielded 271 polls, which were then submitted to ChatGPT-4, the latest version of ChatGPT released by OpenAI in April 2024. Each poll was submitted to ChatGPT in its complete form with the answer choices provided. Polls were proofread prior to submission and edits were made in rare circumstances for the sake of clarity; for instance, if major typos were present that could cause reader confusion. In each circumstance, the content and phrasing of the text was preserved as much as possible to match the original post. Extraneous text and hashtags were removed if they were not germane to the query.
ChatGPT queries
ChatGPT was provided with the following prompt a single time at the beginning of the inquiry process: “I am going to ask you a multiple choice question. Please pick the best answer choice of the options provided.” The 271 polls were then entered individually into the ChatGPT interface (Figure 1). ChatGPT was blinded to the poll results and generated responses independently without knowledge of the outcome of the popular vote. Each response it gave was recorded. In cases where ChatGPT did not commit to a single answer in its response, it was re-prompted with the phrase “please choose the single best answer.” Agreement was documented if the ChatGPT response matched that of the popular vote for a given poll, and disagreement was documented otherwise. This process was performed twice, with the two inquiry rounds spaced one week apart from each other. The sequence in which the polls were entered into the ChatGPT interface was randomized for each of the two rounds.

Examples of ChatGPT-4 responses to nephrology polls posted in the #AskRenal community on X.
Quantitative analysis
The percentage of agreement between ChatGPT's responses and the polling outcomes was recorded. In addition, Cohen's kappa statistic (κ) was calculated to quantify the degree of inter-rater agreement between ChatGPT-4 and the poll results for each of the two inquiry rounds separately, as well as the intra-rater agreement between the two rounds themselves. Cohen's kappa was chosen for our analysis as it provides a straightforward measure of agreement that accounts for chance agreement and is particularly suitable for categorical data. Kappa values, which range from 0 to 1 for agreement, were interpreted using the following thresholds: ≤ 0.20 (no agreement), 0.21 to 0.39 (minimal agreement), 0.4 to 0.59 (weak agreement), 0.6 to 0.79 (moderate agreement), 0.8 to 0.9 (strong agreement), and >0.9 indicating almost-perfect agreement. 21 Each of the 271 questions was then categorized into one of the three categories based on their subject matter for subgroup analysis to explore variability in agreement across medical topics. Data were managed and analyzed using R statistical software (version 4.1.0).
Qualitative analysis
A qualitative assessment of ChatGPT's responses was conducted to complement the quantitative analysis, and aimed to identify patterns, strengths, and limitations in the answers. Two nephrologists on our team independently reviewed each response. They were evaluated with attention given to answer accuracy (alignment with established medical knowledge and guidelines), relevance (appropriateness of the response to the question asked), depth (the level of detail and explanation provided), and clarity (the ease with which the response could be understood by a medical professional). The responses were also assessed thematically to identify common themes, strengths, and potential areas for improvement in ChatGPT's performance. The nephrologists’ evaluations for each question were compared to each other, and any major discrepancies were resolved through discussion and consensus between the academic nephrologists on our team.
Results
Dataset characteristics
The dataset comprised 271 nephrology-focused poll questions spanning a diverse range of medical topics. A significant portion of the polls received dozens to hundreds of responses within a few hours of being posted. The majority of the polls were written in English and one poll was written in Spanish. The questions were categorized into three broad subject areas: 1) CKD, end-stage renal disease (ESRD), and kidney transplantation (n = 117); 2) glomerular disease, hypertension, acute kidney injury (AKI), and critical care (n = 79); and 3) homeostasis, nephrolithiasis, and pharmacology (n = 75) (Table 1). The homeostasis, nephrolithiasis, and pharmacology category was composed of all questions concerning electrolyte and acid-base disorders, mineral, bone, and stone diseases, and pharmacotherapy.
Reliability assessment of responses with poll results across two rounds and internal comparison for different medical categories.
# (%) refers to the number of items and the percentage of the total and
CKD: chronic kidney disease; ESRD: end-stage renal disease.
Inter-rater agreement
ChatGPT responses agreed with the poll results for 163 of the 271 questions (60.2%; κ = 0.42, 95% CI: 0.38–0.46) in the first round of inquiry and 171 out of 271 (63.1%; κ = 0.46, 95% CI: 0.42–0.50) in the second (Table 1). Agreement was highest for questions related to homeostasis, nephrolithiasis, and pharmacology, with the same level of inter-rater agreement (66.7%; κ = 0.53, 95% CI: 0.47–0.59) observed across both rounds. For questions related to CKD, ESRD, and kidney transplantation, there was 62.4% (κ = 0.43, 95% CI: 0.37–0.49) agreement between ChatGPT and the poll results in the first round of inquiry and 64.1% (κ = 0.45, 95% CI: 0.39–0.51) in the second. The glomerular disease, hypertension, AKI, and critical care category had the lowest agreement rates; 50.6% (κ = 0.28, 95% CI: 0.20–0.36) in the first inquiry round and 58.2% (κ = 0.39, 95% CI: 0.31–0.47) in the second. Inter-rater results by subject are summarized in Figure 2(a).

Agreement by question category.
Comparison of the two sets of responses given by ChatGPT demonstrated internal agreement in 245 out of 271 responses (90.4%; κ = 0.86, 95% CI: 0.82–0.90) overall. Agreement was highest for questions related to homeostasis, nephrolithiasis, and pharmacology at 94.7% (κ = 0.93, 95% CI: 0.87–0.99). This was followed by the CKD, ESRD, and kidney transplantation category (89.7%; κ = 0.85, 95% CI: 0.79–0.91). Agreement was slightly lower for the glomerular disease, hypertension, AKI, and critical care category (87.3%; κ = 0.82, 95% CI: 0.74–0.90). Internal agreement results by subject are summarized in Figure 2(b).
Discussion
Previous studies have investigated the performance of LLMs in various medical disciplines, such as radiology and dermatology.22–25 To our knowledge, this is the first study to assess the agreement between an LLM and expert opinion polls in nephrology. Published work thus far has generally shown the effectiveness of AI and machine learning models in processing and analyzing medical data, particularly in diagnostic imaging and patient data management. Their application in directly aiding clinical decision making through interpreting complex case questions remains less explored.26–28 Assessing the alignment between a language model's responses and popular poll outcomes in complex nephrology cases offers insights into the potential of advanced LLMs to support healthcare professionals in navigating challenging situations. Furthermore, it identifies areas where a model's performance might need enhancement and requires further refinement. It also contributes to the expanding knowledge base on artificial intelligence (AI) applications in healthcare, setting the stage for future research and development in this area. This study's findings provide valuable insights into the use of AI in this context, contributing to the growing body of literature on its potential applications and limitations in healthcare decision making.24–26
We found that ChatGPT demonstrated modest overall ability in replicating prevailing medical opinion in nephrology polls in terms of the percentage of agreement between its answers and the poll results. It demonstrated slight improvement from the first round of inquiry to the second. Cohen's kappa scores indicated weak overall inter-rater agreement in both rounds. Inter-rater agreement was minimal for questions related to glomerular disease, hypertension, AKI, and critical care nephrology in both rounds. ChatGPT exhibited excellent internal consistency in its answers between rounds, with near-perfect intra-rater agreement for questions related to homeostasis, nephrolithiasis, and nephrology. Though other suitable metrics of inter-rater agreement exist, Cohen's kappa was chosen for this study as it is a robust, validated, and widely accepted measure that accounts for chance agreement. 29 The variability in performance ChatGPT exhibited between different nephrology topics could suggest that its performance may be influenced by the depth and quality of its training data in specific medical subtopics. Areas with extensive and well-represented data likely led to better alignment with nephrologist responses, while those with less representation could have resulted in poorer performance. Though these results show some initial promise, further development is required to fully prove ChatGPT's utility in real-world clinical practice.
One major consideration regarding the use of ChatGPT and similar LLMs in research and medicine is their propensity for generating false or fabricated information, often referred to as “hallucinations.”30,31 Our qualitative analysis of ChatGPT's answers to the multiple-choice questions did not reveal any obvious hallucinations or falsified information. However, it's important to note that the format of our study (multiple-choice questions) may have limited the opportunity for such fabrications to occur. Responses were found to be generally accurate and relevant to the questions posed. The model demonstrated a solid understanding of nephrology concepts, particularly in areas with well-established clinical guidelines and those concerning physiology such as homeostasis, nephrolithiasis, and pharmacology. The depth of ChatGPT's explanations varied in its answers to different questions. While some responses were detailed and provided comprehensive explanations, others were more simplistic. It typically excelled with questions requiring straightforward factual knowledge but showed limitations in navigating complex scenarios requiring nuanced clinical judgment. An interesting observation made during the inquiry process was that, for questions in which “it depends” was one of the available answer choices, ChatGPT would almost invariably choose it over one of the other pre-defined answer choices (Figure 1). In doing so, it seemed to give itself room to elaborate on the nuances of the other answer choices and “hedge” by elaborating on why they might be equally valid but more appropriate in specific settings. This approach, while cautious, may reflect an understanding of the complexity of medical decision making, and could prove useful in clinical practice when a given problem may have multiple solutions. In contrast to ChatGPT, the poll respondents tended to favor predefined answers. ChatGPT's performance was not as consistent in areas with less clear guidelines or that were more situationally dependent, such as glomerular disease, hypertension, AKI, and critical care. The consensus among nephrologists in our team was that there was need for further refinement in these areas before the model's utility could be relevant in clinical practice. Impressively, there was no lapse in performance with the use of emojis (e.g. up or down arrows) to replace text, or with context-specific abbreviations (e.g. p-uria for proteinuria). These findings suggest that ChatGPT is capable of processing a wide range of nontraditional natural language inputs, and could potentially be utilized in diverse linguistic and clinical settings, enhancing its adaptability and usefulness in global healthcare environments.
Machine learning-driven systems have been developed that can be used to assist nephrologists in specific aspects of clinical practice, such as predicting the development of ESRD in patients with chronic kidney disease and optimizing renal allograft allocation.32,33 While the potential of AI tools in nephrology is promising, it is important to note that their impact on patient outcomes and healthcare costs remains largely theoretical at this stage. LLMs like ChatGPT could potentially support clinical practice, but rigorous clinical studies are needed to validate their effectiveness and safety in real-world healthcare settings. It is crucial to emphasize the importance of human oversight and collaboration in AI-assisted nephrology decision making. 23 While AI tools can provide valuable insights and support, they should not replace the expertise and judgment of nephrologists. Collaborative decision-making processes that combine AI-generated insights with the knowledge and experience of nephrologists are essential for ensuring the safe and effective use of AI in managing disease.
To facilitate the responsible integration of AI tools like ChatGPT into nephrology, several key aspects must be addressed. First, there is a need for standardized evaluation frameworks and benchmarks to assess the performance and reliability of AI tools in various nephrological domains, such as glomerular diseases, tubular disorders, and electrolyte imbalances. This would enable the comparison of different AI models and help ensure their safe and effective integration into nephrological practice. Second, the continual monitoring and updating of AI models in nephrology are essential to ensure their performance remains optimal as nephrological knowledge evolves and new clinical data on kidney diseases becomes available. The potential for real-time learning and adaptation could enhance the long-term utility of AI tools in managing kidney health. Finally, interdisciplinary collaboration among nephrologists, AI researchers, ethicists, and policymakers is crucial for driving the responsible development and deployment of AI tools in nephrology. Such collaborations can help address the technical, ethical, and regulatory challenges associated with AI-assisted nephrology decision making, ultimately improving patient care and outcomes in the management of kidney diseases.
The integration of advanced AI tools like ChatGPT into medical decision-making processes highlights a transformative phase in healthcare, particularly in specialized fields such as nephrology. The utilization of the #AskRenal dataset to assess the agreement between ChatGPT's generated responses and popular poll outcomes is an innovative approach that not only tests the applicability of AI in real-world medical scenarios but also explores its potential as a supportive tool for healthcare professionals. As AI continues to permeate the medical field, addressing ethical considerations and ensuring transparency in AI decision-making processes is paramount. It is essential to maintain a clear protocol for AI's use in clinical settings, safeguard patient privacy, and provide transparent documentation of AI's reasoning paths, which could be crucial for gaining trust among healthcare providers and patients alike.27,34,35 The study underscores the potential and challenges of using AI tools like ChatGPT in medical decision making. As AI technology evolves, its integration into healthcare could significantly enhance the efficiency and accuracy of medical consultations and patient care strategies.
Limitations
Despite the promising findings, this study has several limitations. First, the #AskRenal dataset, while diverse, may not be representative of all nephrology questions encountered in clinical practice. Second, the study relied on a specific version of ChatGPT (4.0) and may not reflect the performance of other language models or future iterations. Given the dynamic nature of AI models like ChatGPT, future research should establish protocols for regular re-evaluation of these tools in nephrology contexts. This could involve creating a standardized set of nephrology questions that can be used to benchmark different versions of AI models over time, allowing for tracking of performance improvements or changes across iterations. Third, the multiple-choice nature of the poll questions may not capture the full spectrum of expert opinions. Fourth, while the polls were specifically sourced from nephrologists and meant to be answered by the nephrology community, they were open to public engagement, and it was not possible to ascertain the qualifications of all the respondents. We aimed to use polls with a large number of respondents to minimize the effect of this potentially confounding factor and to best capture prevailing professional medical opinions. Analyzing data from platforms such as X demonstrates how AI tools like ChatGPT perform in practical real-world settings where information is crowdsourced and not always curated. However, we acknowledge that this approach may not provide the same level of certainty as a comparison with verified nephrology experts. Future studies should consider comparing ChatGPT's performance against responses from a panel of verified nephrology experts. This could involve creating a standardized set of nephrology questions and having both ChatGPT and a group of certified nephrologists answer them. Such an approach would provide a more controlled evaluation of ChatGPT's capabilities and could serve as a complementary analysis to the real-world, crowdsourced data used in the present study. Lastly, future studies with larger datasets are needed to conduct additional subgroup analyses based on question types (e.g. diagnosis vs. data interpretation vs. management).
Future directions
This study's reliance on a single AI model and the static nature of the dataset may limit its applicability to the full range of clinical nephrology practice. To establish the generalizability of these findings, further validation using different AI platforms across various medical settings and disciplines is necessary. Future research should explore factors influencing ChatGPT's performance in different nephrology topics and investigate strategies to improve its accuracy and consistency. Incorporating a broader and more varied dataset, as well as integrating domain-specific knowledge and expert feedback into the training process, may enhance AI performance. Training AI models on specific subfields within nephrology could improve their agreement with human expert opinions and provide more precise support to nephrologists, accommodating the unique challenges of each subfield. Additionally, the development of specialized AI tools tailored for different subfields and implementing systems capable of real-time learning from ongoing medical cases and expert feedback could refine AI's decision-making capabilities, making them more adaptable to the dynamic nature of medical knowledge. Expanding AI capabilities through diversified training and interactive medical settings and integrating AI in multidisciplinary teams could lead to more personalized and precise medical care, enhancing decision-making processes in nephrology and beyond.
Conclusion
This study provides insights into the potential and limitations of using AI-based language models like ChatGPT-4 in medical decision making, specifically in complex nephrology cases. The language model demonstrated modest capability and excellent internal consistency in replicating prevailing professional judgments. While this indicates that ChatGPT has potential to provide relevant medical insights in real-world scenarios, its full capabilities as a clinical adjunct remain unproven. The complexity of medical decision making will require continuous enhancements in AI technology. This study contributes to the understanding of AI's current capabilities and limitations in healthcare and sets the stage for future advancements that could revolutionize how medical knowledge is utilized and disseminated in the field of nephrology. As AI becomes an integral part of medical practice, continuous evaluation and adaptation will be essential to fully realize its potential in improving patient care outcomes. It will also be crucial to engage healthcare professionals, researchers, and ethicists in the development and evaluation of these technologies to ensure their safe, effective, and equitable integration into clinical practice.
Footnotes
Contributorship
JHP and WC were involved in conceptualization, funding acquisition, visualization, and writing—original draft; JHP in data curation, formal analysis, investigation, and resources; JHP, JM, and IMC in methodology; CT and SS in project administration; CT, JM, and IMC in supervision; WC in validation; and CT, SS, JM, and IMC in writing—review & editing. All authors have read and agreed to the published version of the manuscript.
Data availability statement
The data underlying this article will be shared on reasonable request to the corresponding author.
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Ethics approval
This study does not require Ethics Committee or Institutional Review Board approval because it does not involve human or animal subjects, nor does it include patient information or identifiable personal data. Consequently, participant consent was waived for the same reasons.
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
Language model use
The use of ChatGPT in this study was strictly limited to the response-generating protocol described in the methods section. ChatGPT was not used for data analysis, writing, or any other aspects of the production of this manuscript.
Guarantor
WC
