Sage Journals: Discover world-class research

Abstract

Introduction

Artificial intelligence (AI)-powered chatbots, such as ChatGPT-4 and DeepSeek, are increasingly utilized in providing medical information. However, their accuracy, comprehensiveness, and reliability, particularly in specialized fields such as colorectal cancer, remain under-evaluated. This study aimed to compare the performance of ChatGPT-4 and DeepSeek in responding to both community- and expert-oriented questions related to colorectal cancer.

Materials and methods

A total of 30 questions were formulated based on clinical experience, including 15 community-focused and 15 expert-oriented questions. On February 13, 2025, ChatGPT-4 (OpenAI, version 4.0) and DeepSeek-R1 (initial January 2025 release) were queried simultaneously in a single session. Responses were independently evaluated by four colorectal surgery experts for appropriateness (0-100), comprehensiveness (0-100), and reference provision (yes/no). Statistical analyses included Mann-Whitney U and chi-square tests, with significance set at p < 0.05.

Results

ChatGPT-4 and DeepSeek demonstrated comparable appropriateness scores (94.0 vs. 92.25, p > 0.05). In community-oriented questions, ChatGPT-4 showed significantly higher comprehensiveness (median 95.0, interquartile range (IQR) 92-98 vs. 90.0, interquartile range 85-94; p = 0.044). Neither chatbot provided scientific references. Inter-rater agreement ranged from good to moderate, with slightly higher consistency observed for DeepSeek (appropriateness ICC 0.83 vs. 0.81).

Discussion

Both chatbots exhibited distinct strengths and limitations. ChatGPT-4 demonstrated superior comprehensiveness in community-oriented responses, whereas DeepSeek provided slightly more consistent evaluations. The absence of scientific references represents a major limitation, restricting clinical applicability and reliability. Enhancing reference support and response consistency is essential before AI-powered chatbots can be safely integrated into colorectal cancer-related clinical decision-making.

Keywords

Artificial intelligence chatbot colorectal cancer ChatGPT-4 DeepSeek medical information digital health

Introduction

Artificial intelligence (AI)-supported information systems (chatbots), especially large-language models such as ChatGPT, are rapidly becoming widespread in the field of medical information. Although these models are seen as a potential source of information for patients and health professionals, there is insufficient research on important issues such as the accuracy, comprehensiveness, and citation of their answers.^1–3 The contribution of AI-powered chatbots to health information provision is increasing, and models such as ChatGPT-4 and DeepSeek stand out as important tools that aim to facilitate access to medical information for both patients and healthcare professionals.^4–8 However, especially in specialized medical fields such as colorectal cancer, the adequacy of these models in terms of important criteria such as accuracy, comprehensiveness, and reference support is still limited.^9–11

Given the high prevalence and clinical complexity of colorectal cancer, evaluating how effectively AI chatbots can convey accurate and comprehensible information in this field is particularly relevant. Colon and rectal cancer is one of the most common malignancies in both men and women.¹² Providing accurate and comprehensive information is critical for both patients and healthcare providers to make informed decisions. Incorrect or incomplete information can lead to misunderstandings, increased anxiety, and poor clinical outcomes. Therefore, it is of great importance to evaluate the performance of AI-powered chatbots in providing reliable and detailed information.¹³ Although some previous studies have compared the performance of AI chatbots in different medical subjects, there is limited data in the field of colorectal cancer.^14–16 In addition, the fact that chatbots provide verifiable sources in their responses and the appropriateness of their content are decisive factors in terms of reliability. Understanding the extent to which these criteria are met can contribute to optimizing these models for clinical use. In this context, key evaluation metrics such as appropriateness (the factual accuracy and contextual relevance of responses) and comprehensiveness (the breadth and depth of information provided) were considered essential for assessing chatbot performance.

Recent years have witnessed a surge in the use of AI chatbots for oncological purposes, including patient education, symptom triage, emotional support, and general information delivery. Multiple platforms—such as ChatGPT, Perplexity, Bing AI, and DeepSeek—have demonstrated promising results in answering common cancer-related queries with relatively low rates of misinformation and satisfactory DISCERN (a standardized instrument for judging the quality of written consumer health information on treatment choices) scores.^17–19 However, when tasked with more complex clinical duties—such as generating chemotherapy regimens or making treatment recommendations—the performance of these tools markedly declines. For instance, a comparative evaluation showed that ChatGPT and Bing provided correct chemotherapy suggestions in only 5/9 and 4/9 cases, respectively.²⁰

Beyond accuracy, chatbots vary considerably in terms of comprehensibility, actionability, and tone. Studies have revealed that many chatbot outputs are composed at a college reading level,^17–19 potentially limiting accessibility for the general public. Moreover, although platforms like Claude AI have been rated highly for empathy and clarity in patient forums,²¹ their actionability remains limited, with Patient Education Materials Assessment Tool scores for practical advice often falling below 40%. These limitations raise concerns about the equitable and safe deployment of such tools in clinical settings. Furthermore, the “black-box” nature of large-language models makes it difficult to verify the sources or logic behind their outputs, and bias in training datasets may perpetuate health disparities.^22,23

In this study, we compared the responses of two different chatbots, ChatGPT-4 and DeepSeek, to patient (community) and expert questions about colorectal cancer. Our aim is to evaluate the ability of these models to provide medical information and to identify aspects that need to be improved to increase their reliability.²⁴ Comparison will be based on key performance measures, including relevance, comprehensiveness, and referencing, across both patient-oriented and professional-level questions. The aim of this study is to evaluate the reliability of AI tools in colorectal cancer information delivery. To our knowledge, this is the first study to directly compare ChatGPT-4 and DeepSeek in colorectal cancer within a dual-audience framework, highlighting their differential performance in patient education and clinical decision support.

Materials and methods

This study was designed to compare the responses generated by two large-language model-based chatbots, ChatGPT-4 (OpenAI, version 4.0) and DeepSeek-R1 (initial release January 2025), to community-oriented (patient) and expert-level questions related to colorectal cancer. All chatbot queries were performed on February 13, 2025, during a single session. The primary aim of the study was to evaluate and compare the quality of medical information provided by these models in terms of appropriateness, comprehensiveness, and reference provision.

Within the scope of the study, a total of 30 questions were developed based on clinical experience, including 15 community-oriented (patient-focused) questions and 15 expert-level (professional) questions. The questions were formulated by two colorectal surgery experts. Community-oriented questions addressed basic informational topics frequently asked by the general public, whereas expert-level questions focused on complex issues relevant to clinical decision-making.

Sample size calculation

Prior to study initiation, a prospective power analysis was conducted. Based on the effect sizes observed in similar comparative studies of AI chatbots, a total sample size of 30 questions (15 per group) was determined to be sufficient to achieve 80% statistical power with an alpha level of 0.05.

Question formulation

Community-oriented questions were simulated to reflect common patient concerns, such as “If I have a family history of colorectal cancer, is my risk increased?” and “When and how should colorectal cancer screening be performed?” Expert-level questions addressed complex clinical topics relevant to specialist decision-making, including “What is the prognostic and predictive value of microsatellite instability (MSI) and mismatch repair (MMR) defects in colorectal cancer?” and “What is the current evidence on the use of microRNAs as diagnostic and prognostic markers in colorectal cancer?” The questions were not derived from real patient encounters but were developed based on the clinical knowledge and experience of colorectal surgery experts.

All questions were directed to both chatbots (ChatGPT-4, OpenAI, version 4.0; and DeepSeek-R1, initial release January 2025) in the same format. The queries were performed on February 13, 2025, in a single session, and the responses were then coded for evaluation. The text responses received from the chatbots were directly presented for evaluation without any editing. Chatbot responses were evaluated by four experienced colorectal surgery experts. The evaluation was blinded; the experts did not know which answer belonged to which chatbot.

Evaluation protocol and blinding

Chatbot responses were evaluated by four experienced colorectal surgery experts. To minimize bias, all chatbot outputs were anonymized by removing identifiers such as chatbot names, timestamps, and formatting differences, and were subsequently randomized. Responses were presented in a shuffled order, preventing side-by-side comparison. Evaluators were blinded to the identity of the chatbot generating each response and were instructed to focus solely on content quality using predefined criteria. Each response was scored using a 0–100 numerical rating scale, where 0 indicated completely irrelevant or incomplete content and 100 indicated fully appropriate and comprehensive responses, with intermediate scores reflecting expert judgment. Inter-rater reliability for appropriateness and comprehensiveness scores was assessed using the intraclass correlation coefficient (ICC) with a two-way random-effects model for absolute agreement; ICC values were interpreted as <0.50 (poor), 0.50–0.75 (moderate), 0.75–0.90 (good), and >0.90 (excellent).

Consistency in this study refers exclusively to inter-rater agreement among the evaluators, measured using the ICC. We did not evaluate within-model response reproducibility across repeated queries.

The evaluation was based on three main criteria:

Appropriateness (0–100)

Appropriateness measured how accurately and directly the chatbot's response addressed the question, considering factual correctness, clinical relevance, clarity and consistency of terminology, and suitability for the intended audience (community or expert). Higher scores indicated responses that were accurate, contextually appropriate, and free from misleading or hallucinated information.

Comprehensiveness (0–100)

Comprehensiveness assessed the breadth and depth of information provided in each response. Higher scores were assigned to answers that covered multiple relevant aspects of the question, included sufficient explanations or examples, and demonstrated an appropriate level of detail according to the question type (community or expert).

Reference provision (yes/no)

Reference provision was recorded as a binary outcome indicating whether the chatbot response included any scientific citation, clinical guideline, or verifiable source.

The distribution of continuous data was assessed using the Shapiro–Wilk test. Appropriateness and comprehensiveness scores were compared between chatbots using a Mann–Whitney U test. Reference provision, recorded as a binary outcome (Yes/No), was analyzed using a chi-square test or Fisher's exact test, as appropriate. Inter-rater reliability for appropriateness and comprehensiveness scores was assessed using the ICC with a two-way random-effects model for absolute agreement. All statistical analyses were performed using SPSS (version 29.0), and statistical significance was defined as p < 0.05. Additional methodological details and supplementary analyses are provided in the Supplemental Material.

Results

Primary outcomes: appropriateness and comprehensiveness

Based on overall average scores, ChatGPT-4 and DeepSeek demonstrated comparable performance in terms of appropriateness (92.25 vs. 94.0, respectively; p > 0.05), indicating no statistically significant difference between the two models. Similarly, no significant difference was observed between ChatGPT-4 and DeepSeek with respect to overall comprehensiveness scores. A comparison of mean appropriateness and comprehensiveness scores for both chatbots is presented in Figure 1. Individual assessor-based comparisons of appropriateness and comprehensiveness scores are summarized in Table 1.

Figure 1.

Comparison of appropriateness and comprehensiveness scores of ChatGPT-4 and DeepSeek chatbots.

Table 1.

Comparison of ChatGPT-4 and DeepSeek chatbots’ appropriateness and comprehensiveness scores by evaluator.

Variable	Group	ChatGPT-4 (mean)	DeepSeek (mean)	p-Value
Appropriateness	First assessor	93.33	89.50	0.026
	Second assessor	90.67	92.67	0.665
	Third assessor	91.00	95.83	0.088
	Fourth assessor	94.00	98.00	0.003
Comprehensiveness	First assessor	93.83	89.17	0.006
	Second assessor	90.67	92.67	0.665
	Third assessor	91.00	95.83	0.088
	Fourth assessor	94.00	98.00	0.003

Comprehensiveness in community-oriented questions

When analyses were restricted to community-oriented questions, ChatGPT-4 demonstrated significantly higher comprehensiveness scores compared with DeepSeek (p = 0.044). The distribution of comprehensiveness scores for community questions is illustrated using a boxplot in Figure 2, while Figure 3 presents a violin plot depicting score distributions for both chatbots, highlighting differences in score density and dispersion within this subgroup.

Figure 2.

Boxplot comparing the comprehensiveness scores of ChatGPT-4 and DeepSeek chatbots for community-oriented questions. A statistically significant difference was observed between the two models (p = 0.044).

Figure 3.

Violin plot showing the distribution of the comprehensiveness scores of ChatGPT-4 and DeepSeek chatbots for community-oriented questions.

Reference provision

Neither ChatGPT-4 nor DeepSeek provided any scientific references, clinical guidelines, or verifiable sources in their responses across all evaluated questions.

Inter-rater reliability

Inter-rater reliability was assessed using the ICC with a two-way random-effects model for absolute agreement. As shown in Table 2, ChatGPT-4 demonstrated good reliability for community-oriented questions (appropriateness: ICC = 0.812, 95% confidence interval (CI) [0.735–0.871]; comprehensiveness: ICC = 0.743, 95% CI [0.648–0.818]) and moderate reliability for professional-level questions (appropriateness: ICC = 0.751, 95% CI [0.659–0.823]; comprehensiveness: ICC = 0.652, 95% CI [0.543–0.742]). DeepSeek showed slightly higher reliability overall, with good agreement for community questions (appropriateness: ICC = 0.831, 95% CI [0.759–0.885]; comprehensiveness: ICC = 0.769, 95% CI [0.678–0.839]) and good to moderate agreement for professional questions (appropriateness: ICC = 0.772, 95% CI [0.683–0.841]; comprehensiveness: ICC = 0.684, 95% CI [0.579–0.769]). All ICC values were statistically significant (p < 0.001), indicating consistent agreement among evaluators across both chatbots and question types.

Table 2.

Inter-rater reliability results based on intraclass correlation coefficient (ICC) for both chatbots (ChatGPT-4 and DeepSeek) and question types (community and professional).

Group	Appropriateness ICC (95% CI)	Comprehensiveness ICC (95% CI)	Agreement level
ChatGPT-4 community	0.812 (0.735–0.871)	0.743 (0.648–0.818)	Good
ChatGPT-4 professional	0.751 (0.659–0.823)	0.652 (0.543–0.742)	Good/moderate
DeepSeek community	0.831 (0.759–0.885)	0.769 (0.678–0.839)	Good
DeepSeek professional	0.772 (0.683–0.841)	0.684 (0.579–0.769)	Good/moderate

CI: confidence interval.

The ICC values were computed separately for appropriateness and comprehensiveness scores. All ICCs indicate good to moderate agreement among the four expert raters, with higher consistency observed in community-type questions across both chatbot models.

Discussion

This study provides a direct comparison of ChatGPT-4 and DeepSeek in responding to colorectal cancer-related questions addressed to both community (patient) and professional (expert) audiences. Overall, the findings demonstrate that both chatbots are capable of generating generally relevant information, with comparable performance in appropriateness, while differences were observed in comprehensiveness depending on the target audience. These results contribute to the growing literature on the role of large-language models in oncology-related medical information delivery and extend prior work by directly comparing two distinct models within a cancer-specific context.^25–28

The inclusion of both patient-oriented and clinician-level questions offers insight into chatbot performance across different levels of informational complexity. This dual-audience approach allows for a more comprehensive evaluation of how AI-generated responses align with the needs of diverse users, an aspect that has been less frequently addressed in previous chatbot assessments.

Inter-rater reliability analysis demonstrated good to moderate agreement among evaluators for both chatbots, supporting the consistency of the expert assessments. Higher agreement was observed for community-oriented questions compared with professional-level questions, indicating greater consistency in evaluating patient-directed information than responses addressing specialized clinical topics. Across all conditions, appropriateness scores showed higher agreement than comprehensiveness scores, suggesting that evaluators more consistently agreed on whether responses were suitable for the question than on the extent to which all relevant aspects were covered.

From a clinical perspective, these findings suggest that AI-powered chatbots may have potential utility as supplementary tools for patient education and general informational support. The higher consistency observed in community-oriented responses supports their possible role in patient-facing contexts, where clarity and consistency are essential. In contrast, the lower agreement observed for professional-level comprehensiveness underscores the current limitations of chatbot outputs in supporting complex clinical discussions without additional validation or oversight.

Although DeepSeek achieved slightly higher reliability coefficients across most evaluation conditions, overall appropriateness scores did not differ significantly between the two models. This indicates that both chatbots perform at a similar level in terms of relevance and contextual suitability of information. Differences in comprehensiveness were more pronounced at the community level, where ChatGPT-4 demonstrated significantly higher scores. This finding aligns with prior reports suggesting that ChatGPT models tend to provide more detailed responses for lay audiences.^29,30

A critical finding of this study is the complete absence of scientific references in chatbot responses. The lack of verifiable source attribution substantially limits the transparency and clinical reliability of the information provided. As emphasized in previous studies, reference-supported responses are essential for ensuring trustworthiness and safe use of AI-generated medical content.^1,27,30,31 Without such support, chatbot outputs cannot be reliably integrated into clinical decision-making processes. Previous studies have similarly reported that even when citations are provided, they are often generic, incomplete, or unverifiable, limiting transparency and clinical trustworthiness of chatbot-generated content.^32,33

Taken together, these findings highlight both the potential and the current constraints of large-language models in colorectal cancer information delivery. While AI chatbots can generate relevant and, in some contexts, comprehensive responses, significant challenges remain regarding reliability, transparency, and suitability for professional-level use. Addressing these issues will be essential for the responsible advancement of AI applications in healthcare.

Limitations of the study

Several limitations of this study should be acknowledged. First, the question set was developed by two colorectal surgery experts based on clinical experience rather than derived from validated patient questionnaires or standardized guideline-based frameworks. Although this approach reflects real-world practice, it may introduce selection bias and limit the generalizability of the findings. Future studies should consider incorporating validated patient FAQs or guideline-based question sets to improve external validity.

Second, the relatively limited number of questions and the evaluation of only two chatbot models restrict the scope of inference, and the results may not be generalizable to other AI systems. In addition, chatbot responses may vary over time and across different sessions. To minimize temporal variability and ensure identical testing conditions, all queries were intentionally conducted within a single standardized session; however, this approach does not capture potential longitudinal variability in chatbot performance.

Third, the evaluation relied on expert judgment, which inherently involves a degree of subjectivity. Although explicit scoring anchors were defined (0 = completely irrelevant/incomplete; 100 = fully appropriate/comprehensive), no validated rubric currently exists for assessing AI-generated medical content. To mitigate this limitation, multiple blinded evaluators were employed, and inter-rater reliability was assessed using the ICC; nevertheless, some degree of interpretative variability remains unavoidable.

Finally, additional dimensions such as readability level, linguistic complexity, and the influence of language and regional clinical guidelines were not systematically evaluated. Differences between guideline frameworks, such as National Comprehensive Cancer Network (NCCN) and European Society for Medical Oncology (ESMO) recommendations, may affect the accuracy and applicability of AI-generated medical information across healthcare settings.³⁴ Future studies addressing these factors may provide a more comprehensive evaluation of chatbot performance.

Conclusion

In conclusion, this study demonstrates that AI-powered chatbots such as ChatGPT-4 and DeepSeek are capable of generating relevant and generally appropriate information on colorectal cancer. However, their current utility is fundamentally limited by the complete absence of verifiable scientific references, which compromises the transparency and reliability of their outputs. While ChatGPT-4 provided more informative and detailed responses for community-oriented questions, this increased informativeness was not accompanied by source verification. Conversely, although DeepSeek demonstrated slightly higher internal consistency, it did not overcome the lack of evidence-based referencing. Together, these findings highlight a persistent and unresolved trade-off between informativeness, consistency, and verifiability in current large-language models. The integration of evidence-based medical databases—such as PubMed, UpToDate, and clinical guideline repositories—through structured API connections should be prioritized in future development to enable transparent, reproducible, and clinically trustworthy AI-generated medical information for both patients and clinicians.

Supplemental Material

sj-docx-1-dhj-10.1177_20552076261425149 - Supplemental material for Comparative informative capacity of artificial intelligence (AI)-powered chatbots in colorectal cancer: ChatGPT-4 versus DeepSeek

Supplemental material, sj-docx-1-dhj-10.1177_20552076261425149 for Comparative informative capacity of artificial intelligence (AI)-powered chatbots in colorectal cancer: ChatGPT-4 versus DeepSeek by Nurhilal Kızıltoprak, Ömer Faruk Özkan, Fevzi Cengiz, Erdinç Kamer and İlker Sücüllü in DIGITAL HEALTH

Footnotes

ORCID iD

Nurhilal Kızıltoprak

Ethical approval

No patient data were used in this study, and therefore ethical approval was not required. The study was conducted in accordance with the ethical principles outlined in the Declaration of Helsinki.

Author contributions

Nurhilal Kızıltoprak: study design, data collection, analysis, manuscript writing, and final approval.

Erdinç Kamer: data collection.

Ömer Faruk Özkan: data collection, statistical analysis, and final approval.

Fevzi Cengiz: data collection.

İlker Sücüllü: data collection.

An artificial intelligence-based language model was used during literature review and drafting, but all final decisions were made by the authors.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Availability of data and materials

The data used in this study are securely stored in compliance with ethical guidelines. Since no patient data were used, confidentiality regulations were not applicable. Relevant data can be shared with authorized institutions upon reasonable request. At certain stages of this study, an artificial intelligence-based language model was used to support the literature review and writing process. However, the final evaluation and content approval was carried out by the researchers.

Guarantor

Nurhilal Kızıltoprak.

Supplemental material

Supplemental material for this article is available online.

References

Nazir

Wang

. A comprehensive survey of ChatGPT: advancements, applications, prospects, and challenges. Meta Radiol 2023; 1: 100022. ISSN 2950-1628.

Tilwani

Venkataramanan

Sheth

, et al. Neurosymbolic AI approach to attribution in large language models. IEEE Intell Syst 2024; 39: 10–17.

Omar

Nassar

Hijazi

, et al. Generating credible referenced medical research: a comparative study of OpenAI's GPT-4 and Google's Gemini. Comput Biol Med 2024; 185: 109545.

Camalan

Doluoglu

Taraf

, et al. ChatGPT versus DeepSeek in head and neck cancer staging and treatment planning: guideline-based study. Eur Arch Oto-Rhino-Laryngol 2025; 49: 958–968.

Ali

Shi

Cui

. A comparative study on the use of DeepSeek-R1 and ChatGPT-4.5 in different aspects of plastic surgery. Aesthetic Plast Surg 2025: 958–968.

Hopkins

Logan

Kichenadasse

, et al. Artificial intelligence chatbots will revolutionize how cancer patients access information: ChatGPT represents a paradigm-shift. JNCI Cancer Spectr 2023; 7: pkad010.

Garg

Urs

Agrawal

, et al. Exploring the role of ChatGPT in patient care (diagnosis and treatment) and medical research: a systematic review. Health Promot Perspect 2023; 13: 183–191.

Liu

Wang

Liu

. Utility of ChatGPT in clinical practice. J Med Internet Res 2023; 25: e48568.

Liu

Liang

Fang

, et al. The diagnostic ability of GPT-3.5 and GPT-4.0 in surgery: comparative analysis. J Med Internet Res 2024; 26: e54985.

10.

Atarere

Naqvi

Haas

, et al. Applicability of online chat-based artificial intelligence models to colorectal cancer screening. Dig Dis Sci 2024; 69: 791–797.

11.

Siu

Gibson

Chiu

, et al. ChatGPT as a patient education tool in colorectal cancer—an in-depth assessment of efficacy, quality and readability. Colorectal Dis 2024; 27: e17267.

12.

Siegel

Wagle

Cercek

, et al. Colorectal cancer statistics, 2023. CA Cancer J Clin 2023; 73: 233–254.

13.

Bernstein

Atalay

Dibble

, et al. Can incorrect artificial intelligence (AI) results impact radiologists, and if so, what can we do about it? A multi-reader pilot study of lung cancer detection with chest radiography. Eur Radiol 2023; 33: 8263–8269.

14.

Maron

Emile

Horesh

, et al. Comparing answers of ChatGPT and Google Gemini to common questions on benign anal conditions. Tech Coloproctol 2025; 29: 57. PMID: 39864043.

15.

Horesh

Emile

Gupta

, et al. Comparing the management recommendations of large language model and colorectal cancer multidisciplinary team: a pilot study. Dis Colon Rectum 2024; 68: 41–47.

16.

Peng

Feng

Yao

, et al. Evaluating AI in medicine: a comparative analysis of expert and ChatGPT responses to colorectal cancer questions. Sci Rep 2024; 14. 10.1038/s41598-024-52853-3

17.

Pan

Musheyev

Bockelman

, et al. Assessment of artificial intelligence chatbot responses to top searched queries about cancer. JAMA Oncol 2023; 9. DOI: 10.1001/jamaoncol.2023.2947.

18.

Musheyev

Pan

Loeb

, et al.

How well do artificial intelligence chatbots respond to the top search queries about urological malignancies?

Eur Urol 2023; 84. DOI: 10.1016/j.eururo.2023.07.004.

19.

Khabaz

Newman-Hung

Kallini

, et al. Assessment of artificial intelligence chatbot responses to common patient questions on bone sarcoma. J Surg Oncol 2024; 131: 719–724.

20.

Erdat

Yalçıner

Urun

. Accuracy and usability of artificial intelligence chatbot generated chemotherapy protocols. Future Oncol 2024; 20. DOI: 10.2217/fon-2023-0950.

21.

Chen

Parsa

Hope

, et al. Physician and artificial intelligence chatbot responses to cancer questions from social media. JAMA Oncol 2024; 10. DOI: 10.1001/jamaoncol.2024.0836.

22.

Ferrara

. Should ChatGPT be biased? challenges and risks of bias in large language models. First Monday 2023; 28. 10.5210/fm.v28i11.13346

23.

Chow

. Ethical considerations in human-centered AI: advancing oncology chatbots through large language models. JMIR Bioinform Biotechnol 2024; 5. 10.2196/64406

24.

Zhang

Sun

Jagadeesh

, et al. The potential and pitfalls of using a large language model such as ChatGPT, GPT-4, or LLaMA as a clinical assistant. J Am Med Inform Assoc 2024; 31: 1884–1891. PMID: 39018498; PMCID: PMC11339517.

25.

Kolla

Parikh

. Uses and limitations of artificial intelligence for oncology. Cancer 2024; 130: 2101–2107.

26.

Roldan-Vasquez

Mitri

Bhasin

, et al. Reliability of artificial intelligence chatbot responses to frequently asked questions in breast surgical oncology. J Surg Oncol 2024; 130: 188–203.

27.

Chen

Avison

Alnassar

, et al. Medical accuracy of artificial intelligence chatbots in oncology: a scoping review. Oncologist 2025; 30. 10.1093/oncolo/oyaf038

28.

McLean

Hristidis

. Evidence-based analysis of AI chatbots in oncology patient education: implications for trust, perceived realness, and misinformation management. J Cancer Educ 2025; 40. DOI: 10.1007/s13187-025-02592-4.

29.

Bevara

Mannuru

Lund

, et al.

Beyond ChatGPT: how DeepSeek R1 may transform academia and libraries?

Libr Hi Tech News 2025; 42. DOI: 10.1108/lhtn-01-2025-0024.

30.

Graf

McKinney

Dye

, et al. Exploring the limits of artificial intelligence for referencing scientific articles. Am J Perinatol 2024; 41: 2072–2081. PMID: 38653452.

31.

Pan

Musheyev

Bockelman

, et al. Assessment of artificial intelligence chatbot responses to top searched queries about cancer. JAMA Oncol 2023; 9: 1437–1440.

32.

Walters

Wilder

. Fabrication and errors in the bibliographic citations generated by ChatGPT. Sci Rep 2023; 13. 10.1038/s41598-023-41032-5

33.

Athaluri

Manthena

Kesapragada

, et al. Exploring the boundaries of reality: Investigating the phenomenon of artificial intelligence hallucination in scientific writing through ChatGPT references. Cureus 2023; 15. 10.7759/cureus.37432

34.

Gumilar

Indraprasta

Hsu

, et al. Disparities in medical recommendations from AI-based chatbots across different countries/regions. Sci Rep 2024; 14: 17052.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.02 MB