Abstract
Background
OpenAI developed ChatGPT as an advanced artificial intelligence (AI)-driven natural language processing system. ChatGPT is capable of generating responses through statistical pattern recognition established during pretraining.
Objective
To ascertain whether ChatGPT could respond to patients with breast cancer in a way that was consistent with evidence-based medical practices and a breast cancer clinical guideline. This guideline was a practical pocket book based on the latest evidence and took into account the national data, and to evaluate the ability of AI to provide accurate and up-to-date information to patients, potentially serving as a supplementary resource for medical professionals.
Methods
The research team designed a series of tests to assess the responses of ChatGPT to specific questions related to breast cancer diagnosis, treatment options, and post-treatment care. Thirty clinically validated breast cancer questions spanning diagnosis, prognosis, treatment, and pharmacotherapy were administered through three iterative trials to: (1) GPT-3.5/GPT-4.0 (5min interval between trials) and (2) three breast surgeons stratified by expertise (high/medium/low). Responses were scored dichotomously (1 = guideline-consistent; 0 = inconsistent) with total scores ranging 0 to 3 per question. For each consistent and inconsistent answer with the standard answer, 1 and 0 points were given, respectively. The sum of the answers obtained from the three experts resulted in a score of 0 to 3. Data analysis included mean score comparisons (analysis of variance with post hoc Tukey tests), subgroup analyses by question category, and inter-rater reliability assessment.
Results
Performance comparison between GPT-3.5 and GPT-4.0 across breast surgery subspecialties and question types revealed that GPT-4.0 generally outperformed GPT-3.5, despite the absence of significant difference in the mean scores for most items. We found that GPT-3.5 and have the same medical response ability as lower qualified breast surgeons, while GPT-4.0 have the same ability as higher qualified breast surgeons.
Keywords
Introduction
ChatGPT, an artificial intelligence (AI) application developed by OpenAI, demonstrates significant potential within the medical industry by utilizing large language models to generate text that resembles that created by humans. 1 This tool facilitates the provision of medical information through a conversational and interactive format, employing a man-machine dialogue interface to create question-and-answer tasks. 2 Since its inception, numerous studies have explored the application of ChatGPT in the medical field. Recent research has extended its potential use to diverse areas such as environmental impact, 3 healthcare education, 4 and medical writing. 5 In the context of Clinical Decision Support Systems (CDSS), ChatGPT can offer medical information and recommendations for diagnoses and treatments. CDSS integrates a substantial amount of medical knowledge into computational systems, which subsequently employ algorithms to simulate clinical diagnostic and treatment strategies, functioning either independently or in conjunction with clinicians. Research indicates that ChatGPT can enhance clinical decision support, thereby optimizing decision-making processes across various clinical domains. In several clinical areas, including coronavirus disease-2019, 6 cancer, 7 drug discovery, 8 and diabetes, 9 researchers have assessed the value of ChatGPT in CDSS. For example, Khalid Raza and his team generated personalized treatment plans for different patients by analyzing multidimensional data such as clinical and genetic data of lung cancer, optimizing the process, and saving time and effort for bioscientists. 10
Breast cancer continues to be the most prevalent oncological diagnosis among women, with over 2.1 million new cases reported annually worldwide. 11 To enhance the prognosis for patients with breast cancer, individualized treatment approaches are essential. A significant advancement in medical practice is the implementation of CDSS, which can assist healthcare professionals in making informed clinical decisions, thereby improving decision-making efficiency and overall patient care while simultaneously reducing costs. 12 The utilization and accessibility of CDSS in the management of breast cancer are on the rise within medical institutions. However, there may be variations in the methodologies employed to develop specific CDSS, the data they incorporate, the recommendations they provide, and their practical applications in clinical settings. Despite the heightened necessity for CDSS in this field compared with other clinical specialties, further investigation regarding ChatGPT-assisted CDSS for breast cancer is warranted. Haver et al. have demonstrated the potential of ChatGPT in offering information on breast cancer prevention and the feasibility of screening recommendations. 13 Based on these findings, we hypothesized that AI may eventually supplant physicians and fundamentally transform contemporary clinical medicine.
The purpose of this analysis was to evaluate the accuracy of evidence-based responses of ChatGPT, particularly in the context of advancements in technology, and to compare its proficiency in the domain of breast surgery against physicians who rely on scientific data for their clinical decisions. The objective was to determine whether ChatGPT, a state-of-the-art language model, can demonstrate a level of expertise comparable to that of a trained medical professional in this specialized surgical field. The aim of this assessment was to provide insights into the potential applications and limitations of AI in medical diagnosis and treatment.
Methods
Study design
We compared ChatGPT's responses with those of breast surgeons with various levels of experience. The precise information is as follows: (1) How accurate are the answers to ChatGPT 3.5 and ChatGPT 4.0 in comparison to the suggested treatment plans in the guidelines? (2) How accurate are the answers provided by breast surgeons with varying levels of experience based on the suggested treatment plan in the guidelines? (3) In comparison to the responses from breast surgeons with varying levels of experience, do the answers to ChatGPT 3.5 and ChatGPT 4.0 correspond to their level? Since this was not a study involving human participants, neither informed consent nor institutional review board approval were necessary. Even if the problem is a simulation, if it involves real case features (such as rare disease details), it may inadvertently leak privacy or violate data protection laws (such as Health Insurance Portability and Accountability Act, HIPAA/General Data Protection Regulation, GDPR).
Study data
The descriptive analysis was undertaken on 17 June 2023. A total of 30 questions about breast cancer based on BREAST CANCER National Comprehensive Cancer Network (NCCN) Guidelines (Version 4, 2022) and clinical experience, covering screening, prevention, treatment options, and postoperative complications were produced (as indicated in the Supplemental Appendix table). We simultaneously asked GPT-3.5 and GPT-4.0 above questions, asking each one three times. A separate chat session was used for each problem, and diagnostic and patient identification information was not provided to ChatGPT. We directly asked ChatGPT the above 30 questions and collected ChatGPT's answers. For the same question, ask ChatGPT again after an interval of 5 min and repeat it three times to observe the consistency of the answers. At the same time, as a quality control measure, the same clinical problems were provided to experts. We have also enlisted breast surgeons with high, medium, and low qualifications to respond to the aforementioned queries and swiftly record their responses. According to the standards established by the National Health Commission, we categorized breast surgeons in this study. According to their experience and position: (1) low: attending physician; (2) middle: associate chief physician; and (3) high: chief physician; qualified as a physician with more than three years of experience in breast surgery.
In addition, we hired three senior breast surgeons who had greater training and experience than the senior, middle, and entry-level breast surgeons who provided answers to the questions above. Experts were asked to answer 30 questions and try to choose the most concise sentences, such as “yes” or “no.” These answers can be accurately compared with the answers recommended in the guide. The replies of the breast surgeons taking part in this study will be scored by these three senior breast surgeons, who will determine whether the answers are “consistent” or “inconsistent.” Based on how consistent each response is, it will be evaluated from 0 to 3.
Data measurement
The thirty questions listed above will be divided into four categories: prognosis (4 terms), treatment (8 terms), drug (7 terms), and diagnosis (11 terms) to compare ChatGPT's reaction statistics with those of breast surgeons. Analyze statistically the average score based on the question type.
Statistical analysis
SPSS statistical software 25.0 was used to examine the data. Standard deviation divided by the mean represents the numerical variable. As the absolute values of several sets of examples, qualitative variables are defined. To assess the data's conformity to the normal distribution, the Student's t-test was applied. Statistical analysis is performed using the repeated measurement analysis of variance. The statistical significance level is P < .05.
Results
Baseline characteristics
Of the responses provided by GPT-3.5, 76.67% (23/30) complied with guidelines, whereas 13.33% (4/30) were inconsistent, and the remaining 10.00% (3/30) had at least one improper answer. For GPT-4.0, these values were 90.00% (27/30), 0% (0/30), and 10.00% (3/30), respectively. Of the responses provided by three attending physicians, 56.67% (17/30) were in accordance with guidelines, 13.33% (4/30) were inconsistent, and 30.00% (9/30) had one or two inappropriate responses. For the three associate chief physicians, these values were 70.00% (21/30), 6.67% (2/30), and 23.33% (7/30), respectively. For the three chief physicians, these values were 93.34% (28/30), 0% (0/30), and 22.00% (4/30), respectively.
Mean score comparison
The mean score was lower for GPT-3.5 compared with GPT-4.0 (2.47 ± 1.07 vs. 2.90 ± 0.305, respectively, P = .038). The mean score was 2.50 ± 0.630, 2.53 ± 0.820, and 2.93 ± 0.254 for the attending, associate chief, and chief physicians, respectively. The mean score significantly differed between GPT-3.5 and associate chief physicians (P = .024), as well as between GPT-3.5 and chief physicians (P < .001). The mean score also significantly differed between GPT-4.0 and attending physicians (P = .003), as well as between GPT-4.0 and associate chief physicians (P = .025). We did not record differences in the mean score between GPT-3.5 and attending physicians (P = .88) or between GPT-4.0 and chief physicians (P = .65) (Figure 1).
Comparison of mean scores based on question type
Concerning diagnosis questions, the mean score was 2.46 ± 1.11 for GPT-3.5 and 2.93 ± 0.262 for GPT-4.0 (P = .04). The mean score was 2.50 ± 0.638, 2.61 ± 0.685, and 2.93 ± 0.262 for the attending, associate chief, and chief physicians, respectively. The mean score of GPT-3.5 for diagnosis was significantly lower than that of chief physicians (P < .05). We did not record differences between other groups. Regarding prognostic questions, the data revealed that the mean score for GPT-3.5 (P = .04) and 2.90 ± 0.310 in GPT-4.0 (P = .04). The mean score was 2.48 ± 0.634, 2.52 ± 0.829, and 2.93 ± 0.258 for the attending, associate chief, and chief physicians, respectively. The mean score of GPT-3.5 was significantly lower than that of the chief physicians (P = .02). The mean score of GPT-4.0 was significantly higher than those of the attending and associate chief physicians (P = .001, P = .01, respectively). We did not record differences between other groups. For treatment questions, the mean score was 2.50 ± 1.06 for GPT-3.5 and 2.96 ± 0.204 for GPT-4.0 (P = .04). The mean score was 2.46 ± 0.658, 2.63 ± 0.647, and 2.92 ± 0.282 for the attending, associate chief, and chief physicians, respectively. The mean score of GPT-4.0 was significantly higher than those of the attending and associate chief physicians (P = .001 and P = .02, respectively). However, there were no significant differences in scores observed between other groups. In drug question, the mean score was 2.64 ± 0.929 in GPT-3.5 and 3.00 in GPT-4.0. The mean score was 2.50 ± 0.650 in the attending physicians, 2.64 ± 0.633 in the associate chief physician, and 3.00 in the chief physicians. The mean score of GPT-4.0 was significantly higher than those of the attending and associate chief physicians, respectively (P < .05 for both). We did not record differences between other groups (Table 1).

Comparison of mean scores according to question type.
Comparison of mean scores according to question type.
Plus–minus values are means ± SD. *P< .05, **P < .01, and ***P < .001.
Discussion
Comparison to prior work
Numerous studies have assessed the potential use of ChatGPT in the medical sector since its introduction. 14 ChatGPT is a valuable tool for scientific research, clinical practice, and education, and offers promise for these applications. The capacity to articulate oneself clearly and discuss research ideas and findings is one of the many benefits of using ChatGPT for a thorough assessment of the literature. Regarding medical education, ChatGPT can provide invaluable assistance for physician qualification exams, facilitate medical student learning, and enhance communication skill training.
Furthermore, ChatGPT has the potential to streamline healthcare processes, which can lead to cost reductions and increased efficiency. The use of ChatGPT in breast surgery mostly entails administrative support, clinical decision making, and medical and surgical education. Most previous studies have only compared the gap between ChatGPT and responses of healthcare professionals to medical questions, without conducting comparisons between different levels.15–17 Similar to AI training, the knowledge of medical personnel is enriched by the volume of clinical challenges they encounter. Consequently, upgrades of the program increase the internal reserves of systems. Therefore, the experimental results are logically consistent. Senior physicians have a higher accuracy rate, just as version GPT-4.0 is more advanced than GPT-3.5.
Particularly principal findings
Evaluating the competence of ChatGPT against evidence-based breast surgeons
GPT-4.0 demonstrated a superior performance compared with GPT-3.5 in our study when we examined their capacity to respond to evidence-based medicine. GPT-4.0 can generally provide reasonably accurate answers based on evidence-based medicine and patient information. Moreover, it is consistent with the opinions of experts and treatment guidelines, with its rate of correct answers reaching 100.00%. The excellent responses of GPT-4.0 are probably attributed to its research on the most recent medical data to provide a solution that encompasses the most recent medical knowledge.
This is likely due to the fact that while suggestions of ChatGPT are based on the most recent scientific research and adhere to professional guidelines and treatment plans. ChatGPT-3.5's dataset is based on the breast cancer treatment guidelines from 2022 and has not been updated and altered in real time, the accuracy of it cannot be achieved to ChatGPT-4.0. The human body is complex; hence, new discoveries are constantly being recorded in the field of life sciences. Experienced human medical experts are motivated by this to continuously learn, comprehend the most recent information, and enhance their professional expertise. Cancer, in particular, has multiple causes, develops rapidly, involves several connections, and requires expert knowledge. By rapidly learning the most recent knowledge and basing professional judgments and interpretations on it, ChatGPT can successfully bridge this gap. The findings of this study have also supported this outcome. We think that the professional capabilities of ChatGPT will advance with further development, aiding human experts in accelerating the advancement of contemporary clinical medicine.
In this research, we compared responses of ChatGPT with those of breast surgeons. According to the statistical findings, the capacity of GPT-3.5 and GPT-4.0 for evidence-based medicine response is comparable with that of less experienced and more experienced breast surgeons, respectively. It appears that ChatGPT, particularly GPT4.0, provides accurate information more rapidly than the majority of breast surgeons. We think that ChatGPT will advance modern medicine in a new way as a result of its ongoing development.
AI is continuously learning and gaining knowledge, which helps physicians make more precise clinical diagnoses and treatment decisions as a result of the ongoing accumulation of information. However, when faced with complicated and challenging medical knowledge and clinical circumstances, it is frequently necessary to integrate personal experience and the rigorous logical reasoning process of physicians. This should be taken into consideration from the perspective of professional knowledge reserves. It is unlikely that ChatGPT will replace clinicians as the combination of logical reasoning with technological tools may be required to achieve more accurate assessments.
Meanwhile, what's more interesting is that at the time of submitting this article, ChatGPT had launched its 5.0 version, which aroused our curiosity. Therefore, we asked ChatGPT 5.0 the same questions and followed the same procedure (the answers for version 5.0 are provided in the Supplemental Appendix table). Surprisingly, the answers provided by the 5.0 version were almost identical to those of the 4.0 version to the aforementioned questions, further validating the aforementioned conclusion that AI cannot replace clinicians. Analyzing the reason, it may be due to the fact that the version of the tumor diagnosis and treatment guidelines has not been updated, and there is not much difference between the data captured by the AI database and the 4.0 version, hence the results are almost consistent.
Comparison of mean scores according to classification
GPT-4.0 was generally better than GPT-3.5 when the 30 questions concerning breast cancer were compared in different types of questions. Among chief physicians, almost all of the responses (28/30; 93.3%) included at least one treatment method in accordance with NCCN guidelines. However, researchers found that 10% of the responses also included one or more inconsistent answers, which are occasionally difficult to find in other reasonable guidance by GPT-3.5 and 6.67% inconsistent answers provided by GPT-4.0. Inconsistent treatment recommendations were defined as only partially correct recommendations. Of note, 10% of the cases had the same score for GPT-3.5 and GPT-4.0, which shows the complexity of the NCCN guidelines and that the output results of ChatGPT may be ambiguous or difficult to explain. We discovered that the deficiencies of GPT-3.5 in prevention, therapy, and diagnosis were improved in GPT-4.0 depending on the type of questions. In addition, GPT-4.0 performed better than breast surgeons (i.e. attending physicians and associate chief physicians). This finding demonstrates how, with the development of ChatGPT, clinical medical practice can be supported more effectively, the most recent advancements in science and technology can be more seamlessly incorporated, and the standard of patient care can be raised even higher.
We also discovered that, depending on the condition, GPT-4.0 performed better than breast surgeons (i.e. attending physicians and associate chief physicians). The findings of this study demonstrate the potential of ChatGPT in medical diagnosis and therapy despite the fact that ChatGPT has been employed in the medical field. ChatGPT can provide detailed, specific, and individualized replies to medical problems in many subspecialties of breast surgery. Detailed answers to specific questions from each group are presented in the Supplemental Appendix table.
Ethical consideration
This study did not involve patient data; therefore, ethical approval was not required. Numerous studies have shown that ChatGPT can help improve the effectiveness of clinical decision making to date. However, the majority are retrospective investigations without clinical trials to support their findings. Therefore, it is important to exercise caution and closely examine the clinical applicability of ChatGPT. Researchers must base their clinical judgments on established procedures and pertinent laws. The information provided by ChatGPT is based on a wide range of databases and information sources, which frequently results in problems such as plagiarism. While ChatGPT, an AI product, is exempt from legal obligations, clinical practitioners who use it must evaluate and distinguish this type of information and carefully weigh the legal and ethical implications. When using ChatGPT to make clinical decisions, this point needs to be properly taken into account.
ChatGPT, a machine learning simulation method that draws on pre-existing data and knowledge, is characterized by a lag and inability to promptly update new information to offer the most recent options for decision making. In contrast, humans are capable of rapidly obtaining the most recent information and fusing it with experience and reason to create the most effective treatment strategy for patients. ChatGPT will continue to be improved as science and technology advance; nevertheless, it maybe not completely replace physicians in clinical diagnosis and treatment. Considering the consistency in responses observed in this study, physicians can use ChatGPT as a tool to optimize their clinical judgments.
Limitations and future directions
This study had several limitations. Firstly, the investigation was limited by the relatively small number of questions analyzed, which may have introduced statistical error in the findings due to the scarcity of data. To overcome this limitation, future research should include larger sample sizes to ensure more robust results.
Secondly, this study only included a limited number of breast surgeons. To address this, in the future, analyses should involve a high number of breast surgeons with high, middle, and low credentials that can be classified based on their professional titles and other factors. This would enhance the comprehensiveness of studies and the accuracy of the findings. Additionally, the inclusion of more senior specialists could provide further insights.
Thirdly, to further improve the validity of the results, studies in the future should focus on refining the classification of issues and ensuring a balanced representation of each group. This approach would minimize potential biases and enhance the reliability of the conclusions drawn from the analysis.
Fourthly, in the future, we can explore more algorithms and big data models, integrating the results to enhance the accuracy of answers. For instance, we can refer to this study, where the author carefully and rigorously employed convolutional neural networks, recurrent neural network), generative adversarial network, variational autoencoder, and other AI techniques, integrating multiple dimensions and datasets to provide more accurate and personalized treatment plans. 18
In summary, by addressing these limitations in research, we can strive for more comprehensive and robust findings in this field.
Conclusion
In this study, we compared answers to 30 questions regarding breast cancer from breast surgeons of varying seniority and GPT-3.5 and GPT-4.0. The mean score was 2.467 ± 1.074 in GPT-3.5 and 2.900 ± 0.3051 in GPT-4.0 (P = .03783). The mean score significantly differed between the group of GPT-3.5 and middle seniority (2.467 ± 1.074 vs. 2.533 ± 0.8193, P = .024); GPT-3.5 and high seniority (2.467 ± 1.074 vs. 2.933 ± 0.2537, P < .001). The mean score also significantly differed between the group of GPT-4.0 and low seniority (2.900 ± 0.3051 vs. 2.500 ± 0.6297, P = .002729); GPT-4.0 and middle seniority (P = .02524). Performance comparison between GPT-3.5 and GPT-4.0 across breast surgery subspecialties and question types revealed that GPT-4.0 generally outperformed GPT-3.5, despite no significant difference in the mean scores for most items. We found that GPT-3.5 and have the same medical response ability as lower qualified breast surgeons, while GPT-4.0 have the same ability as higher qualified breast surgeons. At present, ChatGPT does not match the ability of highly skilled breast surgeons to provide accurate responses to medical questions. Nonetheless, it may be able to provide responses of this level with future system modifications and the ongoing accumulation of data. The advancement of this AI system in the clinical treatment and diagnosis of breast cancer must also be kept in mind by breast surgeons, who should also consider whether to incorporate it into their clinical workflow.
Supplemental Material
sj-docx-1-dhj-10.1177_20552076261431491 - Supplemental material for The potential of ChatGPT as an artificial intelligence enhancement therapy consultant for patients with breast cancer
Supplemental material, sj-docx-1-dhj-10.1177_20552076261431491 for The potential of ChatGPT as an artificial intelligence enhancement therapy consultant for patients with breast cancer by Xiaoyu Shi, Yao Li and Chengliang Yin in DIGITAL HEALTH
Footnotes
Ethical approval
The Ethics Committee of the Second Hospital of Anhui Medical University (Hefei, China) approved the study, which was conducted in accordance with the ethical guidelines established in the World Medical Congress's Helsinki Declaration and its updated versions.
Author contributions
Xiaoyu Shi: Literature search; Study design; Data collection; Data interpretation; Writing.
Yao Li: Provided feedback on manuscript texts.
Chengliang Yin: Study design; Provided feedback on all manuscript texts.
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Guarantor
CY.
Supplemental material
Supplemental material for this article is available online.
