Evaluation of free-access artificial intelligence chatbots in preoperative patient education about general anesthesia: A comparative study of ChatGPT,gemini,and copilot

Abstract

Objective

Artificial intelligence (AI) chatbots are increasingly used by patients seeking medical information. However, the accuracy and educational quality of such tools in the context of anesthesia remain unclear. This study aimed to evaluate and compare the appropriateness of responses generated by three widely accessible AI platforms—ChatGPT, Gemini, and Copilot—regarding frequently asked questions about general anesthesia.

Methods

Fifty anesthesia-related questions were developed by two anesthesiologists and categorized into four domains: General Information and Process, Safety and Risks, Pain, Comfort, and Recovery, and Preoperative Preparation. Each question was entered in English into the free, publicly available versions of ChatGPT, Gemini, and Copilot. Ten blinded anesthesiologists rated the responses using a 5-point Likert scale (1 = very inappropriate to 5 = very appropriate). Mean scores were compared using one-way ANOVA with Tukey’s post-hoc tests, and inter-rater reliability was assessed using Cronbach’s α.

Results

ChatGPT achieved the highest overall mean score (4.68 ± 0.50), followed by Gemini (4.22 ± 0.63) and Copilot (3.28 ± 0.50), with significant differences among all platforms (p < 0.001). ChatGPT consistently outperformed the others across all four domains. Qualitative observations from evaluator comments suggested that ChatGPT’s concise summaries improved readability, Gemini provided more structured responses with more scholarly-style references, and Copilot was clear but often less detailed. Inter-rater reliability was high (Cronbach’s α = 0.89).

Conclusion

Among free-access AI chatbots, ChatGPT provided the most accurate and comprehensive explanations regarding general anesthesia. While Gemini and Copilot offered partial value, professional oversight remains essential to ensure safe and contextually accurate patient education in preoperative care.

Keywords

chatgpt gemini copilot anesthesia education artificial intelligence preoperative counseling patient information AI chatbot

1. Introduction

General anesthesia is a pharmacologically induced reversible state of unconsciousness, analgesia, amnesia, and muscle relaxation that enables safe performance of surgical procedures. Each year, millions of patients worldwide undergo general anesthesia, and despite its routine nature, the process remains associated with significant patient concerns. Preoperative anxiety is common among surgical patients, with studies reporting prevalence rates of up to ∼48% (95% CI: 39–57%) in a meta-analysis of 28 studies (n = 14 652).¹

Anxiety related specifically to anesthesia, for example fear of not waking up or nausea/vomiting—has been reported in ∼30% of patients before hospital admission.²

In parallel, the digital era has transformed how patients seek medical information. Instead of relying exclusively on clinicians, many access online resources, and more recently, artificial intelligence (AI)-based systems and chatbots are increasingly being used for health-related queries. Chatbots in healthcare have demonstrated benefits in patient navigation and education, but the evidence also highlights important limitations in accuracy, completeness, transparency of sources and potential biases.^3,4

AI chatbots powered by large language models (LLMs) — such as ChatGPT, Gemini and Copilot — are now readily accessible to the public and often used for general informational purposes. While they hold promise for patient education, especially in preoperative settings, their scientific accuracy, adequacy and relevance to anesthesia-specific patient concerns remain under-explored. A recent review found that although chatbots can improve patient education on common medical questions, they may still provide incomplete or incorrect information, particularly when used without oversight.^5,6

Therefore, the aim of this study was to evaluate the accuracy, sufficiency and educational value of responses generated by these popular AI chatbots (ChatGPT, Gemini, Copilot) when asked the most frequently asked patient questions about general anesthesia. By comparing these systems, we sought to determine whether AI platforms are useful tools for preoperative patient education and assess their potential role in supporting anesthesia teams in the informed consent process.

2. Methodology

2.1. Study design

This descriptive comparative study was designed to evaluate the quality of AI-generated medical information related to general anesthesia.

2.2. Question selection

A total of 50 frequently asked questions about general anesthesia were identified by two board-certified anesthesiologists based on clinical experience, commonly encountered patient concerns, and frequently asked questions reported during routine preoperative consultations. The questions were reviewed by the authors for content relevance and clarity prior to use and were organized into four domains (general information, safety and risks, pain, comfort, and recovery, and preoperative preparation) to ensure comprehensive coverage of key aspects of patient education. The full list of questions used in this study is provided in Supplemental Appendix A to enhance transparency and reproducibility.

2.3. AI platforms

Each question was entered separately and identically into three widely used AI chatbots: ChatGPT (OpenAI), Gemini (Google), and Copilot (Microsoft). Each platform was accessed via its free, publicly available version, and no paid or premium memberships were used. All responses were generated between May 18 and May 25, 2025. All chatbot queries were performed within this defined time period to minimize potential variability due to model updates. At the time of data collection, each platform was accessed in its most up-to-date publicly available version. As specific model version identifiers were not consistently disclosed within the user interfaces, the platforms were evaluated as publicly accessible tools reflecting real-world patient use. Each query was entered in a new chat session to prevent any influence from previous interactions or contextual memory. All prompts were entered in English using the standard free-access user interfaces of each platform. Default settings were used throughout, and no platform-specific modes or custom options were enabled or disabled. This approach reflects real-world patient use conditions and enhances the external validity of the findings.

2.4. Evaluation of responses

The responses were independently evaluated by ten anesthesiology specialists who were blinded to the source of the responses. Each answer was rated using a 5-point Likert scale to assess the appropriateness and accuracy of the information: 1 – Very inappropriate, 2 – Inappropriate, 3 – Average, 4 – Appropriate, 5 – Very appropriate. The raters independently evaluated each response using this predefined scale based on overall clinical appropriateness and accuracy. Before scoring, responses were randomized, platform identifiers were removed, and all outputs were presented in the same format to support blinding. No formal reference answer key or external guideline-based scoring template was used. In addition to numerical scoring, evaluators were invited to provide optional free-text comments, which were used to support the qualitative interpretation of response characteristics. For each question, mean scores were calculated across raters, and platform-level performance was subsequently compared across all questions.

2.5. Statistical analysis

Descriptive statistics (mean ± SD) were used to summarize the scores for each platform. The mean scores among ChatGPT, Gemini, and Copilot were compared using one-way ANOVA followed by post-hoc pairwise comparisons when applicable. Statistical significance was set at p < 0.05. Data analysis was performed using IBM SPSS Statistics (version 25, IBM Corp., Armonk, NY, USA).

2.6. Ethical considerations

Since this study did not involve patients, patient data, or clinical interventions, Institutional Review Board approval was not required. Only anesthesiologists participated as expert raters. Written informed consent was not obtained; however, all participants were informed about the study and voluntarily agreed to participate prior to data collection.

3. Results

3.1. Overall comparison between AI platforms

A total of 50 anesthesia-related questions were evaluated across three AI platforms — ChatGPT, Gemini, and Copilot — by ten board-certified anesthesiologists. Each response was rated on a 5-point Likert scale (1 = very inappropriate, 5 = very appropriate).

The overall mean scores were as follows: ChatGPT (4.68 ± 0.50), Gemini (4.22 ± 0.63), and Copilot (3.28 ± 0.50) (Table 1).

Table 1.

Mean ± SD appropriateness scores for AI platforms.

Platform	Mean ± SD	p-value	Ranking
ChatGPT	4.68 ± 0.50	< 0.001	1
Gemini	4.22 ± 0.63	< 0.001 vs ChatGPT	2
Copilot	3.28 ± 0.50	< 0.001 vs both	3

A one-way ANOVA revealed a significant difference among platforms (F(2, 1497) = 842.7, p < 0.001).

Post-hoc Tukey’s HSD tests confirmed that ChatGPT significantly outperformed both Gemini and Copilot (p < 0.001 for all pairs).

The overall ranking was therefore ChatGPT > Gemini > Copilot.

3.2. Per-question and thematic analysis

To explore item-level variation, the mean score for each of the 50 questions was calculated separately for all three platforms. For interpretability, the questions were grouped into four main domains based on content relevance:

A) General Information and Process (1–15)

B) Safety and Risks (16–23)

C) Pain, Comfort, and Recovery (24–35)

D) Preoperative Preparation (36–50)

Figure 1 presents a heatmap illustrating the mean appropriateness scores from 1 to 5 for each of the 50 questions answered by ChatGPT, Gemini, and Copilot. ChatGPT achieved consistently higher ratings across almost all questions, demonstrating a clear overall advantage. Gemini provided moderately accurate and structured responses, while Copilot generally yielded lower scores, particularly for questions requiring contextual or procedural understanding.

Figure 1.

Per-question mean scores from 1 to 5 across platforms.

Figure 2 summarizes the average scores across the four thematic domains. Domain-level mean scores showed that ChatGPT achieved the highest scores across all four domains, followed by Gemini and Copilot. The mean scores for ChatGPT were 4.69 in General Information and Process, 4.64 in Safety and Risks, 4.64 in Pain, Comfort, and Recovery, and 4.73 in Preoperative Preparation. The corresponding values for Gemini were 3.96, 4.01, 4.11, and 4.67, whereas Copilot scored 3.32, 3.36, 3.23, and 3.23, respectively. The greatest between-platform difference was observed in the Preoperative Preparation domain, indicating that chatbot performance diverged most in areas with direct implications for preoperative instructions and perioperative decision-making. Additional descriptive summaries of domain-level scores and the questions showing the greatest between-platform differences are provided in Supplemental Tables S1 and S2.

Figure 2.

Domain-level mean appropriateness scores across platforms.

3.3. Expert qualitative feedback

In addition to numerical scoring, qualitative observations derived from anesthesiologists’ optional free-text comments provided further insight into platform performance:

ChatGPT: Praised for its concise “in short” summaries, improving readability and patient comprehension. However, evaluators noted a lack of reference citations, reducing academic traceability.

Gemini: Valued for its structured tables and apparent use of more scholarly-style references, lending academic credibility.

Copilot: Commended for brevity and clarity, although it appeared more likely to rely on non-academic or popular web sources, reducing reliability.

These comments emphasize that linguistic clarity, citation transparency, and reference quality meaningfully affect perceived trust and educational value beyond quantitative scoring.

3.4. Inter-rater reliability

The consistency among ten evaluators was high (Cronbach’s α = 0.89), demonstrating robust internal reliability across raters.

3.5. Access level

All chatbot responses were obtained from the free, publicly accessible versions of ChatGPT, Gemini, and Copilot. No paid or premium memberships were used, ensuring that the findings represent the performance level available to general users and patients seeking preoperative information online.

3.6. Summary of findings

ChatGPT generated the most accurate, comprehensive, and patient-friendly explanations about general anesthesia, followed by Gemini and Copilot (p < 0.001). Qualitative observations suggested that Gemini tended to provide more structured responses with more scholarly-style references, whereas Copilot’s shorter responses appeared clearer but less detailed and less academically oriented. Free-access AI models varied notably in accuracy, structure, and perceived source transparency, underscoring the importance of professional oversight in patient communication.

4. Discussion

This study compared three widely accessible artificial intelligence (AI) chatbots—ChatGPT, Gemini, and Copilot—in their ability to answer frequently asked patient questions about general anesthesia. ChatGPT provided the most accurate and clinically appropriate responses, followed by Gemini and Copilot, with statistically significant differences among all platforms (p < 0.001).

4.1. Comparison with previous studies

Our findings are consistent with earlier research showing that ChatGPT outperforms earlier versions and similar AI systems in accuracy and linguistic coherence. Choi et al.⁷ reported that ChatGPT 4.0 generated more appropriate anesthesia-related responses than ChatGPT 3.5, and Jin et al.⁸ demonstrated its ability to produce accurate educational materials for anesthesiology residents. These results support the growing evidence that AI chatbots can serve as valuable adjuncts in anesthesia education and patient communication, although their reliability depends on data transparency and model design.

In our study, ChatGPT achieved a mean score of 4.68 ± 0.50, categorized as very appropriate, reflecting high linguistic clarity and contextual accuracy. Gemini (4.22 ± 0.63) provided structured and moderately detailed responses, while Copilot (3.28 ± 0.50) offered brief but often incomplete answers, reflecting architectural and training differences among the models. The superior performance of ChatGPT may be attributed to several factors. Compared to other platforms, ChatGPT consistently provided more structured, comprehensive, and patient-centered responses, which likely contributed to higher appropriateness scores. In addition, differences in training data, reinforcement learning strategies, and optimization for conversational clarity may have enhanced its ability to generate accurate and accessible medical information. While Gemini demonstrated relatively structured outputs and Copilot produced simpler responses, these platforms may have been less effective in balancing medical accuracy with patient-friendly communication.

4.2. Qualitative evaluation of responses

In addition to numerical scoring, qualitative observations were derived from the optional free-text comments provided by the anesthesiologist evaluators. A descriptive qualitative approach was used to systematically identify recurring themes across responses, including structure, readability, level of detail, and source transparency. ChatGPT’s concise “in short” summaries improved readability and patient understanding, though its answers rarely included references. Gemini was appreciated for structured formatting and the apparent use of more scholarly-style references, whereas Copilot produced clear but oversimplified responses and appeared to rely more frequently on non-academic or popular web sources. These observations emphasize that clarity, citation transparency, and reference quality meaningfully affect perceived trust and educational value beyond quantitative scoring. These findings should be interpreted as qualitative observations derived from evaluator comments rather than formal comparative citation analysis.

4.3. Access level of evaluated chatbots

All analyzed outputs were obtained from free, publicly available versions of each chatbot. This ensures that the study reflects the performance level accessible to typical patients rather than premium or enterprise models, improving the external validity of the findings.

4.4. Clinical and educational implications

AI chatbots are emerging as supplemental tools for patient education and preoperative communication. Accurate and accessible information may help reduce preoperative anxiety and strengthen informed consent discussions. However, these systems must be viewed as assistive aids—not replacements—for professional counseling, as human empathy, context awareness, and clinical judgment remain indispensable in anesthesia practice.

A clinically meaningful interpretation of these findings extends beyond overall platform ranking. The greatest between-platform differences were observed in the preoperative preparation domain, particularly in topics such as fasting instructions, medication management, and anesthesia during intercurrent illness, suggesting that chatbot performance diverged most in areas with direct implications for preoperative instructions, perioperative safety, and patient decision-making. In real patient use, deficiencies in these areas may be more concerning than limitations in general descriptive questions, because they could lead to false reassurance, misunderstanding of preoperative instructions, or inappropriate self-management before surgery. In addition, some platforms appeared to perform relatively well in readability while providing less medical nuance, suggesting that patient-friendly language alone is not sufficient if important context or safety-related detail is omitted.

These findings should also be interpreted within important patient safety and implementation limits. AI-generated information may be factually correct yet still inadequate for informed consent if individualized risks, complications, or procedural nuances are omitted. Similarly, highly readable summaries may create false reassurance when important safety-related details are simplified or absent. In addition, the real-world performance of free-access AI chatbots may vary according to geographic region, browser integration, language setting, and update cycle, which limits standardization across users and settings. For these reasons, such tools may be most appropriately used as supportive informational aids rather than as independent substitutes for clinician-led preoperative counseling.

4.5. Limitations and future work

The study analyzed 50 standardized English-language questions, which may limit the generalizability of the findings to non-English-speaking populations and may not fully reflect linguistic variability in real-world patient inquiries. The questions were developed by anesthesiologists, which may introduce selection bias and may not fully capture the diversity and variability of real patient inquiries in different clinical and sociocultural contexts. The rapidly evolving nature of AI models may affect the reproducibility of results over time, as ongoing updates to model architecture and training data can lead to variations in response quality across different time points. The formal evaluation framework was limited to appropriateness and accuracy, which may not fully capture other clinically relevant dimensions of patient education quality, such as completeness, safety, readability, potential for harmful omission, consistency with established guidelines, and the transparency and reliability of cited sources. Reference provision and citation quality were not systematically quantified in this study; therefore, observations regarding source transparency and reference characteristics should be interpreted as qualitative findings derived from evaluator comments rather than formal comparative citation analysis. No formal reference answer key or calibration exercise was used before scoring, which may have increased the potential for shared subjective bias among similarly trained evaluators. Although inter-rater reliability was high (Cronbach’s α = 0.89), subjective variation among evaluators cannot be entirely excluded. Future work should explore multilingual outputs, citation accuracy, and patient comprehension, and track how evolving AI model updates affect content quality. The findings of the present study may inform future research incorporating clinician-generated responses to further evaluate the clinical reliability and applicability of these systems.

Overall, ChatGPT demonstrated the strongest balance of accuracy, readability, and patient-centered language among the evaluated chatbots. These findings support the responsible integration of free-access AI tools into preoperative patient education, provided they remain under expert supervision.

5. Conclusion

This study systematically evaluated the appropriateness and scientific accuracy of responses generated by three free-access AI chatbots—ChatGPT, Gemini, and Copilot—regarding frequently asked patient questions about general anesthesia.

ChatGPT demonstrated the highest overall performance, providing the most accurate, comprehensive, and clinically relevant information, while Gemini performed moderately well and Copilot yielded the least appropriate responses.

These results suggest that AI chatbots, particularly ChatGPT, can serve as valuable adjuncts for preoperative patient education when supervised by healthcare professionals.

AI chatbots hold promising potential to enhance patient understanding and reduce preoperative anxiety, provided their use remains under expert guidance within anesthesiology practice.

Supplemental material

Supplemental material - Evaluation of Free-Access Artificial Intelligence Chatbots in Preoperative Patient Education About General Anesthesia: A Comparative Study of ChatGPT, Gemini, and Copilot

Supplemental material for Evaluation of Free-Access Artificial Intelligence Chatbots in Preoperative Patient Education About General Anesthesia: A Comparative Study of ChatGPT, Gemini, and Copilot by Fatih Oluş and Hüseyin Babun in DIGITAL HEALTH.

Supplemental material

Supplemental material - Evaluation of Free-Access Artificial Intelligence Chatbots in Preoperative Patient Education About General Anesthesia: A Comparative Study of ChatGPT, Gemini, and Copilot

Supplemental material

Supplemental material - Evaluation of Free-Access Artificial Intelligence Chatbots in Preoperative Patient Education About General Anesthesia: A Comparative Study of ChatGPT, Gemini, and Copilot

Footnotes

Acknowledgments

The authors sincerely thank the anesthesiologists who participated as expert raters for their valuable insights and contributions.

ORCID iDs

Fatih Oluş

Hüseyin Babun

Ethical considerations

This study did not involve patients, patient data, or clinical interventions and therefore did not require approval from an ethics committee.

Consent to participate

Written informed consent was not obtained. All anesthesiologists who participated as expert evaluators were informed about the study and voluntarily agreed to participate prior to data collection.

Authors contributions

F.O. contributed to the study conceptualization, question design, data organization, and critical revision of the manuscript. H.B., as the corresponding author, supervised the entire project, performed the statistical analyses, prepared the figures and tables, and led manuscript drafting and editing. Both authors reviewed and approved the final version of the manuscript and agree to be accountable for all aspects of the work.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data Availability Statement

The datasets generated and analyzed during the current study (including chatbot responses and expert ratings) are available from the corresponding author upon reasonable request.*

Use of Artificial Intelligence

AI tools were used for language editing, improving clarity, and assisting in the design of figures. The authors take full responsibility for the content of the manuscript.

Supplemental material

Supplemental material for this article is available online.

References

Abate

Chekol

Basu

. Global prevalence and determinants of preoperative anxiety among surgical patients: a systematic review and meta-analysis. Int J Surg Open 2020; 25: 6–16. https://doi.org/10.1016/j.ijso.2020.05.010

Bello

Nuebling

Koster

, et al. Patient-reported perioperative anaesthesia-related anxiety is associated with impaired patient satisfaction: a secondary analysis from a prospective observational study in Switzerland. Sci Rep 2023; 13(1): 16301. https://doi.org/10.1038/s41598-023-43447-6

Clark

Bailey

. Chatbots in health care: connecting patients to information. Can J Health Technol 2024; 4(1). https://doi.org/10.51731/cjht.2024.818

Goodman

Patrinely

Stone

, et al. Accuracy and reliability of chatbot responses to physician questions. JAMA Netw Open 2023; 6(10): e2336483. https://doi.org/10.1001/jamanetworkopen.2023.36483

Garcia

Emile

Linkeshwaran

, et al. A literature review on the role of artificial intelligence–based chatbots in patient education in colorectal surgery. Surgery 2025; 183: 109393. https://doi.org/10.1016/j.surg.2025.109393

Shiferaw

Zheng

Winter

, et al. Assessing the accuracy and quality of artificial intelligence (AI) chatbot-generated responses in making patient-specific drug-therapy and healthcare-related decisions. BMC Med Inform Decis Mak 2024; 24(1): 404. https://doi.org/10.1186/s12911-024-02824-5

Choi

Park

, et al. Evaluation of the quality and quantity of artificial intelligence-generated responses about anesthesia and surgery: using ChatGPT 3.5 and 4.0. Front Med (Lausanne) 2024; 11: 1400153. https://doi.org/10.3389/fmed.2024.1400153

Jin

Abola

Bargnes

, et al. The utility of generative artificial intelligence chatbot (ChatGPT) in generating teaching and learning material for anesthesiology residents. Front Artif Intell 2025; 8: 1582096. https://doi.org/10.3389/frai.2025.1582096

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.03 MB

0.11 MB

0.00 MB

0.02 MB