Abstract
Background:
Artificial intelligence (AI) chatbots are increasingly being used by patients to obtain medical information. Comparison between platforms with specialty-specific physician assessment remains limited. This study compares the quality, factual accuracy, readability, and consistency of responses generated by four publicly available AI chatbots when answering patient-centered questions about thyroid radiofrequency ablation (RFA).
Methods:
We conducted a cross-sectional analysis of chatbot-generated responses using 20 standardized clinical questions about thyroid RFA. Responses from ChatGPT-4, Gemini, Copilot, and Perplexity were evaluated by six blinded physician reviewers experienced in thyroid RFA using 5-point Likert scales for global quality and factual accuracy. Higher Likert scale scores indicated better performance. Readability and response length were analyzed with established metrics. Statistical significance was defined as p < 0.05.
Results:
Gemini achieved the highest mean scores for global quality (4.08 ± 0.87) and accuracy (3.76 ± 1.05), with significantly better performance than ChatGPT and Copilot (p < 0.005). ChatGPT responses were significantly longer and more readable. Score variability across questions was lowest for Gemini. Copilot and Perplexity ranked lowest across most domains. Question-level analysis identified specific prompts that best discriminated between platforms.
Conclusions:
AI chatbot performance varied across platforms for thyroid RFA queries. Chatbots were generally reliable for straightforward factual information but were less dependable for judgment or context-dependent assessments. These AI tools should supplement, not replace, clinician-vetted patient education and institutional materials.
Keywords
Get full access to this article
View all access options for this article.
