Sage Journals: Discover world-class research

Abstract

Background:

Artificial intelligence (AI) chatbots are increasingly being used by patients to obtain medical information. Comparison between platforms with specialty-specific physician assessment remains limited. This study compares the quality, factual accuracy, readability, and consistency of responses generated by four publicly available AI chatbots when answering patient-centered questions about thyroid radiofrequency ablation (RFA).

Methods:

We conducted a cross-sectional analysis of chatbot-generated responses using 20 standardized clinical questions about thyroid RFA. Responses from ChatGPT-4, Gemini, Copilot, and Perplexity were evaluated by six blinded physician reviewers experienced in thyroid RFA using 5-point Likert scales for global quality and factual accuracy. Higher Likert scale scores indicated better performance. Readability and response length were analyzed with established metrics. Statistical significance was defined as p < 0.05.

Results:

Gemini achieved the highest mean scores for global quality (4.08 ± 0.87) and accuracy (3.76 ± 1.05), with significantly better performance than ChatGPT and Copilot (p < 0.005). ChatGPT responses were significantly longer and more readable. Score variability across questions was lowest for Gemini. Copilot and Perplexity ranked lowest across most domains. Question-level analysis identified specific prompts that best discriminated between platforms.

Conclusions:

AI chatbot performance varied across platforms for thyroid RFA queries. Chatbots were generally reliable for straightforward factual information but were less dependable for judgment or context-dependent assessments. These AI tools should supplement, not replace, clinician-vetted patient education and institutional materials.

Keywords

thyroid nodule radiofrequency ablation artificial intelligence patient education chatbot large language model

Get full access to this article

View all access options for this article.

References

AP-NORC Center for Public Affairs Research. Young adults are leading the way in AI adoption. AP-NORC at the University of Chicago; 2025. Available from: https://apnorc.org/projects/young-adults-leading-the-way-in-ai-adoption/ [Last accessed: October 29, 2025].

Brown

, Dapena

, Stern

. The great AI challenge: We test five top bots on useful, everyday skills. The Wall Street Journal; May 25, 2024. Available from: https://www.wsj.com/tech/personal-tech/ai-chatbots-chatgpt-gemini-copilot-perplexity-claude-9b4002e

Ayers

, Poliak

, Dredze

, et al. Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social media forum. JAMA Intern Med, 2023; 183(6):589–596; doi: 10.1001/jamainternmed.2023.1838

Will

, Gupta

, Zaretsky

, et al. Enhancing the readability of online patient education materials using large language models: Cross-sectional study. J Med Internet Res, 2025; 27:e69955; doi: 10.2196/69955

Lee

, Campbell

, Patel

, et al. Unlocking health literacy: The ultimate guide to hypertension education from ChatGPT versus Google Gemini. Cureus, 2024; 16(5):e59898; doi: 10.7759/cureus.59898

Cornelison

, Erstad

, Edwards

Accuracy of a chatbot in answering questions that patients should ask before taking a new medication. J Am Pharm Assoc (2003). 2024; 64(4):102110; doi: 10.1016/j.japh.2024.102110

Campbell

, Estephan

, Sina

, et al. Evaluating ChatGPT responses on thyroid nodules for patient education. Thyroid, 2024; 34(3):371–377; doi: 10.1089/thy.2023.0491

Guo

, Li

, et al. Comparing ChatGPT’s and surgeon’s responses to thyroid-related questions from patients. J Clin Endocrinol Metab, 2025; 110(3):e841–e850; doi: 10.1210/clinem/dgae235

Rao

, Fernandez-Alvarez

, Guntinas-Lichius

, et al. The Limitations of Artificial Intelligence in Head and Neck Oncology. Adv Ther, 2025; 42(6):2559–2568; doi: 10.1007/s12325-025-03198-4

10.

Pham

, Teh

, Chatzopoulou

, et al. Artificial intelligence in head and neck cancer: Innovations, applications, and future directions. Curr Oncol, 2024; 31(9):5255–5290; doi: 10.3390/curroncol31090389

11.

Vaira

, Lechien

, Abbate

, et al. Accuracy of ChatGPT-generated information on head and neck and oromaxillofacial surgery: A multicenter collaborative analysis. Otolaryngol Head Neck Surg, 2024; 170(6):1492–1503; doi: 10.1002/ohn.489

12.

Flesch

. A new readability yardstick. J Appl Psychol, 1948; 32(3):221–233; doi: 10.1037/h0057532

13.

Kincaid

J. Peter

; Fishburne

Robert P.

Jr. ; Rogers

Richard L.

; and Chissom

Brad S

., “Derivation Of New Readability Formulas (Automated Readability Index, Fog Count And Flesch Reading Ease Formula) For Navy Enlisted Personnel” (1975). Institute for Simulation and Training 56. Available from: https://stars.library.ucf.edu/istlibrary/56

14.

Lakens

. Calculating and reporting effect sizes to facilitate cumulative science: A practical primer for t-tests and ANOVAs. Front Psychol, 2013; 4:863; doi: 10.3389/fpsyg.2013.00863

15.

Long

, Lowe

, Zhang

, et al. A novel evaluation model for assessing ChatGPT on otolaryngology-head and neck surgery certification examinations: Performance study. JMIR Med Educ, 2024; 10:e49970; doi: 10.2196/49970

16.

Kuşcu

, Pamuk

, Sütay Süslü

, et al. Is ChatGPT accurate and reliable in answering questions regarding head and neck cancer? Front Oncol, 2023; 13:1256459; doi: 10.3389/fonc.2023.1256459

17.

Maslinski

, Grasfield

, Awasthi

, et al. Understanding large language models in healthcare: A guide to clinical implementation and interpreting publications. Cureus, 2025; 17(4):e82397; doi: 10.7759/cureus.82397

18.

, Li

, Zhang

, et al. ChatDoctor: A medical chat model fine-tuned on a large language model meta-AI (LLaMA) using medical domain knowledge. Cureus, 2023; 15(6):e40895; doi: 10.7759/cureus.40895

19.

Jacob

, Brasier

, Laurenzi

, et al. AI for IMPACTS framework for evaluating the long-term real-world impacts of AI-powered clinician tools: Systematic review and narrative synthesis. J Med Internet Res, 2025; 27:e67485; doi: 10.2196/67485

20.

, Sun

, Li

, et al. Large language models in medical diagnostics: Scoping review with bibliometric analysis. J Med Internet Res, 2025; 27:e72062; doi: 10.2196/72062

21.

Campbell

, Estephan

, Mastrolonardo

, et al. Evaluating ChatGPT responses on obstructive sleep apnea for patient education. J Clin Sleep Med, 2023; 19(12):1989–1995; doi: 10.5664/jcsm.10728

22.

Palmer

, Michlin

, Estephan

, et al. Utility and safety of artificial intelligence for patient-initiated contact after functional rhinoplasty. OTO Open, 2025; 9(4):e70170; doi: 10.1002/oto2.70170

23.

Garg

, Campbell

, Yang

, et al. Chatbots as patient education resources for aesthetic facial plastic surgery: Evaluation of ChatGPT and Google bard responses. Facial Plast Surg Aesthet Med, 2024; 26(6):665–673; doi: 10.1089/fpsam.2023.0368

24.

Singhal

, Azizi

, Tu

, et al. Large language models encode clinical knowledge. Nature, 2023; 620(7972):172–180; doi: 10.1038/s41586-023-06291-2

25.

Goodman

, Patrinely

, Stone

Jr , et al. Accuracy and reliability of chatbot responses to physician questions. JAMA Netw Open, 2023; 6(10):e2336483; doi: 10.1001/jamanetworkopen.2023.36483

26.

Yeung

, Kraljevic

, Luintel

, et al. AI chatbots not yet ready for clinical use. Front Digit Health, 2023; 5:1161098; doi: 10.3389/fdgth.2023.1161098

Evaluating Artificial Intelligence Chatbots for Patient Education on Thyroid Radiofrequency Ablation: An Analysis of Accuracy,Quality,and Readability

Abstract

Background:

Methods:

Results:

Conclusions:

Keywords

Get full access to this article

References