Sage Journals: Discover world-class research

Abstract

Background

Artificial intelligence (AI)-based chatbots are increasingly used as sources of medical information. Given the high prevalence of neck pain as a musculoskeletal symptom, patients may commonly consult such tools for health-related guidance.

Objective

To evaluate and compare the performance of ChatGPT 4.0 and Google Gemini in addressing commonly asked patient questions and clinical case scenarios related to neck pain, focusing on their accuracy, quality, understandability, readability, reliability, and usability.

Methods

Twenty-four patient-oriented questions and four clinical case scenarios regarding neck pain were submitted to ChatGPT 4.0 and Google Gemini. Responses were evaluated using validated tools: modified DISCERN (mDISCERN) for reliability, Global Quality Scale (GQS) for quality, PEMAT-P for understandability and actionability, and Flesch Reading Ease (FRE) and Flesch-Kincaid Grade Level (FKGL) for readability. Case-based responses were assessed for accuracy, safety, and usability on a 7-point Likert scale by two experienced physicians.

Results

Gemini demonstrated significantly higher reliability (mDISCERN, p < 0.001), whereas ChatGPT 4.0 had slightly higher, though statistically insignificant, GQS and PEMAT-P scores. Readability metrics were similar: ChatGPT's FRE was 48.78 and FKGL 9.08; Gemini's FRE was 47.12 and FKGL 9.11. Both models’ outputs were considered difficult to read. In clinical scenarios, both chatbots showed comparable accuracy, safety, and usability, with minor omissions noted.

Conclusion

ChatGPT 4.0 and Google Gemini provided similar performance in addressing neck pain-related queries. While both may support patient.

Keywords

Neck < N pain < P artificial intelligence < A

Get full access to this article

View all access options for this article.

References

Popescu

Lee

. Neck pain and lower back pain. Med Clin North Am 2020; 104: 279–292.

Kazeminasab

Nejadghaderi

Amiri

, et al. Neck pain: global epidemiology, trends and risk factors. BMC Musculoskelet Disord 2022; 23: 26.

Gerard

Naye

Decary

, et al. Prognostic factors of pain, disability, and poor outcomes in persons with neck pain - an umbrella review. Clin Rehabil 2024; 38: 1658–1676.

Chisari

Cubisino

D'Arma

, et al. Temporomandibular disorders and cervical spine: a systematic review. Ann Stomatol (Roma) 2018; 9: 95–105.

Ishaq

Mehta

Skinner

, et al. Treatment classifications and interventions for neck pain: a scoping review. J Clin Epidemiol 2023; 159: –9.

Vos

Lim

Abbafati

, et al. Global burden of 369 diseases and injuries in 204 countries and territories, 1990–2019: a systematic analysis for the global burden of disease study 2019. Lancet 2020; 396: 1204–1222.

Valenza-Peña

Martín-Núñez

Heredia-Ciuró

, et al. Effectiveness of self-care education for chronic neck pain: a systematic review and meta-analysis. Healthcare (Basel) 2023; 11.

van der Noord

Reezigt

Paap

, et al. Unhelpful information about low back and neck pain on physiotherapist's websites. Eur J Pain 2025; 29: e4782.

Cege

Comert

Akal

, et al. Evaluation of the performance of artificial intelligence based chatbots in providing first aid information on dental trauma according to the ToothSOS application. Dent Traumatol 2025; 41: 696–705.

10.

Yan

Cao

, et al. Evaluating performance of large language models for atrial fibrillation management using different prompting strategies and languages. Sci Rep 2025; 15: 19028.

11.

Tack

. Artificial intelligence and machine learning | applications in musculoskeletal physiotherapy. Musculoskelet Sci Pract 2019; 39: 164–169.

12.

Gonzalez Aroca

Vergara-Merino

Escobar Liquitay

, et al. Role of artificial intelligence-powered conversational agents (chatbots) in musculoskeletal disorders: a scoping review protocol. BMJ Open 2025; 15: e092982.

13.

Dastani

Mardaneh

Rostamian

. Large language models’ capabilities in responding to tuberculosis medical questions: testing ChatGPT, Gemini, and Copilot. Sci Rep 2025; 15: 18004.

14.

Yilmaz Muluk

Olcucu

. The role of artificial intelligence in the primary prevention of common musculoskeletal diseases. Cureus 2024; 16: e65372.

15.

Milner

Quinn

Schmitt

, et al. Performance of artificial intelligence in addressing questions regarding the management of pediatric supracondylar humerus fractures. J Pediatr Soc North Am 2025; 11: 100164.

16.

Alhur

. Redefining healthcare with artificial intelligence (AI): the contributions of ChatGPT, Gemini, and Co-pilot. Cureus 2024; 16: e57795.

17.

Grzybowski

Pawlikowska-Łagód

Lambert

. A history of artificial intelligence. Clin Dermatol 2024; 42: 221–229.

18.

Dede

Oguz

Alyanak

, et al. Competencies of large language models about Piriformis syndrome: quality, accuracy, completeness, and readability study. HSS J 2025: 15563316251340697.

19.

Hassan

Abdelaziz

Abdelrahman

, et al. Performance of AI-Chatbots to common temporomandibular joint disorders (TMDs) patient queries: accuracy, completeness, reliability and readability. Orthod Craniofac Res 2025; 28: 127–134.

20.

Girod

Saniei

Ulrich

, et al. Artificial intelligence in the diagnosis and prognostication of the musculoskeletal patient. HSS J 2025: 15563316251339660.

21.

Gurses

Ozudogru

Tuncay

, et al. The role of artificial intelligence large language models in personalized rehabilitation programs for knee osteoarthritis: an observational study. J Med Syst 2025; 49: 73.

22.

Daher

Koa

Boufadel

, et al.

Breaking barriers: can ChatGPT compete with a shoulder and elbow specialist in diagnosis and management?

JSES Int 2023; 7: 2534–2541.

23.

Zhang

Tashiro

Mukaino

, et al. Use of artificial intelligence large language models as a clinical tool in rehabilitation medicine: a comparative test case. J Rehabil Med 2023; 55: jrm13373.

24.

Mendel

Singh

Mann

, et al. Laypeople's use of and attitudes toward large language models and search engines for health queries: survey study. J Med Internet Res 2025; 27: e64290.

25.

Yun

Bickmore

. Online health information-seeking in the era of large language models: cross-sectional web-based survey study. J Med Internet Res 2025; 27: e68560.

26.

Gezer

Armangil

. Assessing the quality of ChatGPT's responses to commonly asked questions about trigger finger treatment. Ulus Travma Acil Cerrahi Derg 2025; 31: 389–393.

27.

Corp

Mansell

Stynes

, et al. Evidence-based treatment recommendations for neck and low back pain across Europe: a systematic review of guidelines. Eur J Pain 2021; 25: 275–295.

28.

Blanpied

Gross

Elliott

, et al. Neck pain: revision 2017. J Orthop Sports Phys Ther 2017; 47: A1-a83.

29.

Onder

Koc

Gokbulut

, et al. Evaluation of the reliability and readability of ChatGPT-4 responses regarding hypothyroidism during pregnancy. Sci Rep 2024; 14: 243.

30.

Bernard

Langille

Hughes

, et al. A systematic review of patient inflammatory bowel disease information resources on the world wide web. Am J Gastroenterol 2007; 102: 2070–2077.

31.

Shoemaker

Wolf

Brach

. Development of the patient education materials assessment tool (PEMAT): a new measure of understandability and actionability for print and audiovisual patient information. Patient Educ Couns 2014; 96: 395–403.

32.

Kincaid

Fishburne

Rogers

, et al. Derivation of New Readability Formulas (Automated Readability Index, Fog Count and Flesch Reading Ease Formula) for Navy Enlisted Personnel1975.

33.

Flesch

. A new readability yardstick. J Appl Psychol 1948; 32: 221–233.

34.

Guven

Ozdemir

Kavan

. Performance of artificial intelligence chatbots in responding to patient queries related to traumatic dental injuries: a comparative study. Dent Traumatol 2025; 41: 338–347.

35.

Koroglu

Ersoy

Sacikara

, et al. Evaluation of the impact of ChatGPT support on acromegaly management and patient education. Endocrine 2025; 87: 1141–1149.

36.

Ozduran

Hanci

Erkin

, et al. Assessing the readability, quality and reliability of responses produced by ChatGPT, gemini, and perplexity regarding most frequently asked keywords about low back pain. PeerJ 2025; 13: e18847.

37.

Alyanak

Dede

Bagcier

, et al. Parental education in pediatric dysphagia: A comparative analysis of three large language models. J Pediatr Gastroenterol Nutr 2025; 81: 18–26.

38.

Collins

Giammanco

Guirgus

, et al. Evaluating the quality and readability of generative artificial intelligence (AI) Chatbot responses in the management of achilles tendon rupture. Cureus 2025; 17: e78313.

39.

Kara

Ozduran

Kara

, et al. Evaluating the readability, quality, and reliability of responses generated by ChatGPT, Gemini, and Perplexity on the most commonly asked questions about ankylosing spondylitis. PLoS One 2025; 20: e0326351.

40.

Ozduran

Akkoc

Buyukcoban

, et al. Readability, reliability and quality of responses generated by ChatGPT, Gemini, and Perplexity for the most frequently asked questions about pain. Medicine (Baltimore) 2025; 104: e41780.

41.

Zure

Menekşeoğlu

. Assessment of the artificial intelligence- generated fibromyalgia information: beyond the hype. Arch Rheumatol 2025; 40: 358–364.

42.

Koluman

Çiftçi

, et al. Balancing accuracy and readability: comparative evaluation of AI Chatbots for patient education on rotator cuff tears. Healthcare (Basel) 2025; 13.

43.

Kolac

Karademir

Ayik

, et al.

Can popular AI large language models provide reliable answers to frequently asked questions about rotator cuff tears?

JSES Int 2025; 9: 390–397.

44.

Dong

Qiu

Deng

, et al. Comparative evaluation of large language models in delivering guideline-compliant recommendations for topical NSAID use in musculoskeletal pain: a multidimensional analysis. Clin Rheumatol 2025; 44: 4703–4710.

45.

Scaff

SPS

Reis

FJJ

Ferreira

, et al. Assessing the performance of AI chatbots in answering patients’ common questions about low back pain. Ann Rheum Dis 2025; 84: 143–149.

46.

Andreao

Moura Nascimento

De Faria

, et al. Assessing diagnostic precision and therapeutic guidance using artificial intelligence in functional neurosurgery cases. Cureus 2025; 17: e83592.

47.

Stevenson

Walsh

Hibberd

. Can artificial intelligence replace biochemists? A study comparing interpretation of thyroid function test results by ChatGPT and google bard to practising biochemists. Ann Clin Biochem 2024; 61: 143–149.

48.

Gomez-Cabello

Borna

Pressman

, et al. Large language models for intraoperative decision support in plastic surgery: a comparison between ChatGPT-4 and Gemini. Medicina (Kaunas) 2024; 60.

Can AI chatbots guide patients and physicians about neck pain? A reliability and readability comparison of ChatGPT-4 and Gemini

Abstract

Background

Objective

Methods

Results

Conclusion

Keywords

Get full access to this article

References