Do ChatGPT and Gemini’s Recommendations Align With Established Guidelines for Hand and Upper Extremity Surgery?

Abstract

Background:

The use of large language models (LLMs) such as ChatGPT and Gemini in clinical settings has surged, presenting potential benefits in reducing administrative workload and enhancing patient communication. However, concerns about the clinical accuracy of these tools persist. This study evaluated the concordance of ChatGPT and Gemini’s recommendations with American Academy of Orthopedic Surgeons (AAOS) clinical practice guidelines (CPGs) for carpal tunnel syndrome, distal radius fractures, and glenohumeral joint osteoarthritis.

Methods:

ChatGPT (version 4o) and Gemini (version 1.5 Flash) were queried using structured text-based prompts aligned with AAOS CPGs. The LLMs’ outputs were analyzed by blinded reviewers to determine concordance with the guidelines. Concordance rates were compared across models, topics, and guideline strength using descriptive statistics and McNemar’s test. The transparency of responses, including source citation, was also assessed.

Results:

A total of 174 recommendations were generated, with an overall concordance rate of 62.1%. When comparing concordance rates between LLMs, there was no statistically significant difference between ChatGPT and Gemini (66.7% vs 57.5%, P = .131). Concordance varied by topic and guideline strength, with ChatGPT performing best for moderately supported guidelines. Both models demonstrated low citation transparency. Gemini provided sources for 39.1% of recommendations, significantly more than ChatGPT’s 3.5% (P < .0001).

Conclusions:

Despite modest concordance rates, both models exhibited significant limitations, including variability across topics and guideline strengths, as well as insufficient citation transparency. These findings highlight the challenges in integrating LLMs into clinical practice and emphasize the need for further refinement and evaluation before adoption in hand surgery.

Keywords

LLMs (large language models)CPGs (clinical practice guidelines)artificial intelligence in healthcare clinical accuracy and transparency clinical decision-making

Get full access to this article

View all access options for this article.

References

Browne

Gull

Hurley

Sugrue

O’Sullivan

JB.

ChatGPT-4 can help hand surgeons communicate better with patients. J Hand Surg Glob Online. 2024;6(3):436-438.

Yao

Aggarwal

Lopez

Namdari

Current concepts review: large language models in orthopaedics: definitions, uses, and limitations. J Bone Joint Surg Am. 2024;106(15):1411-1418.

Pressman

Borna

Gomez-Cabello

Haider

Forte

AJ.

AI in hand surgery: assessing large language models in the classification and management of hand injuries. J Clin Med. 2024;13(10):2832.

Liu

McCoy

Wright

, et al Leveraging large language models for generating responses to patient messages-a subjective analysis. J Am Med Inform Assoc. 2024;31(6):1367-1379.

Singh

Djalilian

Ali

MJ.

ChatGPT and ophthalmology: exploring its potential with discharge summaries and operative notes. Semin Ophthalmol. 2023;38(5):503-507.

Subramanian

Shahi

Araghi

, et al Using artificial intelligence to answer common patient-focused questions in minimally invasive spine surgery. J Bone Joint Surg Am. 2023;105(20):1649-1653.

Subramanian

Araghi

Amen

, et al Chat Generative Pretraining Transformer answers patient-focused questions in cervical spine surgery. Clin Spine Surg. 2024;37(6):E278-E281.

Mika

Martin

Engstrom

Polkowski

Wilson

JM.

Assessing ChatGPT responses to common patient questions regarding total hip arthroplasty. J Bone Joint Surg Am. 2023;105(19):1519-1526.

Amen

Torabian

Subramanian

Yang

Liimakka

Fufa

Quality of ChatGPT responses to frequently asked questions in carpal tunnel release surgery. Plast Reconstr Surg Glob Open. 2024;12(5):e5822.

10.

Kuroiwa

Sarcon

Ibara

, et al The potential of ChatGPT as a self-diagnostic tool in common orthopedic diseases: exploratory study. J Med Internet Res. 2023;25:e47621.

11.

Nwachukwu

Varady

Allen

, et al Currently available large language models do not provide musculoskeletal treatment recommendations that are concordant with evidence-based clinical practice guidelines. Arthroscopy. 2025;41:263-275.e6. doi:10.1016/j.arthro.2024.07.040

12.

American Academy of Orthopaedic Surgeons. Management of carpal tunnel syndrome evidence-based clinical practice guideline, 2024. Accessed June 18, 2025. www.aaos.org/cts2cpg

13.

American Academy of Orthopaedic Surgeons. Management of glenohumeral joint osteoarthritis evidence-based clinical practice guideline. Published March 23, 2020. Accessed June 18, 2025. www.aaos.org/gjocpg

14.

American Academy of Orthopaedic Surgeons Board of Directors. Management of distal radius fractures evidence-based clinical practice guideline, 2020. Accessed June 18, 2025. www.aaos.org/drfcpg

15.

Kraemer

. Kappa coefficients in medical research. In: Chow

S-C

, ed. Encyclopedia of Biopharmaceutical Statistics. 3rd ed. CRC Press; 2012:679-685.

16.

Chicco

Warrens

Jurman

The Matthews correlation coefficient (MCC) is more informative than Cohen’s kappa and brier score in binary classification assessment. IEEE Access. 2021;9:78368-78381.

17.

Truhn

Reis-Filho

Kather

JN.

Large language models should be used as scientific reasoning engines, not knowledge databases. Nat Med. 2023;29(12):2983-2984.

18.

Singhal

Azizi

, et al Large language models encode clinical knowledge. Nature. 2023;620(7972):172-180.

19.

Ramkumar

Masotto

Woo

JJ.

Editorial commentary: off-the-shelf large language models are of insufficient quality to provide medical treatment recommendations, while customization of large language models results in quality recommendations. Arthroscopy. 2025;41:276-278. doi:10.1016/j.arthro.2024.09.047

20.

Dagher

Dwyer

Baker

Kalidoss

Strelzow

JA.

“Dr. AI will see you now”: how do ChatGPT-4 treatment recommendations align with orthopaedic clinical practice guidelines? Clin Orthop Relat Res. 2024;482(12):2098-2106.

21.

Shen

Pratap

Chen

Bhashyam

AR.

How Does ChatGPT use source information compared with Google? A text network analysis of online health information. Clin Orthop Relat Res. 2024;482(4):578-588.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

2.46 MB