Sage Journals: Discover world-class research

Abstract

Introduction

As artificial intelligence (AI) becomes increasingly integrated into medicine and surgery, its applications are expanding rapidly—from aiding clinical documentation to providing patient information. However, its role in medical decision-making remains uncertain. This study evaluates an AI language model’s alignment with clinical consensus statements in foot and ankle surgery.

Methods

Clinical consensus statements from the American College of Foot and Ankle Surgeons (ACFAS; 2015-2022) were collected and rated by ChatGPT-o1 as being inappropriate, neither appropriate nor inappropriate, and appropriate. Ten repetitions of the statements were entered into ChatGPT-o1 in a random order, and the model was prompted to assign a corresponding rating. The AI-generated scores were compared to the expert panel’s ratings, and intra-rater analysis was performed.

Results

The analysis of 9 clinical consensus documents and 129 statements revealed an overall Cohen’s kappa of 0.29 (95% CI: 0.12, 0.46), indicating fair alignment between expert panelists and ChatGPT. Overall, ankle arthritis and heel pain showed the highest concordance at 100%, while flatfoot exhibited the lowest agreement at 25%, reflecting variability between ChatGPT and expert panelists. Among the ChatGPT ratings, Cohen’s kappa values ranged from 0.41 to 0.92, highlighting variability in internal reliability across topics.

Conclusion

ChatGPT achieved overall fair agreement and demonstrated variable consistency when repetitively rating ACFAS expert panel clinical practice guidelines representing a variety of topics. These data reflect the need for further study of the causes, impacts, and solutions for this disparity between intelligence and human intelligence.

Level of Evidence:

Level IV: Retrospective cohort study

Keywords

artificial intelligence machine learning medical informatics natural language processing surgical decision-making

Get full access to this article

View all access options for this article.

References

Skaria

Satam

Khalpey

Opportunities and challenges of disruptive innovation in medicine using artificial intelligence. Am J Med. 2020;133(6):e215-e217. doi:10.1016/j.amjmed.2019.12.016

Cestonaro

Delicati

Marcante

Caenazzo

Tozzo

Defining medical liability when artificial intelligence is applied on diagnostic algorithms: a systematic review. Front Med. 2023;10:1305756. doi:10.3389/fmed.2023.1305756

Shah

Heiss

JD.

Artificial intelligence as a complementary tool for clinical decision-making in stroke and epilepsy. Brain Sci. 2024;14(3):228. doi:10.3390/brainsci14030228

Howard

Carnino

Chong

NYK

Levi

JR.

Navigating ChatGPT’s alignment with expert consensus on pediatric OSA management. Int J Pediatr Otorhinolaryngol. 2024;186:112131. doi:10.1016/j.ijporl.2024.112131

Chester

Mandler

SI.

A comparison of ChatGPT and expert consensus statements on surgical site infection prevention in high-risk paediatric spine surgery. J Pediatr Orthop. 2025;45(1):e72-e75. doi:10.1097/BPO.0000000000002781

Artamonov

Bachar-Avnieli

Klang

, et al. Responses from ChatGPT-4 show limited correlation with expert consensus statement on anterior shoulder instability. Arthrosc Sports Med Rehabil. 2024;6(3):100923. doi:10.1016/j.asmr.2024.100923

Elhaddad

Hamam

AI-driven clinical decision support systems: an ongoing pursuit of potential. Cureus. 2024;16(4):e57728. doi:10.7759/cureus.57728

Howard

Chong

NYK

Carnino

Levi

JR.

Comparison of ChatGPT knowledge against 2020 consensus statement on ankyloglossia in children. Int J Pediatr Otorhinolaryngol. 2024;180:111957. doi:10.1016/j.ijporl.2024.111957

Sharun

Banu

Pawde

, et al. ChatGPT and artificial hallucinations in stem cell research: assessing the accuracy of generated references—a preliminary study. Ann Med Surg. 2023;85(10):5275-5278. doi:10.1097/MS9.0000000000001228

10.

Májovský

Cˇerný

Kasal

Komarc

Netuka

Artificial intelligence can generate fraudulent but authentic-looking scientific medical articles: Pandora’s box has been opened. J Med Internet Res. 2023;25:e46924. doi:10.2196/46924

11.

Casciato

Mateen

Cooperman

Pesavento

Brandao

RA.

Evaluation of online AI-generated foot and ankle surgery information. J Foot Ankle Surg. 2024;63(6):680-683. doi:10.1053/j.jfas.2024.06.009

12.

Casciato

Mateen

AI-assisted sentiment analysis of ACFAS fellowship-trained foot and ankle surgeon online reviews. J Foot Ankle Surg. 2024;63(5):577-579. doi:10.1053/j.jfas.2024.05.014

13.

Samsonov

Habibi

Butler

Walls

Kennedy

. Artificial intelligence language models are useful tools for patients undergoing total ankle replacement [published online ahead of print May 7, 2024]. Foot Ankle Spec. doi:10.1177/19386400241249810

14.

Parekh

McCahon

JAS

Nghe

Pedowitz

Daniel

Parekh

. Foot and ankle patient education materials and artificial intelligence chatbots: a comparative analysis [published online ahead of print March 19, 2024]. Foot Ankle Spec. doi:10.1177/19386400241235834

15.

American College of Foot and Ankle Surgeons (ACFAS). Clinical consensus documents. n.d. https://www.acfas.org/research/clinical-consensus-documents. Accessed October 17, 2024.

16.

Meyr

Doyle

King

, et al. The American College of Foot and Ankle Surgeons® clinical consensus statement: hallux valgus. J Foot Ankle Surg. 2022;61(2):369-383. doi:10.1053/j.jfas.2021.08.011

17.

Meyr

Mirmiran

Naldo

Sachs

Shibuya

American College of Foot and Ankle Surgeons® clinical consensus statement: perioperative management. J Foot Ankle Surg. 2017;56(2):336-356. doi:10.1053/j.jfas.2016.10.016

18.

Naldo

Agnew

Brucato

Dayton

Shane

ACFAS clinical consensus statement: acute Achilles tendon pathology. J Foot Ankle Surg. 2021;60(1):93-101. doi:10.1053/j.jfas.2020.02.006

19.

Shibuya

McAlister

Prissel

, et al. Consensus statement of the American College of Foot and Ankle Surgeons: diagnosis and treatment of ankle arthritis. J Foot Ankle Surg. 2020;59(5):1019-1031. doi:10.1053/j.jfas.2019.10.007

20.

Piraino

Theodoulou

Ortiz

, et al. American College of Foot and Ankle Surgeons clinical consensus statement: appropriate clinical management of adult-acquired flatfoot deformity. J Foot Ankle Surg. 2020;59(2):347-355. doi:10.1053/j.jfas.2019.09.001

21.

Mirmiran

Bush

Cerra

, et al. Joint clinical consensus statement of the American College of Foot and Ankle Surgeons® and the American Association of Nurse Practitioners®: etiology, diagnosis, and treatment consensus for gouty arthritis of the foot and ankle. J Foot Ankle Surg. 2018;57(6):1207-1217. doi:10.1053/j.jfas.2018.08.018

22.

Schneider

Baca

Carpenter

Dayton

Fleischer

Sachs

BD.

American College of Foot and Ankle Surgeons clinical consensus statement: diagnosis and treatment of adult acquired infracalcaneal heel pain. J Foot Ankle Surg. 2018;57(2):370-381. doi:10.1053/j.jfas.2017.10.018

23.

Fleischer

Abicht

Baker

Boffeli

Jupiter

Schade

VL.

American College of Foot and Ankle Surgeons’ clinical consensus statement: risk, prevention, and diagnosis of venous thromboembolism disease in foot and ankle surgery and injuries requiring immobilization. J Foot Ankle Surg. 2015;54(3):497-507. doi:10.1053/j.jfas.2015.02.022

24.

Dayton

DeVries

Landsman

Meyr

Schweinberger

American College of Foot and Ankle Surgeons’ clinical consensus statement: perioperative prophylactic antibiotic use in clean elective foot surgery. J Foot Ankle Surg. 2015;54(2):273-279. doi:10.1053/j.jfas.2015.01.004

25.

Massey

Montgomery

Zhang

AS.

Comparison of ChatGPT-3.5, ChatGPT-4, and orthopaedic resident performance on orthopaedic assessment examinations. J Am Acad Orthop Surg. 2023;31(23):1173-1179. doi:10.5435/JAAOS-D-23-00396

26.

Lum

ZC.

Can artificial intelligence pass the American Board of Orthopaedic Surgery Examination? Orthopaedic residents versus ChatGPT. Clin Orthop Relat Res. 2023;481(8):1623-1630. doi:10.1097/CORR.0000000000002704

27.

Chen

Reddy

Al-Sharif

, et al. Analysis of ChatGPT responses to ophthalmic cases: can ChatGPT think like an ophthalmologist? Ophthalmol Sci. 2024;5(1):100600. doi:10.1016/j.xops.2024.100600

28.

Hlavinka

Sontam

Gupta

Croen

Abdullah

Humbyrd

CJ.

Are large language models a useful resource to address common patient concerns on hallux valgus? A readability analysis. Foot Ankle Surg. 2025;31:15-19. doi:10.1016/j.fas.2024.08.002

ChatGPT Achieves Only Fair Agreement with ACFAS Expert Panelist Clinical Consensus Statements

Abstract

Introduction

Methods

Results

Conclusion

Level of Evidence:

Keywords

Get full access to this article

References