Abstract
Category:
Other; Basic Sciences/Biologics
Introduction/Purpose:
Large Language Models (LLMs) like ChatGPT and Bard have emerged as potential but not risk-less tools in science, offering specialized answers to queries based on elements of context. In Foot and Ankle (FA) surgery, efficient triage is crucial due to the variety of conditions and limited surgical time. This study evaluates LLMs' ability to guide patients towards appropriate medical or surgical management compared to a panel of board certified FA surgeons.
Methods:
Forty-four fictitious clinical scenarios were created, incorporating chronicity, onset, and anatomic localization. Outcomes were assessed on a Likert scale (1-5) for the likelihood of needing surgical management, alongside the 3 most probable diagnoses and 2 indicated imaging modalities. Two FA surgeons and the LLMs ChatGPT and Bard were evaluated, with agreement analyzed using Fleiss' and Cohen's Kappas.
Results:
Initial Likert scale agreement (Fleiss' Kappa) was 0.233, indicating low concordance. Recategorizing outcomes into binary (surgical vs. medical orientation of patients) improved agreement to fair (0.423). Pairwise comparison using Cohen's Kappa showed slight to moderate agreement among LLMs and surgeons, with Bard aligning more closely with surgeons (77.27% agreement) than ChatGPT.
Conclusion:
LLMs show promise in FA triage but require refinement for clinical reliability. Bard's higher surgeon agreement suggests some models may better capture clinical judgment nuances. Future research should enhance LLM interpretive algorithms and explore their supportive role in medical decision-making.
