Abstract
Background:
ChatGPT-4 has demonstrated potential in offering treatment recommendations for orthopaedic conditions following American Academy of Orthopaedic Surgeons (AAOS) clinical practice guidelines, including those pertaining to foot and ankle pathology. Although prior studies explored its performance in triaging causes of knee pain, ChatGPT-4o’s application in triaging patients into appropriate health care settings remains largely unexamined. This study evaluated ChatGPT-4o’s ability to generate differential diagnoses, recommend appropriate triage destinations, and formulate treatment plans when provided with expanded clinical information. However, its performance in foot and ankle triage remains incompletely characterized.
Methods:
Twenty-four standardized foot and ankle complaints were input into ChatGPT-4o, in an exploratory, hypothesis-generating vignette-based study with memory reset between entries to minimize bias. Twelve cases focused on ChatGPT-4o’s ability to generate differential diagnoses and triage decisions (Primary Care Physician, Foot and Ankle Specialist, or Emergency Department/Urgent Care), which were compared against evaluations by 2 fellowship-trained orthopaedic foot and ankle surgeons. An additional 12 expanded clinical vignettes were used to prompt a primary diagnosis and treatment recommendations, which were then graded for accuracy and suitability.
Results:
ChatGPT-4o generated differentials that were considered clinically appropriate for all triage conditions. The top diagnosis matched that of the surgeons in 9 of 12 cases (75%) and appeared within the first or second position of the differential list in 10 of 12 cases (83.3%). Across all differential lists, 26 of 36 diagnoses (72.2%) were identical. ChatGPT-4o’s triage recommendations matched the surgeons’ decisions in 6 cases (50%). With expanded clinical information, ChatGPT-4o maintained diagnostic accuracy (75%) and generated appropriate management plans in 11 of 12 cases (91.7%).
Conclusion:
ChatGPT-4o was able to generate clinically reasonable differentials for foot and ankle conditions. Although triage decision making showed variability, these findings support a limited role for ChatGPT-4o as an adjunct to central scheduling workflows, helping streamline patient triage and health care delivery.
Level of Evidence:
Level V, expert opinion, vignette-based study.
Introduction
The rapid advancement of artificial intelligence (AI) has transformed various aspects of health care, particularly with the emergence of large language models (LLMs) such as ChatGPT.1-3 As a subset of AI, generative models like ChatGPT are capable of synthesizing novel information by identifying patterns across massive data sets—making them uniquely suited to support complex clinical reasoning, diagnostic interpretation, and patient communication.2,4-9 These systems are designed to emulate human reasoning, decision making, and problem solving, enabling them to assist in tasks that historically required human expertise. ChatGPT-4, the latest iteration of OpenAI’s generative AI chatbot, has demonstrated notable proficiency in medical applications, including passing professional licensing examinations10,11 and supporting clinical decision-making workflows. 12 ChatGPT’s potential in health care has garnered significant attention, with growing interest in its ability to diagnose, triage, and provide treatment recommendations across various medical specialties. 13 In the realm of medical licensing examinations, for instance, Kung et al 10 demonstrated that ChatGPT successfully passed all 3 stages of the United States Medical Licensing Examination (USMLE) with approximately 60% accuracy, offering detailed, clinically informed responses without specialized input or reinforcement. Additionally, Kung et al 11 demonstrated that ChatGPT-4 outperformed the average postgraduate year-5 orthopaedic surgery resident on the Orthopaedic In-Training Examination. This performance illustrates ChatGPT’s potential as a tool for medical education and decision support. Although its utility in orthopaedics has been explored in broader contexts, ChatGPT’s specific role in triaging patients within the subspecialty of orthopaedic foot and ankle surgery remains an area of growing interest.
Recent literature has demonstrated ChatGPT-4’s ability to generate accurate differential diagnoses, align treatment recommendations with clinical guidelines, and provide reliable patient education materials.14-16 Hartman et al 16 highlighted ChatGPT-4’s strengths in diagnosing soft-tissue conditions, while also identifying limitations in its ability to provide comprehensive information and alternative treatment options. In particular, ChatGPT-4 was less effective in offering a range of management strategies for peroneal tendon tears, which often require nuanced, patient-specific decision making. In foot and ankle surgery, patients often present with a range of conditions, from soft tissue injuries to degenerative pathologies, each requiring tailored management strategies making this region of surgery particularly complex. Effective triage is essential in a foot and ankle injury setting because it ensures patients receive timely, appropriate care while making the best use of available resources. The ability to accurately prioritize patients based on the severity of their foot and ankle injuries is especially valuable in resource-limited settings or for frontline health care providers who may lack specialized orthopaedic training. ChatGPT-4 demonstrated the potential to improve emergency department (ED) triaging accuracy, but there were significant variability and potential biases that limited the immediate integration of ChatGPT-4 into the clinical setting. 13 By serving as a virtual assistant, ChatGPT-4 has the potential to enhance diagnostic accuracy, reduce delays in care, and streamline workflows, particularly for conditions that require early intervention to prevent complications.
This study explores the utility of ChatGPT-4o, the latest version of ChatGPT as of April 2025, in triaging patients within the context of orthopaedic foot and ankle surgery. Specifically, this study evaluates ChatGPT-4o’s ability to generate differential diagnoses and identify the appropriate health care setting for evaluating various foot and ankle pathologies in accordance with fellowship-trained foot and ankle orthopaedic surgeons. ChatGPT-4o’s diagnostic accuracy, treatment recommendations, and ability to align with clinical guidelines in the foot and ankle setting was additionally validated.
Methods
Institutional review board (IRB) approval was not required for this study because no patient information was used in this assessment. Role prompting and chain-of-thought (CoT) prompting were used because today’s LLMs function based on the words and/or sentences that a user presents, and such prompting has been shown to improve the chatbot’s reasoning and performance. 17 The role prompt included “As Dr. GPT, a professional orthopaedic surgeon specializing in foot and ankle surgery, your role is to provide expert guidance” and “I, myself, am also an orthopaedic surgeon specializing in foot and ankle surgery.” These prompts allow the chatbot to take on the persona of a professional foot and ankle provider whose work will be verified by another foot and ankle orthopaedic surgeon, thus helping to align ChatGPT-4o’s response with the user’s requirements and/or expectations. Additionally, the CoT prompt included “For each scenario provided, you should analyze possible diagnoses in-depth. Work this out in a step-by-step way to be sure you have the right answer.”
Triage and Vignette Prompts
The initial focus of this study was to assess the ability of ChatGPT-4o to generate a differential diagnosis, and triage patients to the appropriate clinical setting (Primary Care Physician; Foot and Ankle Specialist, i.e., a podiatrist or a foot and ankle-trained orthopaedic surgeon; or Emergency Physician/Urgent Care) from a brief clinical complaint. The subsequent focus was to assess the ability of ChatGPT-4o to provide an accurate diagnosis when presented with a clinical vignette in accordance with American Academy of Orthopaedic Surgeons guidelines. Twelve potential triage prompts and 12 clinical vignettes were selected. These prompts were developed based on the senior author’s clinical experience and common foot and ankle injuries found in the literature to represent common presentation.18-21 ChatGPT-4o was prompted in April 2025 to provide 3 possible diagnoses for each triage prompt. The vignettes did not include any medical history, physical examination findings or diagnostic imaging. For each management method suggested for the vignettes, we asked ChatGPT-4o to explain the applications, preoperative planning, risks, limitations, and expected results. Prior to the prompting session, the memory of ChatGPT-4o was turned off to prevent the chatbot from referencing saved memories about the user. Additionally, the chat history was deleted after every prompting session and no additional queries were entered following the chatbot’s response to either the triage scenarios or the vignettes.
Expert Evaluation
Two fellowship-trained foot and ankle surgeons independently completed the same triage and vignette prompts. These 2 surgeons were not involved in creating the prompts. For each triage prompt, the surgeons provided a differential diagnosis and a triage location, whereas for the clinical vignettes, they only provided a primary diagnosis and proposed management. For each prompt, an aggregate differential diagnosis was formed based on a weighted scoring system. Namely, their first diagnoses scored 3+, second diagnoses 2+, and third diagnoses 1+. Each diagnoses’ score was summed and then the aggregate differential diagnosis was placed in decreasing order. For each of the clinical vignette’s prompts, ChatGPT-4o’s proposed management was graded by the 2 fellowship-trained physicians for diagnostic accuracy and suitability, defined by the treatment relevance, clarity, and comprehensiveness. The accuracy of ChatGPT-4o’s response was scored on a 2-point scale. For diagnostic accuracy, 0 points were given for inaccurate responses, 1 point was given for somewhat accurate responses, and 2 points were given for accurate responses. A 2-point response scale was also used for accessing the chatbot’s response diagnostic suitability. Zero points were given to responses that were not clinically relevant or suitable and has the potential to mislead patients; 1 point was given to responses that were somewhat clinically appropriate but not comprehensive, that is, does not fully include necessary guidance/management; and 2 points were given to the response that was completely relevant to the scenario and the information may be used to provide care equivalent to a foot and ankle subspecialist without ambiguity. A similar scale was used to assess the clarity and comprehensiveness. 22 Last, the percentage alignment of the chatbot’s and the physicians’ differential diagnoses was computed.
Statistical Analysis
All statistical analyses were performed using Microsoft Excel and RStudio (version 2023; Posit arm64), with the statistical significance defined at a threshold of P <.05 in all circumstances. The overall proportion of differential diagnoses that were in line with the surgeons’ consensus were expressed using descriptive statistics as percentages. Quantification of accuracy and clinical suitability were described as medians and interquartile ranges (IQRs). Interrater reliability for raters on accuracy and clinical suitability evaluation was calculated using the Cohen κ statistic. The following κ thresholds were used to classify the quality of agreement: less than or equal to 0.20 as poor, 0.21 to 0.40 as fair, 0.41 to 0.60 as moderate, 0.61 to 0.80 as good, 0.81 to 0.90 as very good, and 0.91 to 1.00 as excellent.
Results
Triage Evaluation and Disposition Recommendations
A parallel analysis was conducted on the same 12 clinical vignettes to evaluate triage accuracy, disposition recommendations, and the appropriateness of differential diagnoses provided by physicians vs ChatGPT. Results are detailed in Table 1.
Comparison of Physician and GPT-4o Triage Responses Across 12 Clinical Vignettes. a
Abbreviations: PCP, primary care physician; ED, emergency department.
Suitability is rated on a 0-2 scale, where 0 indicates responses that are not clinically relevant, 1 indicates responses that are clinically relevant but not comprehensive, and 2 indicates responses that are clinically relevant and comprehensive.
ChatGPT-4o and physicians achieved 100% diagnostic agreement in 3 of 12 vignettes, 66% in 7 of 12 vignettes, and 33% diagnostic agreement in 2 of 12 vignettes. The overall accuracy of GPT differentials was equal to or closely aligned with physicians’ consensus in most cases, with accuracy scores ranging from 33% to 100%.
In terms of triage disposition, ChatGPT-4o matched physician scheduling recommendations in 7 of 12 vignettes (58%), showing general consistency in determining the urgency and type of provider required. Differences in triage recommendations appeared in complex or ambiguous presentations. For example, in the vignette involving a pain and stiffness in the left ankle at night, the physician recommended evaluation by a primary care provider, whereas ChatGPT-4o opted for a foot and ankle surgeon—reflecting different interpretations of surgical involvement. In acute trauma scenarios (eg, Lisfranc injury, calcaneal fracture), both ChatGPT-4o and physicians consistently recommended emergency or urgent care, suggesting shared recognition of red-flag signs requiring expedited intervention.
Lower accuracy scores in several triage vignettes seemed to contain 1 or more trends such as ambiguous or nonspecific symptom presentations, which allowed for multiple plausible differentials and reduced exact match rates; overlapping anatomic or etiologic possibilities, where ChatGPT-4o and physicians proposed equally reasonable diagnoses that differed in terminology or specificity (eg, bursitis vs tendinopathy); and variation in interpretation of injury mechanisms, particularly in trauma cases, leading to divergent yet clinically sound prioritizations of potential injuries. These trends suggest that the lower accuracy scores often reflected semantic or prioritization differences rather than truly incorrect reasoning.
Diagnostic Accuracy and Suitability
Twelve clinical vignettes encompassing a variety of foot and ankle conditions were evaluated to compare diagnostic responses between a physician and ChatGPT. Table 2 summarizes these comparisons in terms of diagnostic concordance, accuracy, and clinical suitability.
Comparative Diagnostic Responses and Evaluations for Clinical Vignettes. a
The accuracy of ChatGPT-4o’s responses was scored on a 2-point scale, with 0 indicating inaccurate responses, 1 indicating somewhat accurate responses, and 2 indicating accurate responses. Suitability is rated on a 0-2 scale, where 0 indicates responses that are not clinically relevant, 1 indicates responses that are clinically relevant but not comprehensive, and 2 indicates responses that are clinically relevant and comprehensive.
ChatGPT-4 and the physician produced identical or nearly identical primary diagnoses in 7 of 12 cases (58%), including cases involving plantar fasciitis, osteochondral lesion of the talus, Achilles tendon rupture, Lisfranc injury, and ankle arthritis. Both models received full scores (2/2) in accuracy and suitability for these matched diagnoses, indicating a high degree of agreement in straightforward musculoskeletal cases.
In cases with partial diagnostic overlap or differing primary diagnoses, the accuracy and suitability scores were still generally consistent, indicating that ChatGPT’s differentials, although not always identical, were clinically reasonable. For instance, in vignette 2, the physician diagnosed lumbar stenosis whereas ChatGPT-4o suggested tarsal tunnel syndrome; both are relevant considerations for foot pain with neuropathic features, yielding a moderate accuracy and suitability score of 1.5 each.
However, notable diagnostic divergence occurred in vignettes 6, 8, and 12. In vignette 6, the physician suspected a syndesmosis (high ankle) sprain, whereas ChatGPT-4o prioritized an Achilles rupture with possible soft tissue injury or DVT due to calf pain. In vignette 8, the physician diagnosed foot arthritis, whereas ChatGPT-4o proposed plantar fasciitis with possible arthritic components—suggesting a broader differential inclusive of structural pathology. In vignette 12, the physician identified an ankle fracture-dislocation, whereas ChatGPT-4o proposed a Lisfranc injury with possible dislocation/fracture. Although anatomically different, both indicate severe midfoot trauma, justifying identical accuracy and suitability scores (2/2).
Discussion
The primary findings of this study are as follows: in a controlled vignette-based comparison (1) ChatGPT-4o was able to generate differential diagnoses for foot and ankle pain that were plausible and overlapped with those proposed by fellowship-trained foot and ankle surgeons; (2) ChatGPT-4o achieved a diagnostic accuracy of 75% when prompted with expanded clinical vignettes, with treatment recommendations that were generally reasonable but not consistently optimal; (3) ChatGPT-4o’s performance in triaging patients to appropriate health care settings (primary care physician, foot and ankle specialist, or emergency care/urgent care) was correct in only 50% of triage cases, highlighting substantial limitations in its current reliability for clinical triage and the need for further refinement prior to any clinical application.
These results align with the growing body of literature evaluating the role of LLMs in musculoskeletal triage and clinical decision support. Prior studies have demonstrated that ChatGPT-4 can reliably generate differential diagnoses and treatment recommendations consistent with established guidelines across a range of orthopaedic conditions.16,22 For instance, Kunze et al 22 showed that ChatGPT-4 achieved 70% diagnostic accuracy for triaging knee pain complaints and reached 100% accuracy when additional clinical context was provided. Our findings mirror these trends in the foot and ankle domain: ChatGPT-4o performed better when provided with expanded clinical information, underscoring the importance of prompt engineering and detailed patient input when using AI for clinical support.
Notably, although the differential diagnoses generated by ChatGPT-4o were consistently plausible, the model showed clinically significant error rates when tasked with triaging patients based solely on minimal information. In only half of the triage scenarios did ChatGPT-4o recommend the same health care setting as 2 foot and ankle fellowship–trained physicians. One possible explanation for these inaccuracies may be the lack of weighting for red-flag symptoms and the absence of nuanced context that expert clinicians instinctively integrate into decision making. This underscores the critical need for careful supervision and refinement before considering LLMs for autonomous triage roles. However, it is important to note that this study evaluated unprompted central scheduling, so the agreement between physicians and ChatGPT-4o may increase with further context into the preferred triaging. ChatGPT-4o’s performance in providing treatment recommendations was promising, with 91.7% of cases leading to appropriate and accurate guidance that is consistent with evidence-based management. In particular, ChatGPT-4o was adept at outlining conservative management strategies, surgical approaches, and associated risks and benefits in a manner that was accurate yet accessible for patients. This finding suggests that ChatGPT-4o may serve as an adjunct to providers by assisting with patient education and counseling in orthopaedic care settings, a finding similarly noted by Kirchner et al 14 when assessing AI’s role in improving orthopaedic patient literacy.
The clinical relevance of these findings is significant. Foot and ankle conditions such as Achilles tendon ruptures, peroneal tendon injuries, and chronic ankle instability require timely diagnosis and management to prevent long-term disability. Delays in triage or mismanagement at early stages can profoundly impact outcomes.23,24 By facilitating more efficient and accurate triage, ChatGPT-4o could help expedite access to specialist care and optimize resource allocation, particularly in settings where orthopaedic expertise is limited. Moreover, in rural or underserved areas where patients often face barriers to timely specialty care, the incorporation of AI tools like ChatGPT-4o could bridge critical access gaps.
Nevertheless, important limitations of this study must be acknowledged. First, the sample size of clinical scenarios was limited, which may restrict the generalizability of the findings. Second, although chain-of-thought prompting improved performance, the study was dependent on prompt design, and alternative prompt strategies may have yielded different results. Third, although 2 independent foot and ankle specialists graded the responses, the subjective nature of assessing “plausibility” and “suitability” introduces potential bias, even though efforts were made to mitigate this with independent review. Fourth, the study compared the accuracy of ChatGPT-4o only with the assessments of 2 foot and ankle surgeons; future studies should include a larger number of clinicians to improve the robustness and validity of the reference standard. Fifth, this study assessed ChatGPT-4o’s responses at a single point in time; given that LLM outputs can vary based on server load and model updates, results may not be reproducible across time points. Sixth, no formal correction for multiple comparisons was applied; therefore, all secondary analyses should be interpreted as exploratory. Finally, although the study simulated realistic patient queries, it did not evaluate performance in live patient interactions, which could introduce additional complexity.
Overall, these findings suggest that LLMs, such as ChatGPT-4o, may have limited and preliminary potential as adjunctive clinical support tools in orthopaedics, but their current performance indicates that they should not be relied on in isolation for diagnosis or triage and require substantial improvement and validation before any routine clinical use. As ChatGPT-4o and future models continue to evolve—particularly with multimodal capabilities incorporating imaging and structured data inputs—their accuracy, contextual understanding, and clinical safety may improve. However, rigorous validation in prospective clinical trials, ethical scrutiny, and appropriate regulatory oversight will be necessary before these models can be safely and effectively integrated into clinical workflows. In the meantime, ChatGPT-4o may offer value as an adjunct for patient education, second-opinion resources, and central scheduling systems seeking to streamline care pathways for patients presenting with foot and ankle complaints.
Conclusion
ChatGPT-4o demonstrated the capacity to generate clinically plausible differential diagnoses and treatment recommendations for a range of common foot and ankle complaints. When provided with expanded clinical information, the model’s performance showed partial overlap with that of fellowship-trained orthopaedic foot and ankle surgeons, suggesting a limited and exploratory role as an adjunctive clinical support tool. However, significant limitations were observed in ChatGPT-4o’s ability to triage patients to the appropriate health care setting based on limited information, underscoring the need for human oversight and further model refinement. The implications of these findings are 2-fold. For patients, the use of LLMs such as ChatGPT-4o could eventually provide improved access to preliminary guidance, facilitate health literacy, and expedite appropriate referrals, particularly in resource-limited settings where specialist care may not be readily available. For providers, ChatGPT-4o could serve as a triage adjunct, helping to streamline scheduling pathways, prioritize higher-acuity cases, and offload administrative burdens, ultimately optimizing workflow efficiency. However, reliance on AI outputs without clinical validation remains inappropriate at this time. Possible mechanisms underlying the model’s performance include its ability to rapidly synthesize a wide range of orthopaedic clinical data and guidelines embedded within its training set. The use of role prompting and CoT prompting likely enhanced its reasoning and contextual analysis capabilities, mirroring clinical stepwise thinking processes. Nevertheless, ChatGPT-4’s limitations in integrating nuanced, patient-specific factors such as comorbidities and disease severity reflect current technological constraints and reinforce the importance of maintaining clinician judgment at the center of patient care. With ongoing advancements in model development, including multimodal input integration and real-world clinical validation, AI platforms like ChatGPT-4o may ultimately augment musculoskeletal care delivery, improving both diagnostic efficiency and patient satisfaction across diverse health care settings.
Supplemental Material
sj-pdf-1-fao-10.1177_24730114261425946 – Supplemental material for Evaluating ChatGPT’s Triage and Diagnostic Capabilities in Patients Presenting With Common Causes of Foot and Ankle Pain
Supplemental material, sj-pdf-1-fao-10.1177_24730114261425946 for Evaluating ChatGPT’s Triage and Diagnostic Capabilities in Patients Presenting With Common Causes of Foot and Ankle Pain by Joseph Mullen, Abdulganeey Olawin, Rachit Saggar, Warren Austin, Glenn Reeves, Mohammedanwar Idress, Andrew Cramer, Amin Karimi, Lauren Lewis, Peter Mangone and MaCalus Hogan in Foot & Ankle Orthopaedics
Footnotes
Ethical Considerations
Ethical approval was not sought for the present study because the study did not involve human subjects research or use of identifiable patient data.
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article. Disclosure forms for all authors are available online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
