Sage Journals: Discover world-class research

Abstract

Background:

ChatGPT-4 has demonstrated potential in offering treatment recommendations for orthopaedic conditions following American Academy of Orthopaedic Surgeons (AAOS) clinical practice guidelines, including those pertaining to foot and ankle pathology. Although prior studies explored its performance in triaging causes of knee pain, ChatGPT-4o’s application in triaging patients into appropriate health care settings remains largely unexamined. This study evaluated ChatGPT-4o’s ability to generate differential diagnoses, recommend appropriate triage destinations, and formulate treatment plans when provided with expanded clinical information. However, its performance in foot and ankle triage remains incompletely characterized.

Methods:

Twenty-four standardized foot and ankle complaints were input into ChatGPT-4o, in an exploratory, hypothesis-generating vignette-based study with memory reset between entries to minimize bias. Twelve cases focused on ChatGPT-4o’s ability to generate differential diagnoses and triage decisions (Primary Care Physician, Foot and Ankle Specialist, or Emergency Department/Urgent Care), which were compared against evaluations by 2 fellowship-trained orthopaedic foot and ankle surgeons. An additional 12 expanded clinical vignettes were used to prompt a primary diagnosis and treatment recommendations, which were then graded for accuracy and suitability.

Results:

ChatGPT-4o generated differentials that were considered clinically appropriate for all triage conditions. The top diagnosis matched that of the surgeons in 9 of 12 cases (75%) and appeared within the first or second position of the differential list in 10 of 12 cases (83.3%). Across all differential lists, 26 of 36 diagnoses (72.2%) were identical. ChatGPT-4o’s triage recommendations matched the surgeons’ decisions in 6 cases (50%). With expanded clinical information, ChatGPT-4o maintained diagnostic accuracy (75%) and generated appropriate management plans in 11 of 12 cases (91.7%).

Conclusion:

ChatGPT-4o was able to generate clinically reasonable differentials for foot and ankle conditions. Although triage decision making showed variability, these findings support a limited role for ChatGPT-4o as an adjunct to central scheduling workflows, helping streamline patient triage and health care delivery.

Level of Evidence:

Level V, expert opinion, vignette-based study.

Keywords

ChatGPT-4o triage vignette artificial intelligence large language model

Introduction

The rapid advancement of artificial intelligence (AI) has transformed various aspects of health care, particularly with the emergence of large language models (LLMs) such as ChatGPT.^1-3 As a subset of AI, generative models like ChatGPT are capable of synthesizing novel information by identifying patterns across massive data sets—making them uniquely suited to support complex clinical reasoning, diagnostic interpretation, and patient communication.^2,4-9 These systems are designed to emulate human reasoning, decision making, and problem solving, enabling them to assist in tasks that historically required human expertise. ChatGPT-4, the latest iteration of OpenAI’s generative AI chatbot, has demonstrated notable proficiency in medical applications, including passing professional licensing examinations^10,11 and supporting clinical decision-making workflows.¹² ChatGPT’s potential in health care has garnered significant attention, with growing interest in its ability to diagnose, triage, and provide treatment recommendations across various medical specialties.¹³ In the realm of medical licensing examinations, for instance, Kung et al¹⁰ demonstrated that ChatGPT successfully passed all 3 stages of the United States Medical Licensing Examination (USMLE) with approximately 60% accuracy, offering detailed, clinically informed responses without specialized input or reinforcement. Additionally, Kung et al¹¹ demonstrated that ChatGPT-4 outperformed the average postgraduate year-5 orthopaedic surgery resident on the Orthopaedic In-Training Examination. This performance illustrates ChatGPT’s potential as a tool for medical education and decision support. Although its utility in orthopaedics has been explored in broader contexts, ChatGPT’s specific role in triaging patients within the subspecialty of orthopaedic foot and ankle surgery remains an area of growing interest.

Recent literature has demonstrated ChatGPT-4’s ability to generate accurate differential diagnoses, align treatment recommendations with clinical guidelines, and provide reliable patient education materials.^14-16 Hartman et al¹⁶ highlighted ChatGPT-4’s strengths in diagnosing soft-tissue conditions, while also identifying limitations in its ability to provide comprehensive information and alternative treatment options. In particular, ChatGPT-4 was less effective in offering a range of management strategies for peroneal tendon tears, which often require nuanced, patient-specific decision making. In foot and ankle surgery, patients often present with a range of conditions, from soft tissue injuries to degenerative pathologies, each requiring tailored management strategies making this region of surgery particularly complex. Effective triage is essential in a foot and ankle injury setting because it ensures patients receive timely, appropriate care while making the best use of available resources. The ability to accurately prioritize patients based on the severity of their foot and ankle injuries is especially valuable in resource-limited settings or for frontline health care providers who may lack specialized orthopaedic training. ChatGPT-4 demonstrated the potential to improve emergency department (ED) triaging accuracy, but there were significant variability and potential biases that limited the immediate integration of ChatGPT-4 into the clinical setting.¹³ By serving as a virtual assistant, ChatGPT-4 has the potential to enhance diagnostic accuracy, reduce delays in care, and streamline workflows, particularly for conditions that require early intervention to prevent complications.

This study explores the utility of ChatGPT-4o, the latest version of ChatGPT as of April 2025, in triaging patients within the context of orthopaedic foot and ankle surgery. Specifically, this study evaluates ChatGPT-4o’s ability to generate differential diagnoses and identify the appropriate health care setting for evaluating various foot and ankle pathologies in accordance with fellowship-trained foot and ankle orthopaedic surgeons. ChatGPT-4o’s diagnostic accuracy, treatment recommendations, and ability to align with clinical guidelines in the foot and ankle setting was additionally validated.

Methods

Institutional review board (IRB) approval was not required for this study because no patient information was used in this assessment. Role prompting and chain-of-thought (CoT) prompting were used because today’s LLMs function based on the words and/or sentences that a user presents, and such prompting has been shown to improve the chatbot’s reasoning and performance.¹⁷ The role prompt included “As Dr. GPT, a professional orthopaedic surgeon specializing in foot and ankle surgery, your role is to provide expert guidance” and “I, myself, am also an orthopaedic surgeon specializing in foot and ankle surgery.” These prompts allow the chatbot to take on the persona of a professional foot and ankle provider whose work will be verified by another foot and ankle orthopaedic surgeon, thus helping to align ChatGPT-4o’s response with the user’s requirements and/or expectations. Additionally, the CoT prompt included “For each scenario provided, you should analyze possible diagnoses in-depth. Work this out in a step-by-step way to be sure you have the right answer.”

Triage and Vignette Prompts

The initial focus of this study was to assess the ability of ChatGPT-4o to generate a differential diagnosis, and triage patients to the appropriate clinical setting (Primary Care Physician; Foot and Ankle Specialist, i.e., a podiatrist or a foot and ankle-trained orthopaedic surgeon; or Emergency Physician/Urgent Care) from a brief clinical complaint. The subsequent focus was to assess the ability of ChatGPT-4o to provide an accurate diagnosis when presented with a clinical vignette in accordance with American Academy of Orthopaedic Surgeons guidelines. Twelve potential triage prompts and 12 clinical vignettes were selected. These prompts were developed based on the senior author’s clinical experience and common foot and ankle injuries found in the literature to represent common presentation.^18-21 ChatGPT-4o was prompted in April 2025 to provide 3 possible diagnoses for each triage prompt. The vignettes did not include any medical history, physical examination findings or diagnostic imaging. For each management method suggested for the vignettes, we asked ChatGPT-4o to explain the applications, preoperative planning, risks, limitations, and expected results. Prior to the prompting session, the memory of ChatGPT-4o was turned off to prevent the chatbot from referencing saved memories about the user. Additionally, the chat history was deleted after every prompting session and no additional queries were entered following the chatbot’s response to either the triage scenarios or the vignettes.

Expert Evaluation

Two fellowship-trained foot and ankle surgeons independently completed the same triage and vignette prompts. These 2 surgeons were not involved in creating the prompts. For each triage prompt, the surgeons provided a differential diagnosis and a triage location, whereas for the clinical vignettes, they only provided a primary diagnosis and proposed management. For each prompt, an aggregate differential diagnosis was formed based on a weighted scoring system. Namely, their first diagnoses scored 3+, second diagnoses 2+, and third diagnoses 1+. Each diagnoses’ score was summed and then the aggregate differential diagnosis was placed in decreasing order. For each of the clinical vignette’s prompts, ChatGPT-4o’s proposed management was graded by the 2 fellowship-trained physicians for diagnostic accuracy and suitability, defined by the treatment relevance, clarity, and comprehensiveness. The accuracy of ChatGPT-4o’s response was scored on a 2-point scale. For diagnostic accuracy, 0 points were given for inaccurate responses, 1 point was given for somewhat accurate responses, and 2 points were given for accurate responses. A 2-point response scale was also used for accessing the chatbot’s response diagnostic suitability. Zero points were given to responses that were not clinically relevant or suitable and has the potential to mislead patients; 1 point was given to responses that were somewhat clinically appropriate but not comprehensive, that is, does not fully include necessary guidance/management; and 2 points were given to the response that was completely relevant to the scenario and the information may be used to provide care equivalent to a foot and ankle subspecialist without ambiguity. A similar scale was used to assess the clarity and comprehensiveness.²² Last, the percentage alignment of the chatbot’s and the physicians’ differential diagnoses was computed.

Statistical Analysis

All statistical analyses were performed using Microsoft Excel and RStudio (version 2023; Posit arm64), with the statistical significance defined at a threshold of P <.05 in all circumstances. The overall proportion of differential diagnoses that were in line with the surgeons’ consensus were expressed using descriptive statistics as percentages. Quantification of accuracy and clinical suitability were described as medians and interquartile ranges (IQRs). Interrater reliability for raters on accuracy and clinical suitability evaluation was calculated using the Cohen κ statistic. The following κ thresholds were used to classify the quality of agreement: less than or equal to 0.20 as poor, 0.21 to 0.40 as fair, 0.41 to 0.60 as moderate, 0.61 to 0.80 as good, 0.81 to 0.90 as very good, and 0.91 to 1.00 as excellent.

Results

Triage Evaluation and Disposition Recommendations

A parallel analysis was conducted on the same 12 clinical vignettes to evaluate triage accuracy, disposition recommendations, and the appropriateness of differential diagnoses provided by physicians vs ChatGPT. Results are detailed in Table 1.

Table 1.

Comparison of Physician and GPT-4o Triage Responses Across 12 Clinical Vignettes.^a

Triage Question (Vignette)	Physicians’ Differential	GPT Differential	Accuracy (%)	Suitability	Physicians’ Consensus	GPT Scheduling
I twisted my ankle during a soccer game and felt sudden severe pain.	1. Ankle sprain 2. Ankle fracture 3. Peroneal tendon strain	1. Lateral Ankle Sprain 2. Ankle Fracture 3. Peroneal Tendon Injury	100	2	Urgent Care	Urgent Care
I have a painful lump on the back of my left heel.	1. Haglund’s deformity 2. Achilles tendinopathy 3. Retrocalcaneal bursa	1. Haglund’s Deformity 2. Retrocalcaneal Bursitis 3. Achilles Tendinopathy	100	2	Podiatrist	Foot & Ankle Surgeon
I’ve had pain and swelling in my right big toe for weeks.	1. Gout 2. Hallux rigidus 3. Toe fracture	1. Gout 2. Hallux Rigidus 3. Septic Arthritis	66	2	PCP	Foot & Ankle Surgeon
I feel pain and stiffness in my left ankle, especially at night.	1. Ankle arthritis 2. Achilles tendinopathy 3. Inflammatory arthritis	1. Ankle OA 2. Ankle Synovitis 3. Tarsal Tunnel Syndrome	66	2	PCP	Foot & Ankle Surgeon
I have a painful lump on the top of my right foot.	1. Arthritis 2. Foot mass 3. Tibialis anterior tendinopathy	1. Ganglion Cyst 2. Dorsal Exostosis 3. Lipoma	66	2	Foot & Ankle Surgeon	Foot & Ankle Surgeon
I have sudden pain and swelling in my right midfoot after landing awkwardly.	1. Lisfranc injury 2. Foot sprain 3. Fracture	1. Lisfranc Injury 2. Midfoot Fracture 3. Midfoot Sprain	100	2	ED	ED
I have pain and swelling in my right toe after stubbing it.	1. Toe fracture 2. Toe sprain/contusion 3. Subungual hematoma	1. Toe Fracture 2. Soft Tissue Injury 3. Subungual Hematoma	66	2	PCP	Podiatrist
I have pain in the back of my right ankle, worse with running, relieved by rest.	1. Achilles tendinopathy 2. Peroneal tendonitis 3. Posterior ankle impingement	1. Achilles Tendinopathy 2. Retrocalcaneal Bursitis 3. Insertional Achilles Tendinopathy	33	1	Foot & Ankle Surgeon	Podiatrist
I’ve had pain in my big toe joint for months, worsening with long walks.	1. Hallux rigidus 2. Bunion deformity 3. Sesamoiditis	1. Hallux Rigidus 2. Gout 3. Sesamoiditis	66	2	Podiatrist	Podiatrist
I felt sudden pain and swelling on the outside of my foot after stepping on an uneven surface.	1. Peroneal tendonitis 2. 5th metatarsal fracture 3. Calcaneal process fracture	1. Lateral Ankle Sprain 2. Fifth Metatarsal Base Fracture 3. Peroneal Tendon Injury	66	2	Urgent Care	Urgent Care
I fell from a ladder, landing on my feet, and now have severe heel pain and swelling.	1. Calcaneus fracture 2. Heel pad contusion 3. Achilles rupture	1. Calcaneal Fracture 2. Talar Fracture 3. Plantar Fascia Rupture	33	1	ED	Foot & Ankle Surgeon
I slipped and landed on my feet; my ankle and shin are swollen and painful.	1. Ankle fracture 2. Tibia/fibula fracture 3. Muscle contusion	1. Ankle Fracture 2. Ankle Sprain 3. Tibial Shaft Fracture	66	2	ED	ED

Abbreviations: PCP, primary care physician; ED, emergency department.

Suitability is rated on a 0-2 scale, where 0 indicates responses that are not clinically relevant, 1 indicates responses that are clinically relevant but not comprehensive, and 2 indicates responses that are clinically relevant and comprehensive.

ChatGPT-4o and physicians achieved 100% diagnostic agreement in 3 of 12 vignettes, 66% in 7 of 12 vignettes, and 33% diagnostic agreement in 2 of 12 vignettes. The overall accuracy of GPT differentials was equal to or closely aligned with physicians’ consensus in most cases, with accuracy scores ranging from 33% to 100%.

In terms of triage disposition, ChatGPT-4o matched physician scheduling recommendations in 7 of 12 vignettes (58%), showing general consistency in determining the urgency and type of provider required. Differences in triage recommendations appeared in complex or ambiguous presentations. For example, in the vignette involving a pain and stiffness in the left ankle at night, the physician recommended evaluation by a primary care provider, whereas ChatGPT-4o opted for a foot and ankle surgeon—reflecting different interpretations of surgical involvement. In acute trauma scenarios (eg, Lisfranc injury, calcaneal fracture), both ChatGPT-4o and physicians consistently recommended emergency or urgent care, suggesting shared recognition of red-flag signs requiring expedited intervention.

Lower accuracy scores in several triage vignettes seemed to contain 1 or more trends such as ambiguous or nonspecific symptom presentations, which allowed for multiple plausible differentials and reduced exact match rates; overlapping anatomic or etiologic possibilities, where ChatGPT-4o and physicians proposed equally reasonable diagnoses that differed in terminology or specificity (eg, bursitis vs tendinopathy); and variation in interpretation of injury mechanisms, particularly in trauma cases, leading to divergent yet clinically sound prioritizations of potential injuries. These trends suggest that the lower accuracy scores often reflected semantic or prioritization differences rather than truly incorrect reasoning.

Diagnostic Accuracy and Suitability

Twelve clinical vignettes encompassing a variety of foot and ankle conditions were evaluated to compare diagnostic responses between a physician and ChatGPT. Table 2 summarizes these comparisons in terms of diagnostic concordance, accuracy, and clinical suitability.

Table 2.

Comparative Diagnostic Responses and Evaluations for Clinical Vignettes.^a

Vignette	Physicians’ Response	GPT Response	Accuracy	Suitability
1. 45-year-old female with increasing pain and stiffness in the left foot, worse in the morning, improving throughout the day. Difficulty standing for long periods.	Plantar fasciitis	Plantar fasciitis	2	2
2. 45-year-old male with foot pain for months, worse with standing/walking, better with rest. Numbness in toes.	Lumbar stenosis	Tarsal tunnel syndrome	1.5	1.5
3. 20-year-old male with ankle pain for weeks, worse with jumping/landing. Clicking sound noted.	Osteochondral lesion of the talus	Osteochondral lesion of the talus	1.5	1.5
4. 35-year-old male with sudden pain in the back of the left ankle while playing basketball. Swelling and difficulty pushing off while walking.	Acute Achilles tendon rupture	Acute Achilles tendon rupture	2	2
5. 35-year-old male with chronic ankle pain, worse with twisting. Swelling noted.	Chronic ankle instability; talus osteochondral injury	Chronic Ankle Instability (CAI) with potential osteochondral lesion (OCL) or ligament insufficiency	2	2
6. 21-year-old male injured ankle during football. Pain above ankle, weight-bearing difficulty, swelling. Calf tenderness.	Syndesmosis injury / high ankle sprain	Achilles tendon rupture with potential soft tissue injury or DVT (given calf pain)	1.5	1.5
7. 30-year-old male with sudden midfoot pain and swelling after football tackle. Pain on top of foot, difficulty pushing off while walking.	Lisfranc injury	Lisfranc injury	2	2
8. 40-year-old female with months of foot pain, worse with standing/walking, better with sitting. Some stiffness.	Foot arthritis	Plantar fasciitis with possible contribution from arthritis or another structural foot pathology	1.5	2
9. 55-year-old male with years of ankle pain, worse with walking long distances. Swelling present.	Ankle arthritis	Ankle arthritis	2	2
10. 40-year-old male who slipped on ice, twisted ankle. Heard snap, significant swelling and bruising. Unable to walk.	Ankle fracture and/or sprain	Ankle fracture with possible ligamentous injury	2	2
11. 25-year-old male in car accident, foot forced upward. Severe swelling, pain, and instability in ankle.	Pilon fracture	Talar neck fracture with possible displacement or subtalar dislocation	2	2
12. 35-year-old male injured foot playing basketball. Landed awkwardly, foot visibly deformed and twisted. Severe pain and immobility.	Ankle fracture-dislocation	Lisfranc injury with possible dislocation and/or fracture	2	2

The accuracy of ChatGPT-4o’s responses was scored on a 2-point scale, with 0 indicating inaccurate responses, 1 indicating somewhat accurate responses, and 2 indicating accurate responses. Suitability is rated on a 0-2 scale, where 0 indicates responses that are not clinically relevant, 1 indicates responses that are clinically relevant but not comprehensive, and 2 indicates responses that are clinically relevant and comprehensive.

ChatGPT-4 and the physician produced identical or nearly identical primary diagnoses in 7 of 12 cases (58%), including cases involving plantar fasciitis, osteochondral lesion of the talus, Achilles tendon rupture, Lisfranc injury, and ankle arthritis. Both models received full scores (2/2) in accuracy and suitability for these matched diagnoses, indicating a high degree of agreement in straightforward musculoskeletal cases.

In cases with partial diagnostic overlap or differing primary diagnoses, the accuracy and suitability scores were still generally consistent, indicating that ChatGPT’s differentials, although not always identical, were clinically reasonable. For instance, in vignette 2, the physician diagnosed lumbar stenosis whereas ChatGPT-4o suggested tarsal tunnel syndrome; both are relevant considerations for foot pain with neuropathic features, yielding a moderate accuracy and suitability score of 1.5 each.

However, notable diagnostic divergence occurred in vignettes 6, 8, and 12. In vignette 6, the physician suspected a syndesmosis (high ankle) sprain, whereas ChatGPT-4o prioritized an Achilles rupture with possible soft tissue injury or DVT due to calf pain. In vignette 8, the physician diagnosed foot arthritis, whereas ChatGPT-4o proposed plantar fasciitis with possible arthritic components—suggesting a broader differential inclusive of structural pathology. In vignette 12, the physician identified an ankle fracture-dislocation, whereas ChatGPT-4o proposed a Lisfranc injury with possible dislocation/fracture. Although anatomically different, both indicate severe midfoot trauma, justifying identical accuracy and suitability scores (2/2).

Discussion

The primary findings of this study are as follows: in a controlled vignette-based comparison (1) ChatGPT-4o was able to generate differential diagnoses for foot and ankle pain that were plausible and overlapped with those proposed by fellowship-trained foot and ankle surgeons; (2) ChatGPT-4o achieved a diagnostic accuracy of 75% when prompted with expanded clinical vignettes, with treatment recommendations that were generally reasonable but not consistently optimal; (3) ChatGPT-4o’s performance in triaging patients to appropriate health care settings (primary care physician, foot and ankle specialist, or emergency care/urgent care) was correct in only 50% of triage cases, highlighting substantial limitations in its current reliability for clinical triage and the need for further refinement prior to any clinical application.

These results align with the growing body of literature evaluating the role of LLMs in musculoskeletal triage and clinical decision support. Prior studies have demonstrated that ChatGPT-4 can reliably generate differential diagnoses and treatment recommendations consistent with established guidelines across a range of orthopaedic conditions.^16,22 For instance, Kunze et al²² showed that ChatGPT-4 achieved 70% diagnostic accuracy for triaging knee pain complaints and reached 100% accuracy when additional clinical context was provided. Our findings mirror these trends in the foot and ankle domain: ChatGPT-4o performed better when provided with expanded clinical information, underscoring the importance of prompt engineering and detailed patient input when using AI for clinical support.

Notably, although the differential diagnoses generated by ChatGPT-4o were consistently plausible, the model showed clinically significant error rates when tasked with triaging patients based solely on minimal information. In only half of the triage scenarios did ChatGPT-4o recommend the same health care setting as 2 foot and ankle fellowship–trained physicians. One possible explanation for these inaccuracies may be the lack of weighting for red-flag symptoms and the absence of nuanced context that expert clinicians instinctively integrate into decision making. This underscores the critical need for careful supervision and refinement before considering LLMs for autonomous triage roles. However, it is important to note that this study evaluated unprompted central scheduling, so the agreement between physicians and ChatGPT-4o may increase with further context into the preferred triaging. ChatGPT-4o’s performance in providing treatment recommendations was promising, with 91.7% of cases leading to appropriate and accurate guidance that is consistent with evidence-based management. In particular, ChatGPT-4o was adept at outlining conservative management strategies, surgical approaches, and associated risks and benefits in a manner that was accurate yet accessible for patients. This finding suggests that ChatGPT-4o may serve as an adjunct to providers by assisting with patient education and counseling in orthopaedic care settings, a finding similarly noted by Kirchner et al¹⁴ when assessing AI’s role in improving orthopaedic patient literacy.

The clinical relevance of these findings is significant. Foot and ankle conditions such as Achilles tendon ruptures, peroneal tendon injuries, and chronic ankle instability require timely diagnosis and management to prevent long-term disability. Delays in triage or mismanagement at early stages can profoundly impact outcomes.^23,24 By facilitating more efficient and accurate triage, ChatGPT-4o could help expedite access to specialist care and optimize resource allocation, particularly in settings where orthopaedic expertise is limited. Moreover, in rural or underserved areas where patients often face barriers to timely specialty care, the incorporation of AI tools like ChatGPT-4o could bridge critical access gaps.

Nevertheless, important limitations of this study must be acknowledged. First, the sample size of clinical scenarios was limited, which may restrict the generalizability of the findings. Second, although chain-of-thought prompting improved performance, the study was dependent on prompt design, and alternative prompt strategies may have yielded different results. Third, although 2 independent foot and ankle specialists graded the responses, the subjective nature of assessing “plausibility” and “suitability” introduces potential bias, even though efforts were made to mitigate this with independent review. Fourth, the study compared the accuracy of ChatGPT-4o only with the assessments of 2 foot and ankle surgeons; future studies should include a larger number of clinicians to improve the robustness and validity of the reference standard. Fifth, this study assessed ChatGPT-4o’s responses at a single point in time; given that LLM outputs can vary based on server load and model updates, results may not be reproducible across time points. Sixth, no formal correction for multiple comparisons was applied; therefore, all secondary analyses should be interpreted as exploratory. Finally, although the study simulated realistic patient queries, it did not evaluate performance in live patient interactions, which could introduce additional complexity.

Overall, these findings suggest that LLMs, such as ChatGPT-4o, may have limited and preliminary potential as adjunctive clinical support tools in orthopaedics, but their current performance indicates that they should not be relied on in isolation for diagnosis or triage and require substantial improvement and validation before any routine clinical use. As ChatGPT-4o and future models continue to evolve—particularly with multimodal capabilities incorporating imaging and structured data inputs—their accuracy, contextual understanding, and clinical safety may improve. However, rigorous validation in prospective clinical trials, ethical scrutiny, and appropriate regulatory oversight will be necessary before these models can be safely and effectively integrated into clinical workflows. In the meantime, ChatGPT-4o may offer value as an adjunct for patient education, second-opinion resources, and central scheduling systems seeking to streamline care pathways for patients presenting with foot and ankle complaints.

Conclusion

ChatGPT-4o demonstrated the capacity to generate clinically plausible differential diagnoses and treatment recommendations for a range of common foot and ankle complaints. When provided with expanded clinical information, the model’s performance showed partial overlap with that of fellowship-trained orthopaedic foot and ankle surgeons, suggesting a limited and exploratory role as an adjunctive clinical support tool. However, significant limitations were observed in ChatGPT-4o’s ability to triage patients to the appropriate health care setting based on limited information, underscoring the need for human oversight and further model refinement. The implications of these findings are 2-fold. For patients, the use of LLMs such as ChatGPT-4o could eventually provide improved access to preliminary guidance, facilitate health literacy, and expedite appropriate referrals, particularly in resource-limited settings where specialist care may not be readily available. For providers, ChatGPT-4o could serve as a triage adjunct, helping to streamline scheduling pathways, prioritize higher-acuity cases, and offload administrative burdens, ultimately optimizing workflow efficiency. However, reliance on AI outputs without clinical validation remains inappropriate at this time. Possible mechanisms underlying the model’s performance include its ability to rapidly synthesize a wide range of orthopaedic clinical data and guidelines embedded within its training set. The use of role prompting and CoT prompting likely enhanced its reasoning and contextual analysis capabilities, mirroring clinical stepwise thinking processes. Nevertheless, ChatGPT-4’s limitations in integrating nuanced, patient-specific factors such as comorbidities and disease severity reflect current technological constraints and reinforce the importance of maintaining clinician judgment at the center of patient care. With ongoing advancements in model development, including multimodal input integration and real-world clinical validation, AI platforms like ChatGPT-4o may ultimately augment musculoskeletal care delivery, improving both diagnostic efficiency and patient satisfaction across diverse health care settings.

Supplemental Material

sj-pdf-1-fao-10.1177_24730114261425946 – Supplemental material for Evaluating ChatGPT’s Triage and Diagnostic Capabilities in Patients Presenting With Common Causes of Foot and Ankle Pain

Supplemental material, sj-pdf-1-fao-10.1177_24730114261425946 for Evaluating ChatGPT’s Triage and Diagnostic Capabilities in Patients Presenting With Common Causes of Foot and Ankle Pain by Joseph Mullen, Abdulganeey Olawin, Rachit Saggar, Warren Austin, Glenn Reeves, Mohammedanwar Idress, Andrew Cramer, Amin Karimi, Lauren Lewis, Peter Mangone and MaCalus Hogan in Foot & Ankle Orthopaedics

Footnotes

ORCID iDs

Abdulganeey Olawin, BS,

Rachit Saggar, MBBS,

Peter Mangone, MD,

MaCalus Hogan, MD, MBA,

Ethical Considerations

Ethical approval was not sought for the present study because the study did not involve human subjects research or use of identifiable patient data.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article. Disclosure forms for all authors are available online.

References

Bohr

Memarzadeh

. The rise of artificial intelligence in healthcare applications. In: Bohr

Memarzadeh

, eds. Artificial Intelligence in Healthcare. Academic Press; 2020:25-60.

Federer

Jones

GG.

Artificial intelligence in orthopaedics: a scoping review. PLoS One. 2021;16(11):e0260471.

Gupta

Kingston

O’Malley

Williams

Ramkumar

PN.

Advancements in artificial intelligence for foot and ankle surgery: a systematic review. Foot Ankle Orthop. 2023;8(1):24730114221151079. doi:10.1177/24730114221151079

Bini

SA.

Artificial intelligence, machine learning, deep learning, and cognitive computing: what do these terms mean and how will they impact health care?

J Arthroplasty. 2018;33(8):2358-2361.

Makhni

Ramkumar

PN.

Artificial intelligence for the orthopaedic surgeon: an overview of potential benefits, limitations, and clinical applications. J Am Acad Orthop Surg. 2021;29(6):235-243.

Meskó

Görög

A short guide for medical professionals in the era of artificial intelligence. NPJ Digit Med. 2020;3(1):126.

Myers

Ramkumar

Ricciardi

Urish

Kipper

Ketonis

Artificial intelligence and orthopaedics: an introduction for clinicians. J Bone Joint Surg Am. 2020;102(9):830-840.

Rowe

An introduction to machine learning for clinicians. Acad Med. 2019;94(10):1433-1436.

Jain

Revolutionizing patient care: the impact of ChatGPT and Generative AI in healthcare. Int J Sci Res Publ. 2024;14(11):10-29322. https://ssrn.com/abstract=5084777

10.

Kung

Cheatham

Medenilla

, et al. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLOS Digit Health. 2023;2(2):e0000198. doi:10.1371/journal.pdig.0000198

11.

Kung

Marshall

Gauthier

Gonzalez

Jackson

3rd . Evaluating ChatGPT performance on the orthopaedic in-training examination. JB JS Open Access. 2023;8(3):e23.00056. doi:10.2106/JBJS.OA.23.00056

12.

Lopez

Gazgalis

Boddapati

Shah

Cooper

Geller

JA.

Artificial learning and machine learning decision guidance applications in total hip and knee arthroplasty: a systematic review. Arthroplast Today. 2021;11:103-112. doi:10.1016/j.artd.2021.07.012

13.

Kaboudi

Firouzbakht

Shahir Eftekhar

, et al. Diagnostic accuracy of ChatGPT for patients’ triage; a systematic review and meta-analysis. Arch Acad Emerg Med. 2024;12(1):e60. doi:10.22037/aaem.v12i1.2384

14.

Kirchner

Kim

Weddle

Bible

JE.

Can artificial intelligence improve the readability of patient education materials?

Clin Orthop Relat Res. 2023;481(11):2260-2267. doi:10.1097/CORR.0000000000002668

15.

Dagher

Dwyer

Baker

Kalidoss

Strelzow

“Dr. AI will see you now”: how do ChatGPT-4 treatment recommendations align with orthopaedic clinical practice guidelines?

Clin Orthop Relat Res. 2024;482(12):2098-2106. doi:10.1097/CORR.0000000000003234

16.

Hartman

Essis

Tung

Peden

Gianakos

AL.

Can ChatGPT-4 diagnose and treat like an orthopaedic surgeon? Testing clinical decision making and diagnostic ability in soft-tissue pathologies of the foot and ankle. J Am Acad Orthop Surg. 2024;33(16):917-923.

17.

Leypold

Schafer

Boos

Beier

JP.

Can AI think like a plastic surgeon? Evaluating GPT-4’s clinical judgment in reconstructive procedures of the upper extremity. Plast Reconstr Surg Glob Open. 2023;11:e5471.

18.

Wedmore

Young

Franklin

Emergency department evaluation and management of foot and ankle pain. Emerg Med Clin. 2015;33(2):363-396. doi:10.1016/j.emc.2014.12.008

19.

Tingan

Bowen

Salas-Tam

Roland

Srivastav

Current concepts in the evaluation, management, and prevention of common foot and ankle injuries in the runner. Curr Phys Med Rehabil Rep. 2024;12:200-209. doi:10.1007/s40141-024-00437-7

20.

Houghton

KM.

Review for the generalist: evaluation of pediatric foot and ankle pain. Pediatr Rheumatol Online J. 2008;6:6.

21.

Lynch

SA.

Assessment of the injured ankle in the athlete. J Athl Train. 2002;37(4):406-412.

22.

Kunze

Varady

Mazzucco

, et al. The large language model ChatGPT-4 exhibits excellent triage capabilities and diagnostic performance for patients presenting with various causes of knee pain. Arthroscopy. 2025;41(5):1438-1447.e14. doi:10.1016/j.arthro.2024.06.021

23.

Svedman

Juthberg

Edman

Ackermann

PW.

Reduced time to surgery improves patient-reported outcome after Achilles tendon rupture. Am J Sports Med. 2018;46(12):2929-2934. doi:10.1177/0363546518793655

24.

Pilskog

Gote

Odland

HEJ

, et al. Association of delayed surgery for ankle fractures and patient-reported outcomes. Foot Ankle Int. 2022;43(6):762-771. doi:10.1177/10711007211070540

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

2.93 MB