Abstract
Background:
There are few studies that have evaluated whether large language models, such as ChatGPT, can provide accurate guidance to clinicians in the field of foot and ankle surgery. This study aimed to assess the accuracy of ChatGPT's responses regarding ankle cartilage repair by comparing them with the consensus statements from foot and ankle experts as a standard reference.
Methods:
The open artificial intelligence (AI) model ChatGPT-4 was asked to answer a total of 14 questions on debridement, curettage, and bone marrow stimulation for ankle cartilage lesions that were selected at the 2017 International Consensus Meeting on Cartilage Repair of the Ankle. The ChatGPT responses were compared with the consensus statements developed in this international meeting. A Likert scale (scores, 1-5) was used to evaluate the similarity of the answers by ChatGPT to the consensus statements. The 4 scoring categories (Accuracy, Overconclusiveness, Supplementary, and Incompleteness) were also used to evaluate the quality of ChatGPT answers, according to previous studies.
Results:
The mean Likert scale score regarding the similarity of ChatGPT’s answers to the consensus statements was 3.1 ± 0.8. Regarding the results of 4 scoring categories of the ChatGPT answers, the percentages of answers that were considered “yes” in the Accuracy, Overconclusiveness, Supplementary, and Incompleteness were 71.4% (10/14), 35.7% (5/14), 78.6% (11/14), and 14.3% (2/14), respectively.
Conclusion:
This study showed that ChatGPT-4 often provides responses that diverge from expert consensus regarding surgical treatment of ankle cartilage lesions.
Level of Evidence:
Level V, expert opinion.
Keywords
Introduction
Chondral or osteochondral lesions of the ankle are common injuries in young active subjects and are often associated with trauma such as ankle sprains or fractures.21,30 Several surgical procedures for ankle cartilage lesions have been reported with favorable outcomes. Among surgical techniques for ankle cartilage lesions, bone marrow stimulation (BMS) is the most widely performed, and its favorable long-term outcomes have been reported.3,22,25 However, literature on ankle cartilage repair is still based on lower-quality scientific evidence, 23 and no superior surgical treatment for either primary or secondary osteochondral lesions of the talus exists.2,5,13 Thus, evidence-based consensus statements following the International Consensus Meeting on Cartilage Repair of the Ankle have emerged as a way to provide guidance for the surgical treatment of ankle cartilage injuries.6,10,24
The application of artificial intelligence (AI) to Medicine is highly alluring and has gained a high degree of interest and popularity recently. 1 Rapid advancement in large language models (LLMs), such as ChatGenerative Pre-Trained Transformer (ChatGPT), Google Bard, and Microsoft’s Bing Chat, have the potential to open a new era of clinical practice and research. These LLMs are freely and readily available to patients and clinicians, and their reliability and accuracy have been recently investigated in several studies on orthopaedic surgery.4,9,16,29 It has been shown that ChatGPT can perform at the same level as orthopaedic surgery residents during the Orthopedic In-Training Examinations.9,17 Some previous studies have also reported that ChatGPT can correctly provide evidence-based responses to questions frequently raised by patients.11,20 However, some studies have shown that these LLMs do not provide clinically supported answers for hip and knee arthroplasty.18,31
In the field of foot and ankle surgery, there is a lack of evidence regarding the reliability and usefulness of LLMs. Therefore, whether or not ChatGPT can offer adequate expertise in foot and ankle surgery remains unclear. This study aimed to assess the accuracy of ChatGPT's responses regarding ankle cartilage repair by comparing them with the consensus statements from foot and ankle experts as a standard reference.
Methods
This study was exempt from institutional review board approval because broadly available free chatbots were used and human or animal subjects were not included.
Questions and Consensus Statements
Of the consensus statements that were developed at the 2017 International Consensus Meeting on Cartilage Repair of the Ankle, the present study collected questions and consensus statements on “Debridement, Curettage and Bone Marrow Stimulation.” 10 This study determined that questions and consensus statements on “Debridement, Curettage and Bone Marrow Stimulation (BMS)” application for treatment of osteochondral lesions (OCLs) of the ankle were selected as standard references, because BMS is one of the most commonly performed surgical procedures for ankle OCLs. 26 Furthermore, BMS is among the most widely available procedures in low-resource health care systems. 26 In addition, long-term outcomes of BMS are available in the literature.22,25
In this study, a total of 15 questions regarding debridement, curettage, and BMS were formalized by 7 international foot and ankle experts as a working group. 10 By using the modified Delphi format reported by Linstone and Turoff, 15 an agreement among the experts was pursued. The grades of consensus were defined as follows: consensus, 51% to 74% agreement; strong consensus, 75% to 99% agreement; and unanimous, 100% agreement. Of the 15 questions and consensus statements, 1 achieved unanimous consensus, 12 achieved strong consensus, 1 achieved consensus, and 1 question was removed. In the present study, a total of 14 questions and consensus statements derived from the reference work of foot and ankle experts, 10 including those that were considered unanimous (1), strong consensus (12), and consensus (1), were used to compare the answers provided by ChatGPT.
A new user account for the LLM chatbot ChatGPT-4 (Open AI; https://chat.openai,com/) was created. This was done to prevent the influence of past prompts on the responses by ChatGPT-4, as allowed by the memory function. The aforementioned 14 questions were fed into ChatGPT on February 6 in 2025. Each answer by ChatGPT was then recorded with no clarification or follow-up. To simulate the behavior of a nonspecialized user, that is, outside of a research environment or by trained users, such as patients or clinicians in their clinical practice, the questions were fed directly to the LLM without any special prompting techniques.
The Evaluation of Answers by ChatGPT
The answers to each question by ChatGPT were compared to each corresponding consensus statement. 10 Two senior orthopaedic surgeons independently performed the comparison and the Likert scale method, and 4 scoring categories were used to evaluate the quality of the answers by ChatGPT. When discrepancies between the 2 examiners were detected, a third senior orthopaedic surgeon provided the final output. The Likert scale was used to evaluate the similarity of the answers by ChatGPT to the consensus statements as follows: 5, completely identical; 4, almost identical; 3, neither the same nor different; 2, slightly different; and 1, completely different.27,29
Four scoring categories were used to evaluate the quality of the ChatGPT answers based on the evaluation framework used in previous studies7,19 (Table 1).
Four Scoring Categories for the Assessment of Responses by ChatGPT.
Statistical Analyses
Statistical analyses were performed using the SPSS software (version 26.0, SPSS, Chicago, IL). The interrater reliability of the 2 assessments by the 2 examiners was evaluated using the κ statistic. The κ coefficient for interrater agreement was graded using Landis classification as follows: slight, 0.0 to 0.20; moderate, 0.21 to 0.60; substantial, 0.61 to 1.00.12,14
Results
Interrater Reliability of the Assessment of Answers by ChatGPT
Interrater reliabilities of the assessment of answers by the ChatGPT by 2 examiners in the Likert scale and 4 scoring categories were graded substantial as follows: Likert scale, κ = 0.77; Accuracy, κ = 0.81; Overconclusiveness, κ = 0.84; Supplementary, κ = 0.66; Incompleteness, κ = 0.76.
The Evaluation of Responses by ChatGPT
The similarity of answers by ChatGPT to those of the experts is shown in Table 2. The mean Likert scale was 3.1 ± 0.8. Six answers (42.9%) were graded as Likert 3, and 3 (21.4%) were graded as Likert 2. The results for the 4 scoring categories are shown in Table 3. The percentages of answers that were considered “yes” in the Accuracy, Overconclusiveness, Supplementary, and Incompleteness were 71.4% (10/14), 35.7% (5/14), 78.6% (11/14), and 14.3% (2/14), respectively. The questions and answers by both ChatGPT and experts are shown in Tables 4 and 5.
Consistency of Responses by ChatGPT to Consensus Statements. a
Likert scale: 1, completely different; 2, almost different; 3, neutral; 4, almost same; 5, completely same. The concordance rate of the Likert scales by the examiners was 85.7% (12/14).
Disagreement between the Likert scales by each examiner. A third senior orthopaedic surgeon resolved it.
Results of the 4 Scoring Categories of Answers by ChatGPT. a
The concordance rates of the answers by the examiners: Accuracy, 92.9% (13/14); Overconclusiveness, 92.9% (13/14); Supplementary, 85.7% (12/14); Incompleteness, 92.9% (13/14).
Disagreement between the Likert scales by each examiner. A third senior orthopaedic surgeon resolved it.
List of the Total 14 Questions Used in This Study.
Questions and Answers by ChatGPT and Foot and Ankle Experts.
Discussion
The most important finding of the present study was that ChatGPT provided inaccurate answers regarding the surgical treatment of ankle cartilage lesions, and its answers were not sufficiently consistent with those of foot and ankle experts. In addition, ChatGPT made an overconclusion on approximately one-third of the included questions. However, ChatGPT was able to offer an answer consistent with the 2018 International Consensus Meeting on Cartilage Repair of the Ankle for approximately 70% of questions on surgical treatment of ankle cartilage lesions. This study demonstrated strong interobserver reliability in evaluating the similarity between ChatGPT-generated responses and answers by the foot and ankle experts. The clinical relevance of this study is that improvement is still required before ChatGPT can be used reliably by clinicians. This divergence should be interpreted in the context of both the evolving nature of LLMs and the limitations of fixed expert references from earlier timepoints as to how best to surgically treat patients with ankle cartilage lesions.
LLMs are broadly available to the general population and health care providers. Recently, the utility of LLMs has been investigated in various medical fields.16,28 For clinicians to use LLMs as support tools when making clinical decisions, they must provide reliable, evidence-based answers to clinical questions. However, the usefulness and safety of LLMs in medicine remains controversial. Several authors have evaluated the accuracy of ChatGPT responses to frequently asked questions from patients.11,20,29 Villarreal-Espinosa et al 29 reported that 80% (8/10) of the responses by ChatGPT to frequently asked questions regarding anterior cruciate ligament surgery were regarded as completely accurate by one of the 2 attending physicians. Mika et al 20 also reported highly accurate answers by the ChatGPT for common patient questions regarding total hip arthroplasty. These previous studies support the usefulness of ChatGPT in patients undergoing orthopaedic surgery. Other authors have evaluated the accuracy of responses by ChatGPT using clinical practice guidelines (CPGs) as a standard reference.7,31 Yang et al 31 reported that ChatGPT encouraged the use of nonrecommended treatments for hip and knee osteoarthritis in 30% of the queries. Duey et al 7 found that 92% of the answers by ChatGPT-4 for thromboembolic prophylaxis in spine surgery were correct.
In the present study, no answer by ChatGPT-4 was completely the same (Likert 5) as the corresponding one by foot and ankle experts, whereas 42.9% were graded as neutral (Likert 3), and 21.4% were almost different (Likert 2). Therefore, the findings of the present study suggest that ChatGPT-4, at present, does not provide answers to clinical queries regarding the surgical treatment of ankle cartilage lesions sufficiently aligned with expert consensus. Although the utility of ChatGPT in clinical practice is promising, further studies are needed to clarify the role of LLMs in clinical practice. In particular, the origin of the data provided by ChatGPT has not been clarified, as it could contain false or outdated information in its training text corpus. In the future, if ChatGPT could guide clinicians correctly to the literature with a high level of evidence to find the best answer to clinical questions, it would greatly be beneficial for clinicians.
There were several limitations to the present study. First, the consensus statements and answers by experts in this study were developed at the 2017 International Consensus Meeting on Cartilage Repair of the Ankle 10 ; thus, these statements may potentially be outdated. This limitation has also been found in previous studies with similar study designs.7,31 Second, the ChatGPT model and prompts used for querying could have affected the study findings. 8 It could be argued that new models, such as “reasoning” models, could have provided more accurate answers. Additionally, the model was fed questions without any advanced prompting techniques. It is possible that by “forcing” the model to adopt a persona, such as “an orthopaedic surgeon advising a colleague,” different answers would have been obtained. Third, this study used only 14 questions on the BMS for the ankle cartilage lesions, limiting the ability to broadly investigate the accuracy of answers by ChatGPT. Fourth, the present study evaluated the answers by ChatGPT-4 for questions concerning “Debridement, Curettage and Bone Marrow Stimulation.” 9 Therefore, this study could not assess the accuracy of answers by ChatGPT for other surgical procedures of the ankle cartilage pathologies, such as autologous osteochondral or allograft transplantation, fixation of the ankle cartilage or osteochondral lesion, and autologous cartilage implantation. Fourth, the evaluation of ChatGPT’s responses was performed by only 2 examiners. Finally, the current study did not evaluate the accuracy of answers to previous versions of ChatGPT or LLMs other than ChatGPT. Despite these limitations, this study has clarified the present reliability of ChatGPT regarding medical knowledge and would contribute to enhancing interactions between contemporary AI systems and health care providers on the study topic. Although failing to match the consensus statements from experts, the findings of this study suggest potential utility of ChatGPT in patient education or even for nonexpert clinicians. However, cautious implementation and further evaluation are recommended considering the limitations of the present study and potential enhancement of LLM possibilities that were not explored in this study. Additionally, ChatGPT responses may change over time because of ongoing model updates, and thus the findings of this study may not generalize to future versions.
Conclusion
This study showed that ChatGPT, model GPT-4, often generated responses that differed from expert consensus statements developed by the International Consensus Meeting on Cartilage Repair of the Ankle.
Supplemental Material
sj-pdf-1-fao-10.1177_24730114251352494 – Supplemental material for ChatGPT-4 Responses on Ankle Cartilage Surgery Often Diverge from Expert Consensus: A Comparative Analysis
Supplemental material, sj-pdf-1-fao-10.1177_24730114251352494 for ChatGPT-4 Responses on Ankle Cartilage Surgery Often Diverge from Expert Consensus: A Comparative Analysis by Takuji Yokoe, Giulia Roversi, Nuno Sevivas, Naosuke Kamei, Pedro Diniz and Hélder Pereira in Foot & Ankle Orthopaedics
Footnotes
Author’s Note
Pedro Diniz currently has the following affiliations: Department of Orthopaedic Surgery, Centre Hospitalier Universitaire Brugmann, Brussels, Belgium; Department of Orthopaedic Surgery, Hôpital Universitaire de Bruxelles - Hôpital Erasme, Brussels, Belgium; and Department of Bioengineering and iBB - Institute for Bioengineering and Biosciences, Instituto Superior Técnico, Universidade de Lisboa, Lisbon, Portugal.
Ethical Approval
No ethical approval was needed for this study. No patients were included in this study.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article. Disclosure forms for all authors are available online.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
