Abstract
Background:
There has been a rapid increase in research applying artificial intelligence (AI) to various subspecialties of orthopaedic surgery, including foot and ankle surgery. The purpose of this systematic review is to (1) characterize the topics and objectives of studies using AI in foot and ankle surgery, (2) evaluate the performance of their models, and (3) evaluate their validity (internal or external validation).
Methods:
A systematic literature review was conducted using PubMed/MEDLINE and Embase databases in December 2022. All studies that used AI or its subsets machine learning (ML) and deep learning (DL) in the setting of foot and ankle surgery relevant to orthopaedic surgeons were included. Studies were evaluated for their demographics, subject area, outcomes of interest, model(s) tested, model(s)’ performance, and validity (internal or external).
Results:
A total of 31 studies met inclusion criteria: 14 studies investigated AI for image interpretation, 13 studies investigated AI for clinical predictions, and 4 studies were grouped as “other.” Studies commonly explored AI for ankle fractures, calcaneus fractures, hallux valgus, Achilles tendon pathologies, plantar fasciitis, and sports injuries. For studies reporting the area under the receiver operating characteristic curve (AUC), AUCs ranged from 0.64 (poor) to 0.99 (excellent). Two studies (6.45%) reported external validation.
Conclusion:
Applications of AI in the field of foot and ankle surgery are expanding, particularly for image interpretation and clinical predictions. Current model performances range from poor to excellent, and most studies lack external validation, demonstrating a need for further research prior to deploying AI-based clinical applications.
Level of Evidence:
Level III, retrospective cohort study.
Introduction
Artificial intelligence (AI) and its subsets machine learning (ML) and deep learning (DL) are being increasingly explored for applications in medicine and orthopaedic surgery.3,7,8,10,12,15,19,25,29,31,49 The essentials of AI, ML, and DL for orthopaedic surgeons, clinicians, and researchers have been thoroughly described in previous literature.6,12,34,37,38,45 Briefly, AI and its subsets involves the use of technology to simulate human intelligence. Algorithms or models can be developed that learn and understand complex relationships from data sets. These models can then be applied for many different purposes, such as automating analysis of radiographic images, predicting surgical outcomes, or predicting injuries in sports players.
AI models are being developed in nearly all orthopaedic subspecialties, including hip, knee, spine, and pediatric surgery.21,27,28,35,55 Klemt et al 27 developed and validated ML models for predicting the risk of early revision surgery after primary total hip arthroplasty (THA). Jo et al 21 developed and validated an ML model for predicting the risk of transfusion following primary TKA. Merali et al 35 developed and validated a DL model for detecting cervical spinal cord compression in magnetic resonance imaging (MRI) scans. Kunze et al 28 trained and tested several ML models for predicting patients that would achieve the minimal clinically important difference (MCID) in Hip Outcome Score-Sports Subscale (HOS-SS) following hip arthroscopy for femoroacetabular impingement syndrome. Xu et al 55 developed a DL-assisted system for automated measurements and classifications pertinent to developmental dysplasia of the hip directly from plain pelvic radiographs.
Potential applications for AI in foot and ankle surgery are vast and are at least partly similar to other orthopaedic subspecialties. Given the impact that AI and its subsets may have on clinical and operative practice, it is important for surgeons to understand the current advancements that have been made thus far in applying AI in foot and ankle surgery. Therefore, the purpose of this systematic review is to (1) characterize the topics and objectives of studies using AI in foot and ankle surgery, (2) evaluate the performance of their models, and (3) evaluate their validity (internal or external validation). We hypothesized that most studies would investigate AI for imaging analysis, have models that are not performing excellently, and have models that are not externally validated.
Methods
Search Strategy
We performed a systematic literature review in accordance with Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines. Two reviewers independently completed structured searches using the PubMed/MEDLINE and Embase databases on December 11, 2022, to search for all available articles on the databases before December 11, 2022. The search query used the terms as follows: (artificial intelligence OR machine learning OR deep learning) AND (foot OR ankle OR hallux valgus OR tibial tendon insufficiency OR hallux rigidus OR Lisfranc OR Achilles OR peroneal OR metatarsal OR plantar fasciitis OR midfoot OR talus OR cuboid OR ankle arthroscopy OR ankle arthroplasty). Two experienced orthopaedic researchers independently screened all titles, abstracts, and full-text articles. The reference lists of the final articles were also reviewed and cross-referenced to identify any other additional pertinent studies that were not found from the keyword search. The search strategy used in this study is displayed in Figure 1.

PRISMA diagram.
Eligibility Criteria
Standardized inclusion and exclusion criteria were used to determine study eligibility. Any disagreements or discrepancies were resolved by consensus. Inclusion criteria were as follows: (1) involve foot and ankle surgery; (2) involve AI; (3) clinically or operatively relevant to orthopaedic surgeons; (4) published in English; (5) available between January 1, 2005, and December 11, 2022; (6) original studies with level I to IV evidence; (7) published studies providing extractable outcome data. Exclusion criteria were as follows: (1) not involving foot and ankle surgery; (2) not involving AI; (3) not clinically or operatively relevant to orthopaedic surgeons; (4) not published in English; (5) no original, extractable clinical data (ie, review articles, commentaries, letters to the editor); (6) no full-text available; and (7) systematic review, meta-analysis, abstracts, conference proceedings.
Data Items
The primary outcomes of interest were (1) subject area in which AI was being applied, (2) best model performance metrics, and (3) whether the model(s) were internally or externally validated. Other variables for which data were sought included outcomes of interest, number of participants, median or average age of patients, percentage of males in the study, and the models evaluated.
Studies were grouped into 3 categories based on their subject area: clinical predictions, image interpretation, or other. Image interpretation studies were any that used AI for detection, classification, or diagnosis using plain radiographs, magnetic resonance imaging (MRI) images, computed topography (CT) images, or ultrasonographic images.
The best performance metrics were only recorded for studies applying AI for clinical predictions or image interpretation. The primary metrics used for evaluating the performance of models with dependent variables that were categorical/classes were area under the curve of the receiver operating characteristic curve (AUC), accuracy, sensitivity, specificity, negative predictive value (NPV), and positive predictive value (PPV). The ROC is a plot of a test’s sensitivity and specificity, with sensitivity on the y axis and 1 – specificity on the x axis. AUC values range from 0 to 1.0. A value of 1.0 indicates a test has perfect discriminative ability. AUC values were interpreted as follows for the models: >0.90 was considered excellent performance, 0.80 to 0.89 was considered good, 0.70 to 0.79 was considered fair, and 0.51 to 0.69 was considered poor. 32 The primary metrics used for evaluating the performance of models with dependent variables that were continuous were root mean squared error (RMSE) and coefficient of determination (R2). If none of the aforementioned metrics were available, any other pertinent metrics reported by the study were recorded.
Validation method was recorded for studies applying AI for clinical predictions or image interpretation. Internal validation was defined as when a model is tested on a population that is similar to that on which it was trained on. External validation was defined as evaluating the performance of an algorithm when applied to an external cohort, such as that from a different institution or national database. Studies in which data from a single population was split into training, validation, and independent test sets were not considered to have externally validated their models. Determining whether a model has been externally validated is useful for assessing its generalizability.
Data Analysis
No pooled analysis for AUC, accuracy, or other performance metrics was able to be performed because of significant methodological heterogeneity including the models tested, the types of outcomes, and patient characteristics that increases the risk for bias and inaccurate conclusions.
Results
A total of 31 studies met criteria for inclusion in the final analysis. No additional articles were identified after cross-referencing and reviewing the reference lists. Fourteen studies investigated AI for image interpretation, 13 studies investigated AI for clinical predictions, and 4 studies were grouped as “other.”
Image Interpretation
Of the 14 image interpretation studies, topics included general foot and ankle fractures, Lisfranc malalignment, hallux valgus parameters, calcaneus fractures, and the Achilles tendon (Table 1). Two of the 14 studies externally validated their models (14.3%). DL models were used in all of the studies except for 1 (92.9%) (Table 2). Of the 14 studies, 8 studies reported AUCs, for which the best values ranged from 0.85 (good) to 0.99 (excellent). Eight studies reported accuracies, for which the best values ranged from 72% to 99%.
Artificial Intelligence for Image Interpretation in Foot and Ankle Surgery.
Summary of Artificial Intelligence Models for Image Interpretation in Foot & Ankle Surgery.
Abbreviations: AUROC, area under the receiver operating characteristic curve; CT, computed tomography; DCNN, deep convolutional neural network; DL, deep learning; HVA, hallux valgus angle; ICC, intraclass correlation coefficient; ML, machine learning; NPV, negative predictive value; PPV, positive predictive value.
Highest accuracy when ensuring that patient samples from training set are not in test set.
Ashkani-Esfahani et al 5 internally validated 2 deep convolutional neural networks (DCNNs) for identifying ankle fractures from radiographs and achieved a near-perfect AUC of 0.99. Kitamura et al 26 internally validated 5 separate CNNs for detecting ankle fractures from plain radiographs and achieved a fair fracture detection accuracy of 81%. Prijs et al 44 internally and externally validated a DL model for detecting, classifying, and localizing ankle fractures from plain radiographs and achieved an excellent AUC of 0.92 and accuracy of 99% (classifying “no fracture”) on external validation. Guermazi et al 14 internally validated a DL model for detecting fractures from foot and ankle plain radiographs, which performed excellently with an AUC of 0.97, sensitivity per patient of 93%, and specificity per patient of 93%. Olczak et al 39 internally validated neural network models for classifying ankle fractures from radiographs according to the AO Foundation/Orthopaedic Trauma Association (AO/OTA) 2018 classification, which performed fair to excellent with AUCs ranging from 0.79 to 0.99 in classifying AO types. Pinto Dos Santos et al 42 internally validated a CNN for detecting fractures in anteroposterior ankle radiographs, which performed good with an AUC of 0.85.
Li et al 30 aimed internally validated a DL model for automated detection of 18 anatomical landmarks and measurement of the first-second intermetatarsal angle (IMA), hallux interphalangeal angle (HIA), hallux valgus angle (HVA), and distal metatarsal articular angle (DMAA) from weightbearing, dorsoplantar radiographs. The observed (manual by radiologist) and predicted (model) values of the 4 angles correlated well (ICC 0.89-0.96, r 0.81-0.97). 30
Wang et al 53 internally validated several radiomics-based ML models for diagnosing Achilles tendinopathy from ultrasonographic images in skiers and achieved an excellent AUC of 0.99, 90% sensitivity, and 100%. Kapiński et al 22 internally validated several DL models for classifying Achilles tendons as injured or healthy from MRI images and achieved a maximum accuracy of 97.6%, sensitivity of 98.3%, and specificity of 99.45%.
Wang et al 54 internally and externally tested a DL system for detecting and grading fatigue fractures (a type of stress fracture) from plain radiographs, which performed excellent (AUC 0.911, sensitivity 90.8%) in detection of fatigue fractures for the foot images and good (AUC 0.877, sensitivity 85.5%) for the tibiofibula images. External validity for grading of fatigue fractures was not demonstrated, as the DL system performed poorly with an overall accuracy of 62.9% for the tibiofibula images and an accuracy of 61.1% for the foot images.
Ashkani-Esfahani et al 4 internally validated 2 DCNN models for detecting Lisfranc instability from single-view (anteroposterior) and 3-view radiographs (anteroposterior, lateral, oblique), which performed excellently with AUCs ranging from 0.925 to 0.994.
Day et al 9 aimed to assess the performance of an AI-based software that automatically measures the M1-M2 IMA from weightbearing cone beam computed tomography (WBCT) scans in hallux valgus patients. The AI-based software was faster than manual measurements, correlated well with manual measurements, and had higher and nearly perfect test-retest reliability (0.99 intrasoftware intraclass correlation coefficient for both 3D and 2D IMA). 9
Aghnia Farda et al 1 internally validated a CNN model for classifying calcaneal fractures on CT images into the Sanders system, which performed well with a classification accuracy of nearly 72% after augmenting the data. Pranata et al 43 internally validated 2 separate DCNN models for detecting the presence or absence of calcaneal fractures on CT images and achieved an excellent accuracy of 98%.
Clinical Predictions
Of the 13 clinical prediction studies, topics were wide ranging and included predicting outcomes following surgery for ankle fractures, predicting lower extremity sports injuries, predicting recovery of peroneal nerve palsy, and more (Table 3). Zero of the 13 studies externally validated their models (0%). The number of ML and DL models tested ranged from 1 model to 11 models (Table 4). Of the 13 studies, 9 studies reported AUCs, for which the best values ranged from 0.64 (poor) to 0.97 (excellent). Six studies reported accuracies, for which the best values ranged from 70.4% to 93.18%.
Artificial Intelligence for Clinical Predictions in Foot and Ankle Surgery.
Summary of Artificial Intelligence Models for Clinical Predictions in Foot & Ankle Surgery.
Abbreviations: AUROC, area under the receiver operating characteristic curve; BDT, boosting decision tree; BPM, Bayes point machine; GB, gradient boosting; LOS, length of stay; LR, logistic regression; MLKI, multiligamentous knee injury; RF, random forest; SMO, sequential minimal optimization; VAS, visual analog scale.
Deemed best model in the study.
Diniz et al 11 internally validated one ML model for predicting whether soccer players would return to a similar level of match participation following an Achilles tendon rupture, which achieved a good AUC of 0.81 and Brier score loss of 0.12.
Lu et al 33 internally validated many ML models for predicting the occurrence of a lower extremity muscle strain (calf, groin, quadriceps, hamstring) in professional basketball players, among which the XGBoost model achieved the highest AUC of 0.840 and was deemed the best-performing model when also considering Brier score and calibration. Jauhiainen et al 20 internally validated 2 ML models for predicting moderate and severe knee and ankle injuries in young basketball and floorball players (age ≤ 21 years), which performed poorly with an AUC of 0.63 for the random forest model and 0.65 for the logistic regression model. Ruiz-Pérez et al 46 internally validated many ML models to predict lower extremity noncontact soft tissue injury in elite futsal players, which generally performed fairly, with the best model achieving an AUC of 0.767, sensitivity of 85.1%, and specificity of 62.1%.
Vasavada et al 51 internally validated one random forest model for predicting complete recovery of a peroneal nerve palsy following a multiligamentous knee injury, which performed poorly with an AUC of 0.64, accuracy of 75%, and F1 score of 0.86.
Wang et al 52 internally validated a support vector machine model for classifying hallux valgus patients as having painful feet or pain-free feet using radiographic metrics such as hallux valgus angle (HVA), intermetatarsal angle (IMA), and distal metatarsal articular angle (DMAA), which performed fair with an accuracy of 76.4%.
Hendrickx et al 17 internally validated 4 ML and DL models for predicting which patients with tibial shaft fractures have an occult posterior malleolar fracture. The models performed good with AUCs ranging from 0.81 to 0.89.
Oosterhoff et al 40 internally validated 5 models for predicting posterior malleolar involvement in distal tibial shaft fractures using the same data set as that in the previously described study by Hendrickx et al. 16 Oosterhoff et al 40 found that all the models performed good with AUCs >0.80 (highest 0.89) and 4 of 5 having a Brier score of 0.11.
Suda et al 50 internally validated several support vector machine models for classifying running experience level based on foot-ankle kinematic and kinetic patterns to potentially assist with running rehabilitation and training. The models performed well with classification accuracies of 88.5% for less experienced runners, 87.2% for moderately experienced runners, and 84.6% for experienced runners. 50
Merrill et al 36 internally validated a logistic regression and gradient boosting model for predicting short-term complications, including readmissions and mortality, following open reduction and internal fixation for ankle fractures. Both models performed similarly, with AUCs for gradient boosting ranging from 0.6979 to 0.7580 and AUCs for logistic regression ranging from 0.7101 to 0.7583. 36
Yin et al 56 internally validated a neural network model for predicting patients that would achieve the minimum clinically successful therapy (decrease in visual analog score [VAS] by 60% or more from baseline) at 6 months after extracorporeal shock wave therapy for chronic plantar fasciitis. The model performed well, with an overall accuracy of 92.5%, sensitivity of 95.0%, and specificity of 90.0%. 56
Sharif Bidabadi et al 47 internally validated many models for classifying gait patterns as normal or due to L5 radiculopathy using data from sensors called inertial measurement units (IMUs). Their best model performed excellently as evidenced by an AUC of 0.97 and accuracy of 93.18%. 47
Keijsers et al 24 internally validated a neural network model for differentiating patients who have forefoot pain and those that do not using plantar pressure data, which performed satisfactorily with an accuracy of 70.4%.
Other
Ardhianto et al 2 applied DL to help with automated measurement of the foot progression angle (FPA) from plantar pressure images to help clinicians assess gait abnormalities. Pakhomov et al 41 applied ML to automate identification and classification of foot examination findings from clinical notes as normal, abnormal, or not assessed, and their models performed well with overall accuracies ranging from 81% to 87%. Hernigou et al 18 applied AI and ML to assist in conducting their study for developing a method of defining the ideal and patient-specific motion axes of the tibiotalar joint, with the goal of improving how total ankle arthroplasty is performed with robotics. Zhu et al 57 aimed to assess whether ultrasonography-guided needle knife therapy with AI assistance can improve patient outcomes for plantar fasciitis better than the same therapy without AI. The AI technology used in this study assisted with processing of the ultrasonographic images. Those receiving the intervention with AI had significantly lower plantar fascia thickness, lower plantar fascia elasticity scores, and higher American Orthopaedic Foot & Ankle Society (AOFAS) ankle-hindfoot scores at 2, 4, and 8 weeks posttreatment compared to those without AI assistance. 57
Discussion
There is early optimism of the transformative impact that AI may have on the health care system and change how we practice medicine. As such, it is necessary for orthopaedic surgeons to be aware of the advancements in AI in their respective areas. This systematic review is the first of its kind in orthopaedic foot and ankle surgery to explore the subject areas in which AI is being applied, the performance of AI models, and the validity for the AI models. This study found that most studies are using AI for image interpretation, especially for ankle fractures, calcaneus fractures, and hallux valgus. The performance of current AI models is wide ranging, from poor to excellent, but there is significant heterogeneity in study methodologies that prevents any pooled analysis. Additionally, very few studies have externally validated their models.
This systematic review found that most current studies involve AI applications for imaging analysis, particularly fracture identification and classification. This is a common trend seen in other orthopaedic subspecialties as well. For example, in TJA, Karnuta et al 23 externally validated a DL system for classifying hip arthroplasty femoral implants from radiographs that performed excellently with a near perfect AUC of 0.999. Many investigators are likely driven to explore AI’s utility in image analysis because of their optimism that AI will outperform or enhance humans in speed and accuracy, translating to potential time-savings, cost-savings, and better patient outcomes. 13
This systematic review found that models for image interpretation are mostly performing excellent, with 75% (9 of 12) of those reporting accuracy or AUCs achieving a value ≥0.90. In contrary, almost no clinical prediction model (8.33%, 1 of 12 studies) performed excellent (AUC or accuracy ≥0.90). There is a need for more research on improving the performance of the clinical prediction models. Many factors, including the quality and size of the data sets, types of models, and how models are optimized, influence model performance and need to be further investigated for foot and ankle surgery. Clinicians can play a vital role in ensuring high-quality data are available to train and test models by helping with accurate data collection, data annotation, and data auditing.
Internal validation may lead to false optimism as it does not allow assessment of how generalizable a model is to other populations, such as those of a different region, age, or insurance status. It has been shown that predictive models often perform significantly worse during external validation. 48 Thus, external validation is necessary prior to clinical translation of any ML or DL model. None of the clinical prediction studies in this systematic review performed external validation of their models and only 2 of the imaging interpretation studies did. Therefore, it is important that clinicians are aware that most models developed in current foot and ankle surgery studies, although promising, are not yet ready for clinical translation.
Conclusion
AI applications are being increasingly explored in foot and ankle surgery, but most models lack external validation. Most models are being used for image interpretation and are performing excellently in doing so, but model performance is not robust for clinical predictions. More subject areas need to be explored in foot and ankle surgery, and models with better performance and external validation are needed.
Footnotes
Author Contributions
All authors were involved in conceiving the study, researching background literature, data analysis, and manuscript writing. All authors reviewed and edited the manuscript and approved the final version of the manuscript.
Ethical approval
Ethical approval for this study was waived by the Institutional Review Board.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article. ICMJE forms for all authors are available online.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
