Abstract
Background:
Accurate Current Procedural Terminology (CPT) coding is crucial for appropriate billing and reimbursement in foot and ankle surgery but is often time-consuming and prone to error. Large language models (LLMs) offer an approach to automate coding and reduce administrative burden, yet their performance in orthopaedic subspecialties remains limited. This study sought to evaluate the accuracy of 5 publicly available LLMs—ChatGPT-5 Mini, Google Gemini 2.5 Flash, Claude 4.0 Sonnet, Deepseek V3, and Perplexity—in correctly generating CPT codes for single-code (simple) foot and ankle procedures.
Methods:
Twenty-one common procedures identified by a single CPT code were selected. Each LLM was queried 4 times with standardized prompts requesting CPT codes. Accuracy was assessed based on the correct identification of CPT codes, with statistical analyses comparing performance across models and trials.
Results:
Perplexity achieved the highest accuracy (92.9%, 95% CI = 87.4%-98.4%), whereas Deepseek V3 performed worst (48.2%, 95% CI = 37.5%-58.9%). Global χ2 testing showed significant differences in coding accuracy among models (P < .001). Pairwise comparisons revealed Perplexity outperformed ChatGPT-5 Mini, Claude 4.0 Sonnet, and Deepseek V3; Google Gemini 2.5 Flash outperformed Deepseek V3 and ChatGPT-5 Mini; and Claude 4.0 Sonnet outperformed Deepseek V3.
Conclusion:
LLM performance in CPT coding for simple foot and ankle procedures was highly variable, with some models demonstrating acceptable accuracy whereas others performed poorly. These findings highlight that current LLMs are not sufficiently reliable for independent clinical use. Select models may serve as first-pass aids when combined with careful human verification.
Introduction
Accurate documentation is essential in orthopaedic surgery to identify correct descriptive terms and Current Procedural Terminology (CPT) codes, ensuring proper billing and reimbursement.1,2 Accurate CPT coding requires adherence to standardized guidelines; however, complexity arises because many procedures may require multiple or add-on CPT codes depending on the specific components of the case. 1
The adoption of electronic health records (EHRs) has substantially increased orthopaedic surgeons’ administrative workload, consuming more than 58% of their clinical time for order entry and documentation. 3 This reduces patient interaction, raises error risk, and contributes to administrative burden, with about 70% of physicians attributing it to EHR use. 3 To address these challenges, artificial intelligence, particularly large language models (LLMs), offers potential solutions, such as automating CPT code generation to reduce this burden.
LLMs are deep learning algorithms that generate and analyze content. 4 In health care, they have been applied to diagnostics, patient communication, scribing administrative tasks, and procedural coding.5,6 Within orthopaedics, AI applications expand beyond coding to imaging annotation, implant evaluation, and predictive modeling for personalized surgical planning.7-10 The Mayo Clinic orthopaedic AI laboratory currently develops tools for automated image annotation, risk stratification, and text processing to support diagnostics, surgical planning, and clinical data extraction. 11 Collectively, these studies and real-world applications demonstrate the broader potential of AI to augment orthopaedic practice; however, its application to administrative tasks such as CPT coding remains underexplored. Commercial platforms like GaleAI have also emerged to automate coding and improve revenue capture, reflecting growing industry interest in AI-assisted billing solutions. 12
Although ChatGPT has demonstrated educational utility, achieving a 47% score on the Orthopaedic In-Training Examination (OITE), research evaluating its clinical or administrative accuracy in orthopaedics is limited. 13 LLMs have been successfully applied to CPT code generation in specialties such as neurosurgery and aesthetic plastic surgery.14,15 A recent evaluation of LLMs in hand surgery coding found that Perplexity and Bard achieved the highest accuracy for simple procedures, with Perplexity correctly identifying all 15 CPT codes and Bard correctly identifying 14. 16 These results suggest that LLMs may effectively automate routine coding tasks for straightforward cases; however, their performance declines in more complicated scenarios. In hand procedures with more than 1 required CPT code for billing, Perplexity and Bard accurately coded only 3 of 5 procedures, whereas Bing AI and ChatGPT failed to generate any accurate CPT codes.
To date, no research has examined the accuracy of LLMs in identifying CPT codes specific to foot and ankle surgery, and it remains unclear how their performance may change as models evolve. This study, therefore aimed to assess publicly available LLMs in generating CPT codes for common foot and ankle procedures and to compare results across 2 time points spaced 1 month apart. We hypothesized that LLMs would perform well on simple procedures with a single code, where ambiguity is minimal and code retrieval is straightforward, and that their accuracy would improve on retesting after 1 month.
Methods
This is a retrospective study where the most commonly performed foot and ankle procedures at our institution were reviewed by the senior author, from which 21 procedures were selected for analysis. CPT codes were identified through review of published orthopaedic literature on PubMed and subsequently verified using Codify by the American Academy of Professional Coders. All codes are provided in Supplemental Table 1. 17 The names of these procedures were subsequently used to generate standardized query prompts in the following format: “What is the most appropriate CPT code(s) for [procedure]?”
Five LLMs were selected for comparison: ChatGPT-5 Mini, Google Gemini 2.5 Flash, Claude 4.0 Sonnet, Deepseek V3, and Perplexity. Each LLM was queried with the complete set of prompts in 4 trials conducted 24 hours apart. Following the final trial, researchers implemented a 1-month interval before retesting the individual CPT codes, employing the same evaluation schema to ensure consistency and reliability of results.
Responses were evaluated using a graded correctness framework. A response was deemed fully correct only if the language model accurately identified and explicitly specified the single most appropriate CPT code for the procedure. Responses were classified as partially correct if multiple CPT codes were provided and 1 of them was correct. Responses that did not include the correct CPT code were classified as incorrect.
Data Analysis
All statistical analyses were conducted using Python (version 3.7, Python Software Foundation, https://www.python.org/). Fisher’s exact tests were used to evaluate the overall association between the AI model used and the accuracy of responses. Fisher’s exact tests were also used to compare intertrial accuracy for each model. For pairwise comparisons, accuracy data were collapsed into 2 categories (correct vs incorrect) and analyzed using Fisher’s exact tests. This test was chosen because of its suitability for contingency tables with small, expected frequencies. Percent accuracy was calculated for each LLM, with 95% CIs calculated using the Wald formula. Statistical significance was defined as P <.05 for all analyses.
Results
The number of correct, partially correct, and incorrect responses for simple procedures by the AI model are listed in Table 1. Perplexity displayed the best performance, with 78 fully correct responses, 0 partially correct responses, and 6 incorrect responses (92.9% accuracy, 95% CI = 87.4%-98.4%). Deepseek V3 demonstrated the worst performance, with 40 fully correct responses, 1 partially correct response, and 43 incorrect responses (48.2% accuracy, 95% CI = 37.5%-58.9%). Google Gemini 2.5 Flash displayed 82.1% accuracy (95% CI = 73.9%-90.3%), Claude 4.0 Sonnet demonstrated 72.6% accuracy (95% CI = 63.1%-82.1%), and ChatGPT-5 Mini demonstrated 59.5% accuracy (95% CI = 49.0%-70.0%).
Number of Correct, Partially Correct, and Incorrect Responses for Simple Procedures by AI model. a
Fisher exact tests assessed differences in accuracy between trials. Boldface indicates significance (P < .05).
There were significant differences in coding accuracy across trials when using Google Gemini 2.5 Flash (P = .010), Deepseek V3 (P = .003), and Perplexity (P = .048). A global comparison of all 5 models using χ2 testing demonstrated a statistically significant difference in overall accuracy for simple procedures (P < .001). Results for pairwise comparisons of simple procedure coding accuracy between models are shown in Table 2. Perplexity performed significantly better than ChatGPT-5 Mini (P < .001), Claude 4.0 Sonnet (P = .001), and Deepseek V3 (P < .001).
P Values for Pairwise Comparisons of Coding Accuracy for Simple Procedures Between LLMs. a
Partially incorrect answers were reclassified as incorrect to facilitate analysis. Boldface indicates significance (P < .05).
Discussion
Accurate use of CPT codes is fundamental to ensuring correct billing and reimbursement in orthopaedic surgery. However, the process of thorough documentation and code assignment is often time-consuming and detracts from valuable clinical time spent with patients. 18 In this context, LLMs have emerged as potentially promising tools that may help address administrative burden by offering rapid, automated support for CPT code generation.19,20 Their ability to process natural language and map clinical documentation to structured outputs positions them as potential accelerators of efficiency in health care coding workflows. This study systematically evaluated the coding performance of 5 publicly available LLMs on single-CPT–coded foot and ankle surgeries. Our findings show that simple procedures can potentially be coded with reasonable fidelity by some models; however, overall reliability remains inconsistent.
Performance on coding procedures was highly variable, with accuracy ranging from 48.2% to 92.9%, consistent with prior reports across neurosurgery, aesthetic plastic surgery, hand surgery, and spine billing when tasks are narrowly defined and unambiguous.14-16,21 However, intertrial stability varied by model. Although some models demonstrated stable performance across trials, others exhibited noticeable variability despite high overall accuracy. This observation is clinically relevant, as reproducibility, rather than peak performance alone, is essential for real-world deployment. Models that maintain consistent accuracy across repeated prompts may be better suited for routine coding support, where clerical fatigue and oversight commonly contribute to human error. 22 In these scenarios, LLMs could offer practical value by accelerating documentation and reducing administrative burden without compromising reliability. However, in the present study, the observed error rates across several models raise substantial concerns, as accuracy below acceptable clinical thresholds renders these current systems unsuitable for independent use in real-world billing scenarios.
Among the models tested, Perplexity demonstrated the highest overall accuracy (92.9%), significantly outperforming ChatGPT-5 Mini, Claude 4.0 Sonnet, and Deepseek V3. Google Gemini 2.5 Flash also performed significantly better than Deepseek V3 and ChatGPT-5 Mini (P < .001 and P = .001, respectively), whereas Claude 4.0 Sonnet outperformed Deepseek V3 (all P ≤ .001). These findings underscore meaningful differences among LLMs, likely reflecting variation in training data, architectural design, and optimization strategies.23,24 Importantly, the observed variability in performance further emphasizes the need for careful model selection and rigorous validation before implementation, as well as potential improvement through domain‑specific training or access to official coding knowledge bases rather than relying solely on general pretrained knowledge.25-27
The clinical implications of these findings are significant. For simple, single-code procedures, LLMs could help reduce time spent on administrative tasks, enabling physicians and coders to focus more on patient-facing responsibilities. 28 Perplexity and Google Gemini 2.5 Flash may be best positioned as first-pass aids, generating preliminary coding suggestions that require clinician verification. This potential role aligns with broader calls for AI systems to alleviate documentation burden and reduce physician burnout in health care. In this capacity, LLMs may be best positioned as first-pass aids, generating accurate preliminary suggestions that coders or clinicians can rapidly verify—enhancing efficiency without displacing human oversight.
Future development should focus on domain-specific fine-tuning with curated coding data sets, as well as hybrid systems that integrate LLMs with rule-based logic engines to improve adherence to CPT coding rules. Beyond accuracy, further evaluations should examine reproducibility, transparency of reasoning, and error-detection capabilities, all of which are critical for safe integration into clinical workflows. Broader validation across additional surgical subspecialties is also needed to establish generalizability, as coding challenges vary substantially between domains. Policy considerations, including accountability, liability, and regulatory compliance, must be addressed before widespread adoption of LLM-assisted coding can occur.
Limitations
This study is the first to systematically evaluate multiple publicly available LLMs in generating CPT codes specifically for foot and ankle procedures. However, the simplified categorization of procedures and reliance on standardized text prompts may not fully replicate the complexity of clinical documentation or decision-making. Additionally, only a snapshot of current LLMs was assessed, without accounting for ongoing model updates. Although multiple pairwise comparisons were performed to evaluate differences between models, these analyses were exploratory in nature and may increase the risk of overinterpretation of small differences between groups. Similarly, 10 pairwise comparisons were conducted at α = 0.05 without multiplicity correction. Thus, findings are subject to inflated type I error and should be interpreted cautiously. Finally, the study’s focus on foot and ankle procedures limits generalizability to other subspecialties or broader coding domains.
Conclusion
This study systematically evaluated multiple LLMs for CPT code generation in foot and ankle procedures, focusing on simple, single-code tasks. Accuracy for these procedures ranged from 48.2% (Deepseek V3) to 92.9% (Perplexity), demonstrating variable accuracy. Although select currently available models may serve as preliminary decision support tools, reliance on their outputs without human verification may result in significant clinical and financial consequences.
Supplemental Material
sj-pdf-1-fao-10.1177_24730114261448207 – Supplemental material for Evaluation of Large Language Models for Automated Simple CPT Coding in Foot and Ankle Surgery
Supplemental material, sj-pdf-1-fao-10.1177_24730114261448207 for Evaluation of Large Language Models for Automated Simple CPT Coding in Foot and Ankle Surgery by Eve R. Glenn, Ariana Rowshan, Eric Mao, David Ryu, Yesha Parekh, Nigel N. Hsu, John M. Thompson and Amiethab A. Aiyer in Foot & Ankle Orthopaedics
Footnotes
Appendix
Current Procedural Terminology Codes for Simple Orthopaedic Procedures Evaluated Across 5 Large Language Model Platforms.
| Procedure | CPT Code(s) |
|---|---|
| 1. Open reduction and internal fixation (ORIF) of medial malleolar ankle fracture | 27766 |
| 2. ORIF of bimalleolar ankle fracture | 27814 |
| 3. ORIF of calcaneal fracture | 28415 |
| 4. Primary Achilles tendon repair without graft | 27650 |
| 5. Primary Achilles tendon repair with graft | 27652 |
| 6. Secondary Achilles tendon repair without graft | 27654 |
| 7. Ankle arthroscopy with limited debridement | 29897 |
| 8. Ankle arthroscopy with extensive debridement | 29898 |
| 9. Distal single MT osteotomy bunionectomy for hallux valgus | 28296 |
| 10. Double MT osteotomy bunionectomy for hallux valgus | 28299 |
| 11. First metatarsophalangeal (MTP) joint fusion | 28750 |
| 12. Subtalar arthrodesis (single joint) | 28725 |
| 13. Triple arthrodesis (subtalar, talonavicular, calcaneocuboid joints) | 28715 |
| 14. Single excision of Morton’s neuroma | 28080 |
| 15. Excision of ganglion cyst of the distal foot | 28090 |
| 16. Removal of deep implant (e.g., screw or plate) | 20680 |
| 17. Tibia-only osteotomy | 27705 |
| 18. Primary total ankle arthroplasty with implant | 27702 |
| 19. Revision total ankle arthroplasty with implant | 27703 |
| 20. ORIF of trimalleolar ankle fracture without posterior malleolar fixation | 27822 |
| 21. ORIF of trimalleolar ankle fracture with posterior malleolar fixation | 27823 |
ORCID iDs
Ethical Considerations
Ethical approval was not sought as this study involved only publicly available large language models and predefined procedural descriptions, with no human participants or patient data.
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article. Disclosure forms for all authors are available online.
Supplemental material
Supplementary material is available online with this article.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
