Evaluation of Large Language Models for Automated Simple CPT Coding in Foot and Ankle Surgery

Abstract

Background:

Accurate Current Procedural Terminology (CPT) coding is crucial for appropriate billing and reimbursement in foot and ankle surgery but is often time-consuming and prone to error. Large language models (LLMs) offer an approach to automate coding and reduce administrative burden, yet their performance in orthopaedic subspecialties remains limited. This study sought to evaluate the accuracy of 5 publicly available LLMs—ChatGPT-5 Mini, Google Gemini 2.5 Flash, Claude 4.0 Sonnet, Deepseek V3, and Perplexity—in correctly generating CPT codes for single-code (simple) foot and ankle procedures.

Methods:

Twenty-one common procedures identified by a single CPT code were selected. Each LLM was queried 4 times with standardized prompts requesting CPT codes. Accuracy was assessed based on the correct identification of CPT codes, with statistical analyses comparing performance across models and trials.

Results:

Perplexity achieved the highest accuracy (92.9%, 95% CI = 87.4%-98.4%), whereas Deepseek V3 performed worst (48.2%, 95% CI = 37.5%-58.9%). Global χ² testing showed significant differences in coding accuracy among models (P < .001). Pairwise comparisons revealed Perplexity outperformed ChatGPT-5 Mini, Claude 4.0 Sonnet, and Deepseek V3; Google Gemini 2.5 Flash outperformed Deepseek V3 and ChatGPT-5 Mini; and Claude 4.0 Sonnet outperformed Deepseek V3.

Conclusion:

LLM performance in CPT coding for simple foot and ankle procedures was highly variable, with some models demonstrating acceptable accuracy whereas others performed poorly. These findings highlight that current LLMs are not sufficiently reliable for independent clinical use. Select models may serve as first-pass aids when combined with careful human verification.

Keywords

Large language models artificial intelligence automated medical coding billing accuracy

Introduction

Accurate documentation is essential in orthopaedic surgery to identify correct descriptive terms and Current Procedural Terminology (CPT) codes, ensuring proper billing and reimbursement.^1,2 Accurate CPT coding requires adherence to standardized guidelines; however, complexity arises because many procedures may require multiple or add-on CPT codes depending on the specific components of the case.¹

The adoption of electronic health records (EHRs) has substantially increased orthopaedic surgeons’ administrative workload, consuming more than 58% of their clinical time for order entry and documentation.³ This reduces patient interaction, raises error risk, and contributes to administrative burden, with about 70% of physicians attributing it to EHR use.³ To address these challenges, artificial intelligence, particularly large language models (LLMs), offers potential solutions, such as automating CPT code generation to reduce this burden.

LLMs are deep learning algorithms that generate and analyze content.⁴ In health care, they have been applied to diagnostics, patient communication, scribing administrative tasks, and procedural coding.^5,6 Within orthopaedics, AI applications expand beyond coding to imaging annotation, implant evaluation, and predictive modeling for personalized surgical planning.^7-10 The Mayo Clinic orthopaedic AI laboratory currently develops tools for automated image annotation, risk stratification, and text processing to support diagnostics, surgical planning, and clinical data extraction.¹¹ Collectively, these studies and real-world applications demonstrate the broader potential of AI to augment orthopaedic practice; however, its application to administrative tasks such as CPT coding remains underexplored. Commercial platforms like GaleAI have also emerged to automate coding and improve revenue capture, reflecting growing industry interest in AI-assisted billing solutions.¹²

Although ChatGPT has demonstrated educational utility, achieving a 47% score on the Orthopaedic In-Training Examination (OITE), research evaluating its clinical or administrative accuracy in orthopaedics is limited.¹³ LLMs have been successfully applied to CPT code generation in specialties such as neurosurgery and aesthetic plastic surgery.^14,15 A recent evaluation of LLMs in hand surgery coding found that Perplexity and Bard achieved the highest accuracy for simple procedures, with Perplexity correctly identifying all 15 CPT codes and Bard correctly identifying 14.¹⁶ These results suggest that LLMs may effectively automate routine coding tasks for straightforward cases; however, their performance declines in more complicated scenarios. In hand procedures with more than 1 required CPT code for billing, Perplexity and Bard accurately coded only 3 of 5 procedures, whereas Bing AI and ChatGPT failed to generate any accurate CPT codes.

To date, no research has examined the accuracy of LLMs in identifying CPT codes specific to foot and ankle surgery, and it remains unclear how their performance may change as models evolve. This study, therefore aimed to assess publicly available LLMs in generating CPT codes for common foot and ankle procedures and to compare results across 2 time points spaced 1 month apart. We hypothesized that LLMs would perform well on simple procedures with a single code, where ambiguity is minimal and code retrieval is straightforward, and that their accuracy would improve on retesting after 1 month.

Methods

This is a retrospective study where the most commonly performed foot and ankle procedures at our institution were reviewed by the senior author, from which 21 procedures were selected for analysis. CPT codes were identified through review of published orthopaedic literature on PubMed and subsequently verified using Codify by the American Academy of Professional Coders. All codes are provided in Supplemental Table 1.¹⁷ The names of these procedures were subsequently used to generate standardized query prompts in the following format: “What is the most appropriate CPT code(s) for [procedure]?”

Five LLMs were selected for comparison: ChatGPT-5 Mini, Google Gemini 2.5 Flash, Claude 4.0 Sonnet, Deepseek V3, and Perplexity. Each LLM was queried with the complete set of prompts in 4 trials conducted 24 hours apart. Following the final trial, researchers implemented a 1-month interval before retesting the individual CPT codes, employing the same evaluation schema to ensure consistency and reliability of results.

Responses were evaluated using a graded correctness framework. A response was deemed fully correct only if the language model accurately identified and explicitly specified the single most appropriate CPT code for the procedure. Responses were classified as partially correct if multiple CPT codes were provided and 1 of them was correct. Responses that did not include the correct CPT code were classified as incorrect.

Data Analysis

All statistical analyses were conducted using Python (version 3.7, Python Software Foundation, https://www.python.org/). Fisher’s exact tests were used to evaluate the overall association between the AI model used and the accuracy of responses. Fisher’s exact tests were also used to compare intertrial accuracy for each model. For pairwise comparisons, accuracy data were collapsed into 2 categories (correct vs incorrect) and analyzed using Fisher’s exact tests. This test was chosen because of its suitability for contingency tables with small, expected frequencies. Percent accuracy was calculated for each LLM, with 95% CIs calculated using the Wald formula. Statistical significance was defined as P <.05 for all analyses.

Results

The number of correct, partially correct, and incorrect responses for simple procedures by the AI model are listed in Table 1. Perplexity displayed the best performance, with 78 fully correct responses, 0 partially correct responses, and 6 incorrect responses (92.9% accuracy, 95% CI = 87.4%-98.4%). Deepseek V3 demonstrated the worst performance, with 40 fully correct responses, 1 partially correct response, and 43 incorrect responses (48.2% accuracy, 95% CI = 37.5%-58.9%). Google Gemini 2.5 Flash displayed 82.1% accuracy (95% CI = 73.9%-90.3%), Claude 4.0 Sonnet demonstrated 72.6% accuracy (95% CI = 63.1%-82.1%), and ChatGPT-5 Mini demonstrated 59.5% accuracy (95% CI = 49.0%-70.0%).

Table 1.

Number of Correct, Partially Correct, and Incorrect Responses for Simple Procedures by AI model.^a

AI Model	Pair One						Pair 2						Intertrial P Value
AI Model	Trial 1			Trial 2			Trial 1			Trial 2
	Correct	Partially Correct	Incorrect	Correct	Partially Correct	Incorrect	Correct	Partially Correct	Incorrect	Correct	Partially Correct	Incorrect
ChatGPT-5 Mini	11	0	10	10	1	10	15	0	6	13	1	7	.576
Google Gemini 2.5 Flash	13	0	8	16	0	5	20	0	1	20	0	1	.010
Claude 4.0 Sonnet	17	0	4	16	0	5	15	0	6	13	0	8	.553
Deepseek V3	13	0	8	16	1	4	2	0	19	9	0	12	.003
Perplexity	19	0	2	17	0	4	21	0	0	21	0	0	.048
Total	73	0	32	75	2	28	73	0	32	76	1	28

Fisher exact tests assessed differences in accuracy between trials. Boldface indicates significance (P < .05).

There were significant differences in coding accuracy across trials when using Google Gemini 2.5 Flash (P = .010), Deepseek V3 (P = .003), and Perplexity (P = .048). A global comparison of all 5 models using χ² testing demonstrated a statistically significant difference in overall accuracy for simple procedures (P < .001). Results for pairwise comparisons of simple procedure coding accuracy between models are shown in Table 2. Perplexity performed significantly better than ChatGPT-5 Mini (P < .001), Claude 4.0 Sonnet (P = .001), and Deepseek V3 (P < .001).

Table 2.

P Values for Pairwise Comparisons of Coding Accuracy for Simple Procedures Between LLMs.^a

	ChatGPT-5 Mini	Google Gemini 2.0	Claude 4.0 Sonnet	Deepseek V3
ChatGPT-5 Mini
Google Gemini 2.5 Flash	.001
Claude 4.0 Sonnet	.074	.196
Deepseek V3	.216	<.001	.002
Perplexity	<.001	.060	.001	<.001

Partially incorrect answers were reclassified as incorrect to facilitate analysis. Boldface indicates significance (P < .05).

Discussion

Accurate use of CPT codes is fundamental to ensuring correct billing and reimbursement in orthopaedic surgery. However, the process of thorough documentation and code assignment is often time-consuming and detracts from valuable clinical time spent with patients.¹⁸ In this context, LLMs have emerged as potentially promising tools that may help address administrative burden by offering rapid, automated support for CPT code generation.^19,20 Their ability to process natural language and map clinical documentation to structured outputs positions them as potential accelerators of efficiency in health care coding workflows. This study systematically evaluated the coding performance of 5 publicly available LLMs on single-CPT–coded foot and ankle surgeries. Our findings show that simple procedures can potentially be coded with reasonable fidelity by some models; however, overall reliability remains inconsistent.

Performance on coding procedures was highly variable, with accuracy ranging from 48.2% to 92.9%, consistent with prior reports across neurosurgery, aesthetic plastic surgery, hand surgery, and spine billing when tasks are narrowly defined and unambiguous.^14-16,21 However, intertrial stability varied by model. Although some models demonstrated stable performance across trials, others exhibited noticeable variability despite high overall accuracy. This observation is clinically relevant, as reproducibility, rather than peak performance alone, is essential for real-world deployment. Models that maintain consistent accuracy across repeated prompts may be better suited for routine coding support, where clerical fatigue and oversight commonly contribute to human error.²² In these scenarios, LLMs could offer practical value by accelerating documentation and reducing administrative burden without compromising reliability. However, in the present study, the observed error rates across several models raise substantial concerns, as accuracy below acceptable clinical thresholds renders these current systems unsuitable for independent use in real-world billing scenarios.

Among the models tested, Perplexity demonstrated the highest overall accuracy (92.9%), significantly outperforming ChatGPT-5 Mini, Claude 4.0 Sonnet, and Deepseek V3. Google Gemini 2.5 Flash also performed significantly better than Deepseek V3 and ChatGPT-5 Mini (P < .001 and P = .001, respectively), whereas Claude 4.0 Sonnet outperformed Deepseek V3 (all P ≤ .001). These findings underscore meaningful differences among LLMs, likely reflecting variation in training data, architectural design, and optimization strategies.^23,24 Importantly, the observed variability in performance further emphasizes the need for careful model selection and rigorous validation before implementation, as well as potential improvement through domain‑specific training or access to official coding knowledge bases rather than relying solely on general pretrained knowledge.^25-27

The clinical implications of these findings are significant. For simple, single-code procedures, LLMs could help reduce time spent on administrative tasks, enabling physicians and coders to focus more on patient-facing responsibilities.²⁸ Perplexity and Google Gemini 2.5 Flash may be best positioned as first-pass aids, generating preliminary coding suggestions that require clinician verification. This potential role aligns with broader calls for AI systems to alleviate documentation burden and reduce physician burnout in health care. In this capacity, LLMs may be best positioned as first-pass aids, generating accurate preliminary suggestions that coders or clinicians can rapidly verify—enhancing efficiency without displacing human oversight.

Future development should focus on domain-specific fine-tuning with curated coding data sets, as well as hybrid systems that integrate LLMs with rule-based logic engines to improve adherence to CPT coding rules. Beyond accuracy, further evaluations should examine reproducibility, transparency of reasoning, and error-detection capabilities, all of which are critical for safe integration into clinical workflows. Broader validation across additional surgical subspecialties is also needed to establish generalizability, as coding challenges vary substantially between domains. Policy considerations, including accountability, liability, and regulatory compliance, must be addressed before widespread adoption of LLM-assisted coding can occur.

Limitations

This study is the first to systematically evaluate multiple publicly available LLMs in generating CPT codes specifically for foot and ankle procedures. However, the simplified categorization of procedures and reliance on standardized text prompts may not fully replicate the complexity of clinical documentation or decision-making. Additionally, only a snapshot of current LLMs was assessed, without accounting for ongoing model updates. Although multiple pairwise comparisons were performed to evaluate differences between models, these analyses were exploratory in nature and may increase the risk of overinterpretation of small differences between groups. Similarly, 10 pairwise comparisons were conducted at α = 0.05 without multiplicity correction. Thus, findings are subject to inflated type I error and should be interpreted cautiously. Finally, the study’s focus on foot and ankle procedures limits generalizability to other subspecialties or broader coding domains.

Conclusion

This study systematically evaluated multiple LLMs for CPT code generation in foot and ankle procedures, focusing on simple, single-code tasks. Accuracy for these procedures ranged from 48.2% (Deepseek V3) to 92.9% (Perplexity), demonstrating variable accuracy. Although select currently available models may serve as preliminary decision support tools, reliance on their outputs without human verification may result in significant clinical and financial consequences.

Supplemental Material

sj-pdf-1-fao-10.1177_24730114261448207 – Supplemental material for Evaluation of Large Language Models for Automated Simple CPT Coding in Foot and Ankle Surgery

Supplemental material, sj-pdf-1-fao-10.1177_24730114261448207 for Evaluation of Large Language Models for Automated Simple CPT Coding in Foot and Ankle Surgery by Eve R. Glenn, Ariana Rowshan, Eric Mao, David Ryu, Yesha Parekh, Nigel N. Hsu, John M. Thompson and Amiethab A. Aiyer in Foot & Ankle Orthopaedics

Footnotes

Appendix

Supplemental Table S1.

Current Procedural Terminology Codes for Simple Orthopaedic Procedures Evaluated Across 5 Large Language Model Platforms.

Procedure	CPT Code(s)
1. Open reduction and internal fixation (ORIF) of medial malleolar ankle fracture	27766
2. ORIF of bimalleolar ankle fracture	27814
3. ORIF of calcaneal fracture	28415
4. Primary Achilles tendon repair without graft	27650
5. Primary Achilles tendon repair with graft	27652
6. Secondary Achilles tendon repair without graft	27654
7. Ankle arthroscopy with limited debridement	29897
8. Ankle arthroscopy with extensive debridement	29898
9. Distal single MT osteotomy bunionectomy for hallux valgus	28296
10. Double MT osteotomy bunionectomy for hallux valgus	28299
11. First metatarsophalangeal (MTP) joint fusion	28750
12. Subtalar arthrodesis (single joint)	28725
13. Triple arthrodesis (subtalar, talonavicular, calcaneocuboid joints)	28715
14. Single excision of Morton’s neuroma	28080
15. Excision of ganglion cyst of the distal foot	28090
16. Removal of deep implant (e.g., screw or plate)	20680
17. Tibia-only osteotomy	27705
18. Primary total ankle arthroplasty with implant	27702
19. Revision total ankle arthroplasty with implant	27703
20. ORIF of trimalleolar ankle fracture without posterior malleolar fixation	27822
21. ORIF of trimalleolar ankle fracture with posterior malleolar fixation	27823

ORCID iDs

Eve R. Glenn, ScB,

Ariana Rowshan, BS,

David Ryu, BA,

Nigel N. Hsu, MD,

Amiethab A. Aiyer, MD,

Ethical Considerations

Ethical approval was not sought as this study involved only publicly available large language models and predefined procedural descriptions, with no human participants or patient data.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article. Disclosure forms for all authors are available online.

Supplemental material

Supplementary material is available online with this article.

References

Rogero

Rider

Grear

Richardson

Murphy

Bettin

CC.

Coding patterns and implications for reimbursement in foot-and-ankle surgery. Cureus. 2025;17(4):e81955. doi:10.7759/cureus.81955

Filler

BC.

Coding basics for orthopaedic surgeons. Clin Orthop Relat Res. 2007;457:105-113. doi:10.1097/blo.0b013e31803372b8

Kesler

Wynn

Pugely

AJ.

Time and clerical burden posed by the current electronic health record for orthopaedic surgeons. J Am Acad Orthop Surg. 2022;30(1):e34-e43. doi:10.5435/JAAOS-D-21-00094

Chatterjee

Bhattacharya

Pal

Lee

Chakraborty

ChatGPT and large language models in orthopedics: from education and surgery to research. J Exp Orthop. 2023;10(1):128. doi:10.1186/s40634-023-00700-1

Alowais

Alghamdi

Alsuhebany

, et al. Revolutionizing healthcare: the role of artificial intelligence in clinical practice. BMC Med Educ. 2023;23(1):689. doi:10.1186/s12909-023-04698-z

Mess

Mackey

Yarowsky

DE.

Artificial intelligence scribe and large language model technology in healthcare documentation: advantages, limitations, and recommendations. Plast Reconstr Surg Glob Open. 2025;13(1):e6450. doi:10.1097/GOX.0000000000006450

Kumar

Roche

Overman

, et al. What is the accuracy of three different machine learning techniques to predict clinical outcomes after shoulder arthroplasty? Clin Orthop Relat Res. 2020;478(10):2351-2363. doi:10.1097/CORR.0000000000001263

Lisacek-Kiosoglous

Powling

Fontalis

Gabr

Mazomenos

Haddad

FS.

Artificial intelligence in orthopaedic surgery. Bone Joint Res. 2023;12(7):447-454. doi:10.1302/2046-3758.127.BJR-2023-0111.R1

Richardson

ML.

MR protocol optimization with deep learning: a proof of concept. Curr Probl Diagn Radiol. 2021;50(2):168-174. doi:10.1067/j.cpradiol.2019.10.004

10.

Kurmis

Ianunzio

JR.

Artificial intelligence in orthopedic surgery: evolution, current state and future directions. Arthroplasty. 2022;4(1):9. doi:10.1186/s42836-022-00112-z

11.

Orthopedic Surgery Artificial Intelligence Laboratory: Cody C. Wyles, Michael J. Taunton - Overview. Mayo Clinic. Accessed December 28, 2025. https://www.mayo.edu/research/labs/orthopedic-surgery-artificial-intelligence-laboratory/overview

12.

GaleAI. Effortless medical coding with AI. Accessed December 28, 2025. https://www.galeai.co/

13.

Kung

Marshall

Gauthier

Gonzalez

Jackson

JB.

Evaluating ChatGPT performance on the orthopaedic in-training examination. JB JS Open Access. 2023;8(3):e23.00056. doi:10.2106/JBJS.OA.23.00056

14.

O’Malley

Sarwar

Cassimatis

, et al. Can publicly available artificial intelligence successfully identify Current Procedural Terminology codes for common procedures in neurosurgery? World Neurosurg. 2024;183:e860-e870. doi:10.1016/j.wneu.2024.01.043

15.

Isch

Sambangi

Somers

, et al. The role of artificial intelligence in enhancing CPT coding accuracy for aesthetic plastic surgery: insight into large language models. J Plast Reconstr Aesthet Surg. 2025;103:226-228. doi:10.1016/j.bjps.2025.02.031

16.

Isch

Lee

Self

, et al. Artificial intelligence in surgical coding: evaluating large language models for Current Procedural Terminology accuracy in hand surgery. J Hand Surg Glob Online. 2025;7(2):181-185. doi:10.1016/j.jhsg.2024.11.013

17.

Medical Coding & Billing Tools - CPT®, ICD-10, HCPCS Codes, & Modifiers - Codify by AAPC. Accessed December 28, 2025. https://www.aapc.com/codes/

18.

Kirsch

Fakhry

Bernard

Tominaga

GT.

Documentation and coding for trauma and surgical critical care: updates and tips. Trauma Surg Acute Care Open. 2024;9(1):e001532. doi:10.1136/tsaco-2024-001532

19.

Clusmann

Kolbinger

Muti

, et al. The future landscape of large language models in medicine. Commun Med (Lond). 2023;3(1):141. doi:10.1038/s43856-023-00370-1

20.

Zhang

Meng

Yan

, et al. Revolutionizing health care: the transformative impact of large language models in medicine. J Med Internet Res. 2025;27:e59069. doi:10.2196/59069

21.

Zaidat

Lahoti

Mohamed

Cho

Kim

JS.

Artificially intelligent billing in spine surgery: an analysis of a large language model. Global Spine J. 2025;15(2):1113-1120. doi:10.1177/21925682231224753

22.

Aghighi

Aryankhesal

Raeissi

Factors affecting the recurrence of medical errors in hospitals and the preventive strategies: a scoping review. J Med Ethics Hist Med. 2022;15:7. doi:10.18502/jmehm.v15i7.11049

23.

Blanchfield

Heffernan

Osgood

Sheehan

Meyer

GS.

Saving billions of dollars—and physicians’ time—by streamlining billing practices. Health Affairs. 2010;29(6): 1248-1254. doi:10.1377/hlthaff.2009.0075

24.

Pavuluri

Sangal

Sather

Taylor

RA.

Balancing act: the complex role of artificial intelligence in addressing burnout and healthcare workforce dynamics. BMJ Health Care Inform. 2024;31(1):e101120. doi:10.1136/bmjhci-2024-101120

25.

Simmons

Takkavatakarn

McDougal

, et al. Extracting international classification of diseases codes from clinical documentation using large language models. Appl Clin Inform. 2025;16(2):337-344. doi:10.1055/a-2491-3872

26.

Roy

Self

Isch

, et al. Evaluating large language models for automated CPT code prediction in endovascular neurosurgery. J Med Syst. 2025;49(1):15. doi:10.1007/s10916-025-02149-4

27.

Hou

Liu

Bian

Zhuang

Enhancing medical coding efficiency through domain-specific fine-tuned large language models. NPJ Health Syst. 2025;2(1):14. doi:10.1038/s44401-025-00018-3

28.

Bai

Luo

Zhang

, et al. Assessment and integration of large language models for automated electronic health record documentation in emergency medical services. J Med Syst. 2025;49(1):65. doi:10.1007/s10916-025-02197-w

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.78 MB

0.00 MB