Sage Journals: Discover world-class research

Abstract

Background

Accurate procedural coding is essential for resource allocation, billing integrity, and quality reporting within critical care. Current Procedural Terminology (CPT) coding is largely manual and error-prone, especially in high-acuity environments such as the cardiovascular surgical intensive care unit (CVSICU), where complex procedures like extracorporeal membrane oxygenation (ECMO) are common. Large Language Models (LLMs) may offer scalable solutions for automated coding, but their performance in the CVSICU has not been systematically evaluated.

Methods

Six publicly accessible LLMs (GPT-4, Claude 3.7 Sonnet, Perplexity, DeepSeek, Google Gemini 2.5 Pro, Mistral) were tested on CPT code assignment to 47 CVSICU procedures, including 7 ECMO-related interventions, from a single tertiary center (July 2023 to May 2025). Models received prompts in a standardized format and was evaluated based on code accuracy. Statistical comparisons were conducted to assess inter-model performance differences for ECMO and non-ECMO related procedures.

Results

For non-ECMO procedures, Gemini 2.5 Pro and Perplexity achieved the highest accuracy (88%), followed by Deepseek (78%), Claude 3.7 Sonnet (75%), Mistral (68%), and GPT-4.0 (56%). For ECMO-related codes, Perplexity outperformed all models (86%), followed by Gemini 2.5 Pro (71%), Mistral (43%), DeepSeek (29%), Claude 3.7 Sonnet (14%), and GPT 4.0 (0%). Pairwise comparisons revealed statistically significant inter-model differences.

Conclusions

While LLMs such as Perplexity and Gemini show promise for automated coding, their limited understanding of context, specifically context-dependent nuances of ECMO, remains a key barrier. Future work should focus on developing domain-specific fine-tuning to capture procedural context before they are employed in high acuity clinical settings.

Keywords

artificial intelligence cardiovascular surgical intensive care unit (CVSICU)clinical documentation current procedural terminology (CPT) coding extracorporeal membrane oxygenation (ECMO)large language models (LLMs)

Get full access to this article

View all access options for this article.

References

Nouraei

SAR

Hudovsky

Frampton

, et al. A study of clinical coding accuracy in surgery: implications for the use of administrative big data for outcomes management. Ann Surg 2015; 261: 1096–1107.

Dong

Falis

Whiteley

, et al.

Automated clinical coding: what, why, and where we are?

npj Digit Med 2022; 5: 159.

Dotson

. CPT® codes: what are they, why are they necessary, and how are they developed? Adv Wound Care 2013; 2: 583–587.

Hou

Liu

Bian

, et al. Enhancing medical coding efficiency through domain-specific fine-tuned large language models. Npj Health Syst 2025; 2: 14.

Burns

Rigby

Mamidanna

, et al. Systematic review of discharge coding accuracy. J Public Health 2012; 34: 138–148.

Tseng

Kaplan

Richman

, et al. Administrative costs associated with physician billing and insurance-related activities at an academic health care system. JAMA 2018; 319: 691–697.

Jazayeri

Khavanin

, et al. Variability in current procedural terminology codes for craniomaxillofacial Trauma reconstruction: a national survey. J Craniofac Surg 2020; 31: 996–999.

Rogero

Rider

Grear

, et al. Coding patterns and implications for reimbursement in foot-and-ankle surgery. Cureus 2025; 17: e81955.

Soroush

Glicksberg

Zimlichman

, et al. Large language models are poor medical coders — benchmarking of medical code querying. NEJM; AI(1). DOI: 10.1056/aidbp2300040, Epub ahead of print April 25, 2024.

10.

Isch

Lee

Self

, et al. Artificial intelligence in surgical coding: evaluating large language models for current procedural terminology accuracy in hand surgery. J Hand Surg Glob Online 2025; 7: 181–185.

11.

Isch

Guler

Galantini

, et al. Bridging the coding gap: assessing large language models for accurate modifier assignment in craniofacial operative notes. J Craniofac Surg 2025; 36(7). DOI: 10.1097/SCS.0000000000011390, Epub ahead of print.

12.

Roy

Self

Isch

, et al. Evaluating large language models for automated CPT code prediction in endovascular neurosurgery. J Med Syst 2025; 49: 15.

13.

Damluji

Al-Damluji

Pomenti

, et al. Health care costs after cardiac arrest in the United States. Circ Arrhythm Electrophysiol 2018; 11: e005689.

14.

Oude Lansink-Hartgring

Miranda

DDR

Mandigers

, et al. Health-related quality of life, one-year costs and economic evaluation in extracorporeal membrane oxygenation in critically ill adults. J Crit Care 2023; 73: 154215.

15.

Glauser

Sharma

Beatson

, et al. Surgical CPT coding discrepancies: analysis of surgeons and employed coders. Am J Med Qual 2021; 36: 263–269.

16.

Abdelgadir

Thongprayoon

Miao

, et al. AI integration in nephrology: evaluating ChatGPT for accurate ICD-10 documentation and coding. Front Artif Intell 2024; 7: 1457586.

17.

Isch

Sarikonda

Sambangi

, et al. Evaluating the efficacy of large language models in CPT coding for craniofacial surgery: a comparative analysis. J Craniofac Surg 2025; 36: 831–835.

18.

Soroush

Glicksberg

Zimlichman

, et al. Assessing GPT-3.5 and GPT-4 in generating international classification of diseases billing codes. medRxiv. DOI: 10.1101/2023.07.07.23292391, Epub ahead of print July 9, 2023.

19.

Kwan

. Large language models are good medical coders, if provided with tools. ArXiv; abs/2407: 12849. DOI: 10.48550/arXiv.2407.12849, Epub ahead of print July 6, 2024.

20.

Hager

Jungmann

Holland

, et al. Evaluation and mitigation of the limitations of large language models in clinical decision-making. Nat Med 2024; 30: 2613–2622.

21.

Kirsch

Fakhry

Bernard

, et al. Documentation and coding for trauma and surgical critical care: updates and tips. Trauma Surg Acute Care Open 2024; 9: e001532.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.14 MB

Artificial intelligence for procedural coding in cardiac critical care: Evaluating large language models for current procedural terminology accuracy

Abstract

Background

Methods

Results

Conclusions

Keywords

Get full access to this article

References

Supplementary Material