Abstract
Background
International Classification of Diseases (ICD) coding is essential for health insurance reimbursement, healthcare delivery, and public health management, supporting quality assessment, cost control, and clinical research. Traditional ICD coding relies on manual processes that are labor-intensive, time-consuming, and prone to human error. Large language models (LLMs) offer a promising approach for automating medical record coding; however, their clinical application is limited by the complexity of medical records and the highly specialized nature of clinical knowledge.
Objective
This study aims to evaluate the effects of different knowledge-based prompting strategies on LLMs’ ICD coding performance, identify optimal combinations of models and prompts, and assess their effectiveness in real-world medical record coding tasks.
Methods
A total of 800 discharge summaries from the Department of Urology at the First Affiliated Hospital of Soochow University, dated between 1 January and 31 May 2025, were randomly selected to construct a standardized dataset. The study was conducted in two stages. First, five prompting strategies were evaluated using GPT-4o across primary diagnosis, secondary diagnosis, and surgical procedure coding to identify the optimal strategy. Second, this strategy was applied to multiple LLMs to compare coding performance.
Results
Contextual prompting tailored to medical specialties achieved the best performance with GPT-4o, with accuracies of 84%, 85%, and 82% for the three coding tasks. Using this strategy, DeepSeek-V3 achieved the highest overall performance, with accuracies of 89.5%, 88.6%, and 93.3%, respectively.
Conclusion
An integrated framework combining contextual prompting with DeepSeek-V3 substantially improves automated ICD coding accuracy and efficiency, demonstrating strong potential for clinical application.
Keywords
Background
The International Classification of Diseases (ICD), developed by the World Health Organization (WHO), is a globally standardized system for disease classification. It systematically categorizes diseases according to etiology, pathology, clinical manifestations, and anatomical location, thereby providing a unified foundation for global disease statistics and health management. 1 Since its introduction, the ICD has become a core classification system in the international medical community and plays a pivotal role in the healthcare systems of most countries. ICD coding serves as a critical link among health insurance reimbursement, healthcare service delivery, insurance claims, and medical information management. It also constitutes a core data source for hospital operations and resource allocation. 2 The accuracy of ICD coding directly influences healthcare quality assessment, cost containment, clinical research, and public health policy formulation. Coding errors may lead to clinical misguidance, biased management decisions, insurance disputes, financial losses, and distortion of research data, resulting in serious systemic consequences. 3 Therefore, ensuring accurate ICD coding is critically important. However, traditional ICD coding relies predominantly on manual processes performed by professionally trained coders. This process is laborious, time demanding, and costly. It is also highly susceptible to human error due to variations in coder expertise, work fatigue, and the complexity of clinical documentation, making stable coding accuracy difficult to maintain. 4
To address these challenges, automated ICD coding has long been a major research focus in medical informatics and natural language processing. 5 Early studies primarily focused on rule-based algorithms and traditional machine learning models, such as Bayesian classifiers and support vector machines. However, these approaches commonly suffer from limited annotated data, poor recognition of low frequency codes, weak generalization, and insufficient interpretability. 6 Currently, the accuracy of ICD coding assisted by computers generally remains between 60% and 80%. This severely constrains the progress of Diagnosis-Related Group (DRG) reforms and adversely affects insurance reimbursement and hospital operations. 7
In recent years, large language models (LLMs), including the GPT series and DeepSeek-V3, have demonstrated strong capabilities in natural language understanding and generation, offering new directions for automated ICD coding. 8 Several studies have explored the application of LLMs to ICD coding. For example, Youngju Yoo 9 designed a specialized fine-tuning framework; however, its performance remained inferior to that of human coders. Matúš Falis 10 reported that GPT-3.5 shows potential in medical text generation and coding but has not yet reached clinical usability standards. Ashley Simmons et al. 11 further demonstrated that LLMs underperform human coders in extracting ICD-10-CM codes. Hong Jie Dai et al. 4 developed an auxiliary system based on GPT-2, achieving an acceptability rate of 76.02%. Despite these preliminary efforts, current approaches still face two fundamental challenges. First, model robustness remains insufficient when handling pervasive noise, ambiguity, and long-tailed distributions in clinical texts, making it difficult to achieve clinically acceptable reliability and consistency. Second, existing studies primarily focus on model modification, including domain-adapted training, complex hybrid architectures, and fine-tuning. Although these methods can improve performance on specific datasets, they depend heavily on computational resources, extensive annotated data, and complex architectures, leading to high costs, poor reproducibility, and limited clinical scalability.
Given the limitations of existing studies, this research adopts a paradigm innovation perspective by shifting the focus from model modification based on large volumes of annotated data to model guidance through sophisticated prompt engineering, thereby fully leveraging the latent capabilities of general LLMs. Using real-world electronic medical record (EMR) data, this study systematically evaluates multiple knowledge injection strategies. The core findings demonstrate that, for exact-matching tasks such as ICD coding, contextual prompting strategies significantly outperform mainstream retrieval-augmented generation (RAG) in both accuracy and robustness, revealing the intrinsic advantage of precise and moderate direct knowledge injection over complex retrieval pipelines. The resulting general LLM combined with precise contextual prompting achieves clinical level coding accuracy without complex fine-tuning, with surgical procedure coding accuracy reaching 93.3%. This approach offers substantial advantages in operational simplicity, cost efficiency, and ease of clinical integration, providing a highly efficient and scalable technological pathway for intelligent medical record coding.
Materials and methods
Study design
This study is an observational study and has been approved by the Ethics Committee of the First Affiliated Hospital of Soochow University (Approval No. 2025550). The study aimed to evaluate the performance of LLMs in medical record coding. The study did not involve information that could identify patients or any clinical interventions; therefore, the requirement for informed consent was waived. The study was conducted in three sequential stages (Figure 1): (i) Retrospective extraction of 800 discharge summaries from the Department of Urology to construct a standardized dataset; (ii) Design of five distinct prompting strategies and evaluation of their performance using the GPT-4o in primary diagnosis coding, secondary diagnosis coding, and surgical procedure coding, to identify the optimal prompting strategy; (iii) Application of the selected optimal prompting strategy to uniformly evaluate the coding performance of multiple LLMs, thereby identifying the optimal combination of model and prompt and providing empirical support for subsequent clinical implementation.

Study design diagram.
Dataset description
This study used all discharge records from the Department of Urology in the hospital EMR system between 1 January and 31 May 2025, as the sampling frame. A total of 800 discharge summaries with a hospitalization duration exceeding 24 h were randomly selected to construct a standardized dataset. The following 11 variables were extracted from each discharge summary: sex, age, marital status, admission diagnosis, admission time, surgical procedure name, surgery time, discharge diagnosis, discharge time, admission condition, and diagnostic and therapeutic course. After excluding corrupted files with quality issues such as garbled text or inaccessible formats, all data were anonymized to protect patient privacy.12,13 Two researchers independently validated the data to ensure integrity and accuracy.14,15 Owing to the overall high quality of the EMR system, all 800 records were ultimately included in the final analysis. The dataset was subsequently divided into a prompt development set and a test set at a ratio of 1:3. In addition, the corresponding coded data were extracted from the front pages of inpatient records and used as the reference standard for diagnostic prediction.
Constructing prompt strategy
All discharge summaries, prompt strategies, and contextual coding knowledge were presented in Chinese to ensure linguistic consistency and minimize bias across languages. Guided by the Best Practices for Text Generation Prompts 16 and the CRISPE framework, 17 base prompts were designed and iteratively refined through multiple rounds of testing with the prompt encoder. This process resulted in prompts that were stable, clear, and well organized (example prompts can be found in Supplemental Material 1). The samples used for testing were collected independently and excluded from the main experimental dataset. Medical record coding was conducted with reference to both the ICD-10 diagnostic coding system and the ICD-9-CM-3 procedural coding system. Accordingly, five prompt strategies were designed following a progressive paradigm that moves from no-resource settings to limited-resource settings, and finally to full-resource settings. Prompt Strategy 1 (Base Prompt + Few-Shot). Given the poor performance of zero-shot methods reported in prior studies, a few-shot prompting approach was adopted. Three annotated standardized examples were appended to the base prompt. No external coding databases were introduced. This strategy evaluated the model's ability to perform coding using only internal parametric knowledge and limited examples, and served as the baseline for subsequent strategies. Prompt Strategy 2 (Base Prompt + RAG–Specialty Coding). A dedicated RAG knowledge base was constructed by mapping the official “Urology Specialty Subset” from the Chinese National Clinical ICD Standard Database. This subset is a specialized component of the standard database and contains approximately 500 ICD-10 and ICD-9-CM-3 codes. During implementation, semantic retrieval based on cosine similarity was applied to match vectorized discharge summaries with knowledge base entries, retrieving the most relevant codes for each case. Prompt Strategy 3 (Base Prompt + RAG–General Coding). Building upon Strategy 2, the retrieval scope was expanded to the full Chinese National Clinical ICD Standard Database (approximately 50,000 ICD-10 and ICD-9-CM-3 codes) to evaluate the impact of large volume knowledge base retrieval on model performance. The retrieval mechanism remained unchanged. Prompt Strategy 4 (Base Prompt + Contextual–Specialty Coding). Approximately 500 standard urology ICD entries (the official “Urology Specialty Subset” of the Chinese National Clinical ICD Standard Database) were embedded into the prompt as structured tabular attachments. The model could reference this static contextual knowledge throughout code generation, enabling evaluation of the effect of compact, high-quality contextual information on coding performance. Prompt Strategy 5 (Base Prompt + Contextual–General Coding). Based on Strategy 4, the contextual content was expanded to include the complete Chinese National Clinical ICD coding database (approximately 50,000 codes). This knowledge was statically embedded into the prompt in the same structured tabular format to evaluate the impact of prompts with very long context on model performance, serving as a direct comparison with Strategy 4. All ICD code entries used in this study are provided in Supplemental Material 2.
Model selection and configuration
The experimental period of this study spanned from 1 April to 30 August 2025, and was divided into two distinct phases. The first phase aimed to identify the optimal prompt strategy for medical record coding. Given the widespread influence of the ChatGPT series among global LLMs, and the methodological consistency between its prompting framework and that adopted in this study, GPT-4o was selected as the pilot experimental model. 18 As the study focused on evaluating publicly accessible chatbot versions rather than expensive or technically demanding interfaces built on application programming interfaces, all experiments were conducted via web-based user interfaces under standardized hardware and network conditions. During this phase, five researchers independently applied five prompt strategies by sequentially feeding 200 discharge summaries from the prompt development set into GPT-4o and comparing the model's generated codes with gold standards. Coding accuracy was used as the primary evaluation metric for selecting the optimal prompt strategy. In the second phase, the optimal prompt strategy identified in Phase I was used as a unified input condition to systematically evaluate the real-world performance of multiple LLMs in medical record coding. This study explored a lightweight coding paradigm without domain-specific fine-tuning, and model selection followed four guiding principles: (1) the ability to process ultra-long contexts (at least 50,000 characters); (2) strong Chinese language comprehension and instruction-following capabilities; (3) ecological diversity and reproducibility across international proprietary models, domestic open-source models, and mainstream commercial models; and (4) prioritization of models with transparent architectures, mature deployment ecosystems, open APIs, or local deployment support to ensure clinical operability and scalability. Ultimately, five models, GPT-4o, DeepSeek-V3, Qwen-3, Kimi-2, and Doubao-1.5, were included for comparative evaluation across models (detailed information for each model can be found in Supplemental Material 1).
Evaluation indicators and data collection methods
During the model performance evaluation phase, five senior coders with over five years of clinical coding experience were selected to form an independent review panel. The ICD codes confirmed through expert review of inpatient record front pages were used as the gold standard, and model outputs were evaluated using a blinded, randomized assessment protocol. The evaluation procedure consisted of the following steps: (1) Blinding: all model predictions were stripped of model identity before review to prevent labeling effects and evaluation bias; (2) Random allocation: each coder received only a randomly numbered summary sheet, remained unaware of model identities, and did not communicate with other reviewers, ensuring independence and objectivity of the evaluation process. The evaluation dimensions and decision criteria included: (1) Primary diagnosis coding accuracy: the predicted primary diagnosis and its ICD-10 subcategory code were required to exactly match the gold standard to be considered correct; (2) Secondary diagnosis coding accuracy: all secondary diagnoses and their corresponding ICD-10 subcategory codes generated by the model were required to fully cover the corresponding gold standard entries. Excess generation was permitted; however, omission of any required item was considered an error. (3) Surgical procedure coding accuracy: all surgical procedures and their corresponding ICD-9-CM-3 detailed codes generated by the model were required to fully cover the corresponding gold-standard entries. Over-generation was permitted; however, omission of any required item was considered an error. The evaluation standards adopted in this study strictly followed current manual medical record coding guidelines and represent one of the most rigorous assessment frameworks currently in use.
Statistical analysis
This study employed paired McNemar's tests to compare the performance of different prompt strategies and LLMs across three coding tasks: primary diagnosis coding, secondary diagnosis coding, and surgical procedure coding. To control for Type I errors resulting from multiple comparisons, Bonferroni corrections were applied separately to two analytical categories: (i) prompt strategy comparisons, involving pairwise comparisons among five strategies across three coding tasks; and (ii) model performance comparisons, involving pairwise comparisons among five LLMs across the same three coding tasks. Each category comprised C(5,2) × 3 = 30 independent tests, which were treated as independent families of hypothesis tests. The Bonferroni-adjusted significance level was set at α=0.05/30≈0.002. All statistical analyses were performed using SPSS Statistics (version 27.0; IBM Corp., USA).
Results
Coding accuracy of different prompt strategy
This study compared the performance of five prompting strategies implemented in GPT-4o across three clinical coding tasks: principal diagnosis, secondary diagnosis, and procedure coding (Figure 2). Thirty pairwise comparisons were performed using the McNemar test, revealing statistically significant differences in prompting strategies across datasets. Detailed comparative data are available in Supplemental Material 1. Prompting Strategy 4 demonstrated the highest coding accuracy across all tasks, achieving 84% for principal diagnosis coding, 85% for secondary diagnosis coding, and 82% for procedure coding. After Bonferroni correction for multiple comparisons, no statistically significant differences were observed between Prompting Strategies 4 and 2 across the three tasks (P > 0.002). However, from a clinical perspective, Prompting Strategy 4 consistently achieved slightly higher accuracy than Prompting Strategy 2, with more pronounced advantages in clinically critical tasks such as secondary diagnosis coding. Prompting Strategy 2 also demonstrated strong performance, achieving accuracies of 80% for principal diagnosis coding, 78% for secondary diagnosis coding, and 79% for procedure coding, and significantly outperformed Prompting Strategy 1 (P < 0.002). Prompting Strategy 1 exhibited the lowest accuracy across all tasks, with rates of 41%, 27%, and 31%, respectively. Prompting Strategies 3 and 5 demonstrated moderate performance, with no statistically significant difference between them (P > 0.002). Overall, Prompting Strategy 4 demonstrated superior performance across all three clinical coding tasks, indicating the highest potential for clinical application under the current experimental conditions.

Comparison of accuracy of different prompt strategy across various medical record coding tasks (n = 200).
Coding performance across different models
Using the Prompting Strategy 4, this study systematically evaluated five LLMs, DeepSeek-V3, Qwen-3, GPT-4o, Kimi-2, and Doubao-1.5, across three coding tasks: principal diagnosis, secondary diagnosis, and procedure coding. The results (Figure 3) revealed statistically significant differences in accuracy among the models across all tasks (P < 0.002). In the principal diagnosis coding task, DeepSeek-V3 achieved the highest performance, with an accuracy of 89.5%, an F1 score of 0.90, and a Kappa coefficient of 0.89 (the relevant performance metrics for each model are presented in Table 1). This performance demonstrates near-expert-level coding capability and validates the effectiveness of the “moderate and precise contextual prompting” strategy. Qwen-3 (accuracy 80.2%, F1 = 0.81, Kappa = 0.79) and GPT-4o (accuracy 81.1%, F1 = 0.80, Kappa = 0.80) showed comparable performance, with no statistically significant difference (P > 0.002). Notably, GPT-4o demonstrated strong transferability across languages and generalization ability across domains. Kimi-2 achieved an accuracy of 72.1%. In contrast, Doubao-1.5 showed the lowest performance, with an accuracy of 61.0% and corresponding F1 and Kappa values of 0.60 and 0.59, respectively. In the secondary diagnosis coding task, DeepSeek-V3 (accuracy 88.6%) and Qwen-3 (accuracy 87.8%) achieved the highest accuracies, with no statistically significant difference between them. GPT-4o (78.6%), Kimi-2 (76.6%), and Doubao-1.5 (72.1%) also showed no statistically significant differences among them. In the procedure coding task, DeepSeek-V3 remained the leading model, achieving an accuracy of 93.3%. Qwen-3 (84.6%) and GPT-4o (80.0%) demonstrated comparable performance. In contrast, Kimi-2 (73.3%) and Doubao-1.5 (69.0%) performed at relatively lower levels. Overall, DeepSeek-V3 significantly outperformed the other models across all three coding tasks, demonstrating comprehensive and consistent superiority. Qwen-3 and GPT-4o showed comparable performance across multiple tasks, indicating strong and stable coding capabilities. In contrast, Kimi-2 and Doubao-1.5 exhibited weaker overall performance, with particularly pronounced gaps in the principal diagnosis coding task.

Accuracy comparison of five large language models across various medical record coding tasks (n = 600).
Performance comparison of five large language models across primary diagnosis coding tasks.
Note: The Precision, Recall, and F1-score values reported in the table are weighted averages, computed by weighting each class-specific metric by each class's sample size, thereby reflecting the model's overall performance across categories.
Error frequency distribution of primary diagnosis coding
An analysis of errors made by the five major models in the primary diagnosis coding task revealed significant differences in their error frequency distributions (Figure 4). Overall, Doubao-1.5 recorded the highest frequencies across all five error categories—core treatment omission, insufficient diagnostic specificity, tumor staging deviation, misuse of mixed codes, and diagnostic coding inconsistency—with particularly prominent performance issues in “diagnostic coding inconsistency,” where the error count reached 60. In contrast, the DeepSeek-V3 demonstrated the greatest robustness, showing the lowest error frequencies across all categories and achieving zero errors in “diagnostic coding inconsistency.” Other models, such as Kimi-2 and GPT-4o, showed moderate error frequencies. These findings suggest that considerable differences remain in the accuracy and consistency of current LLMs for clinical coding, with DeepSeek-V3 exhibiting a relative advantage in mitigating common coding errors.

Distribution of error frequencies in major diagnostic coding tasks for five LLMs.
Discussion
Precise and appropriate knowledge injection is key to enhancing the performance of LLMs in medical coding
The results of this study demonstrate that precise and appropriately scaled knowledge injection is critical for unlocking the full coding potential of LLMs. 19 This effect arises from the highly specialized nature of medical coding, which requires models to accurately map unstructured clinical narratives to highly structured and precise ICD codes. This study designed five progressive prompting strategies to systematically compare different levels of knowledge augmentation, ranging from internal model knowledge to external knowledge integration. Prompt Strategy 1 achieved an accuracy of only 27–41%, indicating that reliance solely on embedded general medical knowledge is insufficient for reliable medical coding. 11 Therefore, in professional clinical coding scenarios, the integration of external coding knowledge is essential rather than optional, consistent with prior studies. 20 On this basis, this study further clarifies how external knowledge can be delivered to models more effectively. Notably, Prompt Strategy 4 achieved the best performance, whereas Strategy 5 showed a marked decline. This contrast reveals a critical insight: effective knowledge injection depends on precision rather than quantity. 21 The ICD system contains tens of thousands of codes, whereas a single specialty typically involves only 500–1000 relevant entries. Consequently, adding relevant domain knowledge significantly reduces the model's cognitive load, allowing attention mechanisms to focus on relevant candidates and thereby improving coding accuracy. This mechanism aligns with label attention strategies proposed by Youngju Yoo et al. 9 for managing the large ICD label space, further validating the effectiveness of precise knowledge injection in enhancing coding performance.
Compared with previous studies, this study demonstrates substantially improved performance. Early coding systems based on rules or traditional machine learning typically reported accuracies of 60–80%.6,7 Even with the introduction of LLMs, the GPT-2-based system proposed by Dai et al. achieved an “acceptability rate” of only 76.02%. 4 Ashley Simmons et al. further reported that systems based on LLMs continue to underperform human coders in ICD coding tasks.11,22 In contrast, this study adopts a “general LLM + precise contextual prompting” framework, achieving substantially higher performance in both primary diagnosis coding (89.5%) and procedure coding (93.3%). This performance advantage can be attributed to the synergistic effects of three key factors. First, the generational advancement of model capabilities. The latest LLMs used in this study, including DeepSeek-V3 and GPT-4o, exhibit substantial advances over earlier models in language understanding, instruction following, and contextual reasoning. 23 These improvements provide a stronger foundation for parsing complex clinical narratives. Second, innovation in prompt engineering strategies. Unlike prior studies that emphasized model fine-tuning, 24 this study focuses on prompt design and demonstrates that, for closed-domain tasks that require exact matching, contextual prompts that directly add relevant coding knowledge outperform RAG. This precise and targeted knowledge strategy avoids inefficient retrieval processes and improves both decision efficiency and accuracy. Finally, differences in task design and evaluation criteria also contribute to the observed performance gap. Some studies 25 adopt general department settings, tasks with multiple labels, or relatively lenient evaluation metrics. In contrast, this study focuses on a single department and uses a strict exact-match accuracy metric. The nearly 90% performance achieved indicates that, in specific clinical scenarios, general LLMs combined with precise knowledge injection can reach coding performance close to that of human experts.
Contextual prompting outperforms RAG in closed-domain exact matching tasks
In tasks within a closed domain that require exact matching, such as clinical record coding, contextual prompting outperforms RAG in both accuracy and robustness. Systematic evaluation shows that Strategy 4 consistently outperforms Strategy 2 in both coding accuracy and robustness. This finding diverges from prevailing mainstream perspectives. 26 However, it does not negate the value of the RAG paradigm; rather, it suggests that the effectiveness of knowledge integration mechanisms strongly depends on the task. The task examined in this study is a closed-domain task that requires exact matching, rather than an open-domain task focused on knowledge question answering. Under these conditions, the fragility of the RAG retrieval and generation pipeline becomes more pronounced. Its reliance on semantic retrieval makes it highly sensitive to subtle variations in clinical terminology (e.g. “post thyroidectomy left” vs “post thyroidectomy status”), which increases the risk of retrieval errors and subsequent coding deviations. 27 This contrasts with the open-domain tasks focused on semantic integration for which RAG is primarily designed. 28 By contrast, contextual prompting places the entire coding table within the model's context window, transforming the task into a global matching problem. Through the self-attention mechanism, the model performs parallel comparisons between the input description and all candidate codes. This enables more comprehensive, robust decision-making and reduces the risk of failure at a single retrieval point in the RAG process. Although long context inputs may introduce the lost in the middle problem, 29 the scale of the coding table in this study remains controllable and well within the processing capacity of modern LLMs. This ensures the proposed strategy is feasible. The conclusions of this study are consistent with the split-retrieve-synthesize strategy recommended in the OpenAI guidelines for structured tasks. 30 Therefore, for tasks in a closed domain that require exact matching, such as clinical record coding, contextual prompting achieves a more favorable balance among performance, robustness, and implementation complexity. In tasks that require exact matching, directly providing the complete set of answers as contextual input may be simpler and more effective than constructing complex retrieval pipelines.
Performance variations and core bottlenecks in LLMs for medical record encoding
Under the experimental conditions of this study, DeepSeek-V3 demonstrated the strongest performance in medical record coding tasks. Qwen-3 and GPT-4o ranked next, whereas Kimi-2 and Doubao-1.5 exhibited comparatively weaker performance. This finding is consistent with prior studies that report substantial performance heterogeneity among LLMs in clinical tasks. 31 Despite overall improvements in model capabilities, substantial performance gaps persist in specialized Chinese medical record coding tasks. Major diagnostic errors were classified into five categories: core treatment omission, insufficient diagnostic specificity, tumor staging deviation, misuse of mixed codes, and diagnostic coding inconsistency. Among these categories, core treatment omission and diagnostic coding inconsistency were the most prevalent error types across models. Specifically, DeepSeek-V3 exhibited significantly better error control in insufficient diagnostic specificity and diagnostic coding inconsistency than other models. In contrast, Doubao-1.5 showed the highest error rates across all categories. These error patterns closely mirror the difficulties encountered in manual medical coding, indicating that LLMs reproduce typical cognitive challenges faced by human coders. 32 Core treatment omission manifests as a failure to infer core diagnoses from urological interventions. For example, when encountering the description “transurethral resection of the prostate,” models frequently encode “benign prostatic hyperplasia” directly. They often omit the more specific diagnosis “bladder outlet obstruction,” revealing deficiencies in clinical causal reasoning. Insufficient diagnostic specificity and tumor staging deviation are particularly evident in urological oncology coding tasks. Models often generalize “clear cell renal carcinoma” as “malignant renal tumor.” They fail to integrate pathological and imaging reports for accurate tumor staging. This reveals fundamental limitations in integrating information from multiple sources and in applying complex classification rules. Misuse of mixed coding reflects confusion regarding the clinical focus of urological cases. Diagnostic coding inconsistency indicates differences in semantic alignment stability across models. Weaker models tend to generate illogical associations in complex clinical contexts, revealing deficiencies in task comprehension and contextual reasoning mechanisms. In contrast, DeepSeek-V3 demonstrates superior performance. This advantage may derive from its dynamic network architecture and enhanced capacity to retain key information, 33 providing important insights for future optimization.
Furthermore, experimental results indicate that optimizing one type of error often amplifies other error types. This finding suggests that prompt engineering alone is insufficient and that systematic architectural and methodological upgrades are urgently required. First, at the level of knowledge integration, a “clinical coding decision logic” framework should be established. This framework should structurally embed rules such as therapy–diagnosis linkage, specificity prioritization, and staging evidence integration into the reasoning process, rather than merely injecting coding tables. Second, at the input processing stage, a prior information structuring module should be implemented to standardize the extraction of key elements from medical records. This approach divides the task into two stages, information extraction and logical coding, thereby reducing the model's cognitive burden. Third, at the system design stage, we recommend adopting a three-step closed-loop workflow. In this workflow, the model first rapidly screens cases for which it has high confidence. Predefined rules are then applied to automatically detect obvious logical inconsistencies. Finally, human experts focus on reviewing complex cases or those for which the model shows low certainty. The results of human review are fed back into the system, 33 enabling continuous improvement over time. At the same time, during model updating, priority should be given to targeted optimization using high-quality data from specialized medical domains. The model should also continuously learn and adapt based on feedback from clinicians and other end users. Through this approach, research can gradually shift from simply injecting knowledge into the system toward more refined and intelligent organization and application of medical knowledge, ultimately enabling the system to develop reliable reasoning capability.
From potential to practice: application pathways and deployment strategies of LLMs for assisting medical record coding
In terms of clinical feasibility, under optimal prompting strategies and model configurations, LLMs achieved approximately 90% or higher accuracy in primary diagnosis, secondary diagnosis, and surgical procedure coding tasks. This performance is consistent with recent studies and further confirms the substantial potential of LLMs in automated ICD coding tasks. 34 The potential application scenarios include: (1) integration into existing coding systems as an assistive tool for professional coders to provide recommendations and improve efficiency; and (2) support for preliminary screening and quality control through batch automated coding of discharge records and identification of uncertain cases for manual review. Nevertheless, several critical challenges remain. First, even with a 90% accuracy rate, the remaining 10% error margin in clinical settings may significantly affect medical insurance reimbursement and healthcare data statistics. Therefore, in the foreseeable future, LLMs should be positioned as assistive tools rather than autonomous systems. Human and AI collaboration represents a more reliable approach. This is consistent with the viewpoint proposed by Johnson et al., who emphasize that mechanisms involving humans in the loop should be maintained in high-risk domains. 35 Second, the current study has not undergone end-to-end validation within hospital information systems (HIS). Performance in real-world settings may also be affected by practical factors, such as the quality of EMR. Accordingly, we recommend localized deployment of models to ensure data privacy and security, as well as integration of the optimal solution into standardized plugins embedded in HIS for evaluation in clinical practice. In addition, substantial differences in coding logic across medical specialties impose inherent limitations on general models. We therefore oppose a uniform approach across all settings and advocate for developing lightweight models tailored to specific medical specialties and domain characteristics. This strategy builds on open-source models and enables efficient, low-cost adaptation through prompt optimization and fine-tuning, thereby providing a practical pathway to address challenges related to domain adaptation in medical artificial intelligence. 36
Conclusion
This study proposes an integrated framework that combines a contextual encoding prompt strategy with the DeepSeek-V3. Experiments on real-world medical record datasets demonstrate that, under the current deployment conditions, the framework achieves diagnostic coding accuracy approaching 90% and surgical procedure coding accuracy of 93.3%, demonstrating outstanding automated coding performance. This approach achieves strong predictive performance without requiring complex fine-tuning, is readily integrable into clinical workflows, and offers significant advantages in operational simplicity, high reproducibility, and low implementation cost. It provides a novel pathway for improving the efficiency and accuracy of medical record coding and has the potential to substantially enhance the quality and effectiveness of healthcare information management.
Limitations of the study
This study has several limitations. First, the experimental data were derived from a single dataset and did not include data from multiple sources, institutions of different sizes, or healthcare facilities at varying levels. Future studies should incorporate multicenter and heterogeneous datasets for external validation to further evaluate the model's generalizability. Second, the model was implemented in a web-based environment, which differs from the hardware requirements of a localized deployment. Therefore, whether localized deployment can achieve performance comparable to that observed in this study remains to be determined through comprehensive evaluation within HIS environments. Third, this study included five globally dominant LLMs based on experimental requirements. This selection does not negate the performance potential of other models, whose advantages remain to be further validated in subsequent research.
Supplemental Material
sj-pdf-2-dhj-10.1177_20552076261434052 - Supplemental material for The performance boundaries of knowledge prompts: A study on injection strategies and scale effects for ICD coding tasks in large language models
Supplemental material, sj-pdf-2-dhj-10.1177_20552076261434052 for The performance boundaries of knowledge prompts: A study on injection strategies and scale effects for ICD coding tasks in large language models by Xiaosong Jiang, Chen Wang, Guxue Shan, Zhimao Wu, Ping Li, Yanxing Yang, Siyu Zha, Li Liu and Tingqi Shi in DIGITAL HEALTH
Supplemental Material
sj-docx-3-dhj-10.1177_20552076261434052 - Supplemental material for The performance boundaries of knowledge prompts: A study on injection strategies and scale effects for ICD coding tasks in large language models
Supplemental material, sj-docx-3-dhj-10.1177_20552076261434052 for The performance boundaries of knowledge prompts: A study on injection strategies and scale effects for ICD coding tasks in large language models by Xiaosong Jiang, Chen Wang, Guxue Shan, Zhimao Wu, Ping Li, Yanxing Yang, Siyu Zha, Li Liu and Tingqi Shi in DIGITAL HEALTH
Supplemental Material
sj-xlsx-4-dhj-10.1177_20552076261434052 - Supplemental material for The performance boundaries of knowledge prompts: A study on injection strategies and scale effects for ICD coding tasks in large language models
Supplemental material, sj-xlsx-4-dhj-10.1177_20552076261434052 for The performance boundaries of knowledge prompts: A study on injection strategies and scale effects for ICD coding tasks in large language models by Xiaosong Jiang, Chen Wang, Guxue Shan, Zhimao Wu, Ping Li, Yanxing Yang, Siyu Zha, Li Liu and Tingqi Shi in DIGITAL HEALTH
Footnotes
Abbreviations
Ethics and patient consent
The study was conducted in accordance with the Declaration of Helsinki and was approved by the Ethics Committee of the First Affiliated Hospital of Soochow University (no. 2025550) on 27 June 2025, with the need for written informed consent waived.
Author contributions
XJ, CW, GS, ZW, and PL jointly contributed to the conceptualization and design of the study. YY and SZ were responsible for data collection, and XJ conducted the formal analysis. XJ and CW prepared the original draft and visualizations. LL and TS supervised the project and critically reviewed and edited the manuscript. All authors read and approved the final version of the manuscript
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was supported by Nanjing Health Science and Technology Development Special Fund Project (grant number: YKK22074); University-Industry Collaborative Education Program (grant number: 230905329045253). Aid Project of Jiangsu Ningai Medical Development & Medical Aid Foundation and the General Program of China Hospital Reform and Development Research Institute of Nanjing University, Nanjing Drum Tower Hospital (grant number: NDYGN2025029).
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Data availability statement
All data generated or analyses in this study are included in this article. Further enquiries can be directed to the corresponding author.
Supplemental material
Supplemental material for this article is available online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
