Sage Journals: Discover world-class research

Abstract

Objective

This study aims to investigate the development of automated International Classification of Diseases (ICD) coding models using the Medical Information Mart for Intensive Care (MIMIC) dataset. This work integrates computer science and clinical perspectives to evaluate progress, identify challenges, and provide insights for future ICD coding automation.

Methods

We conducted a systematic review following the Preferred Reporting Items for Systematic Reviews and Meta-Analyses guidelines. We selected 73 studies between 2014 and 2024 and extracted key information about data preprocessing, knowledge integration, model architectures, evaluation strategies, and explainability.

Results

In the reviewed papers, 69.57% (48 papers) focused on utilizing medical knowledge, primarily through knowledge graphs. The methods have evolved from traditional machine learning techniques to more advanced approaches, such as deep learning, knowledge reasoning, information retrieval, and generative models. Since 2019, F1-micro scores have consistently improved: Studies using the MIMIC-III full dataset have shown a 6.4% increase, while those using the MIMIC-III top-50 dataset have experienced a 10.2% improvement. Furthermore, 60.27% (44 papers) implemented strategies to enhance explainability, which included attention visualization and analysis.

Conclusion

Automated ICD coding tasks have improved, but ongoing challenges remain. These challenges include a lack of diverse data, inadequate use of medical knowledge, complex algorithms, and insufficient validation by clinical coders. Those issues obstruct the model implementation. Future research should focus on integrating a wider range of multimodal data, enhancing the application of medical knowledge, and improving the explainability of models.

Keywords

Automated International Classification of Diseases coding Medical Information Mart for Intensive Care dataset healthcare applicability multi-label classification model explainability

Introduction

The International Classification of Diseases (ICD) is crucial for efficiently storing, retrieving, and analyzing health data. It supports payment systems, service planning, and the administration of quality and safety in healthcare settings.¹ The critical role of ICD in optimizing healthcare delivery underscores the commitment of hospital administrators to enhancing the quality of ICD data.

However, manual ICD coding presents significant challenges. Coders must meticulously review medical records, verify contexts, consult with physicians, and follow frequently updated coding guidelines.^2,3 This complex task requires extensive skills in clinical medicine, health statistics, and disease classification. Moreover, manual coding is prone to errors, with accuracy rates ranging from 50% to 98%, and a median accuracy of 80%.^4,5 It is also time-consuming; a study by the American Health Information Management Association (AHIMA) on coding productivity⁶ showed that coders processed about 24 inpatient records per day, spending roughly 20 min per record using the ICD-9 Clinical Modification (ICD-9-CM) system. In NHS Scotland, clinical coders typically handle around 60 cases per day.⁷ With the shift to the more detailed ICD-10 Clinical Modification/Procedure Coding System (ICD-10-CM/PCS) the average time per inpatient record increased to approximately 38 min.⁶

Given these challenges, the healthcare industry needs tools that enhance and streamline the coding process. Automated clinical coding, a subset of Computer-Assisted Coding, utilizes artificial intelligence techniques, such as natural language processing and machine learning, to improve efficiency and data quality.⁸

Previous reviews⁹ on automated clinical coding have primarily analyzed this task from a computer science perspective. ICD coding intersects multiple disciplines, requiring a strong understanding of medical knowledge and interdisciplinary teamwork to comprehend the task and effectively improve the model method. This review aims to provide a thorough, clinically based evaluation of automated ICD coding models built using the Medical Information Mart for Intensive Care (MIMIC) dataset. This review advances the field in several key areas:

It highlights the clinical perspective of ICD coding by emphasizing practical classification needs, data characteristics, and the importance of domain-specific medical knowledge, as discussed in the “Introduction” section.

It thoroughly examines the development of ICD coding models across essential aspects such as data handling, knowledge integration, model frameworks, paradigms, evaluation metrics, and explainability, as detailed in the “Result” section.

It evaluates model performance across studies using the MIMIC dataset and a consistent data-splitting strategy, enabling fair comparison of evaluation metrics over publication years and paradigms, as shown in “Evaluation metrics and results” section.

From a practical standpoint, it evaluates whether current models fulfill real-world clinical needs and provides clear, actionable recommendations for their improvement and implementation in healthcare, as discussed in “Discussion” section.

Characteristics of the ICD system

Continuous updating

The ICD, established by the World Health Organization (WHO), is a classification of diseases that can be defined as a system of categories to which diagnoses of diseases and other health issues are assigned according to established criteria.¹⁰ Introduced in 1976, ICD-9 included nearly 5000 categories, while ICD-10, introduced in 1995, expanded to about 8000. The ICD-11 further increased granularity to over 50,000 categories to meet the growing demands of morbidity data analysis. In practice, coding often requires greater specificity, extending to detailed levels in Clinical Modifications for billing and reimbursement purposes.^11,12

The latest update, ICD-11 for Mortality and Morbidity Statistics (ICD-11-MMS), was rolled out in early 2022. This version introduced significant changes, such as a new chapter structure, new diagnostic categories, and revised diagnostic criteria. A key innovation in ICD-11 is the “Foundation Component,” a semantic network that builds a comprehensive polyhierarchy of medical concepts.¹³

Structure and principles

The ICD taxonomy is organized into a hierarchical structure with parent–child nodes, showing clear levels of inheritance and distinct layers. The ICD system does not aim to mirror the real world directly but instead categorizes diseases based on necessary and sufficient conditions to establish mutually exclusive disease category classifications.¹⁴

The principles of ICD classification are manually defined. For example, pregnancy and perinatal conditions are prioritized and categorized into the “special groups” chapters. Sibling nodes at the same level in the ICD taxonomy are organized by axes, which include etiology, pathology, anatomical location, and clinical manifestations. Conditions that cannot be classified according to these axes are categorized as “other” conditions, including rare conditions and ‘unspecified’ cases. Additionally, the system supports co-occurrence codes, which indicate combinations of conditions such as tumor morphology, pathogens, or causes of injury.¹⁰

Insufficient definitions

The definitions provided in Volume 1 of the ICD Regulations regarding nomenclature offer those working with statistics a clear explanation of what is included and excluded in the categories, subcategories, and tabulation list items in statistical tables. However, these definitions often lack enough semantic depth and context, making classification difficult and confusing. Therefore, there is a need for additional external knowledge and guidelines.

To accommodate local practices and meet country-specific reporting requirements, many nations and regions have developed modified versions of the ICD system.¹⁵ For example, the United States uses ICD-10-CM, Canada employs ICD-10 Canadian Modification (ICD-10-CA), Germany relies on ICD-10 German Modification (ICD-10-GM), and Australia adopts ICD-10 Australian Modification (ICD-10-AM). In China, the ICD-10-CM has been under development and in use since 2017 to meet local categorization needs.

Characteristics of the automated ICD coding task

Input: Clinical notes

Early research in automated coding mainly concentrated on classifying term excerpts by extracting “diagnosis descriptions".^16,17 However, the brevity of these descriptions often resulted in a lack of comprehensive contextual information, leading to suboptimal outcomes. With advancements in NLP technologies, models have been trained on large public Electronic Health Records (EHRs) datasets. These include the MIMIC,¹⁸ UKLarge and UKSmall,¹⁹ CLEF dataset,²⁰ and CodiEsp dataset,²¹ among others. The MIMIC dataset is the most widely used published research, supporting numerous state-of-the-art (SOTA) advancements in automated ICD coding. Consequently, studies utilizing this dataset are particularly significant for comparing and furthering field research.

Clinical notes consist of free text with a large amount of professional medical terminology and noisy elements such as non-standard synonyms and misspellings. These documents are usually extensive, covering a wide range of clinical information, including health profiles, laboratory test results, radiology reports, operative notes, and medication records. As a result, they tend to be lengthy, with an average of 1609 words in MIMIC-III and 1151 words in MIMIC-II.

Output: ICD codes

All public datasets exhibit the following common characteristics in the distribution of ICD data: Firstly, the label and feature spaces in ICD classification are extremely large. For instance, the ICD-9-CM and ICD-10-CM coding systems contain over 14,000 and 70,000 codes, respectively. This extensive label space complicates the prediction process, causing many models to mainly focus on classifying the top 50 or 100 codes. Additionally, the distribution of ICD coding shows a long-tail pattern. A few codes, like those related to respiratory infections and coughs, appear very frequently in Electronic Medical Records (EMRs), while most codes are rarely used.²²

We analyzed the distribution of diagnosis and procedure codes in the MIMIC-III dataset. The results are shown in Figures 1 and 2, respectively. In the MIMIC-III dataset, 21.8% of the diagnostic codes appear only once, and about 4201 labels appear between one and ten times. Similarly, 20.1% of the procedure codes appear only once, and about 1200 labels appear between one and ten times. More seriously, over 50% of the diagnostic and procedure codes, approximately 17,000, never appear in the dataset. These characteristics make Extreme Multi-label Text Classification (XMTC) techniques highly suitable for managing the large-scale label spaces in ICD datasets, as research has demonstrated their effectiveness.^23,24

Figure 1.

Distribution of diagnosis labels in the Medical Information Mart for Intensive Care III (MIMIC-III) dataset.

Figure 2.

Distribution of procedure codes in the Medical Information Mart for Intensive Care III (MIMIC-III) dataset.

Knowledge-based model

Task models for ICD classification require not only the ICD standards and disease information but also patient characteristics, anatomical sites, etiology, pathology, diagnostics, prognosis, and the classification rules based on these factors. Although biomedical knowledge sources like ICD ontologies, Systematized Nomenclature of Medicine Clinical Terms (SNOMED CT), and coding standards offer expert-curated insights, reliable methods to represent and integrate this knowledge have limited their practical usefulness. Developing computational techniques to represent and interpret these standards is essential for accurately guiding models to assign ICD codes in real-world applications.²⁵

Methods

Eligibility criteria

This study presents a systematic review of automated ICD coding models developed using the MIMIC dataset. This review followed the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA).²⁶ The inclusion and exclusion criteria in this review are detailed in Appendix 1.

Information source

The studies were sourced from a range of high-quality academic platforms, including PubMed, ScienceDirect, IEEE Xplore, arXiv, and SpringerLink. Additional sources included the Association for Computing Machinery (ACM) Digital Library, as well as conference proceedings from the Association for Computational Linguistics (ACL) Anthology, the Association for the Advancement of Artificial Intelligence (AAAI) Conference on Artificial Intelligence, and the Conference of the North American Chapter of the Association for Computational Linguistics (NAACL). The review protocol was prospectively registered in the International Prospective Register of Systematic Reviews (PROSPERO), hosted by the National Institute for Health Research (NIHR), to avoid duplication and enhance credibility. The registration was completed before the formal literature search, with the registration number: [CRD42023457388]. The protocol specifies the review objectives, inclusion and exclusion criteria, data sources, and analytical tools to guarantee that the study selection and synthesis follow a planned, systematic methodology.

Search strategy

To construct the search query, keywords within each conceptual group were combined using the OR operator, while different conceptual groups were linked using the AND operator. The specific keywords used for the search are listed in Appendix 2.

Selection process

The retrieved publications were stored in Zotero 6 (Corporation for Digital Scholarship) reference management software. Duplicates were identified and removed using the software's tools, supplemented by manual deletions. After deduplication, 3098 records remained. These were screened by title and abstract to assess their relevance, narrowing the selection to 262 publications. In the second stage, full texts were retrieved automatically or via library access. The third stage involved a detailed review of each full text, focusing on the methodology and experimental sections to ensure the inclusion criteria were met. As a result, 162 papers were excluded, and 73 papers published between 2014 and 2024 were included in the final review. All 73 papers explicitly used the MIMIC dataset to develop and evaluate automated ICD coding models. The detailed study selection pathway is illustrated in Figure 3 (PRISMA flow diagram).

Figure 3.

Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) flow diagram for the review.

Data collection, extraction, and synthesis methods

The key variables were collected on the following aspects: Data handling and preprocessing (specific methods and categories), knowledge integration (detailed sources and categories), modeling paradigms (specific approaches and categories), evaluation metrics (e.g., F1 score, Area Under the Curve (AUC), and Precision), and model explainability (specific interpretability mechanisms).

After data collection, organization, and verification, this review classifies and summarizes the included studies by categories and conducts descriptive statistical analyses. For model evaluation, it performs comparative analyses based on publication year and modeling paradigm. Studies that employ the same dataset division method as Mullenbach et al.²⁷ are included in the comparative analysis of evaluation metrics. If a study reports multiple outcomes, the highest-performing result is selected. The analysis examines two perspectives: Micro-F1 and a composite metric. The composite metric for each model is calculated using a weighted sum of several evaluation metrics: AUC-macro (0.20), AUC-micro (0.15), F1-macro (0.20), F1-micro (0.15), P@5 (0.10), P@8 (0.10), and P@15 (0.10). For missing metric values, a penalty is applied by replacing the missing value with the lowest observed value for that metric across all records. All statistical analyses and visualizations are performed using Python.

Quality assessment and risk of bias

Two independent reviewers meticulously collected data for this review to ensure accuracy and reliability. The reviewers directly contacted study investigators to clarify data points and verify methodological details. Regular consensus meetings were held to resolve discrepancies and ensure consistency in data interpretation.

A quality assessment was conducted to verify that the selected studies aligned with the review's objectives. A checklist of ten closed-ended questions was created, and each publication was required to score at least 7 out of 10 to be included in the final analysis. The checklist items are listed below.

Q1. Is the research objective clearly stated and well-defined?

Q2. Is the methodology clearly described?

Q3. Is the MIMIC dataset identified and appropriately used?

Q4. Are the data preprocessing or input preparation steps clearly described and justified?

Q5. Are the input representations (e.g., features, embeddings, or encodings) clearly described?

Q6. Are the feature extraction and engineering methods clearly described?

Q7. Are the classifiers used in the study clearly described?

Q8. Does the study provide a clear and structured comparison with existing baseline models?

Q9. Is the system's performance evaluated, and are the results properly interpreted and discussed?

Q10. Does the conclusion reflect the research findings?

Results

Overview of included papers

Since their introduction in the 1990s,²⁸ automated ICD coding methods have significantly improved. The 73 papers included in the review range from 2014 to 2024. Table 1 shows the publication year, paper title, and model name for each study. Of these, 43 achieved SOTA results. Studies marked with an asterisk (*) are gray literature, such as preprints. Risk of bias was evaluated with a 10-item checklist, and all included studies scored at least 7 points. Detailed scores are available in Appendix 3.

Table 1.

Summary of the included papers.

Paper ID	Authors, year	Paper title	Model	SOTA
2014-001	Perotte, et al.,(2014)¹⁶	Diagnosis code assignment: Models and evaluation metrics	Hierarchy-based SVM
2016-001	Ayyar, et al., (2016)²⁹	Tagging patient notes with ICD-9 codes	ICD-9 Tagger model
2017-001	Prakash, et al., (2017)³⁰	Condensed memory networks for clinical diagnostic inferencing	C-MemNNs
2017-002	Baumel, et al., (2017)³¹*	Multi-label classification of patient notes a case study on ICD code assignment	HA-GRU	√
2017-003	Berndorfer and Henriksson(2017) ³²	Automated diagnosis coding with combined text representations	—
2018-001	Mullenbach, et al., (2018)²⁷	Explainable prediction of medical codes from clinical text	DR-CAML	√
2018-002	Xie, et al., (2018)³³	A neural architecture for automated ICD coding	—
2018-003	Rios, et al., (2018)³⁴	Few-shot and zero-shot multi-label learning for structured label spaces	ZAGCNN
2018-004	Samonte, et al., (2018)³⁵	ICD-9 tagging of clinical notes using topical word embedding	EnHANs
2018-005	Catling, et al., (2018)³⁶	Towards automated clinical coding	GRU(X)––GRU(Z)
2019-001	Huang, et al., (2019)³⁷	An empirical evaluation of deep learning for ICD-9 code assignment using MIMIC-III clinical notes	—
2019-002	Xie, et al., (2019)³⁸	EHR coding with multi-scale feature attention and structured knowledge graph propagation	MSATT-KG	√
2019-003	Bai and Vucetic,(2019)³⁹	Improving medical code prediction from clinical text via incorporating online knowledge sources	KSI	√
2019-004	Falis, et al., (2019)⁴⁰	Ontological attention ensembles for capturing semantic concepts in ICD code prediction from clinical text	—	√
2019-005	Li, et al., (2019)⁴¹	Automated ICD-9 coding via A deep learning approach	DeepLabeler	√
2019-006	Zeng, et al., (2019)⁴²	Automatic ICD-9 coding via deep transfer learning	—	√
2019-007	Schäfer and Friedrich(2019)⁴³	UMLS mapping and Word embeddings for ICD code assignment using the MIMIC-III intensive care database	eFastText-UMLS
2019-008	Xu, et al., (2019)⁴⁴	Multimodal machine learning for automated ICD coding	—	√
2019-009	Du, et al., (2019)⁴⁵	ML-Net: multi-label classification of biomedical texts with deep neural networks	ML-Net
2020-001	Vu, et al., (2020)⁴⁶	A label attention model for ICD coding from clinical text	JointLAAT	√
2020-003	Sonabend, et al., (2020)⁴⁷	Automated ICD coding via unsupervised knowledge integration (UNITE)	UNITE
2020-004	Cao, et al., (2020)⁴⁸	HyperCore: Hyperbolic and co-graph representation for automatic ICD coding	HyperCore	√
2020-005	Wang, et al., (2020)⁴⁹	Coding electronic health records with adversarial reinforcement path generation	RPGNet	√
2020-006	Li and Yu,(2020)⁵⁰	ICD coding from clinical text using multi-filter residual convolutional neural network	MultiResCNN	√
2020-008	Guo, et al., (2020)⁵¹	A disease inference method based on symptom extraction and bidirectional long short term memory networks	—
2020-009	Mascio, et al., (2020)⁵²	Comparative analysis of text classification approaches in electronic health records	—
2020-010	Teng, et al., (2020)⁵³	Explainable prediction of medical codes with knowledge graphs	G_Code	√
2020-011	Ji, et al., (2020)⁵⁴	Dilated convolutional attention network for medical code assignment from clinical text	DCAN	√
2020-012	Hsu, et al., (2020)⁵⁵	Multi-label classification of ICD coding using deep learning	—
2021-001	Feucht, et al., (2021)⁵⁶	Description-based label attention classifier for explainable ICD-9 classification	DLAC
2021-002	Zhou, et al., (2021)⁵⁷	Automatic ICD coding via interactive shared representation networks with self-distillation mechanism	ISD	√
2021-003	Rajendran, et al., (2021)⁵⁸	Embed wisely: An ensemble approach to predict ICD coding	—	√
2021-004	Song, et al., (2021)⁵⁹	Generalized zero-shot text classification for ICD coding	AGM-HT	√
2021-005	Wang, et al., (2021)⁶⁰*	Few-shot electronic health record coding through graph contrastive learning	CoGraph model	√
2021-006	Tsai, et al., (2021)⁶¹	Modeling diagnostic label correlation for automatic ICD coding	—
2021-007	Pascual, et al., (2021)⁶²	Towards BERT-based automatic ICD coding: Limitations and opportunities	BERT-ICD
2021-008	Liu, et al., (2021)⁶³	Effective convolutional attention network for multi-label clinical document classification	EffectiveCAN	√
2021-009	Luo, et al., (2021)⁶⁴	Fusion: towards automated ICD coding via feature compression	Fusion	√
2021-010	Heo, et al., (2021)⁶⁵	Medical code prediction from discharge summary: document to sequence BERT using sequence attention	—	√
2021-011	Kim and Ganapathi,(2021)⁶⁶	Read, attend, and code: Pushing the limits of medical codes prediction from clinical notes by machines	RAC	√
2021-012	Bao, et al., (2021)⁶⁷	Medical code prediction via capsule networks and ICD knowledge	BiCapsNetLE	√
2021-013	Biswas, et al., (2021)⁶⁸	TransICD: Transformer based code-wise attention model for explainable icd coding	TransICD
2021-014	Dong, et al., (2021)⁶⁹	Explainable automated coding of clinical notes using hierarchical label-wise attention networks and label embedding initialisation	HLAN	√
2021-015	Ji, et al., (2021)⁷⁰	Does the magic of BERT apply to medical code assignment? A quantitative study	BERT-hier + LAN
2021-016	Mayya, et al., (2021)⁷¹	Multi-channel, convolutional attention based neural model for automated diagnostic coding of unstructured patient discharge summaries	Multi-channel CAML	√
2021-018	Li, et al., (2021)⁷²	JLAN: medical code prediction via joint learning attention networks and denoising mechanism	JLAN	√
2022-001	Yuan, et al., (2022)⁷³	Code synonyms do matter: multiple synonyms matching network for automatic ICD coding	MSMN	√
2022-002	Michalopoulos, et al., (2022)⁷⁴	ICDBigBird: A contextual embedding model for ICD code classification	ICDBigBird model	√
2022-003	DeYoung, et al., (2022)⁷⁵*	Entity anchored ICD coding	—
2022-004	Huang, et al., (2022)⁷⁶	PLM-ICD: Automatic ICD coding with pretrained language models	PLM-ICD	√
2022-005	Wang, et al., (2022)⁷⁷	A novel framework based on medical concept driven attention for explainable medical code prediction via external knowledge	MCDA	√
2022-006	Yang, et al., (2022)⁷⁸	Knowledge injected prompt based fine-tuning for multi-label few-shot ICD coding	KEPTLongformer	√
2022-007	Liu, et al., (2022)⁷⁹	Hierarchical label-wise attention transformer model for explainable ICD coding	HiLAT＋ClinicalplusXLNet	√
2022-008	Falis, et al., (2022)⁸⁰	Horses to Zebras: ontology-guided data augmentation and synthesis for ICD-9 coding	CAML
2022-009	Liu, et al., (2022)⁸¹	TreeMAN: Tree-enhanced multimodal attention network for ICD coding	TreeMAN	√
2023-001	Chen, et al., (2023)⁸²	Rare codes count: mining inter-code relations for long-tail clinical text classification	—	√
2023-002	Nguyen, et al., (2023)⁸³	A two-stage decoder for efficient ICD coding	—	√
2023-003	Yang, et al., (2023)⁸⁴	Multi-label few-shot ICD coding as autoregressive generation with prompt	Gpsoap	√
2023-004	Ng, et al., (2023)⁸⁵	Modelling temporal document sequences for clinical ICD coding	HTDS	√
2023-005	Niu, et al., (2023)⁸⁶	Retrieve and rerank for automated ICD coding via contrastive learning	FLASH-Framework	√
2023-006	Yang, et al., (2023)⁸⁷	Intriguing effect of the correlation prior on ICD-9 code assignment	—
2023-007	Liu, et al., (2023)⁸⁸	Automated ICD coding using extreme multi-label long text transformer-based models	XR-LAT	√
2023-008	Kang, et al., (2023)⁸⁹	Automatic ICD coding based on segmented ClinicalBERT with hierarchical tree structure learning	SCB-T	√
2023-009	Jin, et al., (2023)⁹⁰	Learning from undercoded clinical records for automated ICD coding	—
2023-010	Mou, et al., (2023)⁹¹	Automated ICD coding based on neural machine translation	RAANMT
2023-011	Li, et al., (2023)⁹²	Towards automatic ICD coding via knowledge enhanced multi-task learning	KEMTL	√
2024-001	Luo, et al., (2024)⁹³	CoRelation: Boosting automatic ICD coding through contextualized code relation learning	CoRelation	√
2024-002	Lu, et al., (2024)⁹⁴	Towards semi-structured automatic ICD coding via tree-based contrastive learning	CM
2024-003	Williamson, et al., (2024)⁹⁵	Low resource ICD coding of hospital discharge summaries	—
2024-004	Caralt, et al., (2024) ⁹⁶	Continuous predictive modeling of clinical notes and ICD codes in patient health records	LAHST
2024-005	Wang, et al., (2024)⁹⁷	ICDXML: enhancing ICD coding with probabilistic label trees and dynamic semantic representations	ICDXML
2024-006	Wang, et al., (2024)⁹⁸	Multi-stage retrieve and re-rank model for automatic medical coding recommendation	—	√
2024-007	Goldstein, et al., (2024)⁹⁹	Towards understanding attention-based reasoning through graph structures in medical codes classification	GCN_EHR

ICD: International Classification of Diseases; GCCN: graph convolutional network; EHR: electronic health records; SOTA: state-of-the-art; SVM: support vector machine; CNN: convolutional neural network; UMLS: Unified Medical Language System; BERT: bidirectional encoder representations from transformer.

— indicates that the study used a model but did not specify its name.

* indicates that the study is a preprint paper.

Data handling and preprocessing

Dataset division strategies

Data division is an essential methodological element in automated clinical coding research. Among the 73 reviewed studies, 91.78% (67 papers) explicitly detail how the MIMIC dataset was divided into training, validation, and test sets. Of these, 58.21% (39 papers) follow the split strategy proposed by Mullenbach et al.²⁷ This MIMIC-III full codes dataset is divided into 47,719/1632/3372 (train/validation/test), while the top-50 codes dataset is divided into 8067/1574/1730. Papers that used the MIMIC-II full dataset employed a 20,533/2282 (train/test) split.

However, some studies only report proportional splits or omit data partitioning details. As the field matures, clear and consistent reporting of dataset division strategies remains crucial for ensuring reproducibility and enabling reliable comparisons across studies. Appendix 4 provides detailed data split information for each included publication.

Text preprocessing techniques

Clinical narratives are often noisy, sparse, and contain many misspellings, non-standard synonyms, and grammatical errors. Various pre-processing methods have been developed to tackle these challenges,¹⁰⁰ including tokenization, converting to lowercase, removing stop words, sentence segmentation, expanding abbreviations, spelling correction, and lemmatization. Most research on text representation has primarily used models like Word to Vector (Word2Vec), Convolutional Neural Network (CNN), and Bidirectional Long Short-Term Memory (Bi-LSTM). However, recent developments in graph models and large language models have led to more advanced representations used in this test, such as Clinical Bidirectional Encoder Representations from Transformers (BERTs),^87,89 PubMedBERT,⁶² Bigbird,⁷⁴ and ClinicalLongformer.⁹⁷

Besides basic text representation, many studies have highlighted the importance of including hierarchical structures in clinical text representations. The HA-GRU model³¹ used paragraphs as an additional representation layer for input text, considering the strong hierarchical structure of discharge summaries. Similarly, the EnHANs³⁵ and DeepLabeler⁴¹ models established word, sentence, or document-level representation layers, creating a hierarchical representation of clinical narratives.

Given the length and complexity of clinical notes, several approaches have been proposed to divide the text into chunks, thus improving model training and prediction efficiency. Models like Hierarchical BERT,⁷⁰ HiLAT,⁷⁹ and SCB-T⁸⁹ models create text chunks based on token length, while more advanced methods segment clinical notes based on semantic content or semi-structured medical information. For instance, The CM model⁹⁴ introduces the DF-IAPF algorithm, which automatically segments clinical notes based on their semi-structured characteristics, thereby reducing data variability. Additionally, the LAHST model⁹⁶ organizes and segments input data using clinical note timestamps to preserve temporal order in the data.

To further address the challenges posed by complex medical texts, more advanced data augmentation techniques have been introduced to improve model performance. The CAML model⁸⁰ employs tools like SemEHR¹⁰¹ and MedCAT¹⁰² for data augmentation and synthesis; The LAHST model⁹⁶ utilizes the Extended Context Algorithm (ECA) to enrich datasets by providing more context, thereby improving model predictions.

Knowledge integration sources

The task of automated clinical coding is inherently knowledge-guided; in the analyzed articles, 69.57% (48 papers) incorporated knowledge in automated clinical coding. Following Hu et al.,¹⁰³ we divide the knowledge designed for ICD coding tasks into three types: Text knowledge, knowledge graph, and rule knowledge. The sources, classification, and application methods of each type are detailed in Appendix 5. Table 2 provides statistics on the types of knowledge integrated into different studies. The statistical results clearly show that research on knowledge integration has steadily increased since 2019. The specific applications of each type of knowledge are described as follows.

Table 2.

Statistics on types of knowledge integrated in the included papers.

Publication year	Knowledge graph		Text knowledge	Rule knowledge	Total
Publication year	Entity knowledge	Triplet knowledge	Text knowledge	Rule knowledge	Total
2014	0	1	0	0	1 (1.49%)
2017	0	1	1	0	2 (2.98%)
2018	3	2	0	0	5 (7.46%)
2019	5	2	1	0	8 (11.94%)
2020	2	3	1	0	6 (8.96%)
2021	5	3	3	0	11 (16.42%)
2022	5	1	3	1	10 (14.93%)
2023	2	6	5	0	13 (19.40%)
2024	4	5	2	0	11 (16.42%)
Total	26(35.82%)	24(38.81%)	16 (23.88%)	1 (1.49%)	67

Knowledge graphs

Entity Knowledge

The literature focusing solely on using Entity Knowledge resources accounts for 35.82% and remains a consistently popular approach. This method depends entirely on the textual descriptive information of ICD codes, which includes labels, definitions, synonyms, and terminology. Common practices include using the ICD label information released by the WHO^{27,34,56,66,67}^87–89^,94 and leveraging other terminology libraries, such as the Unified Medical Language System (UMLS),^{43,51,73,92,93,95,99} Medical Subject Headings,^42,97 and PyMedTermino.⁸⁰ Some studies employ entity linking^75,80 techniques to enrich the descriptions of ICD codes. Early research used models such as Word2Vec, CNN, and Bi-LSTM to generate ICD description representations, while recent studies have utilized pre-trained models like ClinicalBERT,^87,89 PubMedBERT,⁸⁵ and RoBERTa,^88,96 etc., to obtain ICD description embeddings.

Triplet Knowledge

The utilization of the Triplet Knowledge resource accounts for 38.81% within this study. Early research primarily focused on hierarchical relationships, capturing only “parent-child” connections and failing to represent more complex interactions, such as mutual exclusion or weak links between different code families. SCB-T model⁸⁹ developed the Hierarchical Information Transmission module using Gated Recurrent Unit (GRU) cell algorithms to capture and utilize hierarchical relationships between ICD codes, enhancing prediction accuracy, particularly for rare codes. However, this performance improvement involves higher model complexity and more computation.

With technological advancements and a deeper understanding of the task, recent studies have expanded to incorporate co-occurrence relations derived from datasets such as MIMIC,^57,61,93 ICD ontologies,^48,49 and UMLS.^92,99 Wang et al.⁹⁸ investigated external auxiliary knowledge from EHR data, including Diagnosis-Related Group (DRG) and Current Procedural Terminology (CPT) codes, by combining it with co-occurrence relations of ICD labels. While these relations show whether two codes appear together in the training data, they do not specify the type or nature of the relationships. The CGN_EHR model⁹⁹ indicates that integrating external knowledge graphs, like UMLS, does not align with the specific coding rules of ICD systems, resulting in decreased performance compared to baseline models.

Text-based knowledge

This category of methods focuses on enhancing the representation learning of ICD-related knowledge by leveraging text data sources such as medical domain texts (e.g., Wikipedia documents about diagnoses),^{30,39,47,60,77} and de-identified medical records from datasets like the US Veterans Health Administration Corporate Data Warehouse.⁸⁴ Other sources include the Partners HealthCare Biobank,⁴⁷ Biomedical Semantic Indexing and Question Answering, Hallmarks of Cancers biomedical literature,⁴⁵ and MIMIC data.^58,77,79,97 With advancements in large model technology, these approaches have become increasingly popular since 2021, accounting for 23.88% of the methods used. Most studies integrate medical text knowledge by directly employing pre-trained language models (PLMs), conducting separate pre-training stages, or incorporating such knowledge into encoder architectures.

Rule-based knowledge

ICD coding rules are mainly based on standardized processes and guidelines, which are often formalized as sets of rules and terminologies used within the healthcare system. For example, the mapping from SNOMED CT to ICD-10 mapping rules primarily involve gender, patient age, acquired versus congenital conditions, poisonings, external causes, dagger, and asterisk.¹⁰⁴ In research literature, the TreeMAN⁸¹ model attempts to extract information such as patient physiological indicators, gender, admission type, and treatment events from the MIMIC dataset and trains decision trees based on these characteristics of structured data.

Gaps in medical knowledge utilization

Based on extensive experience and intuition in the medical field, we believe the following knowledge sources can be highly effective for this task: SNOMED CT contains a vast collection of concepts, relationships, and descriptions. By leveraging the sufficiency and necessity conditions defined in this terminology system and utilizing its existing OWL representation, the accuracy of ICD coding can be significantly enhanced.¹⁰⁵ Another valuable resource is the SNOMED CT to ICD-10-CM mapping,¹⁰⁴ which includes manually edited logical rules and mapping examples. These mappings capture the expertise of medical coding professionals, enabling the model to learn from human expertise and enhance coding accuracy.¹⁰⁶ Furthermore, ontology representations of ICD-10 or ICD-11 can provide valuable semantic knowledge for the coding task. The principles, such as inclusion, exclusion, and “code also,” encapsulate complex relationships and dependencies between ICD codes.¹³

Model frameworks

The typical architecture of ICD coding models includes four core layers: Input layer, representation layer, feature combination layer, and output layer. Larger models often have parameter counts over 100 million, whereas smaller models range from 10 to 30 million parameters. Further details about each study's model architecture, representation, feature layers, output strategies, training techniques, and parameter sizes can be found in Appendix 6.

i. Input layer

The input layer handles diverse data sources, including clinical narratives, multimodal clinical data, and structured domain knowledge (e.g., ICD ontologies). Input information is tokenized using various techniques tailored to the specific language expressions.

ii. Representation layer

This layer converts tokens into vectorized representations. Techniques include traditional embeddings such as Word2Vec¹⁰⁷ and FastText ,¹⁰⁸ Statistical methods such as Term Frequency-Inverse Document Frequency (TF-IDF) and Bag-of-Words, along with deep learning-based dynamic representations like RNNs (LSTM, GRU) and CNNs. Recent studies increasingly incorporate contextualized embeddings from pre-trained models like BERT and its variants to better capture semantic nuances.

iii. Feature combination layer

The feature combination layer creates more advanced architectures to capture multi-level and richer features. It uses methods like attention mechanisms, multi-scale feature extraction, and complex structures such as RNNs, CNNs, or Transformers. The aim is to ensure that the extracted features match the structural characteristics of clinical data and support the decision-making processes of coders.

iv. Output layer

The output layer converts learned features into ICD codes using classification, retrieval, and generation. Classification remains the most common method. Retrieval techniques offer greater flexibility by finding the most relevant codes. Generation-based methods dynamically create ICD codes.

While not inherently part of the model architecture, training strategies are crucial for the model's performance. Typically, models are optimized using objective functions like binary cross-entropy loss or ranking loss. Advanced techniques have been developed to tackle specific challenges, for example, the JLAN model⁷² includes a Truncation Loss function and a Dynamic Threshold Function to minimize noise impact in ICD coding. The CoRelation model⁹³ combines cross-entropy loss with a complexity penalty loss to simplify relationship reasoning and enhance the model's efficiency in handling complex ICD code relationships.

Modeling paradigms

With technological advancements, the modeling paradigms for ICD coding tasks have become increasingly diverse. Each study may adopt one or more paradigms. We recorded and organized those modeling paradigms explicitly stated in the reviewed studies (excluding inferred or implied paradigms) and classified them into five main categories: Deep Learning, Knowledge Representation and Reasoning, Information Retrieval, Machine Learning, and Generation. Details of each study's adopted paradigm(s), model architecture, representation and feature layers, output strategies, training methods, and parameter sizes are provided in Appendix 7, while the frequency of paradigms for the included studies is statistically summarized in Table 3.

Table 3.

Statistical summary of modeling paradigms for the included papers.

Modeling paradigm	Frequency	Subcategory	Frequency
Machine learning	12	Adversarial learning	3
		Decision trees	2
		Reinforcement learning	1
		Supervised learning	1
		SVM	3
		Unsupervised learning	2
Deep learning	131	Contrastive learning	4
		Few-shot learning	4
		Multitasking learning	2
		Self-distillation learning	1
		Self-supervised learning	1
		Transfer learning	2
		Attention mechanisms	43
		Capsule networks	1
		CNNs	21
		Recurrent neural networks —RNN, LSTM, GRU, etc.	27
		Transformer models	6
		Pretrained language models—BERT, GPT, PubMedBERT, etc.	19
Knowledge representation and reasoning	28	Knowledge graphs—GCN, GRN, etc.	20
		Multimodal models	3
		Other knowledge representation	5
Generative models	5	Autoregressive models	1
		GAN	1
		Prompt engineering	2
		Sequence to sequence models—Seq2Seq	1
Information retrieval	14	Matching and mapping	7
		Ranking algorithms	4
		Vector space models	3

CNN: convolutional neural network; GCN: graph convolutional network; SVM: support vector machine; LSTM: long short-term memory; BERT: bidirectional encoder representations from transformers; GAN: generative adversarial networks.

We observed significant differences in how the ICD coding task is defined across various studies. Although it is commonly described as a multi-label classification problem, some studies provide different interpretations of the task. For instance, Zeng et al.⁴² define it as an indexing task, mapping medical text to predefined ICD code indices. DeYoung et al.⁷⁵ and Ziletti et al.¹⁰⁹ treat it as ontology linking or entity normalization, linking text to concepts within the ICD ontology or normalizing it to ICD codes. Prakash et al.³⁰ and Guo et al.⁵¹ treat it as a disease inference task, identifying the patient's disease based on the provided information text. NMT models⁹¹ treat the task as a translation problem, converting diagnostic descriptions into the corresponding ICD codes. These distinctions indicate that researchers frequently lack a clear understanding of the difference between “disease classification” and “disease diagnosis,” which could undermine the validity and usefulness of the models.

Deep learning

Deep learning is the most commonly used approach in ICD coding models among the studies surveyed, with 131 times. It acts as a flexible computational framework that can be integrated into different methodologies. When developing deep learning models for ICD coding, several important challenges must be considered.

Firstly, considerable research has been done on model architecture. Some studies have developed hierarchical structures to create deep learning architectures, such as JointLAAT,⁴⁶ ZAGCNN,³⁴ HLAN,⁶⁹ Two-stage decoding model,⁸³ and Hierarchical BERT,⁷⁰ XR-LAT.⁸⁸ These models utilize attention mechanisms, label embeddings, or various trained structures like BERT, taking advantage of the ICD coding system's natural hierarchy to improve overall performance. While hierarchical models reflect human disease classification logic, they pose risks, such as higher-level errors impacting later predictions. Other studies have constructed complex network architectures, such as capsule networks,⁶⁷ Multi-CNN,⁵³ Multi-Scale,^38,73 and Transformers,^57,68,85 to extract more features. However, these models often have too many parameters, require high computational resources, and are challenging to optimize and debug. Meanwhile, LSTM models^73,83 have also achieved SOTA results.

Secondly, attention mechanisms have evolved considerably over time. Initially, single-layer attention was used to pinpoint relevant keywords. This approach later advanced to hierarchical attention for a more organized text representation. More sophisticated mechanisms, including label-wise attention, parent-child label attention, and multi-head self-attention, have enhanced feature extraction by integrating ICD-specific traits and hierarchical structures. For example, the JLAN model⁷² employs a joint learning mechanism to combine self-attention and label attention, creating specialized representations for both high- and low-frequency labels. This reflects a shift from simple stacking to interactive designs, integrating task-specific focus with medical domain knowledge.

Thirdly, recent studies have commonly employed PLMs, such as BioBERT,¹¹⁰ ClinicalBERT,¹¹¹ PubMedBERT,¹¹² and RoBERTa-PM,¹¹³ which utilize extensive scientific domain data to enhance semantic understanding and ICD coding accuracy. Ji et al.⁷⁰ suggest that pre-trained models, like BERT, do not necessarily improve the ICD coding results. The Gpsoap model⁸⁴ uses the SOAP structure to create pre-training tasks. While these models possess strong semantic capabilities and benefit from large corpora, they are complex and may exhibit limited gains. The XR-LAT model⁸⁸ emphasizes the importance of domain-specific context, as it was pre-trained on biomedical data using the BIGBIRD model.

Lastly, various deep learning techniques have been developed to address specific challenges in ICD coding. Few-shot learning methods, like AGM-HT⁵⁹ and Mining Inter-Code Relations,⁸² target issues related to sparse label prediction. Transfer learning, discussed in Li et al.'s study,⁴¹ utilizes knowledge from related areas, though its effectiveness depends on the correlation between source and target tasks. The ISD⁵⁷ model employs a self-distillation mechanism to minimize noise in input texts. Contrastive learning approaches have also been used to improve feature representations, including text-label contrastive learning in the FLASH framework,⁸⁶ graph contrastive learning in the CoGraph model,⁶⁰ and tree-based contrastive learning in the CM framework.⁹⁴ Multi-task learning methods, such as Yang et al.,⁸⁷ combine ICD with CPT and DRG code classification, highlighting the benefit of handling diagnostic and procedural classifications separately. The KEMTL⁹² model views these tasks as a multi-task learning problem, encompassing ICD coding, treatment recommendations, and mortality prediction.

Knowledge representation and reasoning

The knowledge representation and reasoning paradigm was mentioned 28 times in the surveyed studies. Early research focused on utilizing deep learning techniques such as Bi-LSTM, CNN, and Embeddings from Language Models for knowledge representation. With advances in knowledge graph representation and application technologies, many studies have aimed to develop structured and semantic ICD knowledge bases. For instance, GRU^36,60,89 and Graph Convolutional Network (GCN)^{38,48,74,82,99} have been applied to learn coded representations on the ICD graph.

The KEMTL model⁹² employs a Graph Attention Network that integrates the UMLS medical knowledge base to construct a heterogeneous textual graph. This approach effectively captures essential information within clinical texts and elucidates the semantic relationships between concepts. In comparison, the GCN_EHR model⁹⁹ established GCNs based on UMLS for this purpose, but it did not yield significant performance improvements. Furthermore, many studies underutilize ICD knowledge and neglect the broader semantic relationships among medical concepts. The complexity of incorporating knowledge graphs and the necessity for extensive hyperparameter tuning pose additional challenges for practical model training.

Several studies have investigated multimodal methods^58,81 that combine structured data with textual information. These approaches have shown that integrating structured data can significantly improve the performance of ICD coding. For instance, the ICDXML model⁹⁷ addresses the diverse nature of PLMs and domain knowledge using a Multi-modal Factorized Bilinear operation. This technique effectively incorporates domain knowledge while maintaining the original semantic information.

Other specific representation techniques have been developed to capture ICD code knowledge. For instance, Hyperbolic Geometry⁴⁸ learns continuous vector representations reflecting the ICD hierarchy, while the Path Generator models⁴⁹ ICD coding as a process of generating paths along the ICD tree. Wang et al.⁹⁸ utilize the BM25 algorithm to extract co-occurrence information of ICD labels and employ the Graphormer model to encode the semantic relationships between these labels. The CoRelation model⁹³ utilizes a Contextualized Code Relation Learning mechanism that dynamically captures the intricate relationships between ICD codes in the processed case context. This model performs exceptionally well in the top-50 and full dataset results, making it one of the best models in Knowledge Representation and Reasoning paradigms.

Information retrieval

Information retrieval techniques for ICD coding have been utilized 14 times in the surveyed studies. These studies focus on transforming clinical texts and terminologies into vector spaces to improve the effectiveness of matching and mapping. In earlier research, Perotte et al.¹⁶ combined TF-IDF features with Support Vector Machines (SVMs) classifiers for ICD coding. The C-MemNNs³⁰ model utilized stored information to assist in diagnostic retrieval, while Shi et al.³³ and Guo et al.⁵¹ implemented attention mechanisms to align diagnostic descriptions with ICD knowledge. The MSMN model⁵⁶ enhanced ICD code representation by leveraging synonyms through a multiple-synonym matching network. The KSI model³⁹ computed matching scores between clinical notes and external knowledge sources like Wikipedia. The UNITE model⁴⁷ utilized word vector techniques to create representations from both EMRs and online knowledge sources.

Ranking methods aim to optimize label ranking in multi-label classification problems. The MADE Reranker⁶¹ was the first to apply re-ranking methods to prioritize the selection of primary diagnoses and procedures. It adjusted coding sequences by estimating probabilities and leveraging label correlations. To tackle challenges such as a large label space and long-tail label distribution, Wang et al.⁹⁸ introduced a two-stage retrieval (utilizing auxiliary knowledge and BM25) and re-ranking phase (incorporating contrastive learning and label co-occurrence relationships), achieving a 6.40% improvement in F1-micro scores on the full dataset. Similarly, the FLASH framework⁸⁶ presents an innovative retrieval and re-ranking method, resulting in a 5.10% improvement in F1-micro scores on the full dataset. The two-stage retrieval and re-ranking model⁹⁸ and the FLASH framework⁸⁶ represent significant advancements in the category of retrieval methods.

Machine learning

Machine learning methods have been applied in ICD coding in 12 surveyed studies. Early approaches primarily utilized SVM.^16,32,43 The G_Code model⁵³ generates adversarial examples to enhance sample diversity and improve the robustness of ICD code assignments. The RPGNet model⁴⁹ uses adversarial and reinforcement learning to frame ICD coding as a path generation task. The MCDA model⁷⁷ employs Latent Dirichlet Allocation (LDA), an unsupervised learning method, to extract medical concepts from clinical notes and Wikipedia. Additionally, some studies combine structured data with Decision Trees^44,81 and deep learning architectures to enhance model performance.

Generation

Research on generative paradigms is limited; it has been applied 5 times in studies. The prompt-based paradigm has recently gained attention. The AGM-HT model⁵⁹ uses a Wasserstein Generative Adversarial Network with Gradient Penalty to leverage the hierarchical structure of ICD codes, improving zero-shot classification. Yang et al.⁷⁸ adopt a prompt-based fine-tuning approach, framing the task as filling in prompts, while another study⁸⁴ employs an autoregressive encoder-decoder for generating ICD codes using a cloze-style prompting method.

Evaluation metrics and results

Evaluation metrics

In the study by Mullenbach et al.,²⁷ the model was evaluated using three datasets: MIMIC-III full codes, MIMIC-III top 50 codes, and MIMIC-II full codes. The reported metrics included AUC-macro, AUC-micro, F1-macro, F1-micro, and Precision@5, Precision@8, and Precision@15. These benchmarks have served as important reference points for later research, with many subsequent models comparing themselves to and striving to exceed these benchmark performances. Detailed metric results from the literature are included in Appendix 8.

Performance outcomes

In ICD coding research, most studies focus on ICD-9 coding tasks due to the MIMIC dataset only containing ICD-9 codes, while only a few have extended to manual annotation and handling of ICD-10 classification. For example, Xu et al.⁴⁴ and Rajendran et al.⁵⁸ manually mapped 32 ICD-9 codes to ICD-10 codes. The HAN GRU model¹¹⁴ used mapping tables to convert 5935 unique ICD-9-CM codes to ICD-10, which could introduce potential noise or inaccuracies. Additionally, the study by DeYoung et al.⁷⁵ involved professionals undertaking tasks like ICD-10-CM coding, ordering, and entity annotation. Given that most studies focus on ICD-9 classification, this review exclusively compares and analyzes the classification outcomes of ICD-9 coding.

We used Mullenbach et al.²⁷ as the baseline and organized the literature by year to analyze differences in F1-micro, which is the most frequently reported metric, and composite metrics, as shown in Figure 4, Figure 5. The analysis shows an upward trend in the F1-micro metric since 2019, with the most notable improvements observed in the MIMIC-III top-50 dataset. Composite metrics have shown consistent improvement since 2022. In the full dataset, the results from the HLAN⁶⁹ and BERT-hier + LAN⁷⁰ models fall below the baseline, and in the top-50 dataset, the results from Jin et al.⁹⁰ fall below the baseline.

Figure 4.

Medical Information Mart for Intensive Care (MIMIC) F1-micro scores difference comparison by publication year.

Figure 5.

Medical Information Mart for Intensive Care (MIMIC) composite scores difference comparison by publication year.

The literature was categorized by paradigms, and statistics for F1-micro and composite metrics were summarized for each paradigm to compare model performance. The results indicate that knowledge-based and generation methods significantly improve F1-micro scores, especially when using the full dataset. In contrast, deep learning, knowledge-driven, and information retrieval methods perform better in composite metrics. The Violin plot comparisons for each paradigm are presented in Figures 6 and 7.

Figure 6.

Violin plot of Medical Information Mart for Intensive Care (MIMIC) F1-micro scores difference by paradigm.

Figure 7.

Violin plot of Medical Information Mart for Intensive Care (MIMIC) composite scores difference by paradigm.

Metrics improvement in ICD coding

Currently, few studies have explored the classification tasks of label quantity and order. The ML-Net model⁴⁵ develops a label count prediction network, treating label quantity prediction as an N-way classification task and using a multi-task learning approach to simultaneously predict label quantity and ICD codes. Additionally, Tsai et al.⁶¹ were the first to use a reranking method to adjust the order of automatic ICD coding. However, these studies lack evaluations or analyses of label quantity and order predictions.

Perotte et al.¹⁶ introduced novel metrics to improve evaluation metrics, including shared path and depth metrics. These metrics utilize the hierarchical structure of ICD codes to assess the relationship between predicted and gold standard codes, thereby facilitating error analysis. Amigo and Delgado¹¹⁵ proposed the Information Contrast Model for multi-label hierarchical extreme classification. This model compares the informational content of predicted and actual label sets, effectively addressing challenges such as hierarchical similarity and class imbalance.

Model explainability

Case analysis and visualization techniques

Early research on automated ICD coding largely overlooked the importance of interpretability. Among the literature reviewed, 60.27% (44 papers) tried to enhance interpretability, primarily through qualitative visualization and case analysis methods.

In case analysis, Luo et al.⁹³ examined various coding systems and their interrelationships, while Williamson et al.⁹⁵ concentrated on rare codes. Falis et al.⁸⁰ introduced the Weak Hierarchical Confusion Matrix method, which allows for a more nuanced evaluation of errors and links algorithmic outcomes to professional expertise. Despite these advancements, many case analyses still fail to provide comprehensive assessments of classification results from the perspective of ICD coding professionals.

Regarding visualization, the HA-GRU model³¹ made a groundbreaking contribution in 2017 by employing attention mechanism visualization to tackle the interpretability challenges in ICD classification. Following this innovation, many studies^{31,40,48,68,69} have adopted similar methods to clarify model predictions, focusing on keywords and sentences related to specific ICD codes. The HiLAT model⁷⁹ further improved this understanding by comparing these keywords with established ICD knowledge bases like SNOMED CT and UMLS. Additionally, the HLAN model⁶⁹ quantitatively identified the most significant words and sentences for each label and compared its findings across multiple models. However, discussions about professional aspects and the mechanisms of visual interpretability are often overlooked in the model. Most existing research inadequately examines the logical relationships between visualization keywords.

Explainability improvements in ICD coding

Automated ICD coding is a specialized task where interpretability is crucial. However, deep learning methods are often considered “black boxes,” making it difficult to understand their decision-making processes. Recent studies¹¹⁶ have shown that attention weights do not always effectively explain model decisions, as they can be influenced by various factors like data and model parameters.

To enhance the explainability of ICD coding models, clinical experts should evaluate the outputs of algorithms. Research should establish reasoning pathways similar to those employed by human coders. Knowledge-enhanced PLMs integrating explicit reasoning and dynamic relationship modeling from various knowledge sources can significantly improve performance and interpretability.⁹⁶ High-quality medical datasets, such as the MDACE dataset,¹¹⁷ along with multimodal sources like laboratory test data, medical imaging, and associated reports, can increase the reliability of ICD coding and improve the clarity of coding explanations. Furthermore, building on the studies conducted by Balkir et al.¹¹⁸ and Darwiche and Ji,¹¹⁹ the focus on generating “sufficient and necessary explanations” can provide more reliable justifications for predictions.

Discussion

Common errors in medical data, including diagnostic and ICD codes, pose serious challenges for healthcare systems worldwide.^120,121 These errors often necessitate extensive human resources for quality management. It is essential to develop automated decision-support tools for ICD coding to address these issues. These tools need to process medical documents that can be structured in various formats, may be lengthy, noisy, and often incomplete. The outputs of these tasks tend to be imbalanced and involve a wide range of labels. Additionally, the classification systems and coding rules are complex and subject to change, which makes effective implementation in real-world medical settings quite challenging.⁸

Our study explored the development of automated ICD coding models based on the MIMIC dataset from 2014 to 2024, focusing on both computer science and clinical perspectives. To address the critical dimensions of this evolving research field, we analyzed and compared the results using quantitative classification methods. The key findings and synthesis of the reviewed studies are as follows:

I. Input data

To ensure reproducibility and comparability, algorithms must use consistent and well-documented data division strategies. Paragraph-level representation, text chunking, and structured data extraction can further enhance performance. Currently, studies in automated clinical coding primarily rely on specialized datasets, including MIMIC, Centers for Disease Control, and CodiEsp dataset.¹⁰⁰ This review focuses on widely used published research datasets, specifically the MIMIC datasets. However, these datasets have inherent limitations. The data primarily comes from intensive care unit (ICU) patients, resulting in a biased distribution of diseases, with severe conditions potentially being overrepresented. Furthermore, the dataset lacks diversity, including only a small subset of all possible ICD-9 codes.¹²² Williamson et al.⁹⁵ noted that the limited data extraction points and inadequate labeling were primary constraints in their research.

II. Knowledge integration

Of the 48 papers reviewed, 69.57% utilized various types of knowledge, including text-based information, knowledge graphs, and rule-based systems, with knowledge graphs being the most commonly used method. The primary sources of knowledge included the ICD ontology, the UMLS ontology, and medical information sourced from platforms like Wikipedia. The CGN_EHR system faced performance issues due to the limited alignment between utilizing the UMLS knowledge graph and coding-specific requirements. The integration methods employed included graph algorithms, PLMs, and hierarchical algorithms, highlighting the importance of incorporating medical knowledge to enhance model performance. However, effectively integrating and utilizing this knowledge remains a challenge, as standards, structures, and guidelines related to ICD coding have not been fully utilized.

III. Model frameworks and paradigms

Algorithm development has evolved from traditional machine learning to more advanced paradigms, including deep learning, knowledge representation and reasoning, information retrieval, and generative models. With the development of PLM, the application of knowledge representation and reasoning has increased significantly. Deep learning techniques have been extensively explored, including few-shot learning, transfer learning, self-distillation, and contrastive learning. Developing these methods aims to replicate the reasoning processes used by clinical coders. This involves the innovative construction of complex model architectures, such as multi-scale and multi-head structures. However, these advancements often introduce increased complexity, which may lead to substantial computational costs and pose challenges in real-world applications.

IV. Performance outcomes

Studies that employ the same dataset division method as Mullenbach et al.²⁷ are included in the comparative analysis of evaluation metrics. Literature reviews indicate a consistent improvement in F1-micro scores since 2019, particularly for the top-50 dataset from MIMIC-III. Additionally, composite metrics have shown enhancements since 2022. For the full MIMIC-III dataset, the Multi-Stage Retrieve and Re-Rank Model⁹⁸ achieved a 6.4% increase in F1-micro scores. For the top-50 MIMIC-III dataset, the HiLAT + ClinicalPlusXLNet Model⁷⁹ recorded a 10.2% improvement in F1-micro scores. It is important to note that larger models do not always ensure better classification performance, as several smaller models^73,77 have also achieved SOTA results.

Methods that employ Knowledge Representation and Reasoning and information retrieval generally achieve better F1-micro scores. In contrast, deep learning and Knowledge Representation and Reasoning perform better on composite metrics. Models based on Information Retrieval paradigms provide significant advantages; this improvement is mainly due to the models’ ability to tackle challenges inherent in ICD coding tasks, such as the extremely large label space and long-tail label distribution. By treating coding as a retrieval task, these models effectively narrow the set of candidate codes, eliminating the need for extensive computations across numerous classes. The integration of domain-specific knowledge not only improves the model's understanding of the semantic alignment between clinical narratives and target codes but also enhances interpretability. Research into Knowledge Representation and Reasoning is particularly promising, especially for addressing low-resource conditions, such as rare-50 coding.⁹⁵

V. Model explainability

60.27% (44 papers) focused on enhancing explainability through qualitative visualizations, such as attention mechanism visualizations, and case-based analysis methods. However, these approaches often lack thorough review or validation by medical professionals.

Limitations of this review

This review lacks a comprehensive analysis of deep learning algorithms, particularly regarding parameter settings, algorithm transparency, ablation studies, and computational efficiency. A thorough examination of these elements is crucial for understanding the underlying mechanisms of the models, optimizing their performance, and promoting further algorithm development. Future review studies should strive to evaluate these dimensions more systematically and comprehensively to provide deeper insights and practical guidance. As an interdisciplinary field, automated clinical coding requires integrated medical, computer science, and informatics expertise. Therefore, future research should place a greater emphasis on cross-disciplinary collaboration.

Implications for practice, policy, and future research

Future models should be trained on more comprehensive datasets that accurately reflect real-world clinical scenarios across various medical specialties, institutions, and countries. The ICD coding systems are continuously updated to keep pace with advancements in medical technology and changes in healthcare policies. The MIMIC-II and MIMIC-III datasets, which are limited to ICD-9 coding and primarily focus on data from ICUs, pose challenges for broader applicability. As ICD-9 gradually transitions to ICD-10 and ICD-11, it is crucial to address the differences in label structure, semantic density, and granularity when adapting models to these newer coding systems.¹²³ The recently released MIMIC-IV dataset, which supports multiple ICD versions (excluding ICD-11), represents a significant step toward bridging this gap.¹²⁴ Mapping and converting between different ICD systems will be essential for facilitating knowledge transfer and enabling model reuse across standards, ultimately improving coding accuracy and applicability. Furthermore, integrating multimodal real-world clinical data, such as clinical narratives alongside structured information (such as lab results and vital signs) and diagnostic imaging (such as chest X-rays and reports), can greatly enhance contextual understanding and diagnostic accuracy.⁷ This integration is particularly valuable in complex cases involving complications, comorbidities, or unclear documentation. Additionally, it establishes a robust chain of evidence that supports model interpretability.

The review highlighted the significant advantages of knowledge representation and reasoning and information retrieval in models, emphasizing the importance of incorporating broad and deep domain knowledge for future advancements. The ICD coding task is a tightly regulated medical task that demands precise input, accurate outputs, and clear interpretability. It is essential to integrate various forms of medical knowledge, including ICD coding standards and guidelines, clinical terminology, and treatment pathways. The efficiency, accuracy, and consistency of EHR data representation should be based on domain-specific medical knowledge. For instance, adopting standardized frameworks such as Fast Healthcare Interoperability Resources (FHIR) can provide a practical solution for more effectively representing medical information.¹²⁵ In the scope of medical knowledge, efforts should focus on verifying the validity of knowledge sources, enhancing the density and richness of information embedded in models, and developing efficient organizational structures for managing medical knowledge.^103,126

The primary goal of automated clinical coding is to improve efficiency and alleviate the burden of manual coding. However, several practical challenges need to be addressed carefully. These challenges include aligning with regional healthcare administration and insurance regulatory requirements, designing effective workflows for EHR integration, and ensuring the quality of ICD coding assessments, including diagnostic ordering, accuracy, and completeness. Establishing trust among clinicians is crucial, as it allows physicians to validate and depend on the outputs of models more effectively. Visualizing “reasoning chains,” which illustrate the sources of text evidence and coding rules and guidelines, can significantly enhance this trust. According to Dong et al.,⁸ automated clinical coding that is human-centered, explainable, intelligent, and robust enough to handle complex real-world scenarios still faces significant challenges. To effectively tackle these challenges, promoting interdisciplinary collaboration between medical informatics and computer science is essential.

Conclusion

This review systematically evaluates automated ICD coding models developed using the MIMIC dataset from 2014 to 2024. The analysis shows that 69.57% (48 papers) of the reviewed studies incorporate various forms of medical knowledge, with knowledge graphs being the most commonly used. Algorithm development has advanced from traditional machine learning to more sophisticated paradigms, such as deep learning, knowledge reasoning, information retrieval, and generative models. Knowledge representation, reasoning, and information retrieval have shown significant improvement. There has been a consistent improvement in F1-micro scores since 2019, and composite metrics have shown enhancements since 2022. There was a 6.4% increase in F1-micro scores for the full MIMIC-III dataset, while the top-50 MIMIC-III dataset recorded a 10.2% improvement. Additionally, 60.27% (44 papers) of the studies include efforts to enhance model explainability, primarily through attention visualization and case-based analysis.

This review highlights several critical limitations in the development of models. For instance, the MIMIC data in these models lacks diversity, and the application of medical domain knowledge has not been fully realized. Additionally, developing these models involves high algorithmic complexity and insufficient validation by clinical coders. These factors reduce their reliability in clinical settings. Future research should prioritize key areas such as incorporating a wider variety of multimodal data sources, more effective integration of medical knowledge, and enhancements in model explainability. There is a significant gap between algorithm development and practical application, requiring multidisciplinary experts’ collaboration and effort.

Supplemental Material

sj-docx-1-dhj-10.1177_20552076251404518 - Supplemental material for A systematic review of automated International Classification of Diseases coding models using the Medical Information Mart for Intensive Care dataset

Supplemental material, sj-docx-1-dhj-10.1177_20552076251404518 for A systematic review of automated International Classification of Diseases coding models using the Medical Information Mart for Intensive Care dataset by Ying Zhang, Chen Lyu, Lu Chang, Hong Yang, Bin Ji and Ling-yun Wei in DIGITAL HEALTH

Supplemental Material

sj-docx-2-dhj-10.1177_20552076251404518 - Supplemental material for A systematic review of automated International Classification of Diseases coding models using the Medical Information Mart for Intensive Care dataset

Supplemental material, sj-docx-2-dhj-10.1177_20552076251404518 for A systematic review of automated International Classification of Diseases coding models using the Medical Information Mart for Intensive Care dataset by Ying Zhang, Chen Lyu, Lu Chang, Hong Yang, Bin Ji and Ling-yun Wei in DIGITAL HEALTH

Supplemental Material

sj-docx-3-dhj-10.1177_20552076251404518 - Supplemental material for A systematic review of automated International Classification of Diseases coding models using the Medical Information Mart for Intensive Care dataset

Supplemental material, sj-docx-3-dhj-10.1177_20552076251404518 for A systematic review of automated International Classification of Diseases coding models using the Medical Information Mart for Intensive Care dataset by Ying Zhang, Chen Lyu, Lu Chang, Hong Yang, Bin Ji and Ling-yun Wei in DIGITAL HEALTH

Supplemental Material

sj-docx-4-dhj-10.1177_20552076251404518 - Supplemental material for A systematic review of automated International Classification of Diseases coding models using the Medical Information Mart for Intensive Care dataset

Supplemental material, sj-docx-4-dhj-10.1177_20552076251404518 for A systematic review of automated International Classification of Diseases coding models using the Medical Information Mart for Intensive Care dataset by Ying Zhang, Chen Lyu, Lu Chang, Hong Yang, Bin Ji and Ling-yun Wei in DIGITAL HEALTH

Supplemental Material

sj-docx-5-dhj-10.1177_20552076251404518 - Supplemental material for A systematic review of automated International Classification of Diseases coding models using the Medical Information Mart for Intensive Care dataset

Supplemental material, sj-docx-5-dhj-10.1177_20552076251404518 for A systematic review of automated International Classification of Diseases coding models using the Medical Information Mart for Intensive Care dataset by Ying Zhang, Chen Lyu, Lu Chang, Hong Yang, Bin Ji and Ling-yun Wei in DIGITAL HEALTH

Supplemental Material

sj-docx-6-dhj-10.1177_20552076251404518 - Supplemental material for A systematic review of automated International Classification of Diseases coding models using the Medical Information Mart for Intensive Care dataset

Supplemental material, sj-docx-6-dhj-10.1177_20552076251404518 for A systematic review of automated International Classification of Diseases coding models using the Medical Information Mart for Intensive Care dataset by Ying Zhang, Chen Lyu, Lu Chang, Hong Yang, Bin Ji and Ling-yun Wei in DIGITAL HEALTH

Supplemental Material

sj-docx-7-dhj-10.1177_20552076251404518 - Supplemental material for A systematic review of automated International Classification of Diseases coding models using the Medical Information Mart for Intensive Care dataset

Supplemental material, sj-docx-7-dhj-10.1177_20552076251404518 for A systematic review of automated International Classification of Diseases coding models using the Medical Information Mart for Intensive Care dataset by Ying Zhang, Chen Lyu, Lu Chang, Hong Yang, Bin Ji and Ling-yun Wei in DIGITAL HEALTH

Supplemental Material

sj-docx-8-dhj-10.1177_20552076251404518 - Supplemental material for A systematic review of automated International Classification of Diseases coding models using the Medical Information Mart for Intensive Care dataset

Supplemental material, sj-docx-8-dhj-10.1177_20552076251404518 for A systematic review of automated International Classification of Diseases coding models using the Medical Information Mart for Intensive Care dataset by Ying Zhang, Chen Lyu, Lu Chang, Hong Yang, Bin Ji and Ling-yun Wei in DIGITAL HEALTH

Footnotes

Acknowledgements

This work was carried out in the Information Department of Guangdong Women and Children Hospital, in collaboration with the research team from the School of Computer Science, Sun Yat-sen University. The authors would like to express their gratitude to all colleagues and collaborators for their valuable contributions to the technical discussions and manuscript preparation.

ORCID iD

Ling-yun Wei

Author contributions

Ling-Yun Wei: Conceptualization, project administration, resources, supervision, funding acquisition, writing—review & editing. Ying Zhang: Data curation, formal analysis, investigation, methodology, software, funding acquisition. Chen Lyu: Methodology, validation, writing—review & editing. Lu Chang: Investigation. Hong Yang: Visualization, writing—original draft. Bin Ji: Writing—review & editing.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was supported by the Natural Science Foundation of Guangdong Province (Grant number: 2021A1515110721).

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Availability of data and materials

The data that support the findings of this study are available from Guangzhou Healthcare Security Administration, but restrictions apply to the availability of these data, which were used under license for the current study, and so are not publicly available. Data are however available from the corresponding author upon reasonable request and with permission of Guangzhou Healthcare Security Administration.

Supplemental material

Supplemental material for this article is available online.

References

What is Medical Coding? AAPC, https://www.aapc.com/resources/what-is-medical-coding (accessed 18 April 2024).

ICD-10-CM Official Guidelines for Coding and Reporting. Centers for Medicare & Medicaid Services (CMS), https://www.cms.gov/files/document/fy-2023-icd-10-cm-coding-guidelines-updated-01/11/2023.pdf (accessed 18 April 2024).

Canadian Coding Standards for Version 2022 ICD-10-CA and CCI. Canadian Institute for Health Information (CIHI), https://secure.cihi.ca/free_products/canadian-coding-standards-2022-en.pdf (accessed 18 April 2024).

Burns

Rigby

Mamidanna

, et al. Systematic review of discharge coding accuracy. J Public Health 2012; 34: 138–148.

Venkatesh

Raza

Kvedar

. Automating the overburdened clinical coding system: challenges and next steps. NPJ Digit Med 2023; 6: 16.

Smith

Bowman

Dooling

. Measuring and Benchmarking Coding Productivity: A Decade of AHIMA Leadership. Measuring and Benchmarking Coding Productivity: A Decade of AHIMA Leadership/AHIMA, American Health Information Management Association, https://library.ahima.org/doc?oid=302649 (2019, accessed 17 January 2024).

Alonso

Santos

Pinto

, et al. Problems and barriers during the process of clinical coding: a focus group study of coders’ perceptions. J Med Syst 2020; 44: 62.

Dong

Falis

Whiteley

, et al.

Automated clinical coding: what, why, and where we are?

NPJ Digit Med 2022; 5: 159.

Kaur

Ginige

Obst

. AI-based ICD coding and classification approaches using discharge summaries: a systematic literature review. Expert Syst Appl 2023; 213: 118997.

10.

International statistical classification of diseases and related health problems - 10th revision v. 2. Instruction manual. Fifth edition. World Health Organization, https://icd.who.int/browse10/Content/statichtml/ICD10Volume2_en_2019.pdf (2016).

11.

Moriyama

Loy

Robb-Smith

AHT

, et al. History of the statistical classification of diseases and causes of death (2011), https://www.cdc.gov/nchs/data/misc/classification_diseases2011.pdf.

12.

Kaur

Ginige

Obst

. A systematic literature review of automated ICD coding and classification systems using discharge summaries. arXiv preprint arXiv:210710652.

13.

ICD-11 Reference Guide, https://icdcdn.who.int/icd11referenceguide/en/html/index.html#icd11-reference-guide (accessed 22 April 2024).

14.

Rodrigues

J-M

Schulz

Rector

, et al. Sharing ontology between ICD 11 and SNOMED CT will enable seamless Re-use and semantic interoperability. In: MEDINFO 2013. Amsterdam, Netherlands: IOS Press, 2013, pp.343–346.

15.

Kaur

Ginige

. Comparative analysis of algorithmic approaches for auto-coding with ICD-10-AM and ACHI. Stud Health Technol Inform 2018; 252: 73–79.

16.

Perotte

Pivovarov

Natarajan

, et al. Diagnosis code assignment: models and evaluation metrics. J Am Med Inform Assoc 2014; 21: 231–237.

17.

Shi

Xie

, et al. Towards Automated ICD Coding Using Deep Learning. arXiv e-prints. Epub ahead of print 1 November 2017. DOI: 10.48550/arXiv.1711.04075.

18.

Johnson

AEW

Pollard

Shen

, et al. MIMIC-III, a freely accessible critical care database. Sci Data 2016; 3: 160035.

19.

Kavuluru

Rios

. An empirical evaluation of supervised learning approaches in assigning diagnosis codes to electronic medical records. Artif Intell Med 2015; 65: 155–166.

20.

Bellot

Trabelsi

Mothe

, et al. (eds). Experimental IR Meets Multilinguality, Multimodality, and Interaction. In: 9th International Conference of the CLEF Association, CLEF 2018, Avignon, France, September 10-14, 2018, Proceedings, Cham: Springer International Publishing. Epub ahead of print 2018. DOI: 10.1007/978-3-319-98932-7.

21.

Arampatzis

Kanoulas

Tsikrika

, et al. (eds). Experimental IR Meets Multilinguality, Multimodality, and Interaction. In: 11th International Conference of the CLEF Association, CLEF 2020, Thessaloniki, Greece, September 22–25, 2020, Proceedings. Cham: Springer International Publishing. Epub ahead of print 2020. DOI: 10.1007/978-3-030-58219-7

22.

Zhang

Zhao

, et al. Enhancing automatic ICD-9-CM code assignment for medical texts with PubMed. In: Cohen

Demner-Fushman

Ananiadou

, et al. (eds) BioNLP 2017. Vancouver, Canada: Association for Computational Linguistics, 2017, pp.263–271.

23.

Cao

Zhang

. OTSeq2set: an optimal transport enhanced sequence-to-set model for extreme multi-label text classification. In: Goldberg

Kozareva

Zhang

(eds) Proceedings of the 2022 conference on empirical methods in natural language processing. Abu Dhabi, United Arab Emirates: Association for Computational Linguistics, 2022, pp.5588–5597.

24.

Sen

Aslam

, et al. From Extreme Multi-label to Multi-class: A Hierarchical Approach for Automated ICD-10 Coding Using Phrase-level Attention, http://arxiv.org/abs/2102.09136 (2022, accessed 27 September 2023).

25.

Kaur

. Distributed knowledge based clinical auto-coding system. In: Alva-Manchego

Choi

Khashabi

(eds) Proceedings of the 57th annual meeting of the association for computational linguistics: student research workshop. Florence: Italy: Association for Computational Linguistics, 2019, pp.1–9.

26.

Moher

Liberati

Tetzlaff

, et al. Preferred reporting items for systematic reviews and meta-analyses: the PRISMA statement. PLoS Med 2009; 6: e1000097.

27.

Mullenbach

Wiegreffe

Duke

, et al. Explainable prediction of medical codes from clinical text. In: Walker

Stent

(eds) Proceedings of the 2018 conference of the north American chapter of the association for computational linguistics: human language technologies, volume 1 (long papers). New Orleans, Louisiana: Association for Computational Linguistics, 2018, pp.1101–1111.

28.

Larkey

Croft

. Combining classifiers in text categorization. In: Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval, pp.289–297. New York, NY, USA: Association for Computing Machinery.

29.

Ayyar

Don

. Tagging patient notes with ICD-9 codes. In: Proceedings of the 29th conference on Neural Information Processing Systems, 2016, pp.1–8.

30.

Prakash

Zhao

Hasan

, et al. Condensed memory networks for clinical diagnostic inferencing. In: Proceedings of the AAAI conference on Artificial Intelligence 2017; 31, Epub ahead of print 12 February 2017. DOI: 10.1609/aaai.v31i1.10964.

31.

Baumel

Nassour-Kassis

Cohen

, et al. Multi-label classification of patient notes a case study on ICD code assignment. arXiv e-prints. Epub ahead of print 1 September 2017. DOI: 10.48550/arXiv.1709.09587.

32.

Berndorfer

Henriksson

. Automated diagnosis coding with combined text representations. Stud Health Technol Inform 2017; 235: 201–205.

33.

Xie

Xing

. A neural architecture for automated ICD coding. In: Gurevych

Miyao

(eds) Proceedings of the 56th annual meeting of the association for computational linguistics (volume 1: long papers). Melbourne, Australia: Association for Computational Linguistics, 2018, pp.1066–1076.

34.

Rios

Kavuluru

. Few-Shot and zero-shot multi-label learning for structured label spaces. In: Riloff

Chiang

Hockenmaier

, et al. (eds) Proceedings of the 2018 conference on empirical methods in natural language processing. Brussels, Belgium: Association for Computational Linguistics, 2018, pp.3132–3142.

35.

Samonte

MJC

Gerardo

Fajardo

, et al. ICD-9 Tagging of clinical notes using topical word embedding. In: Proceedings of the 2018 1st international conference on internet and e-business. New York, NY, USA: Association for Computing Machinery, 2018, pp.118–123.

36.

Catling

Spithourakis

Riedel

. Towards automated clinical coding. Int J Med Inf 2018; 120: 50–61.

37.

Huang

Osorio

. An empirical evaluation of deep learning for ICD-9 code assignment using MIMIC-III clinical notes. Comput Methods Programs Biomed 2019; 177: 141–153.

38.

Xie

Xiong

, et al. EHR Coding with multi-scale feature attention and structured knowledge graph propagation. In: Proceedings of the 28th ACM international conference on information and knowledge management. New York: NY: USA: Association for Computing Machinery, 2019, pp.649–658.

39.

Bai

Vucetic

. Improving medical code prediction from clinical text via incorporating online knowledge sources. In: The world wide web conference. San Francisco, CA: ACM, 2019, pp.72–82.

40.

Falis

Pajak

Lisowska

, et al. Ontological attention ensembles for capturing semantic concepts in ICD code prediction from clinical text. In: Holderness

Jimeno Yepes

Lavelli

, et al. (eds) Proceedings of the tenth international workshop on health text mining and information analysis (LOUHI 2019). Hong Kong: Association for Computational Linguistics, 2019, pp.168–177.

41.

Fei

Zeng

, et al. Automated ICD-9 coding via A deep learning approach. IEEE/ACM Trans Comput Biol Bioinf 2019; 16: 1193–1202.

42.

Zeng

Fei

, et al. Automatic ICD-9 coding via deep transfer learning. Neurocomputing 2019; 324: 43–50.

43.

Schäfer

Friedrich

. UMLS Mapping and word embeddings for ICD code assignment using the MIMIC-III intensive care database. In: 2019 41st Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), 2019, pp.6089–6092.

44.

Lam

Pang

, et al. Multimodal machine learning for automated ICD coding. In: Proceedings of the 4th Machine Learning for Healthcare Conference, pp.197–215: PMLR.

45.

Chen

Peng

, et al. ML-Net: multi-label classification of biomedical texts with deep neural networks. J Am Med Inform Assoc 2019; 26: 1279–1285.

46.

Nguyen

. A label attention model for ICD coding from clinical text. In: Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, Yokohama, Yokohama, Japan, 2021, pp.3335–3341.

47.

Sonabend W

Cai

Ahuja

, et al. Automated ICD coding via unsupervised knowledge integration (UNITE). Int J Med Inf 2020; 139: 104135.

48.

Cao

Chen

Liu

, et al. Hypercore: hyperbolic and co-graph representation for automatic ICD coding. In: Jurafsky

Chai

Schluter

, et al. (eds) Proceedings of the 58th annual meeting of the association for computational linguistics. Online: Association for Computational Linguistics, 2020, pp.3105–3114.

49.

Wang

Ren

Chen

, et al. Coding electronic health records with adversarial reinforcement path generation. In: Proceedings of the 43rd international ACM SIGIR conference on research and development in information retrieval. New York, NY: USA: Association for Computing Machinery, 2020, pp.801–810.

50.

. ICD Coding from clinical text using multi-filter residual convolutional neural network. Proc AAAI Conf Artif Intell 2020; 34: 8180–8187.

51.

Guo

Duan

, et al. A disease inference method based on symptom extraction and bidirectional long short term memory networks. Methods 2020; 173: 75–82.

52.

Mascio

Kraljevic

Bean

, et al. Comparative analysis of text classification approaches in electronic health records. In: Demner-Fushman

Cohen

Ananiadou

, et al. (eds) Proceedings of the 19th SIGBioMed workshop on biomedical language processing. Online: Association for Computational Linguistics, 2020, pp.86–94.

53.

Teng

Yang

Chen

, et al. Explainable prediction of medical codes with knowledge graphs. Front Bioeng Biotechnol 2020; 8: 867.

54.

Cambria

Marttinen

. Dilated convolutional attention network for medical code assignment from clinical text. In: Rumshisky

Roberts

Bethard

, et al. (eds) Proceedings of the 3rd clinical natural language processing workshop. Online: Association for Computational Linguistics, 2020, pp.73–78.

55.

Hsu

C-C

Chang

P-C

Chang

. Multi-Label classification of ICD coding using deep learning. In: 2020 International Symposium on Community-centric Systems (CcS), pp.1–6.

56.

Feucht

Althammer

, et al. Description-based label attention classifier for explainable ICD-9 classification. In: Xu

Ritter

Baldwin

, et al. (eds) Proceedings of the seventh workshop on noisy user-generated text (W-NUT 2021). Online: Association for Computational Linguistics, 2021, pp.62–66.

57.

Zhou

Cao

Chen

, et al. Automatic ICD coding via interactive shared representation networks with self-distillation mechanism. In: Zong

Xia

, et al. (eds) Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing (volume 1: long papers). Online: Association for Computational Linguistics, 2021, pp.5948–5957.

58.

Rajendran

Zenonos

Spear

, et al. Embed wisely: an ensemble approach to predict ICD coding. In: Kamp

Koprinska

Bibal

, et al. (eds) Machine learning and principles and practice of knowledge discovery in databases. Cham: Springer International Publishing, 2021, pp.371–389.

59.

Song

Zhang

Sadoughi

, et al. Generalized zero-shot text classification for ICD coding. In: Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, Yokohama, Yokohama, Japan, 2021, pp.4018–4024.

60.

Wang

Ren

Chen

, et al. Few-Shot Electronic Health Record Coding through Graph Contrastive Learning. arXiv e-prints. Epub ahead of print 1 June 2021. DOI: 10.48550/arXiv.2106.15467.

61.

Tsai

S-C

Huang

C-W

Chen

Y-N

. Modeling diagnostic label correlation for automatic ICD coding. In: Toutanova

Rumshisky

Zettlemoyer

, et al. (eds) Proceedings of the 2021 conference of the north American chapter of the association for computational linguistics: human language technologies. Online: Association for Computational Linguistics, 2021, pp.4043–4052.

62.

Pascual

Luck

Wattenhofer

. Towards BERT-based automatic ICD coding: limitations and opportunities. In: Demner-Fushman

Cohen

Ananiadou

, et al. (eds) Proceedings of the 20th workshop on biomedical language processing. Online: Association for Computational Linguistics, 2021, pp.54–63.

63.

Liu

Cheng

Klopfer

, et al. Effective convolutional attention network for multi-label clinical document classification. In: Moens

M-F

Huang

Specia

, et al. (eds) Proceedings of the 2021 conference on empirical methods in natural language processing. Online and Punta Cana: Dominican Republic: Association for Computational Linguistics, 2021, pp.5941–5953.

64.

Luo

Xiao

Glass

, et al. Fusion: towards automated ICD coding via feature compression. In: Zong

Xia

, et al. (eds) Findings of the association for computational linguistics: ACL-IJCNLP 2021. Online: Association for Computational Linguistics, 2021, pp.2096–2101.

65.

Heo

T-S

Yoo

Park

, et al. Medical code prediction from discharge summary: document to sequence BERT using sequence attention. In: 2021 20th IEEE International Conference on Machine Learning and Applications (ICMLA), pp.1239–1244.

66.

Kim

B-H

Ganapathi

. Read, attend, and code: pushing the limits of medical codes prediction from clinical notes by machines. In: Proceedings of the 6th Machine Learning for Healthcare Conference. PMLR, pp.196–208.

67.

Bao

Lin

Zhang

, et al. Medical code prediction via capsule networks and ICD knowledge. BMC Med Inform Decis Mak 2021; 21: 55.

68.

Biswas

Pham

T-H

Zhang

. TransICD: transformer based code-wise attention model for explainable ICD coding. In: Tucker

Henriques Abreu

Cardoso

, et al. (eds) Artificial intelligence in medicine. Cham: Springer International Publishing, 2021, pp.469–478.

69.

Dong

Suárez-Paniagua

Whiteley

, et al. Explainable automated coding of clinical notes using hierarchical label-wise attention networks and label embedding initialisation. J Biomed Inform 2021; 116: 103728.

70.

Hölttä

Marttinen

. Does the magic of BERT apply to medical code assignment? A quantitative study. Comput Biol Med 2021; 139: 104998.

71.

Mayya V

SSK

Krishnan

, et al. Multi-channel, convolutional attention based neural model for automated diagnostic coding of unstructured patient discharge summaries. Future Gener Comput Syst 2021; 118: 374–391.

72.

Zhang

Islam

, et al. JLAN: medical code prediction via joint learning attention networks and denoising mechanism. BMC Bioinformatics 2021; 22: 590.

73.

Yuan

Tan

Huang

. Code synonyms do matter: multiple synonyms matching network for automatic ICD coding. In: Muresan

Nakov

Villavicencio

(eds) Proceedings of the 60th annual meeting of the association for computational linguistics (volume 2: short papers). Dublin, Ireland: Association for Computational Linguistics, 2022, pp.808–814.

74.

Michalopoulos

Malyska

Sahar

, et al. ICDBigbird: A contextual embedding model for ICD code classification. In: Demner-Fushman

Cohen

Ananiadou

, et al. (eds) Proceedings of the 21st workshop on biomedical language processing. Dublin, Ireland: Association for Computational Linguistics, 2022, pp.330–336.

75.

DeYoung

Shing

H-C

Kong

, et al. Entity Anchored ICD Coding. Epub ahead of print 15 August 2022. DOI: 10.48550/arXiv.2208.07444.

76.

Huang

C-W

Tsai

S-C

Chen

Y-N

. PLM-ICD: automatic ICD coding with pretrained language models. In: Naumann

Bethard

Roberts

, et al. (eds) Proceedings of the 4th clinical natural language processing workshop. Seattle, WA: Association for Computational Linguistics, 2022, pp.10–20.

77.

Wang

Zhang

, et al. A novel framework based on medical concept driven attention for explainable medical code prediction via external knowledge. In: Muresan

Nakov

Villavicencio

(eds) Findings of the association for computational linguistics: ACL 2022. Dublin, Ireland: Association for Computational Linguistics, 2022, pp.1407–1416.

78.

Yang

Wang

Rawat

BPS

, et al. Knowledge injected prompt based fine-tuning for multi-label few-shot ICD coding. In: Goldberg

Kozareva

Zhang

(eds) Findings of the association for computational linguistics: EMNLP 2022. Abu Dhabi, United Arab Emirates: Association for Computational Linguistics, 2022, pp.1767–1781.

79.

Liu

Perez-Concha

Nguyen

, et al. Hierarchical label-wise attention transformer model for explainable ICD coding. J Biomed Inform 2022; 133: 104161.

80.

Falis

Dong

Birch

, et al. Horses to zebras: ontology-guided data augmentation and synthesis for ICD-9 coding. In: Demner-Fushman

Cohen

Ananiadou

, et al. (eds) Proceedings of the 21st workshop on biomedical language processing. Dublin, Ireland: Association for Computational Linguistics, 2022, pp.389–401.

81.

Liu

Wen

, et al. TreeMAN: tree-enhanced multimodal attention network for ICD coding. In: Calzolari

Huang

C-R

Kim

, et al. (eds) Proceedings of the 29th international conference on computational linguistics. Gyeongju, Republic of Korea: International Committee on Computational Linguistics, 2022, pp.3054–3063.

82.

Chen

, et al. Rare codes count: mining inter-code relations for long-tail clinical text classification. In: Naumann

Ben Abacha

Bethard

, et al. (eds) Proceedings of the 5th clinical natural language processing workshop. Toronto: Canada: Association for Computational Linguistics, 2023, pp.403–413.

83.

Nguyen

T-T

Schlegel

Ramesh Kashyap

, et al. A two-stage decoder for efficient ICD coding. In: Rogers

Boyd-Graber

Okazaki

(eds) Findings of the association for computational linguistics: ACL 2023. Toronto, Canada: Association for Computational Linguistics, 2023, pp.4658–4665.

84.

Yang

Kwon

Yao

, et al. Multi-Label few-shot ICD coding as autoregressive generation with prompt. Proc AAAI Conf Artif Intell 2023; 37: 5366–5374.

85.

BLC

Santos

Rei

. Modelling temporal document sequences for clinical ICD coding. In: Vlachos

Augenstein

(eds) Proceedings of the 17th conference of the European chapter of the association for computational linguistics. Dubrovnik, Croatia: Association for Computational Linguistics, 2023, pp.1640–1649.

86.

Niu

, et al. Retrieve and rerank for automated ICD coding via contrastive learning. J Biomed Inform 2023; 143: 104396.

87.

Yang

Zhang

, et al. Intriguing effect of the correlation prior on ICD-9 code assignment. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop), 2023, pp.109–118.

88.

Liu

Perez-Concha

Nguyen

, et al. Automated ICD coding using extreme multi-label long text transformer-based models. Artif Intell Med 2023; 144: 102662.

89.

Kang

Wang

Xiong

, et al. Automatic ICD coding based on segmented ClinicalBERT with hierarchical tree structure learning. In: Database systems for advanced applications: 28th international conference, DASFAA 2023, Tianjin, China, April 17–20, 2023, proceedings, part IV. Berlin, Heidelberg: Springer-Verlag, 2023, pp.250–265.

90.

Jin

Xiong

Shi

, et al. Learning from undercoded clinical records for automated international classification of diseases (ICD) coding. J Am Med Inform Assoc 2023; 30: 438–446.

91.

Mou

, et al. Automated ICD coding based on neural machine translation. In: 2023 8th International Conference on Cloud Computing and Big Data Analytics (ICCCBDA), 2023, pp.495–500.

92.

Zhao

Zhang

, et al. Towards automatic ICD coding via knowledge enhanced multi-task learning. In: Proceedings of the 32nd ACM international conference on information and knowledge management. New York, NY: Association for Computing Machinery, 2023, pp.1238–1248.

93.

Luo

Wang

, et al. Corelation: boosting automatic ICD coding through contextualized code relation learning. In: Calzolari

Kan

M-Y

Hoste

, et al. (eds) Proceedings of the 2024 joint international conference on computational linguistics, language resources and evaluation (LREC-COLING 2024). Torino, Italia: ELRA and ICCL, 2024, pp.3997–4007.

94.

Reddy

Wang

, et al. Towards semi-structured automatic ICD coding via tree-based contrastive learning. In: Proceedings of the 37th International Conference on Neural Information Processing Systems. Red Hook, NY, USA: Curran Associates Inc., 2024, pp.68300–68315.

95.

Williamson

de Hilster

Meyers

, et al. Low resource ICD coding of hospital discharge summaries. In: Demner-Fushman

Ananiadou

Miwa

, et al. (eds) Proceedings of the 23rd workshop on biomedical natural language processing. Bangkok, Thailand: Association for Computational Linguistics, 2024, pp.548–558.

96.

Caralt

CBL

Rei

. Continuous predictive modeling of clinical notes and ICD codes in patient health records. In: Demner-Fushman

Ananiadou

Miwa

, et al. (eds) Proceedings of the 23rd workshop on biomedical natural language processing. Bangkok, Thailand: Association for Computational Linguistics, 2024, pp.243–255.

97.

Wang

Zhang

, et al. ICDXML: enhancing ICD coding with probabilistic label trees and dynamic semantic representations. Sci Rep 2024; 14: 18319.

98.

Wang

Mercer

Rudzicz

. Multi-stage retrieve and Re-rank model for automatic medical coding recommendation. In: Duh

Gomez

Bethard

(eds) Proceedings of the 2024 conference of the north American chapter of the association for computational linguistics: human language technologies (volume 1: long papers). Mexico City, Mexico: Association for Computational Linguistics, 2024, pp.4881–4891.

99.

Goldstein

Amin

Neumann

, et al. Towards understanding attention-based reasoning through graph structures in medical codes classification. In: Ustalov

Gao

Panchenko

(eds) Proceedings of TextGraphs-17: graph-based methods for natural language processing. Bangkok, Thailand: Association for Computational Linguistics, 2024, pp.78–92.

100.

Teng

Liu

, et al. A review on deep neural networks for ICD coding. IEEE Trans on Knowl and Data Eng 2023; 35: 4357–4375.

101.

Toti

Morley

, et al. SemEHR: a general-purpose semantic search system to surface semantic data from clinical notes for tailored care, trial recruitment, and clinical research. J Am Med Inform Assoc 2018; 25: 530–537.

102.

Kraljevic

Bean

Mascio

, et al. MedCAT – Medical Concept Annotation Tool. Epub ahead of print 1 December 2019. DOI: 10.48550/arXiv.1912.10166.

103.

Liu

Zhao

, et al. A survey of knowledge enhanced Pre-trained language models. IEEE Trans Knowl Data Eng 2024; 36: 1413–1430.

104.

ICD-10 Mapping Technical Guide - ICD-10 Mapping Technical Guide - SNOMED Confluence, https://confluence.ihtsdotools.org/display/DOCICD10/ICD-10+Mapping+Technical+Guide (accessed 29 October 2024).

105.

SNOMED CT Implementation Support Portal - SNOMED Implementation Support - SNOMED Confluence, https://confluence.ihtsdotools.org/display/IMP/SNOMED+CT+Implementation+Support+Portal (accessed 3 June 2024).

106.

Mapping Tools - SNOMED Implementation Support - SNOMED Confluence, https://confluence.ihtsdotools.org/display/IMP/Mapping+Tools (accessed 3 June 2024).

107.

Mikolov

, Corrado Gs, et al. Efficient Estimation of Word Representations in Vector Space. 2013, pp.1–12.

108.

Joulin

Grave

Bojanowski

, et al. Bag of tricks for efficient text classification. In: Lapata

Blunsom

Koller

(eds) Proceedings of the 15th conference of the European chapter of the association for computational linguistics: volume 2, short papers. Valencia, Spain: Association for Computational Linguistics, 2017, pp.427–431.

109.

Ziletti

Akbik

Berns

, et al. Medical coding with biomedical transformer ensembles and zero/few-shot learning. arXiv preprint arXiv:220602662.

110.

Lee

Yoon

Kim

, et al. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 2020; 36: 1234–1240.

111.

Alsentzer

Murphy

Boag

, et al. Publicly available clinical BERT embeddings. In: Rumshisky

Roberts

Bethard

, et al. (eds) Proceedings of the 2nd clinical natural language processing workshop. Minneapolis, MN: Association for Computational Linguistics, 2019, pp.72–78.

112.

Tinn

Cheng

, et al. Domain-specific language model pretraining for biomedical natural language processing. ACM Transactions on Computing for Healthcare (HEALTH) 2021; 3: 1–23.

113.

Lewis

Ott

, et al. Pretrained language models for biomedical and clinical tasks: understanding and extending the state-of-the-art. In: Rumshisky

Roberts

Bethard

, et al. (eds) Proceedings of the 3rd clinical natural language processing workshop. Online: Association for Computational Linguistics, 2020, pp.146–157.

114.

Makohon

. Multi-Label classification of ICD-10 coding & clinical notes using MIMIC & CodiEsp. In: 2021 IEEE EMBS International Conference on Biomedical and Health Informatics (BHI), 2021, pp.1–4.

115.

Amigo

Delgado

. Evaluating extreme hierarchical multi-label classification. In: Muresan

Nakov

Villavicencio

(eds) Proceedings of the 60th annual meeting of the association for computational linguistics (volume 1: long papers). Dublin, Ireland: Association for Computational Linguistics, 2022, pp.5809–5819.

116.

Wiegreffe

Pinter

. Attention is not not explanation. In: Inui

Jiang

, et al. (eds) Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP). Hong Kong, China: Association for Computational Linguistics, 2019, pp.11–20.

117.

Cheng

Jafari

Russell

, et al. MDACE: MIMIC documents annotated with code evidence. In: Rogers

Boyd-Graber

Okazaki

(eds) Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers). Toronto, Canada: Association for Computational Linguistics, 2023, pp.7534–7550.

118.

Balkir

Nejadgholi

Fraser

, et al. Necessity and sufficiency for explaining text classifiers: a case study in hate speech detection. In: Carpuat

de Marneffe

M-C

Meza Ruiz

(eds) Proceedings of the 2022 conference of the north American chapter of the association for computational linguistics: human language technologies. Seattle, United States: Association for Computational Linguistics, 2022, pp.2672–2686.

119.

Darwiche

. On the computation of necessary and sufficient explanations. Proc AAAI Conf Artif Intell 2022; 36: 5582–5591.

120.

Ortega

Hidrue

Lehrhoff

, et al. Patterns in physician burnout in a stable-linked cohort. JAMA Network Open 2023; 6: e2336745.

121.

Moy

Hobensack

Marshall

, et al. Understanding the perceived role of electronic health records and workflow fragmentation on clinician documentation burden in emergency departments. J Am Med Inform Assoc 2023; 30: 797–808.

122.

Khope

Elias

. Strategies of predictive schemes and clinical diagnosis for prognosis using MIMIC-III: a systematic review. Healthcare (Basel) 2023; 11: 710.

123.

Krawczyk

Święcicki

. ICD-11 vs. ICD-10 - a review of updates and novelties introduced in the latest version of the WHO international classification of diseases. Psychiatr Pol 2020; 54: 7–20.

124.

Johnson

AEW

Bulgarelli

Shen

, et al. MIMIC-IV, a freely accessible electronic health record dataset. Sci Data 2023; 10: 1.

125.

Mandel

Kreda

Mandl

, et al. SMART On FHIR: a standards-based, interoperable apps platform for electronic health records. J Am Med Inform Assoc 2016; 23: 899–908.

126.

Cui

Gao

Talamadupula

, et al. Knowledge-Augmented deep learning and its applications: a survey. IEEE Trans Neural Networks Learn Syst 2023; 36: 2133–2153.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.02 MB

0.01 MB

0.09 MB

0.07 MB

0.10 MB

0.09 MB

0.12 MB