Sage Journals: Discover world-class research

Abstract

Objective: This paper describes Victoria, an empirically built data pipeline for SNDS to: - Build an automated, scalable pipeline supporting changes to the data model inherent to the use of large databases, - Deliver a documented pipeline with clear processes, enabling scientific, epidemiological researches, - Ease access to SNDS data in compliance with regulatory requirements. Methods: This paper describes the 2-steps process of the Victoria pipeline and its final output. The initial cleaning step consists in formatting, deleting empty, error or duplicate records and renaming variables without changing their values, accordingly with the official SNDS documentation. The second step consists in creating 2 linearised data models: every line of each table is an event, and each table is indexed with a unique patient identifier, without the need for a central patient or identifier table. These 2 models are: - the epidemiological model, used for answering most of the research questions requiring population phenotyping (demography, diagnosis, procedures characteristics). - the medico-economic model is used for costs and healthcare consumption analyses. It contains more complex information about reimbursements rates and the data quality assessment is focused on costs rather than medico-administrative information. Results: The pipeline was executed on 2 different datasets representing ∼85 000 and ∼870 000 beneficiaries with the following configuration: one master with 4 cores and 16Go of RAM and respectively 4 and 6 workers. The total execution time for the smaller dataset was 25 h and 96 h for the larger one. The longest part of those times is represented by the format conversion to parquet. The cleaning step took only 4 h in both cases. The epidemiological model took 344 min for the smaller dataset and 1934 min for the larger one. The medico-economic model took the longest time with 704 min and 2145 min, respectively. Conclusion: Victoria pipeline is a successfully implemented SNDS pipeline. Compared to previous pipelines, reviewability is part of its design as unit tests and quality assessments can natively be developed to ensure data and analysis quality. The pipeline has been used for 2 published studies. The recent work toward OMOP conversion will be integrated in upcoming versions and, as Victoria is set to run on a CD platform, the potential evolution if SNDS format can be considered.

Keywords

claims common data model DCIR data model pipeline PMSI RWE real world evidence SNDS Victoria

Introduction

Real World Evidence studies are used by policy makers and health technology assessment bodies (HTA)^1–3 as a critical foundation in their decisions processes. While now considered for effectiveness and post authorization safety studies⁴ the scope of RWE studies was initially limited to market access studies, feasibility and representativeness studies. Consequently, a new methodology framework was needed to cover the increased heterogeneity of RWE studies objectives and to enhance the understanding of the underlying data sources and models.^5,6 In this context of evidence generation, there is a clear need for study reproducibility,⁷ data pipeline automation and scalability of the used tools. Furthermore, common data models partially answer this need.^8–11 However, they all come with a cost to implement/maintain, and design choices that fits their respective goals such as interoperability, standardization of concepts but to the detriment of user experience and flexibility when it comes to non-standard study designs.

The French national healthcare system claims databases¹² is a widely used data source for pharmaco-epidemiology and more broadly observational studies.¹³ It contains hospitalizations, medication dispensation, and claims data. Its actual relational model is to be considered as a snowflake-like one: while this choice of architecture is allowing for better memory space optimization, it does imply designing complex and compute intensive queries. Furthermore, this type of model is not well-suited for distributed computing.

One attempt to address this last issue was led by Bacry et al. with SCALPEL3,¹⁴ an Apache Spark based open-source data pipeline framework, to be used specifically on the French national claims database, the “Système National des Données de Santé” (SNDS) database. While successfully implemented and used for evidence generation,¹⁵ and according to its authors, some part of the initial data processing could be improved and requires important resources, especially memory.

In this paper we describe the design and implementation of Victoria’s pipeline, an empirically built data pipeline for French national healthcare system claims database extracts (SNDS). Victoria was developed in order to:

- Provide with a tool based on previous published work such as SCALPEL 3.

- Build an automated, maintainable and scalable core pipeline that can support project-specific changes to the data model inherent to the use of large observational databases and their related studies.

- Deliver a documented pipeline with clear from-start-to-finish ETL (Extract-Transform-Load) processes, thus enabling scientific, epidemiological and quality reviews of those data manipulation processes.

- Be compliant with regulatory requirements regarding health data warehouses.

- Ease access to and comprehension of SNDS data via data visualization tools built on top of the pipeline. (e.g., for feasibility studies)

Material and methods

This paper describes the SNDS data input, then the 2-step process of the Victoria pipeline and the final output of the Victoria’s pipeline.

SNDS data input

The SNDS data structure and vocabularies are described by the Health Data Hub¹⁶ are 3 main sources of information in the SNDS¹: the French PMSI (“Programme de Médicalisation des Systèmes d'Information”) containing all the activity from public and private hospitals,² the DCIR (“Datamart de Consommation Inter Regime”) identifying all outpatient reimbursements records outside the hospital and³ the CepiDC (“Centre d'épidémiologie sur les causes médicales de Décès”) which holds record of death and their causes. In the SNDS model, each source is structured as a snowflake schema centered around a central beneficiary table, containing a unique ID for each individual and their demographic information. This model is referenced in the online documentation of the Health Data Hub.

As is, this schema is tedious to use for epidemiological purposes: the number of joints needed to query basic information linking DCIR and PMSI sources can exceed 10 for a simple drug consumption question.¹⁷ Moreover, because it was designed from an accounting and medico administrative standpoint, the expertise required to correctly identify the relevant tables, variables and vocabularies to answer a study question is another barrier to using it. Joint operation queries on large non indexed tables are known to be compute-intensive and therefore represents a non-negligible cost.

General design and requirements

Victoria’s pipeline is written in Scala language using the Spark framework. The input is raw SNDS data which is provided by the national health insurance (Caisse Nationale d’Assurance Maladie – CNAM) in a csv format, stored on an encrypted hard drive.

Victoria’s target output is a normalised, linearised and minimised data lake split into 29 different tables, as shown in Figure 1, where each line represents an event or a dispensation. It vastly differs from the original model as tables are designed to be queried on and answer easier and faster to some specific questions.

Figure 1.

General representation of the Victoria pipeline.

Step 1 - clean: from SNDS raw data to cleaned tables

The initial cleaning step consists in formatting, deleting empty, error or duplicate records and renaming variables without changing their values. Those are following the official SNDS documentation. Main cleaning functions and their input/output are described in Table 1.

Table 1.

Main functions in cleaning step.

Function	Type	Definition
castColumn	FORMAT	Convert hospitalization total costs in integer format
castColumnEntDatDel	FORMAT	Convert delays in integer
cleanAndUnifiy	FORMAT	Pads specific variables with zeroes and rename variables
cleanCcamColumn	DELETE	Delete empty procedure variables.
cleanIncorrectChainingKeys	DELETE	Delete patients records with an “error” value for chaining.
cleanInterHospitStays	DELETE	Delete “double” hosp. Stays which are temporary transfers.
cleanNullValuesDouble	FORMAT	Format variables with null values to string
cleanPublicHospit	DELETE	Delete outpatient costs in hospitalization data
cleanSejHadInError	DELETE	Delete records with an error value for cost
cleanSejTypInError	DELETE	Delete stays that are in “errors” regarding his type
cleanValo	RENAME	Rename non-billable stays variable
cleanValo	DELETE	Delete non-billable records
cleanWrongGHM	DELETE	Delete records with an “error” discharge code for hospitalization stays
cleanWrongGME	DELETE	Delete records with an “error” discharge code for rehabilitation stays
createCplRemTauColumn	CREATE	Add a missing variable for reimbursement rates
handleSsrFpTable	RENAME	Rename and merge all data from LPP (expensive and unusual medical devices)
removeGeoDuplicates	DELETE	Remove duplicates due to local
removeSpacesBeforeLppCode	FORMAT	Remove spaces from LPPs variables
renameCmu	FORMAT	Format CMU beneficiaries’ ids to integer
renameCmu	RENAME	Rename CMU beneficiaries ids variables
renameColumns	RENAME	Rename and unify columns names for procedures and medical devices
unifyColValues	FORMAT	Pads variables with “0s”

At this step, every raw ID is encrypted following the argon2 method. The resulting hashes are encoded in hexadecimal format. A mapping table linking raw IDs and newly created ones is encrypted with the AES algorithm and stored a different S3 object than the database. The initial raw IDs are deleted from the database.

DCIR (ER tables)

Outpatients’ reimbursements data contains information about hospitalizations for “information purpose”: those records are identified with ER tables and deleted. Records flagged as “errors” are deleted, and costs are formatted for later use.

PMSI (T_MCO, T_SSR, T_HAD, T_RIP tables)

In the PMSI hospital discharge data, some stays are not linked to central patient table: the national ID (NIR) used as a key is missing. Those cases are deleted. For some major city hospitals, discharge data is sent twice: once by individual sub-entities hospitals and once by the larger legal and administrative organisation regrouping them (such as APHP for Paris hospitals). These duplicates are identified and suppressed according to the official documentation.

Discharge costs with an error code are deleted and only reimbursed stays are kept in the final dataset.

At the end of this stage, the schema stays intact while some tables and records are dropped or renamed. This output serves as input for the next step: merging the data into a usable model.

Step 2 - merge: from cleaned tables to data lake

The next stage of the pipeline consists in creating 2 linearised data models: every line of each table is an event, and each table is indexed with a unique patient identifier, without the need for a central patient or identifier table.

The 2 models are the epidemiological model, used for answering most of the research questions requiring population phenotyping (demography, diagnosis, and procedures characteristics). On the other side, the medico-economic model is used for specific costs and healthcare consumption analysis. For example, it contains more complex information about reimbursements rates and the data quality assessment is focused on costs rather than medico-administrative information.

Epidemiologic model

A unique primary key is created for all sources. As PMSI and DCIR tables use their own specific keys and IDs to link dimensions of records, ultimately joined with the central beneficiary table, it is necessary to perform multiple joints to simplify the merge. This key is created using those initial patients IDs, claims IDs and location IDs.

The SNDS has mostly a redundant data model and fixed naming convention, thus allowing for merging records into common tables. The hospitalization tables, while being from different sources (medicine, surgery, obstetrics, rehabilitative care, home-to-home care, psychiatry), share common information with DCIR.

The native variable names can either be totally different (e.g. procedure table names are “CCAM_COD,” “CCAM_ACT” and “CDC_ACT” in PMSI and “CAM_PRS_IDE” in DCIR) or hold some minor discrepancies such as uppercase and lowercase differences, underscore displacement or shorter name convention. Those differences are present between DCIR and PMSI sources, and between each PMSI sources. That sub step in itself is a strong justification for a pipeline because different tables names for the same information make the data unusable for analysis requiring crossing sources.

Intermediate merged tables: Data lake

Input data is merged by year and source, thus resulting in 2 groups of tables (“MergeDCIR” and “MergePMSI”) divided into years. This intermediate step ensures that data can be reviewed before being processed in the next steps. The final tables and their content are described below.

Table specific concatenation

The last sub-step consists in regrouping all data from DCIR and PMSI into unique entities table suitable for analysis.

Hospital diagnosis

In the PMSI, ICD-10 diagnoses are recorded for each patient stays. The initial goal is to generate a discharge code (GHM in French, similar to the Group Related Diseases in the USA) based on a medico-economic algorithm: each part of a stay has a cost associated with it, based on ICD-10 diagnosis, CCAM procedures codes, length of stay, duration of stay, type of location. This algorithm differs between MCO (acute ward), HAD (home care) and SSR (rehabilitation). All ICD-10 diagnosis from the PMSI (all years and sources) are fetched into the final Diagnosis table. The year metadata and type of diagnosis (e.g. principal or associated diagnosis for MCO) are stored along with the codes. As hospital financing depends on those variables, coding rules are constantly changing and are different between sources (PMSI MCO and HAD e.g.), holding this information is paramount for analysis and interpretation.

Procedures

Procedures codes are stored as CCAM codes, a French-specific procedures classification [source]. Sources for those codes are DCIR, MCO and SSR tables. Their stay (if available) and date of execution are computed and stored along with them.

Consults

Consults gathers both outpatients and inpatients consults, as well as procedures. They regroup NGAP codes and French national insurance internal codes. CCAM (complex procedures) codes recorded during consults are stored int the procedure table.

The sources for consults are DCIR, MCO and SSR.

Death

Death date exists in multiple sources in the SNDS. Firstly, the CepiDC death registry stores date of death and its cause. However, this source is suffering from an important delay (4 years) and some linking issues: it is not considered a viable source for death status as per official documentation [source]. In the PMSI, the death date is only known if it occurred during the beneficiary’s hospitalization. The DCIR contains the date of death in two cases: (1) when a national life insurance (“capital décès”) is paid, (2) when a reimbursement occurred after the beneficiary’s death.

As death dates can be incoherent between sources, the retained death date is the latest, only if the earliest and last dates differ from less than 7 days. Death dates with major incoherences (>7 days) results in flagging the record as “incoherent death” for later use.

Last contact date

This information does not exist natively in the SNDS and is created from DCIR and PMSI tables. A job compares the latest dates of reimbursement (DCIR) or hospitalization (PMSI) and keep the last one as the last contact date for each patient.

CMUc

CMUc, replaced by CSS since November 2019, is a French universal health coverage system aimed at disadvantaged individuals that covers all health-related advances of cost. As most of health care expenses requires a payment upfront and then are reimbursed by mutualized health insurances, CMUc status saves the patient from potentially unmanageable costly advances. Thus, CMUc status is a proxy for social status.

The PMSI tables and DCIR are the 2 sources for this information.

Drug classification table and DDD reference

DDDs (defined daily dose) are a widely used tool in real world studies, defined as the assumed average maintenance dose per day for a drug used for its main indication in adults by the OMS. In order to use them natively in subsequent analysis, Victoria’s pipeline integrates the official values of DDDs, linked to drugs by their ATC codes, and finally to each drug delivered via the CIP-ATC link in IR_PHA table.

This information relies on IR_PHA classification table and the update DDDs provided by the OMS, previously integrated in an internal classification.

Medico-economic model

Medico-economic tables

Those are specific tables derived from main tables, with parallel processing, stored as transformed tables. Firstly, the PMSI and DCIR are matched through individual IDs, location, dates, and discharges codes. Then PMSI and DCIR procedures, consults, hospitalization, expensive medical devices, and their associated costs are processed into refined tables. A last step processes the cost of medicalised transport and is joined with the other medic-economic tables. Methodologies used are the same as those used for the RAC database created by the DREES.¹⁸

Output model

The final data model is a linearised data lake containing one row per event. Four tables are common between DCIR (outpatients) and PMSI (inpatients): expensive medical devices, consults, biology procedures, surgery/medical/imagery procedures. ICD-10 and drug dispensation are 2 more tables, belonging to inpatients and outpatients’ tables respectively.

Development and execution context

To ensure reviewability and maintainability, Victoria’s entire pipeline is integrated into a Continuous Integration/ Continuous Development environment. Gitlab version 15.8 is used as the software on which relies the CI/CD.

The entire pipeline is following the CD philosophy with a master branch in production and features/development branches merging into the main branch once they are fully reviewed and thoroughly tested.

The pipeline is executed once and all analysis are performed on the final resulting model. All derivatives variables are generated through another Scala/python pipeline (ATLAS) not described in this article and then manipulated with commands calling Scala/python jobs associated with configuration files directly with the CI interface.

An index population table is used to specify sub-population characteristics for specific studies: sex, age, CMUc status, and an index event date.

Unit test and quality assessments

Standard unit tests are automatically executed at this step and are designed to ensure that functions and classes behave, at a low level, correctly. Contrary to unit test, integrative quality assessments are manually executed on multiple generated synthetic data¹⁹ to check the output of each sub step.

Security and regulatory considerations

Following the French regulation regarding health data warehouses,²⁰ Victoria was developed in compliance with those requirements.

The pipeline is executed inside a secured health data hosting service. Complex analyses are conducted inside containerised “isolated” environments distinct for each study with Jupyter notebooks.

Results

The pipeline was executed on 2 different datasets representing ∼85 000 and ∼870 000 beneficiaries respectively with the following configuration: one master with 4 cores and 16Go of RAM and respectively 4 and 6 workers with 4 cores and 16Go of RAM. The total execution time for the smaller data set was 25 h and 96 h for the larger one. The longest part of those times is represented by the initial format conversion to parquet (6 h 15 min and 11 h 30 min respectively). The clean step took only 4 h in both cases. The epidemiological model took 344 min for the smaller dataset and 1934 min for the larger one. The medico-economic model took the longest time with 704 min and 2145 min, respectively.

The final Epidemiological model and the corresponding tables are described in Table 2.

Table 2.

Epidemiological model output.

Table name	Definition	Dimension (n of variables)	Size for dataset with ∼85 000 (GB)	Size for dataset with ∼870 000 (GB)
DCIR - expensive medical devices	Expensive medical devices tracked by the national insurance dispensed to outpatients.	25	0.45	1.9
DCIR - biology procedures	Biology procedures reimbursements outside an hospital setting	5	0.5	2.3
DCIR - surgery/medical/imagery procedures	Procedures reimbursements performed outside an hospital setting	9	0.11	0.7
DCIR - drug dispensation	Drug dispensed in pharmacy outside an hospital setting	25	3.4	15
DCIR - consults	Consults performed outside an hospital setting	10	0.38	1.5
PMSI - expensive medical devices	Expensive medical devices tracked by the national insurance dispensed to hospitalised patients	14	0.02	0.05
PMSI - surgery/medical/imagery procedures	Procedures reimbursements performed in an hospital setting	9	0.25	1.4
PMSI - biology procedures	Biology procedures reimbursements in an hospital setting	8	0.13	1
PMSI - ICD-10 diagnosis	Principal or associated diagnosis coded in hospital claims.	14	0.7	3.2
PMSI - consults	Consults performed in an hospital setting	9	0.07	0.34

Victoria’s use cases

As of March 2023, 2 studies using this pipeline have been published: Deharo et al.,²¹ Didier et al.,²² with respectively 47,000 and 30,000 patients included. One other use case for Victoria pipeline was the matching of a heart failure registry.²³

Conclusion

Victoria pipeline is a successfully implemented attempt at developing an SNDS pipeline. Easy to use and to deploy, it was designed to be integrated in a modern analysis platform. Compared to previous pipelines, reviewability is natively part of its design as unit tests and quality assessments can natively be developed to ensure data and analysis quality. The pipeline has been used for 3 published studies and more are under review. The model used is not standard, but the recent work toward OMOP conversion will be integrated in upcoming versions. SNDS format and model changes impact on output have not been evaluated yet but as the CNAM is actively documenting any changes and Victoria is set to run on a CD platform, any evolution can be considered.

Discussion

Recent regulatory changes in France both accelerated and strengthened the access to real world data through SNDS. The published and updated Health Data Warehouse legal framework opened a wider way for matching clinical data to the SNDS’s large observational database in a standardised fashion. The use cases are not limited to real world evidence generation: automatic ICD-10 coding^24,25 and event detection^26,27 for such pipelines are others apparent applications. Developing an easy to audit and reviewable pipeline such as Victoria is one the very first step to answer those challenges. Further development implies evaluating the pipeline against raw (SNDS) or standard model (OMOP) queries, international data processes requirements, such as CDISC. Sensitivity analysis have yet to be performed next to ensure that the pipeline can withstand model changes.

Footnotes

Author contributions

Kevin Ouazzani wrote the paper, Xavier Ansolabehere wrote the paper, Florence Journeau wrote the paper, Alexandre Vidal conceived and designed the pipeline, Nicolas Jaubourg contributed to the projects using the pipeline, Maxime Doublet contributed to the projects using the pipeline, Raphael Thollot conceived and designed the pipeline, Arnaud Fabre contributed to the projects using the pipeline, Nicolas Glatt conceived and designed the pipeline.

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

Ethical statement

ORCID iDs

Kevin Ouazzani

Xavier Ansolabehere

Maxime Doublet

Raphael Thollot

Arnaud Fabre

References

Arlett

Kjær

Broich

, et al. Real-world evidence in EU medicines regulation: enabling use and establishing value. Clin Pharmacol Ther 2022; 111(1): 21–23.

Schurman

The framework for FDA’s real-world evidence program. Appl Clin Trials 2019; 28(4): Disponible sur. https://www.appliedclinicaltrialsonline.com/view/framework-fda-s-real-world-evidence-program-1.

Gross

. Using real world data to support regulatory approval of drugs in rare diseases: a review of opportunities, limitations & a case example. Curr Probl Cancer 2021; 45(4): 100769.

Xavier

. Convergence: EMA study reveals need for RWE framework, submission structure. Regulatory News. 20 Déc. Disponible sur: https://www.raps.org/news-and-articles/news-articles/2021/9/convergence-ema-study-reveals-need-for-rwe-framewo (2022).

Wang

Schneeweiss

. Assessing and interpreting real-world evidence studies: introductory points for new reviewers. Clin Pharmacol Ther 2021; 111: 145–149.

Liu

Panagiotakos

. Real-world data: a brief review of the methods, applications, challenges and opportunities. BMC Med Res Methodol 2022; 22(1): 287.

Wang

Sreedhara

Schneeweiss

REPEAT Initiative . Reproducibility of real-world evidence studies using clinical practice data to inform regulatory and coverage decisions. Nat Commun 2022; 13(1): 5126.

Murphy

Mendis

Berkowitz

, et al. Integration of clinical and genetic data in the i2b2 architecture. AMIA Annu Symp Proc 2006; 2006: 1040.

Ball

Robb

Anderson

, et al. The FDA’s sentinel initiative--a comprehensive approach to medical product surveillance. Clin Pharmacol Ther 2016; 99(3): 265–268.

10.

Qualls

Phillips

Hammill

, et al. Evaluating foundational data quality in the national patient-centered clinical research network (PCORnet®). EGEMS (Wash DC) 2018; 6(1): 3.

11.

Hripcsak

Duke

Shah

, et al. Observational health data sciences and informatics (OHDSI): opportunities for observational researchers. Stud Health Technol Inf 2015; 216: 574–578.

12.

Bezin

Duong

Lassalle

, et al. The national healthcare system claims databases in France, SNIIRAM and EGB: powerful tools for pharmacoepidemiology. Pharmacoepidemiol Drug Saf 2017; 26(8): 954–962.

13.

Health Data Hub . Calendrier du CESREES. Paris, France: Health Data Hub. Disponible sur: https://www.health-data-hub.fr/cesrees.

14.

Bacry

Gaïffas

Leroy

, et al. SCALPEL3: a scalable open-source library for healthcare claims databases. 2020. ArXiv Preprint arXiv:1910.07045.

15.

Bacry

Gaïffas

Kabeshova

, et al. ZiMM: a deep learning model for long term adverse events with non-clinical claims data. 2019. ArXiv Preprint arXiv:1911.05346.

16.

Documentation du SNDS & SNDS OMOP . Documentations publiques du health data hub. Disponible sur: https://documentation-snds.health-data-hub.fr/ (2023).

17.

Drezen

Guyet

Happe

. From medico-administrative databases analysis to care trajectories analytics: an example with the French SNDS. Fundam Clin Pharmacol 2018; 32(1): 78–80.

18.

Direction de la recherche, des études, de l’évaluation et des statistiques . La base RAC sur les dépenses et les restes à charge en santé après assurance maladie obligatoire. Disponible sur: https://drees.solidarites-sante.gouv.fr/sources-outils-et-enquetes/la-base-rac-sur-les-depenses-et-les-restes-charge-en-sante-apres (2023).

19.

GitLab . Health data hub/synthetic generator. San Francisco, CA: GitLab. Disponible sur: https://gitlab.com/healthdatahub/synthetic-generator (2021).

20.

CNIL . La CNIL adopte un référentiel sur les entrepôts de données de santé. Paris, France: CNIL. Disponible sur: https://www.cnil.fr/fr/la-cnil-adopte-un-referentiel-sur-les-entrepots-de-donnees-de-sante (2023).

21.

Deharo

Leroux

Theron

, et al. Long-term prognosis value of paravalvular leak and patient–prosthesis mismatch following transcatheter aortic valve implantation: insight from the France-tavi registry. J Clin Med 2022; 11(20): 6117.

22.

Didier

Gouysse

Eltchaninoff

, et al. Successful linkage of French large-scale national registry populations to national reimbursement data: improved data completeness and minimized loss to follow-up. Arch Cardiovasc Dis 2020; 113(8–9): 534–541.

23.

Logeart

Damy

Doublet

, et al. Feasibility and accuracy of linking a heart failure registry to the national claims database using indirect identifiers. Arch Cardiovasc Dis 2022; 116(1): 18–24. Disponible sur: https://www.sciencedirect.com/science/article/pii/S1875213622002285.

24.

Kaur

Ginige

Obst

. AI-based ICD coding and classification approaches using discharge summaries: a systematic literature review. Expert Syst Appl 2022; 213: 118997.

25.

Yan

Liu

, et al. A survey of automated ICD coding: development, challenges, and applications. Intell Med August 2022; 2(3): 161–173.

26.

Thurin