Abstract
Introduction
Real World Evidence studies are used by policy makers and health technology assessment bodies (HTA)1–3 as a critical foundation in their decisions processes. While now considered for effectiveness and post authorization safety studies 4 the scope of RWE studies was initially limited to market access studies, feasibility and representativeness studies. Consequently, a new methodology framework was needed to cover the increased heterogeneity of RWE studies objectives and to enhance the understanding of the underlying data sources and models.5,6 In this context of evidence generation, there is a clear need for study reproducibility, 7 data pipeline automation and scalability of the used tools. Furthermore, common data models partially answer this need.8–11 However, they all come with a cost to implement/maintain, and design choices that fits their respective goals such as interoperability, standardization of concepts but to the detriment of user experience and flexibility when it comes to non-standard study designs.
The French national healthcare system claims databases 12 is a widely used data source for pharmaco-epidemiology and more broadly observational studies. 13 It contains hospitalizations, medication dispensation, and claims data. Its actual relational model is to be considered as a snowflake-like one: while this choice of architecture is allowing for better memory space optimization, it does imply designing complex and compute intensive queries. Furthermore, this type of model is not well-suited for distributed computing.
One attempt to address this last issue was led by Bacry et al. with SCALPEL3, 14 an Apache Spark based open-source data pipeline framework, to be used specifically on the French national claims database, the “Système National des Données de Santé” (SNDS) database. While successfully implemented and used for evidence generation, 15 and according to its authors, some part of the initial data processing could be improved and requires important resources, especially memory.
In this paper we describe the design and implementation of Victoria’s pipeline, an empirically built data pipeline for French national healthcare system claims database extracts (SNDS). Victoria was developed in order to: - Provide with a tool based on previous published work such as SCALPEL 3. - Build an automated, maintainable and scalable core pipeline that can support project-specific changes to the data model inherent to the use of large observational databases and their related studies. - Deliver a documented pipeline with clear from-start-to-finish ETL (Extract-Transform-Load) processes, thus enabling scientific, epidemiological and quality reviews of those data manipulation processes. - Be compliant with regulatory requirements regarding health data warehouses. - Ease access to and comprehension of SNDS data via data visualization tools built on top of the pipeline. (e.g., for feasibility studies)
Material and methods
This paper describes the SNDS data input, then the 2-step process of the Victoria pipeline and the final output of the Victoria’s pipeline.
SNDS data input
The SNDS data structure and vocabularies are described by the Health Data Hub 16 are 3 main sources of information in the SNDS 1 : the French PMSI (“Programme de Médicalisation des Systèmes d'Information”) containing all the activity from public and private hospitals, 2 the DCIR (“Datamart de Consommation Inter Regime”) identifying all outpatient reimbursements records outside the hospital and 3 the CepiDC (“Centre d'épidémiologie sur les causes médicales de Décès”) which holds record of death and their causes. In the SNDS model, each source is structured as a snowflake schema centered around a central beneficiary table, containing a unique ID for each individual and their demographic information. This model is referenced in the online documentation of the Health Data Hub.
As is, this schema is tedious to use for epidemiological purposes: the number of joints needed to query basic information linking DCIR and PMSI sources can exceed 10 for a simple drug consumption question. 17 Moreover, because it was designed from an accounting and medico administrative standpoint, the expertise required to correctly identify the relevant tables, variables and vocabularies to answer a study question is another barrier to using it. Joint operation queries on large non indexed tables are known to be compute-intensive and therefore represents a non-negligible cost.
General design and requirements
Victoria’s pipeline is written in Scala language using the Spark framework. The input is raw SNDS data which is provided by the national health insurance (Caisse Nationale d’Assurance Maladie – CNAM) in a csv format, stored on an encrypted hard drive.
Victoria’s target output is a normalised, linearised and minimised data lake split into 29 different tables, as shown in Figure 1, where each line represents an event or a dispensation. It vastly differs from the original model as tables are designed to be queried on and answer easier and faster to some specific questions. General representation of the Victoria pipeline.
Step 1 - clean: from SNDS raw data to cleaned tables
Main functions in cleaning step.
At this step, every raw ID is encrypted following the argon2 method. The resulting hashes are encoded in hexadecimal format. A mapping table linking raw IDs and newly created ones is encrypted with the AES algorithm and stored a different S3 object than the database. The initial raw IDs are deleted from the database.
DCIR (ER tables)
Outpatients’ reimbursements data contains information about hospitalizations for “information purpose”: those records are identified with ER tables and deleted. Records flagged as “errors” are deleted, and costs are formatted for later use.
PMSI (T_MCO, T_SSR, T_HAD, T_RIP tables)
In the PMSI hospital discharge data, some stays are not linked to central patient table: the national ID (NIR) used as a key is missing. Those cases are deleted. For some major city hospitals, discharge data is sent twice: once by individual sub-entities hospitals and once by the larger legal and administrative organisation regrouping them (such as APHP for Paris hospitals). These duplicates are identified and suppressed according to the official documentation.
Discharge costs with an error code are deleted and only reimbursed stays are kept in the final dataset.
At the end of this stage, the schema stays intact while some tables and records are dropped or renamed. This output serves as input for the next step: merging the data into a usable model.
Step 2 - merge: from cleaned tables to data lake
The next stage of the pipeline consists in creating 2 linearised data models: every line of each table is an event, and each table is indexed with a unique patient identifier, without the need for a central patient or identifier table.
The 2 models are the epidemiological model, used for answering most of the research questions requiring population phenotyping (demography, diagnosis, and procedures characteristics). On the other side, the medico-economic model is used for specific costs and healthcare consumption analysis. For example, it contains more complex information about reimbursements rates and the data quality assessment is focused on costs rather than medico-administrative information.
Epidemiologic model
A unique primary key is created for all sources. As PMSI and DCIR tables use their own specific keys and IDs to link dimensions of records, ultimately joined with the central beneficiary table, it is necessary to perform multiple joints to simplify the merge. This key is created using those initial patients IDs, claims IDs and location IDs.
The SNDS has mostly a redundant data model and fixed naming convention, thus allowing for merging records into common tables. The hospitalization tables, while being from different sources (medicine, surgery, obstetrics, rehabilitative care, home-to-home care, psychiatry), share common information with DCIR.
The native variable names can either be totally different (e.g. procedure table names are “CCAM_COD,” “CCAM_ACT” and “CDC_ACT” in PMSI and “CAM_PRS_IDE” in DCIR) or hold some minor discrepancies such as uppercase and lowercase differences, underscore displacement or shorter name convention. Those differences are present between DCIR and PMSI sources, and between each PMSI sources. That sub step in itself is a strong justification for a pipeline because different tables names for the same information make the data unusable for analysis requiring crossing sources.
Intermediate merged tables: Data lake
Input data is merged by year and source, thus resulting in 2 groups of tables (“MergeDCIR” and “MergePMSI”) divided into years. This intermediate step ensures that data can be reviewed before being processed in the next steps. The final tables and their content are described below.
Table specific concatenation
The last sub-step consists in regrouping all data from DCIR and PMSI into unique entities table suitable for analysis.
Hospital diagnosis
In the PMSI, ICD-10 diagnoses are recorded for each patient stays. The initial goal is to generate a discharge code (GHM in French, similar to the Group Related Diseases in the USA) based on a medico-economic algorithm: each part of a stay has a cost associated with it, based on ICD-10 diagnosis, CCAM procedures codes, length of stay, duration of stay, type of location. This algorithm differs between MCO (acute ward), HAD (home care) and SSR (rehabilitation). All ICD-10 diagnosis from the PMSI (all years and sources) are fetched into the final Diagnosis table. The year metadata and type of diagnosis (e.g. principal or associated diagnosis for MCO) are stored along with the codes. As hospital financing depends on those variables, coding rules are constantly changing and are different between sources (PMSI MCO and HAD e.g.), holding this information is paramount for analysis and interpretation.
Procedures
Procedures codes are stored as CCAM codes, a French-specific procedures classification [source]. Sources for those codes are DCIR, MCO and SSR tables. Their stay (if available) and date of execution are computed and stored along with them.
Consults
Consults gathers both outpatients and inpatients consults, as well as procedures. They regroup NGAP codes and French national insurance internal codes. CCAM (complex procedures) codes recorded during consults are stored int the procedure table.
The sources for consults are DCIR, MCO and SSR.
Death
Death date exists in multiple sources in the SNDS. Firstly, the CepiDC death registry stores date of death and its cause. However, this source is suffering from an important delay (4 years) and some linking issues: it is not considered a viable source for death status as per official documentation [source]. In the PMSI, the death date is only known if it occurred during the beneficiary’s hospitalization. The DCIR contains the date of death in two cases: (1) when a national life insurance (“capital décès”) is paid, (2) when a reimbursement occurred after the beneficiary’s death.
As death dates can be incoherent between sources, the retained death date is the latest, only if the earliest and last dates differ from less than 7 days. Death dates with major incoherences (>7 days) results in flagging the record as “incoherent death” for later use.
Last contact date
This information does not exist natively in the SNDS and is created from DCIR and PMSI tables. A job compares the latest dates of reimbursement (DCIR) or hospitalization (PMSI) and keep the last one as the last contact date for each patient.
CMUc
CMUc, replaced by CSS since November 2019, is a French universal health coverage system aimed at disadvantaged individuals that covers all health-related advances of cost. As most of health care expenses requires a payment upfront and then are reimbursed by mutualized health insurances, CMUc status saves the patient from potentially unmanageable costly advances. Thus, CMUc status is a proxy for social status.
The PMSI tables and DCIR are the 2 sources for this information.
Drug classification table and DDD reference
DDDs (defined daily dose) are a widely used tool in real world studies, defined as the assumed average maintenance dose per day for a drug used for its main indication in adults by the OMS. In order to use them natively in subsequent analysis, Victoria’s pipeline integrates the official values of DDDs, linked to drugs by their ATC codes, and finally to each drug delivered via the CIP-ATC link in IR_PHA table.
This information relies on IR_PHA classification table and the update DDDs provided by the OMS, previously integrated in an internal classification.
Medico-economic model
Medico-economic tables
Those are specific tables derived from main tables, with parallel processing, stored as transformed tables. Firstly, the PMSI and DCIR are matched through individual IDs, location, dates, and discharges codes. Then PMSI and DCIR procedures, consults, hospitalization, expensive medical devices, and their associated costs are processed into refined tables. A last step processes the cost of medicalised transport and is joined with the other medic-economic tables. Methodologies used are the same as those used for the RAC database created by the DREES. 18
Output model
The final data model is a linearised data lake containing one row per event. Four tables are common between DCIR (outpatients) and PMSI (inpatients): expensive medical devices, consults, biology procedures, surgery/medical/imagery procedures. ICD-10 and drug dispensation are 2 more tables, belonging to inpatients and outpatients’ tables respectively.
Development and execution context
To ensure reviewability and maintainability, Victoria’s entire pipeline is integrated into a Continuous Integration/ Continuous Development environment. Gitlab version 15.8 is used as the software on which relies the CI/CD.
The entire pipeline is following the CD philosophy with a master branch in production and features/development branches merging into the main branch once they are fully reviewed and thoroughly tested.
The pipeline is executed once and all analysis are performed on the final resulting model. All derivatives variables are generated through another Scala/python pipeline (ATLAS) not described in this article and then manipulated with commands calling Scala/python jobs associated with configuration files directly with the CI interface.
An index population table is used to specify sub-population characteristics for specific studies: sex, age, CMUc status, and an index event date.
Unit test and quality assessments
Standard unit tests are automatically executed at this step and are designed to ensure that functions and classes behave, at a low level, correctly. Contrary to unit test, integrative quality assessments are manually executed on multiple generated synthetic data 19 to check the output of each sub step.
Security and regulatory considerations
Following the French regulation regarding health data warehouses, 20 Victoria was developed in compliance with those requirements.
The pipeline is executed inside a secured health data hosting service. Complex analyses are conducted inside containerised “isolated” environments distinct for each study with Jupyter notebooks.
Results
The pipeline was executed on 2 different datasets representing ∼85 000 and ∼870 000 beneficiaries respectively with the following configuration: one master with 4 cores and 16Go of RAM and respectively 4 and 6 workers with 4 cores and 16Go of RAM. The total execution time for the smaller data set was 25 h and 96 h for the larger one. The longest part of those times is represented by the initial format conversion to parquet (6 h 15 min and 11 h 30 min respectively). The clean step took only 4 h in both cases. The epidemiological model took 344 min for the smaller dataset and 1934 min for the larger one. The medico-economic model took the longest time with 704 min and 2145 min, respectively.
Epidemiological model output.
Victoria’s use cases
As of March 2023, 2 studies using this pipeline have been published: Deharo et al., 21 Didier et al., 22 with respectively 47,000 and 30,000 patients included. One other use case for Victoria pipeline was the matching of a heart failure registry. 23
Conclusion
Victoria pipeline is a successfully implemented attempt at developing an SNDS pipeline. Easy to use and to deploy, it was designed to be integrated in a modern analysis platform. Compared to previous pipelines, reviewability is natively part of its design as unit tests and quality assessments can natively be developed to ensure data and analysis quality. The pipeline has been used for 3 published studies and more are under review. The model used is not standard, but the recent work toward OMOP conversion will be integrated in upcoming versions. SNDS format and model changes impact on output have not been evaluated yet but as the CNAM is actively documenting any changes and Victoria is set to run on a CD platform, any evolution can be considered.
Discussion
Recent regulatory changes in France both accelerated and strengthened the access to real world data through SNDS. The published and updated Health Data Warehouse legal framework opened a wider way for matching clinical data to the SNDS’s large observational database in a standardised fashion. The use cases are not limited to real world evidence generation: automatic ICD-10 coding24,25 and event detection26,27 for such pipelines are others apparent applications. Developing an easy to audit and reviewable pipeline such as Victoria is one the very first step to answer those challenges. Further development implies evaluating the pipeline against raw (SNDS) or standard model (OMOP) queries, international data processes requirements, such as CDISC. Sensitivity analysis have yet to be performed next to ensure that the pipeline can withstand model changes.
Footnotes
Author contributions
Kevin Ouazzani wrote the paper, Xavier Ansolabehere wrote the paper, Florence Journeau wrote the paper, Alexandre Vidal conceived and designed the pipeline, Nicolas Jaubourg contributed to the projects using the pipeline, Maxime Doublet contributed to the projects using the pipeline, Raphael Thollot conceived and designed the pipeline, Arnaud Fabre contributed to the projects using the pipeline, Nicolas Glatt conceived and designed the pipeline.
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
