Abstract
Objective
Unlocking the potential of routine medical data for clinical research requires the analysis of data from multiple healthcare institutions. However, according to German data protection regulations, data can often not leave the individual institutions and decentralized approaches are needed. Decentralized studies face challenges regarding coordination, technical infrastructure, interoperability and regulatory compliance. Rare diseases are an important prototype research focus for decentralized data analyses, as patients are rare by definition and adequate cohort sizes can only be reached if data from multiple sites is combined.
Methods
Within the project “Collaboration on Rare Diseases”, decentralized studies focusing on four rare diseases (cystic fibrosis, phenylketonuria, Kawasaki disease, multisystem inflammatory syndrome in children) were conducted at 17 German university hospitals. Therefore, a data management process for decentralized studies was developed by an interdisciplinary team of experts from medicine, public health and data science. Along the process, lessons learned were formulated and discussed.
Results
The process consists of eight steps and includes sub-processes for the definition of medical use cases, script development and data management. The lessons learned include on the one hand the organization and administration of the studies (collaboration of experts, use of standardized forms and publication of project information), and on the other hand the development of scripts and analysis (dependency on the database, use of standards and open source tools, feedback loops, anonymization).
Conclusions
This work captures central challenges and describes possible solutions and can hence serve as a solid basis for the implementation and conduction of similar decentralized studies.
Keywords
Introduction
The continuous development and improvement of technologies and data analysis methods have had a significant impact on healthcare in recent years.
These include the secondary use of data, which was primarily collected for another purpose 1 (e.g. documentation of care, billing in the hospital information system), for medical studies. Secondary use of data opens up the possibility for the generation of new data-driven medical insights and is a viable alternative to time-consuming and cost-intensive clinical studies. For example, data collected for documentation and billing in the hospital information system can be used for analyses in the context of a medical study. 2
Within the framework of secondary use of patient data, centralized or decentralized studies can be conducted (see Figure 1). They refer to research models in which data collection and decision-making are conducted across multiple locations. In the centralized approach, raw or pseudonymized data from different sites is transferred to a trusted location. This location (see Figure 1, site A) aggregates the data and executes the analysis scripts to generate results. In contrast, with the decentralized approach, the data is analyzed locally by running centrally designed analysis scripts at each participating institution and returning only the (anonymous) statistical results. As a consequence, sensitive patient data remains at the sites but can still be used for research in a privacy-preserving manner. 3

Centralized vs. decentralized analysis.
Decentralized studies based on secondary use of patient data offer great potential, especially for rare diseases. As the patients are rare by definition, it is necessary to aggregate the data from as many study participants as possible in compliance with data protection principles. The principles are upheld if the patients cannot be identified, i.e. remain anonymous. By merging only aggregated data, which contains the results of the analysis scripts and not the actual patient data, it is not possible to identify individuals. An as of yet unused potential exists for data science and privacy-preserving methods to improve medical decision-making and to minimize the need for expensive clinical trials.
While decentralized studies based on secondary use of patient data offer several benefits, such as enlarged pool of potential participants, increased diversity in participation due to widespread geographical involvement and absence of direct patient engagement,
4
they also face certain challenges.
To meet these challenges, decentralized studies need to be designed and conducted with a strong focus on quality. The aim of this paper is to describe the process ensuring high-quality decentralized studies, by pinpointing the lessons learned while conducting research on rare diseases in Germany.
State of the art
Medical Informatics Initiative: German research infrastructure
The German Medical Informatics Initiative (MII) aimed to bring significant enhancements in medical research and healthcare in Germany by harnessing digital technologies and data, and by devising innovative solutions to health-related challenges. Supported by the German Federal Ministry of Education and Research, all university hospitals in Germany have received funding to advance the objectives of the MII.6,7 These objectives encompass the preparation of data for medical decision support and scientific research, and the reinforcement of Germany's position as a scientific hub in the domain of Clinical Data Science. Although the four consortia – DIFUTURE,
8
HiGHmed,
9
MIRACUM
10
and SMITH
11
– adopted distinct approaches, they form a common infrastructure based on the following key components:
Each university hospital established a All consortia pledged their commitment to the The To request research data through the portal, a nine-step process must be followed
17
: (1) registration in FDPG, (2) feasibility request, (3) application for data provision or distributed analysis, (4) decision of the Use and Access Committee (UAC) at each site, (5) data use contract, (6) publication in the FDPG project registry, (7) use of the requested data or analysis results, (8) publication of the results in FDPG, (9) listing of the scientific publications in FDPG. At each local site, the A The provision of data or analysis results is done by the Analysis scripts are required to conduct distributed analyses. These can be developed by either the researcher or the data management center and must be based on the MII-CDS. The data use request and contract based on the standardized form of the MII are concluded and the developed scripts are distributed.
To test the established infrastructure and the developed methods, cross-consortium use cases were initiated. One of these use cases was “Collaboration on Rare Diseases” (CORD-MI), which focuses explicitly on the field of rare diseases.
CORD-MI: Retrospective studies on the example of four rare diseases
In the European Union, a disease is considered rare if it affects less than 1 in 2000 people;18–20 in the United States, if less than 200,000 people are affected. It is assumed that there are more than 7000 distinct clinical entities18,21,22 and that approximately 3.5–5.9% of the world population, i.e. approximately 263–446 million people worldwide, suffer from a rare disease. 23 Rare diseases have in common that they are often life-threatening or have a chronic course with negative effects on the quality of life and life expectancy.24,25 They represent major challenges for patients and their families as well as for the health care system20,21 due to accompanied high morbidity. Consequently, the healthcare utilization of children and adults with rare diseases causes enormous costs. 26
CORD-MI, which involved a total of 24 German university hospitals, pursued four goals: (1) improving the visibility of rare diseases, (2) providing insights into the reality of care, (3) improving the quality of patient care and (4) strengthening research in the field of rare diseases. 27 As part of this, decentralized studies were defined and conducted in the project network.
Different rare diseases were selected as prototypes for those decentralized studies in order to take into account both the spectrum of different disease patterns and the spectrum of different data (also with regard to availability, quality and modeling). The following diseases served as examples: (1) cystic fibrosis (CF), (2) phenylketonuria (PKU) and (3) Kawasaki disease and multisystem inflammatory syndrome in children (MIS-C).
Method
As part of the project CORD, the process of decentralized studies had to adhere to the requirements of the MII and the aforementioned infrastructure. These requirements encompass both administrative and technical aspects. For example, privacy concepts and study protocols for legal and ethical review had to be described in detail, and administrative steps had to be followed, especially in the context of the FDPG. In addition, analysis scripts had to be developed based on FHIR profiles compliant with the MII-CDS; and the data management office had to take into account that the aggregation of the resulting data is done according to the previously defined data protection-compliant procedure as well as contributes to answering the research questions. Based on these requirements and a thematic grouping, specific procedural steps were formulated within expert group.
For conducting the decentralized studies, an interdisciplinary team of experts in medicine, public health and data science developed and went through a process, which is shown in Figure 2. The eight steps, thematically bundled into three categories, are described below.

Coherent process (steps 1–7) for decentralized studies on the example of rare diseases in Germany.
We conducted retrospective, observational, descriptive studies. These decentralized studies involved multiple sites (cross-sectional). Data from the clinical information systems (prepared by the data integration centers) was used. This data was gathered for documenting patient conditions, medical interventions and facilitating healthcare service billing. The dataset encompasses various elements such as demographic details, diagnoses, procedures, laboratory results and more. The studies were part of the three-year CORD-MI project and were conducted within 13 months (18 November 2022 to 15 December 2023).
Medical use cases
The decentralized studies served to answer medical questions (
Research questions of the medical use cases.
CF: cystic fibrosis; PKU: phenylketonuria; MIS-C: multisystem inflammatory syndrome in children; COVID-19: coronavirus disease 2019.
Medical experts, together with experts in public health, wrote study protocols (
Before conducting a study, it is beneficial to review the quality of the underlying data (
Scripts
Data scientists formalized the information from the study protocols to both define the target set of scripts to be developed and they determined the data elements needed (
An interdisciplinary team developed the scripts and discussed the development, received feedback as well as performed adjustments and improvements. The scripts were developed to process the MII-CDS using R (
To protect privacy in the resulting statistical tables, cell suppression with a frequency threshold rule with threshold 5 has been implemented. 41 Thus, (absolute) results larger than zero and smaller than five were masked with the category “<5”. Given the implementation of the frequency threshold rule and the exclusive transmission of anonymous results data from the sites for analysis, patient consent was unnecessary.
The scripts are available in a separate Git repository (
Data management
The medical use cases were defined and submitted individually as projects to the FDPG and published on the FDPG website (
The information (including project goal, project description and required data) was forwarded to the transfer offices of the data integration centers (
The individual study sites executed the scripts and provided the aggregated data to the data management center (
The data from the three medical use cases – both at the individual sites and the aggregated data for the data recipients – were forwarded to the data recipients / medical experts and archived (
Formulation of lessons learned
Looking back on the project, the interdisciplinary team of experts in medicine, public health and data science formulated the most impressive lessons learned for them.
Results
By following the process described in the Methods section, we were able to extract, analyze and aggregate data for the three use cases. 17 sites participated in the first two studies of (1) CF and pregnancy or delivery and (2) PKU, comorbidities and pregnancy or delivery; 14 sites participated in the third study of (3) Kawasaki disease and MIS-C.
Lessons learned
Along the process, the following lessons were learned:
(LL1) Collaboration of experts in medicine, public health and data science; (LL2) Attention to the database (existence of data); (LL3) Use of standards; (LL4) Use of open source tools; (LL5) Iterative process with feedback loops; (LL6) Implementation of frequency threshold rule; (LL7) Use of standardized forms; (LL8) Publicly available information on the project.
(LL1): For the seamless conduction of decentralized studies, collaboration between experts is crucial. This requires a high level of understanding of each other's professions (e.g. regarding the use of technical terms or the level of detail of the information provided). By sharing a common vision and goal – to improve research and care for patients with rare diseases – all participants were able to contribute their experience to expand the common knowledge.
(LL2): Understanding the origin of the data is equally crucial. Secondary data research involves the use of data that was primarily collected for a different purpose. For example, billing-related data from hospital information systems – prepared by data integration centers – was used for the studies described. The “Health Care Process Bias”, 46 referring to the potential distortion of medical reality within data resulting from billing guidelines, poses several challenges. Consequently, it is important to consider this factor when structuring decentralized studies reliant on secondary use of patient data. Consequently, we could not use the International Statistical Classification of Diseases and Related Health Problems in the 10th Version (ICD-10-GM) codes on a case basis, but needed to define a patient with a rare disease on a patient-basis, integrating information from several in-patient stays and outpatient visits. Also, not all study sites were able to answer all research questions. For example, only 9 of 14 sites were able to provide information on the research question “Does the frequency of Kawasaki Disease or MIS-C differ regionally in Germany?” (see Table 1, research question 3c). This is due to the fact that not all data integration centers were able or allowed to integrate information about the zip codes of the addresses of their patients, although this information is part of the MII-CDS. Thus, the associated data protection concepts should be taken into account, when formulating the study protocol or developing the analysis scripts. On the one hand, it should be considered which data is available at all; and on the other hand, the required data elements should be described as precisely as possible in order to facilitate the later data provision. Likewise, the analysis methods – both in terms of data retrieval and statistical data evaluation – should be described in advance, in order to facilitate the later script development.
(LL3): The use of standards – terminologies and classifications as well as dataset descriptions and exchange formats – facilitates both the unique naming of data elements and the development of scripts. Although standards (especially FHIR) were used, the individual interpretation of each study site may vary. Thus, the development team discussed individual local peculiarities after trying to execute the scripts in many feedback loops. As a result, the scripts now take into account some exceptions. This was labor-intensive and time-consuming. In addition, the scripts had to be limited to ICD-10-GM, because the study sites have different levels of availability of ORPHAcodes. However, these are essential for unambiguous naming of the individual rare diseases.
(LL4): The use of open source tools – here R and fhircrackr – facilitates the collaborative development and delivery of the scripts. The use of open source tools should facilitate the execution of the decentralized study, as the study sites should have the scripts run in their own infrastructure. However, there are concerns and challenges with running third-party software. Some study sites performed all data management and analysis in an environment classified as critical infrastructure. Therefore, time-consuming IT-security approval processes were required, if novel open source software had to be used. For an individual study, it is challenging and not always feasible to invest the time and resources for these strict approval processes. This might cause individual study sites to require custom solutions or being unable to participate in an analysis.
(LL5): The iterative process allowed the scripts to be constantly improved. Thus, the study sites reported potential improvements or problems after tried execution. These were then discussed and resolved together. In this way, it was guaranteed that all study sites that wanted to participate in the studies were able to do so. Due to the large number of study sites, the feedback loops were very labor-intensive and time-consuming.
(LL6) The masking of small results (with “<5”) ensures privacy, but thus limits the interpretation of findings. Especially in the context of rare diseases, even small numbers of cases are important and can lead to a big difference. This means complex procedures and privacy-preserving methods, 47 such as Secure Multi-Party Computation (SMPC), can be used. The latter refers to a set of cryptographic methods allowing the calculation of a mutual result between several parties without sharing the respective input data. One group of SMPC-methods are so-called “secure sum” protocols, in which data between three or more parties is aggregated without revealing the inputs of the single parties. Secure sum protocols are of particular use when researching rare diseases since the protocols can assist to increase sample sizes. Several tools support the practical execution of secure sum protocols (see e.g. EasySMPC 48 ). However, the SMPC-methods only protect input data, while the data output of a SMPC-protocol still can contain personal data.
(LL7): Standardized forms were used to request and release data that were mutually agreed upon in the MII and accepted by each study site. They provide a secure legal framework for data use. Nevertheless, the bureaucratic effort was very time-consuming, which means that the contract process should be initiated early so that it does not become a showstopper. At every university hospital, individual ethics votes and votes of the UAC had to be obtained due to the individual data protection laws of the federal states and different regulations of the university hospitals. In future, a lead vote by a local ethics committee will suffice to speed up the process.
(LL8): A project has been created for all studies on the FPDG website. There, all interested parties can read about details and progress of the project. The publication of project information and protocols creates transparency.
Discussion
We were able to successfully conduct decentralized studies on three medical use cases in the context of rare diseases on the basis of secondary use of patient data. Thus, the three studies are the first successfully performed studies in the context of the productive start-up of the FDPG in the German MII. For this purpose, we went through a process consisting of eight steps covering the definition of medical use cases and the development of scripts to data management, analysis and archiving. It led to results for all three different use cases, what is demonstrating the usability of the process in practice.
Furthermore, the successful implementation shows that secondary use of patient data can be used for investigating medical issues as part of decentralized studies. It successfully enabled the expansion of the data pool. However, limitations may be noted due to the divergent purposes of data collection and to the lack of data.
The process presented is aligned with the individual steps of an Observational Health Data Science and Informatics (OHDSI) study (study definition and design, review of data availability and quality, standardized analysis, study packages/script, execution, interpretation and write-up 49 ). The research community successfully conducts a variety of different studies and has already gained a lot of experience in these regards. The sub-steps were bundled thematically. So far, the “Review of data availability and quality” was left out due to the project specifics. A query portal can be used to conduct feasibility studies on the data of German university hospitals. 16
The presented process consists of eight steps, contributing to its perceived complexity. This complexity stems from bundling individual tasks according to the diverse responsibilities of experts from various domains, including medicine, public health and data science. The comprehensive description aims to facilitate the implementation of future studies. However, it is essential to consider automating or semi-automating specific steps to enhance efficiency. For instance, in data retrieval and evaluation (refer to
The lessons learned include both positive experiences and their successful implementation as well as potential for improvement. Numerous aspects extend beyond the scope of decentralized studies on secondary use of patient data in general and can be regarded as prerequisites for structured research endeavors. Thus, the collaboration between the experts (see LL1) should be continued and also raised to an international level. In this way, the data and knowledge pool can be continuously expanded. In order to utilize secondary use of patient data profitably and appropriately for research (see LL2), methods for harmonizing and structuring them must be evaluated. With regard to using interoperability standards (see LL3), it has to be checked if a combination with further tools, which especially increase the syntactic interoperability of the data, can be used to facilitate and speed up the process. For example, the complementary use of Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM) of OHDSI would be conceivable.50,51 The use of other international terminologies, such as Systematized Nomenclature of Medicine Clinical Trials (SNOMED CT), would be interesting, too, so that the data can also be compared and used internationally. Also, tools for the structured definition of data elements and their metadata, such as Portal of Medical Data Models (MDM) 52 or ART-DECOR, 53 can be beneficial here. Approaches need to be created to support coding in both medical practice and research as well as to assure the quality of data.
In addition, so far only structured data have been used as a basis; the use of unstructured data of text-based documentation, e.g. from physicians’ letters or free text information in a hospital information system, remains a challenge. Individual efforts (e.g. 54 ) cannot yet be extended to the overall consortium. The data integration centers systematically investigate unstructured data by utilizing diverse tools and methodologies, such as by automating the indexing of medical texts. To minimize the concerns and challenges by using third-party software (see LL4), predetermined guidelines for the design and review of data analysis pipelines could help to streamline this process; as well as the development of standard solutions for certain analysis steps, which can then be pre-approved and used as components in various data analysis pipelines. For shortening the overall process and speeding up the feedback loops (see LL5) it needs to be examined whether standardized methods have potential to streamline the process. Tools, such as ATLAS 55 from OHDSI, for the uniform definition of cohorts are conceivable. In addition, it should be checked to what extent methods can be applied to secure the output data as well (see LL6).
The use of standardized forms allowed for a legal framework to which each study site was committed (see LL7). Nevertheless, bureaucratic hurdles (e.g. customization study protocols for voting by the local ethics committee) still had to be overcome at the sites themselves. Here, a nationwide German solution would save a lot of time and resources. By publishing information about the studies (see LL8), the usefulness of the established structures of the MII and of the data analysis can be demonstrated for the society. This should be continued by also making the results of the studies publicly available.
Conclusion
Despite the potential of decentralized studies, few papers have been published which focus on the implementation of this research model. Detailing the process and discussing lessons learned will help to better understand the characteristics of those kinds of studies. This will enable clinical and scientific researchers to design and conduct decentralized studies themselves.
We successfully conducted decentralized studies. The eight-step process shows the flow for research based on secondary use of patient data. In this way, medical knowledge could be acquired by using data already collected (i.e. within the clinic information systems), thus avoiding additional data collection.
The experience gained from the example of rare diseases can be applied to other disease entities. Above all, the collaboration between the different experts, the use of open source tools and standards and the establishment of feedback loops were beneficial.
In the future, it will be necessary to review how the individual process steps can be simplified and accelerated. For example, further exploration is needed to determine how cutting-edge approaches, particularly in dataset description and privacy prevention, can be integrated into the process.
Footnotes
List of abbreviations
Acknowledgements
The authors would like to thank all those involved in the successful implementation of the three decentralized studies, in particular the data integration centers of the following university hospitals: Universitätsklinikum Aachen, Charité – Universitätsmedizin Berlin, Universitätsklinikum Carl Gustav Carus Dresden, Universitätsklinikum Erlangen, Universitätsklinikum Frankfurt, Universitätsklinikum Freiburg, Universitätsklinikum Heidelberg, Universitätsklinikum Köln, Universitätsklinikum Magdeburg, Philipps-Universität Marbug, Klinikum der Universität München, Klinikum rechts der Isar der Technischen Universität München, Universitätsklinikum Münster, Universitätsklinikum Regensburg, Universitätsklinikum Tübingen, Universitätsklinikum Ulm, Universitätsklinikum Würzburg. They would also like to thank their colleagues at the German Portal for Medical Research Data (German: “Forschungsdatenportal für Gesundheit” (FDPG)) for their administrative support.
Availability of data and materials
Study protocols: 10.5281/zenodo.10656436 Formalized information: 10.5281/zenodo.10213532 R scripts:
Cystic fibrosis and pregnancy or delivery: https://github.com/medizininformatik-initiative/usecase-cord-support/tree/master/Studienprotokolle/CF Phenylketonuria, comorbidities and pregnancy or delivery: https://github.com/medizininformatik-initiative/usecase-cord-support/tree/master/Studienprotokolle/PKU Kawasaki disease and MIS-C: https://github.com/medizininformatik-initiative/usecase-cord-support/tree/master/Studienprotokolle/Kawasaki_MIS-C Details of the FDPG projects: 10.5281/zenodo.11034585
Contributorship
MZ drafted and revised the manuscript as well as was responsible for the coordination of the process. CG developed the scripts. GM and HH wrote the study protocols. JS wrote the data protection concept. HH, DC, GFH, NT and RB described the medical background and provided support with their medical expertise. MZ, CG, A-KA and JT were involved in the data management. MZ and FP were responsible for the management of the project work package “Distributed Analysis”. JF and JS were responsible for the overall project management and acquisition of funding. All authors reviewed and edited the manuscript and approved the final version of the manuscript.
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Ethical approval
Dresden: BO-EK-420092020 Würzburg: EK 50/20 Heidelberg: S-292/2020 Tübingen: : 514-2020BO Frankfurt: 20744, 20744-1, 20744-2 Aachen: EK 378/20 Berlin: Charité adopted the ethics vote of University Hospital Würzburg on the basis of the regulations in the professional code of the Berlin Chamber of Physicians
Funding
This work was part of the project “Collaboration on Rare Diseases” (CORD-MI) of the German Medical Informatics Initiative funded by the German Ministry of Education and Research (Dresden: 01ZZ1911I, Würzburg: 01ZZ1911C, Heidelberg: 01ZZ1911B, Tübingen: 01ZZ1911O, Frankfurt: 01ZZ1911H, Aachen: 01ZZ1911 K, Berlin: 01ZZ1911A).
Guarantor
MZ.
