Sage Journals: Discover world-class research

Abstract

Cross-institution collaborations are constrained by data-sharing challenges. These challenges hamper innovation, particularly in artificial intelligence, where models require diverse data to ensure strong performance. Federated learning (FL) solves data-sharing challenges. In typical collaborations, data is sent to a central repository where models are trained. With FL, models are sent to participating sites, trained locally, and model weights aggregated to create a master model with improved performance. At the 2021 Radiology Society of North America’s (RSNA) conference, a panel was conducted titled “Accelerating AI: How Federated Learning Can Protect Privacy, Facilitate Collaboration and Improve Outcomes.” Two groups shared insights: researchers from the EXAM study (EMC CXR AI Model) and members of the National Cancer Institute’s Early Detection Research Network’s (EDRN) pancreatic cancer working group. EXAM brought together 20 institutions to create a model to predict oxygen requirements of patients seen in the emergency department with COVID-19 symptoms. The EDRN collaboration is focused on improving outcomes for pancreatic cancer patients through earlier detection. This paper describes major insights from the panel, including direct quotes. The panelists described the impetus for FL, the long-term potential vision of FL, challenges faced in FL, and the immediate path forward for FL.

Keywords

clinical decision-making data security and confidentiality machine learning medical imaging IT healthcare evaluation privacy cloud computing collaborative work practices and IT databases and data mining decision-support systems

The impetus for federated learning

COVID-19 increased the need for data sharing

MG: “COVID-19 made us a small world in a real sense because we quickly understood that whatever happens anywhere really affects us all. And this actually was good for people to try and collaborate. However, if you look at it from the perspective of AI and modeling, this wasn’t so good - most of the attempts to address COVID-19 were based on local data and small sample sizes. It was actually a very good precursor to saying that we should find a way to easily share data and come to smart ideas much faster. This is the future.”

While research collaborations involving the sharing of sensitive patient data have successfully happened in the past, they required complex approval processes and regulatory compliance that were difficult to achieve. When facing a pandemic, these already onerous processes became untenable, forcing researchers to think through fast, effective new ways of working.

Federated learning, a machine learning technique in which data is maintained locally while the AI model training process is distributed globally to data behind hospital firewalls, emerged as a solution.¹ As there was no requirement to move any underlying data, researchers around the globe were able to collaborate to train models that were more generalizable due to access to larger, diverse datasets. Collaborating in a truly federated fashion allowed the rapid development of a COVID-19-specific risk model from clinical and chest X-ray data that would have been previously unachievable in that time frame.²

Using AI to improve disease management requires high volumes of data across institutions

EF: “You just need a sheer number of patients to come to conclusions. Federated learning is our best hope of being able to do this - to meet all the requirements of HIPAA and yet allow us to do the research that is needed if we are going to change the trajectory for patients.”

For effective AI training, both in the federated and traditional data lake approaches, and to change outcomes in high morbidity and mortality diseases like COVID-19 or cancer, high volumes of diverse data are required. No single institution alone will be successful, as data diversity (e.g., different protocols, different scanners) and a sheer volume of patients is often required to elucidate patient impact.

Every institution has different rules and complexities regarding data sharing, and sharing is often the exception rather than the rule. Thus, the traditional approach to create a data lake to centralize all data for training creates enormous administrative, cost, and regulatory hurdles, especially when doing so with more than a few thousand cases. Federation is the next generation, where rather than trying to centralize everything and going through the process of signing material transfer agreements and manual de-identification, every single case can contain all the protected health information and just use the federated linkage to connect key analytic components, with no protected health information leakage. This approach will change cross-institutional research collaborations - where currently the first few years are spent working out administrative details before doing science - into projects where administrative barriers only take a few months.

The potential of federated learning

Data diversity is key for clinical translation of algorithms

MGL: “Sharing data is even more crucial when there are few patients, such as in rare diseases or pediatric populations. In general, healthcare data suffers from inequitable representation in our public health systems and services.”

AI technology-driven therapeutic approaches have been mostly focused on adults and common diseases. Patient care must be advanced across the board, including children and rare diseases. Consider the EXAM study, which brought together over 16,000 patients from more than 20 international hospitals - including a pediatric center, which was Children’s National Hospital. At the time, less than 2% of COVID-19 studies were focused on children.³ The EXAM model was able to predict the oxygen need for children with an area under the curve of greater than 0.95. Without using the federated data and working only with the pediatric data, the model achieved less than 0.75 area under the curve.² This demonstrated that the transfer of knowledge from adult models to children can accurately predict outcomes for children with COVID-19. This also indicated the need for hospitals to be technologically equipped and computationally ready to use AI models.

Reduce bias in AI model development

MF: “One important thing to remember is that yes, healthcare today is biased. The way we do everything in life is biased. And to put the onus on AI to eliminate that bias is a tall ask. What we need to do is make sure that AI does not propagate these biases - and to the extent possible, actually counter them and correct-course.”

No single institution is fully representative of our diverse population and practice patterns, so AI models that are trained on single institutional data may replicate biases that are present in that source data.⁴ Access to diverse and representative data is a foundational step in efforts to reduce bias in AI, and federated learning has the capacity to reduce the barriers to building broad, diverse datasets across institutions while preserving privacy and security.

Federated learning offers an opportunity to expose an algorithm to as diverse a population as possible, which means it is more likely to work in the many different settings around the globe. We do not want to have to repeatedly retrain algorithms on local data – rather, we should create algorithms that encompass diversity and are generalizable on a global scale.

It should be noted that while federated learning is a potential solution to bias, it can also be potentially a source of bias if used improperly. When leveraging global cohorts, it is vital to capture meaningful population differences (e.g., age, gender, demographics), but avoid noise (e.g., different data collection procedures, different labeling practices) that might cause artificially skewed data distributions and new bias in resulting models.

The challenges of federated learning

Lack of access to raw data increases data harmonization challenges

HR: “In general, that is a good thing - it enables us to train more robust and more generalizable AI models that can work better across a wider patient population. But using a simple federated averaging algorithm, for example, your model might not even converge anymore because of these differences between the clients’ datasets.”

While federated learning can be immensely helpful in sharing knowledge across participating sites without having to share any raw data, it also brings new challenges to AI model developers and researchers. Without the ability to see the raw data directly, researchers must find new ways to ensure algorithmic performance across the spectrum of inputs that they can no longer directly observe. Because data is not co-located, one could have very heterogeneous data (e.g., different patient populations, different scanners). This challenge is also an opportunity, as there is whitespace for research, such as smarter aggregation algorithms, regularization. Techniques like meta learning and domain adaptation will also become very important to enable efficient federated training.

Lack of streamlined and standardized platforms

EF: “Today, it’s the rediscovery of the wheel at every institution that frustrates everybody. When you see cancer patients, you know that patients don’t have the time to wait 5 years for us to get our act together. If we can discover things earlier and save lives, I think it would changes the whole process. And hopefully what we do here will also allow other people to copy this approach in many other diseases.”

Participation in cooperative multicenter studies has traditionally been the domain of NIH-funded researchers with dedicated research teams. To expand the availability and diversity of research data via mechanisms like federated learning, it will be necessary to convince institutions to invest in the computational platforms and technical expertise necessary to support these types of data platforms. Funding streams will need to be identified to support these investments, most likely from enhanced clinical revenues or from national research investments.

Additionally, the establishment of data standards that facilitate federated learning will be crucial, such as the Observational Medical Outcomes Partnership and Unified Medical Language System.^5,6 These efforts seek to standardize the way in which medical data are stored and encoded to ensure cross-compatibility across institutions and methods. Successful adoption of these standards will require sustained commitments by medical institutions, researchers, and industry.

The path forward for federated learning

Clear messaging and education on the concept of federated learning is key

EK: “Applying clear principles to data science collaborations helps ground researchers in a set of standards that will allow collaboration in a very transparent manner, and allow us to move forward together for patients.”

David Jaffray espouses four basic principles for team data science which help move things like federated learning forward.⁷ First, “observations must be in context”. Each measurement has an uncertainty reflected by the context surrounding the observation. Second, “quality of an observation is captured in the context”. Capture the date, time, setting and description of the observed quantity to ensure appropriate use and the ontology mapping. Third, “provenance links insights to observations”. Data lineage is encoded within that metadata and enables the ability to trust the insight, integrate diverse observations, verify the rights of use, and, when appropriate, attribute credit to the investigators who came up with those insights and also provided the data. Fourth, “data governance must be granular and consistent with the needs of the demand”. This must be foundational, integrated and precise. The framework is iteratively matured and with integrated technical support to allow filters for data access while preserving institutional control of the data, so that patient privacy as well as any intellectual property is respected. Adherence and clarity on these principles helps move collaborations forward.

Federated learning will become easier and more powerful over time

MF: “In pandemic times people wanted to collaborate, as everyone wanted to figure out a solution and the threshold for people agreeing to do federated learning was lowered. But these were early days for federated learning, and not everyone even knew what federated learning meant.”

The EXAM study is a good initial example of federated learning - a collaboration between 20 different institutions, looking at patients coming to the emergency department with COVID-19 symptoms, using certain lab data, EHR data and the chest X-ray.² This was created to predict oxygen requirements in patients. Thus, this was a triage algorithm for patients with suspected COVID-19 infections. Education was key, including highlighting the data security features. One of the biggest lessons was that if people understand the underlying concepts transparently, work becomes easier.

However, another lesson from EXAM is that the actual pipeline to train a model is only a small part of a solution. There is also data orchestration, federated data analytics, a federated workflow, and much more whitespace that needs to be addressed. For example, the ability to log data, to log access, and be able to ascertain lineage of data later are all going to be instrumental. The EXAM study did an amazing job in demonstrating the power of federated learning technology, and the EDRN pancreatic cancer group is now building on those foundational learnings. Our hope is that, in time, federated learning becomes feasible and without much bureaucratic hurdle. With a strong foundational platform, a gold-standard dataset compiled by various researchers and accessible by multiple institutions can be built to expand AI research further, allow for side-by-side comparisons of algorithms, allow for continuous learning to mitigate model drift, and ultimately improve patient impact on a wide-ranging, equitable scale.

Disclosure

Malhar Patel is an employee and shareholder of Rhino HealthTech, Inc., which provides systems for distributed computation that can, among other things, be used to complete FL tasks. Ittai Dayan is an officer and cardinal shareholder of Rhino HealthTech, Inc. Mona Flores is an employee of NVIDIA and owns stock as part of the standard compensation package. Holger Roth is an employee of NVIDIA and owns stock as part of the standard compensation package. Marius George Linguraru is an officer and shareholder of PediaMetrix, Inc. The panel at RSNA was organized by Rhino HealthTech, Inc.

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

ORCID iDs

Malhar Patel

Elliot K Fishman

References

Rieke

Hancox

, et al. The future of digital health with federated learning. npj Digital Medicine 2020; 3(1): 1–7. DOI: 10.1038/s41746-020-00323-1.

Dayan

Roth

Zhong

, et al. Federated learning for predicting clinical outcomes in patients with COVID-19. Nat Med 2021; 27(10): 1735–1743. DOI: 10.1038/s41591-021-01506-3.

Nino

Zember

Sanchez-Jacob

, et al. Pediatric lung imaging features of Covid-19: a systematic review and meta-analysis. Pediatr Pulmonol 2020; 56(1): 252–263. DOI: 10.1002/ppul.25070.

Mehrabi

Morstatter

Saxena

, et al. A survey on bias and fairness in machine learning. ACM Comput Surv 2021; 54(6): 1–35. DOI: 10.1145/3457607.

Hripcsak

Duke

Shah

, et al. Observational health data sciences and informatics (OHDSI): opportunities for observational researchers. Stud Health Technol Inf 2015; 216: 574–578.

Bodenreider

. The unified medical language system (UMLS): integrating biomedical terminology. Nucleic Acids Res 2004; 32(90001): 267D270. DOI: 10.1093/nar/gkh061.

Chung

Jaffray

. Cancer needs a robust “metadata supply chain” to realize the promise of artificial intelligence. Cancer Res 2021; 81(23): 5810–5812. DOI: 10.1158/0008-5472.can-21-1929.

Accelerating artificial intelligence: How federated learning can protect privacy,facilitate collaboration,and improve outcomes