Reproducibility and explainability in digital pathology: The need to make black-box artificial intelligence systems more transparent

Abstract

Artificial intelligence (AI), and more specifically Machine Learning (ML) and Deep learning (DL), has permeated the digital pathology field in recent years, with many algorithms successfully applied as new advanced tools to analyze pathological tissues. The introduction of high-resolution scanners in histopathology services has represented a real revolution for pathologists, allowing the analysis of digital whole-slide images (WSI) on a screen without a microscope at hand. However, it means a transition from microscope to algorithms in the absence of specific training for most pathologists involved in clinical practice. The WSI approach represents a major transformation, even from a computational point of view. The multiple ML and DL tools specifically developed for WSI analysis may enhance the diagnostic process in many fields of human pathology. AI-driven models allow the achievement of more consistent results, providing valid support for detecting, from H&E-stained sections, multiple biomarkers, including microsatellite instability, that are missed by expert pathologists.

Keywords

Artificial intelligence machine learning deep learning digital pathology whole-slide images AI-driven models

The introduction of high-resolution scanners in histopathology services has represented a real revolution for pathologists, allowing the analysis of digital whole-slide images (WSI) on a screen without a microscope at hand. However, it means a transition from microscope to algorithms in the absence of specific training for most pathologists involved in clinical practice. The WSI approach represents a major transformation, even from a computational point of view. The multiple ML and DL tools specifically developed for WSI analysis may enhance the diagnostic process in many fields of human pathology. AI-driven models allow the achievement of more consistent results, providing valid support for detecting, from H&E-stained sections, multiple biomarkers, including microsatellite instability, that are missed by expert pathologists.²

Despite all these possible advantages and promising results, the introduction of AI-driven tools in clinical practice needs to be revised. This is due to multiple reasons. The reproducibility of DL models applied to WSI analysis represents a crucial point, and often a barrier, for the transition of these models from research to clinical workflows. For a method to be widely adopted in clinical practice, it must be explainable and reproducible so that pathologists can have confidence in its use.³ Unfortunately, ML and DL models are characterized by crucial challenges regarding reusability and reproducibility.⁴ It is time to rethink the approach of AI to pathology, aiming to help algorithms reach the levels of reproducibility and availability necessary for approval by national and international authorities, such as the European Medicines Agency (EMA) and the Food and Drugs Administration (FDA).

Here, we would start from the etymology of the term “algorithm.” The word algorithm comes from the Muslim mathematician Muhammead Ibn Musa al-Khwarizmi, born in Uzbekistan around 780 CE, who is credited with inventing algebra and developing the concept of algorithms. These are systematic methods with a sequence of steps and rules, which end with solving mathematical problems. In their original definition, algorithms were characterized by their explainability and reproducibility. Unfortunately, the current status of the vast majority of algorithms applied to computational pathology is characterized by low levels of reusability and reproducibility. When applied to the local dataset, models with high specificity and sensibility often show lower performance when applied to external datasets, evidencing the inability of these models to explain the theory.

The analysis of the multiple steps utilized by AI models in the WSI analysis evidences multiple critical points: stain normalization of tissue sections, tissue type segmentation, type of patch extraction, whole-slide image-based classification versus patch-based analysis and mixed methods, hard negative mining, heatmap generation are among the multiple critical points that characterize the application of a DL model to histopathology in clinical workflows.⁴ The failure to maintain high-level standards regarding data processing, an essential requisite for reproducibility, characterizes most studies on DL models in digital pathology.³

Another critical point is the absence of explainability and interpretability of these models, which appear as “black boxes.”⁵ Although convolutional neural networks (CNNs) have achieved impressive performance, it is more intriguing to understand how the models make decisions and how they learn to solve a given task.⁶ The absence of algorithm elucidation hinders the medical acceptance of AI models.⁷ From a practical point of view, in many countries, a clarification of how AI models work is required for governmental approval for use in clinical settings.⁸

In the medical community, understanding the decision-making process can be as important as the decision itself.⁹ When dealing with a patient, a disease or a complex diagnosis, which can end with decisions that directly affect a human’s health status and survival, a better understanding is necessary to avoid damage, adverse effects, and mistakes. For this reason, a better understanding of “algorithmic decisions” appears mandatory for pathologists utilizing AI models, addressing the need for explainability in digital pathology.¹⁰

These data, taken together, are ready for the proposal of a high-quality, robust, easy-to-use and transparent processing pipeline, which can help ensure the validity and the explainability of AI models applied to histopathology in clinical workflows. The main goal of a new robust pipeline is to overcome the reproducibility crisis of AI models,¹¹ eventually allowing their faster applications in medicine and their acceptance by pathologists for clinical purposes.¹² To this end, novel pathologist-AI interfaces that refer to a human user enable contextual understanding and allow pathologists to ask interactive questions, overcoming the disadvantages of the actual AI models, which do not refer to a human model.

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

ORCID iD

Luigi Barberini

References

Faa

Castagnola

Didaci

, et al. The quest for the application of artificial intelligence to whole slide imaging: unique prospective from new advanced tools. Algorithms 2024; 17(6): 254.

Reis-Filho

Kather

JN.

Overcoming the challenges to implementation of artificial intelligence in pathology. J Natl Cancer Inst 2023; 115(6): 608–612.

Fell

Mohammadi

Morrison

, et al. Reproducibility of deep learning in digital pathology whole slide image analysis. PLoS Digit Heal 2022; 1(12): 1–21.

Wagner

Matek

Shetab Boushehri

, et al. Built to last? Reproducibility and reusability of deep learning algorithms in computational pathology. Mod Pathol 2024; 37(1): 1–11. DOI: 10.1016/j.modpat.2023.100350

Rudin

Stop explaining Black Box Machine Learning models for high stakes decisions and use interpretable models instead. Nat Mach Intell 2019; 1(5): 206–215.

Chen

C-L

Chen

C-C

W-H

, et al. An annotation-free whole-slide training approach to pathological classification of lung cancer types using deep learning. Nat Commun 2021; 12(1): 1193.

Ahmed

Abouzid

Kaczmarek

Deep learning approaches in histopathology. Cancers 2022; 14(21): 5264.

Ching

Himmelstein

Beaulieu-Jones

, et al. Opportunities and obstacles for deep learning in biology and medicine. J R Soc Interface 2018; 15(141): 1–47.

Lee

Recent advancements in deep learning using whole slide imaging for cancer prognosis. Bioengineering 2023; 10(8): 897.

10.

Plass

Kargl

Kiehl

T-R

, et al. Explainability and causability in digital pathology. J Pathol Clin Res 2023; 9(4): 251–260.

11.

Hutson

Artificial intelligence faces reproducibility crisis. Science 2018; 359(6377): 725–726.

12.

Stodden

McNutt

Bailey

, et al. Enhancing reproducibility for computational methods. Science 2016; 354(6317): 1240–1241.