Abstract

Artificial Intelligence (AI)-based clinical decision support systems provide clinicians with insights extending beyond conventional medical tools. The aim is to improve diagnostic and prognostic accuracy by capitalizing on the granularity of available data, allowing a larger population to benefit from tailored care. AI-driven reconstruction of stroke imaging has been shown to have performance non-inferior to that of a neuroradiologist in terms of ischemic lesion scoring according to ASPECTs, 1 also providing consistent scoring of collaterals in the hyperacute setting. 2 Machine learning methods (ML) to aid large vessel occlusion and salvageable tissue detection represent a pivotal stride toward optimizing the onset-to-treatment window, potentially improving cost effectiveness 3 and providing a reasonably accurate prediction of response to treatment. 4 Beyond imaging, AI has also been tested in the prediction of functional outcome.5–7 In such settings, using data from baseline assessment and further details on patient status 24 h after stroke, ML models have been shown to provide up to 80% accurate prediction of good functional outcome at 3 months.5–7
Taken all together, these preliminary studies underscore the potential of ML in leveraging real-world data for outcome prediction. Concurrently, these studies draw attention to the need to deal with a reproducibility crisis related to AI-based research.
ML involves the development of models that enable computers to learn from data autonomously, improving performance without explicit programming. The proliferation of ML has spurred an exponential increase in clinical AI model development, with over 75,000 reported studies (https://aiforhealth.app).
Amid such AI progress, a critical challenge looms large: the reproducibility of AI-based studies. This issue came to the fore from studies on AI-aided interpretation of medical images for COVID diagnosis, where several models faced implementation hurdles due to implicit biases and reproducibility issues. 8 Consequently, as AI continues to integrate into stroke care, prioritizing the three pillars of scientific rigor – reliability, replicability, and reproducibility – becomes imperative.
Reliability concerns the consistency and accuracy of a method to discriminate outcomes coherently. Replicability lies in the accuracy of findings, and whether they can be confirmed using the same procedures in new studies (scientific replicability). This entails the same algorithm producing similar results under stress conditions, encompassing increased sampling variability, greater system uncertainty, or varying data quality. Far from implying exact point estimate replication, scientific replicability rather represents the need to reach consistent results within probabilistic and non-probabilistic variation tolerances.
Lastly, reproducibility relates to transparency of data, methods and analysis, and the steps put in place to allow re-testing and refinement (computational reproducibility). The latter implies that any computed result must be obtainable by any investigator using the same data and algorithms. This principle extends beyond result validity, constituting a fundamental requirement at the computational level: every algorithm must be reproducible to justify its existence. 9
For broad applicability, “an AI model needs to be reproducible, which means the code and data should be available and error-free.” 9 Several antagonistic factors impede this principle, including ethical concerns, privacy issues, and legal obstacles at institutional and national levels, particularly in experiments involving statistical and sub-symbolic domains with deep learning models. 9 The lack of standardized reporting and the limited availability of source code and datasets undermine AI ethics principles. 10 The stroke research field, in particular, finds in data sharing and source code reporting a consistent limitation to the applicability and external validation of research findings and models. 11 From a scoping search, over the last year more than 30 ML algorithms were reported for stroke patients, but only one study fully disclosed the source code, 4 one referred to open source libraries for data mining and algorithm development7,9 and only one provided external validation of the ML model. 12 Therefore, implementing technical and ethical measures is crucial to ensure fairness, accountability, and transparency in AI studies.
From a technical perspective, authors may be encouraged to report the architecture of the ML algorithm (unless patented). This also applies to the methods for detecting and preventing data leakage, to ensure explicit criteria for dataset splitting during training, tuning, testing, and validation. Even when building from community-based frameworks (e.g. MONAI.io), providing source code would allow external validation and reproducibility. Standardized reporting of data curation and annotation can enhance understanding of the ML model development platform. From a reviewer perspective, grasping the functioning of the ML models is critical. As neutral or negative results are often kept unpublished, the main risk for the scientific community is to be overwhelmed by ML studies with unrecognized overfitting – namely those with the highest predictive accuracy. At that point reviewers will likely have little – if any – ability to discriminate good use and misuse of AI 10 without source code and data available. Although data and source code sharing is common in non-medical ML literature (e.g. Open Neural Network Exchange, http://onnx.ai), it remains underused in clinical studies. 13 Indeed, the sharing of sensitive personal data among different institutions raises privacy preserving concerns. Federated platforms may partially overcome data sharing issues, allowing to train algorithms without sharing patient sensitive data, while providing external validation of AI-based models. Such approaches would also mitigate the intrinsic limitations of studies adopting AI-driven text mining for data collection.7,14–16 To this extent, readers and clinicians, as well as patients as final users, would likely be more confident in giving trust to a validated and transparent ML algorithm rather than to a not explainable model with undefined internal or external validity. Implementing safeguard measures for reporting ML studies can help identify potential undisclosed issues such as treatment bias, ethnic group bias, subgroup and sensitivity analysis, and generalizability.
From an ethical standpoint, editorial bodies should adopt a framework that promotes transparency and fairness, including checklist reporting and repositories for source code storage. This approach would streamline the peer-review process, foster reproducibility, and potentially ensure proper citation of original work. The stroke field may also refer to common frameworks developed for cancer imaging, where transparency and explainability are promoted, and testing of real-world implementation of AI solution is expected. 17 Besides checklists (e.g. TRIPOD, CLAIM, MAIC-10) required for ML study submission, reviewers should use a dedicated lexicon and glossary for reviews and test ML model reproducibility and reliability. Reviewers should also categorize ML reproducibility in tiers, which could be prominently displayed alongside article titles to inform journal audiences.
Historically, we transitioned from AI as a branch of mathematics, grounded in deduction, to ML as a branch of statistics, grounded on probability and correlation. But probabilities and correlation are by definition estimates of events rather than deterministic. And although running with probabilities is commonplace in the clinical setting, with AI we may be able to explain general outcomes through infinitesimal – potentially microscopic – parameters and patterns. If ML is to help in stroke care, algorithms must evolve from inscrutable to informative and explainable, for the sake of clinicians and patients. 13 A model’s accuracy depends on the underlying data quality, but can improve and gain trust over time with transparency. As AI develops, the ultimate goal is broad generalizability and positive impact on many lives. Consequently, the community of authors, reviewers, and editors shares a common interest in pooling efforts toward this goal.
Footnotes
Acknowledgements
None
Declaration of conflicting interest
The author(s) declared the following potential conflicts of interest with respect to the research, authorship, and/or publication of this article: MR is supported by Young Investigator Grants from the Italian Stroke Association (ISA-AII), and declares support for educational activities from CLS-Behring and PRESTIGE-AF trial.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
Ethical approval
It is an editorial, there is no ethic include.
Informed consent
It is an editorial, there is no patient included.
Guarantor
MR.
Contributorship
MR, PC: concept, design, writing.
