Sage Journals: Discover world-class research

Abstract

Objective

This study aimed to assess the practicality and trustworthiness of explainable artificial intelligence (XAI) methods used for explaining clinical predictive models.

Methods

Two popular XAIs used for explaining clinical predictive models were evaluated based on their ability to generate domain-appropriate representations, impact clinical workflow, and consistency. Explanations were benchmarked against true clinical deterioration triggers recorded in the data system and agreement was quantified. The evaluation was conducted using two Electronic Medical Records datasets from major hospitals in Australia. Results were examined and commented on by a senior clinician.

Results

Findings demonstrate a violation of consistency criteria and moderate concordance (0.47-0.8) with true triggers, undermining reliability and actionability, criteria for clinicians’ trust in XAI.

Conclusion

Explanations are not trustworthy to guide clinical interventions, though they may offer useful insights and help model troubleshooting. Clinician-informed XAI development and presentation, clear disclaimers on limitations, and critical clinical judgment can promote informed decisions and prevent over-reliance.

Keywords

explainable AI interpretable ML clinical decision support systems clinical predictive models electronic medical records responsible AI

Introduction

ML-based tools have the potential to significantly improve health and healthcare delivery,¹ yet these methods are often ”black-box” in nature. In this context, ML models often fail to elucidate which influential factors affect individual predictions as well as how changes to these observable inputs affect or modulate the outcome being predicted. This is an important deficiency because clinicians’ understanding and confidence in using predictions to guide interventions in a complex process that is modelled is the key to trust. Knowing the main contributing factors allows their evaluation in terms of their coherence with respect to the application task and its potential actionability could help in building trust in clinical settings.² If the explanation cannot explain why a future problem will evolve, and thus justify treatment interventions then clinicians can rightly be sceptical. The lack of transparency in their function reduces their trustworthiness^3,4 and is a barrier to their adoption for clinical decision-making. Understanding the methods should also be sufficient for clinician to suspect when the tools are not working, or being used outside the purpose for which they were developed.

To understand complex ML models, different eXplainable AI (XAI) methods have been proposed in the literature.³ These methods can be categorised into (i) gradient-based e.g., SmoothGrad,⁵ Integrated Gradients,⁶ Deep Taylor Decomposition (DTD)⁷), Layer-wise propagation,⁸ and (ii) perturbation-based e.g. LIME,⁹ Shap.¹⁰ Still, there is little understanding of how applicable or useful they are in clinical settings or whether they should be significantly re-tailored or even novel XAI methods developed.

From clinicians’ view, knowing the subset of features driving the model outputs is crucial as it allows them to compare the data-driven model decisions to their clinical judgment, especially important in case of a disagreement.² Findings suggest that rigorous evaluation of explanations against the following criteria: (i) domain-appropriate representation, (ii) potential actionability and (iii) consistency could contribute to building trust in clinical settings.² Recent studies^11,12 report inconsistency between the explanations generated by various popular explanation methods. This variability implies that at least some generated explanations are incorrect. If incorrect, explanations at a patient level could be misleading and could lead to wrong decisions with dire consequences in applications such as healthcare. However, there is no literature focused on evaluating explanations, their usability and their effect on trust in clinical settings.

The objective of this paper is to: i) examine the (dis)agreement among the most commonly used XAI methods in clinical settings, ii) assess their utility and trustworthiness, iii) identify key requirements for building trust and iv) provide actionable insights to mitigate identified limitations. To our knowledge, this is the first study of its kind.

To that end, we presented results of the quantitative analysis of explanations at both the patient (i.e. local explanations) and the cohort (i.e. global explanations) level. We discussed them in terms of their coherence with respect to the application task, impact on the workflow and consistency. The anaysis focused on two the most used XAI methods for explaining clinical predictive models: Shap¹⁰ and DTD.¹³ It is performed on two EMR datasets sourced from the two major Australian hospitals examining data-driven models predicting unexpected patient deterioration¹⁴ and hospital readmission after discharge.¹⁵ These results and their implications from clinicians’ perspectives are discussed and the necessary criteria for having trustworthy explanations and how these guide the choice of intervention provided.

The main contributions of this paper are:

• Conducted comprehensive benchmarking and analysis of two popular XAI methods to evaluate their effectiveness in explaining ML-powered clinical prediction models.

• Contributes to debate on what degree an explanation could be sufficient by discussing implications of the results in the context of applicability and usefulness in clinical settings.

• Highlights the limitations and potential benefits of state-of-the-art XAI methods in clinical contexts, contributing to developing of trustworthy XAI solutions.

• Suggested necessary criteria for trustworthy explanations.

• Provided actionable insights to mitigate risks of identified shortcomings and concrete suggestions/directions for future research.

Methods

Datasets

This is a cross-sectional study. To evaluate XAI methods, two datasets from two published studies were used. Vital Signs (VS) dataset comprise vital signs and administrative data with N = 3,237,601 records. These were used to predict patient deterioration in acute settings.¹⁴ The second dataset comprised administrative data, pathology, and medication data. These were used for predicting patient readmission within 30 days (RA30).¹⁵ In this study, we considered Children’s cohort and an All-inclusive scenario comprising N = 139,941 records. Datasets characteristics are provided in the Supplemental information Table S1 and S3.

Predictive models

To explore the agreement between the features obtained by different methods we selected three modelling approaches, each representative of one modelling paradigm: regression, conventional ML, and deep neural network (DNN) (Figure 1(a)). We chose logistic regression(LR) with l1 regularization (L1),¹⁶ aka LASSO, as the most accepted predictive model for clinical applications. Unlike other ML models which are generally more complex, it allows insight into the model structure (interpretability) and is a trusted and accepted model among clinicians. Its main disadvantages are the inability to capture complex patterns and hence poorer performance when dealing with high-dimensional or non-linear data. Thus, choosing between interpretable and explainable depends on the specific requirements of the application. We considered XGB¹⁷ as representative of conventional ML, which along with the random forest demonstrated to be powerful models when dealing with large and complex data. Dense NN model¹⁶ was deployed to explore the explainability of NN-based models. Details regarding model development i.e. inputs, outputs, hyperparameters and training, are provided in the Supplement Section 1.

Figure 1.

Visual summary of the experiment design: Models and XAIs (a) Agreement metrics (b) and evaluation criteria (c) used to assess XAIs. PL and CL stand for evaluation performed at patient and cohort levels, respectively.

Explanations

Two XAI methods, one gradient-based named Deep Taylor Decomposition (DTD), and one perturbation-based named Shap were employed to explain the models’ outputs (1a). Shap¹⁰ explains by calculating the contribution of each feature to a model’s prediction by averaging the marginal effects of the feature across all possible feature combinations, the principle adopted from cooperative game theory. DTD⁷ decomposes a model’s output back to its input features, attributing the prediction to individual input features through the process of Taylor expansion, which assigns relevance scores to highlight the contribution of each feature. To measure agreement, we compared generated explanations at the patient and the cohort levels and used metrics introduced by Krishna et al.¹¹ Patient-wise explanations (i.e. local explanations) are those computed for each patient individually for each predicted outcome. To gain an insight into features that underpin decisions of a predictive model overall, we analysed explanations at the cohort level (i.e. global explanations) which are obtained by adding up the absolute values of all explanations obtained for individual patients and averaging them over the total number of considered patients.

Agreement metrics

• Feature agreement (FA) represents the fraction of common features between the sets of top-n features obtained by two explanation methods.

• Correlation Ranking (CR) is quantified by Spearman’s rank correlation coefficient. It is used to assess the strength and direction of association between two ranked variables, i.e. feature ranking obtained with two different methods.

• sign quantifies the fraction of common features in the sets of top-n that have the same sign, which shows the direction of contribution. As such it is used only for patient-level analysis. Considering that DTD is positive, it was only computed for the Shap. Despite this limitation, DTD application to EMR-based predictive models,¹³ theoretically solid principles and wide use motivated considering it as a representative of gradient methods.

Experiment design

To evaluate explanations against the criteria suggested by Tonekaboni et al.² we used explanations obtained for EMR-based predictive models reported in published studies.^14,15 To that end extensive analysis is performed to investigate whether explanations (i) have domain-appropriate representation, i.e. whether their representation is coherent with respect to the application task, (ii) they may impact clinical workflow and (iii) are consistent (Figure 1(c)).

To assess whether the explanations are coherent i.e. domain-appropriate for the application task, we analysed explanations at the cohort level. Explanations that are redundant are not desirable unless critical to potential clinical workflows, i.e. they should not further obfuscate model behaviour to a clinician.² Therefore, top contributors at the cohort level obtained by different methods were compared and their representation in the context of the application task is assessed based on clinical expert knowledge. To account for the inherent randomness in the optimisation procedure, analysis is performed on results obtained by running each algorithm five times.

To evaluate the potential actionability of the explanations, i.e. whether they are informative and may impact the workflow by informing follow-up clinical workflow while simultaneously being parsimonious and timely, explanations at the patient level were analysed, assessed and discussed. In this scenario, only the samples that correctly predicted patient deterioration/readmission were considered. To understand how explanations for incorrect predictions differ from correct ones, we analysed percentage of occurrence of the most frequent top contributors for each case (TP, TN, FN and FP), their means and standard deviations. Doing a patient-level analysis allowed for the evaluation of explanations generated for each patient individually, assessment of their informativeness and their potential impact on the workflow. E.g. If explanation of the main contributing factor can guide the choice of intervention, the risk assessment tool will help to allocate resources and explanations that should facilitate decision-making.²

Explanations intended to build trust in clinical settings could benefit from being rigorously evaluated against the ground truth. With that objective in mind, local explanations obtained for VS datasets were benchmarked against the red flag triggers recorded by the data collection system deployed in the hospital. We compared whether the top feature (either the original feature (e.g. SpO2) or its derivatives e.g. its min, max, std, average, count, slope) are matching the cause of the correctly predicted future red flag event recorded by the deployed system (one of the following measurements: blood pressure, pulse, oxygen saturation (SpO2), GCS, Sedation score or respiratory rate). In the analysis, only correct predictions were considered. Multiple red flags could potentially arise within the prediction window, but only the first one was considered (see Figure 2). Concordance was computed as a fraction of the total count of samples for which the explanation was matching the trigger recorded by the deployed data collection system and the total number of correctly predicted red flags. Given that none of the predictors in RA30 dataset offers a cause for patient readmission which could be considered as a ground truth, RA30 detest was excluded from benchmarking analysis. Considering that different methods may produce different results¹² and the discordance could impact clinicians’ trust, explanation agreement between the methods at the patient has been also investigated and discussed.

Figure 2.

Visualisation of the benchmarking process.

Explanations should be consistent, i.e. they should (i) yield observable changes for any changes in predictions due to changes in inputs, (ii) be invariant to underlying design variations, i.e. they should only reflect relevant clinical variability. Violating any of these elements results in inconsistent explanations. This undermines their reliable actionability which in turn has a negative impact on clinicians’ trust.² The first consistency factor suggests causality. However, though considered XAIs make correlations transparent these are patterns, not causal relationships. On occasions, the patterns may be casual, but this is not to be expected. Aas et al.¹⁸ showed that marginal Shapley values may lead to incorrect explanations when features are highly correlated. While DTD can provide insights into how a neural network processes information by decomposing the NN into a sum of simple interpretable functions and associating them with a specific input feature to score each feature, it is not specifically designed for causal inference.

To examine the consistency of explanations in relation to variations in the design of underlying models (DNN and XGB), explanations calculated at the cohort level were analysed for each of the five independent runs. Obtained explanations were compared across the runs and their agreement was quantified with FA and CR metrics. Unlike DTD which is applicable only to NNs, Shap explainer can be applied to any model. Therefore, the consistency of Shap explanations in relation to the deployed models, XGB and NN, was analysed and quantified with FA and CR metrics. To explore the consistency of explanations regarding the direction of contribution (i.e., sign), explanations obtained at the patient level for both XGB and DNN models using the Shap method were analysed across five independent runs.

Results

Domain appropriate representation

Table 1 lists the top 5 features at the cohort level obtained after the first run of each algorithm. Rank lists for all 5 runs are provided in the Supplemental information Table S5-S14 and detailed reporting on their agreement in Supplemental Information Table S15-S22. In assessing domain-appropriate representation explanations were evaluated against domain-expert knowledge. From a clinical perspective, influential predictors to anticipate patient deterioration aligned with vital signs parameters identified as important by clinical collaborators. These included observable vital signs such as AVPU (level of consciousness), respiratory function (SpO2, resp Rate and Oxygen therapy), cardiovascular function (systolic and diastolic BP); metadata of bedside activity indicating higher levels of existing care (e.g. the number of recorded vital signs and bedside events logged, and baseline measures of chronicity of the acute process (e.g. current inpatient length of stay). Similarly, for predicting patient readmission, informative predictors suggested by senior health care administrators including previous inpatient visit counts, previous emergency visit counts (stays and presentations), age, index of socioeconomic status and indicator for a stay longer than a day¹⁵ were identified as influential predictors by the explainability algorithms. It was also observed that there was discordance between the features identified for both models across the deployed XAI methods. On average the agreement between each pair of methods was moderate across 5 independent runs. For the VS dataset, there was a degree of interdependence between the vital signs, hence clusters and associations of features that are not identical could indicate a common pattern or route of deterioration.

Table 1.

Domain appropriate representation criteria: Top contributors for VS and RA30 datasets (Run 1).

Rank/Data: Method	VS: Coeff L1	VS: DNN Shap	VS: DNN DTD	VS: XGB Shap
1	LOS*	SpO2	Mum.rec, VS	LOS
2	SpO2*	LOS	LOS	Spo2
3	Resp. Rate	SBP*	SpO2	SBP
4	minSBP*	Num. rec. VS	SPB	Num. rec. VS
5	DBP*	O2 flow rate	O2 flow rate	minSpO2

Rank/Data: Method	RA30: Coeff L1	RA30: DNN Shap	RA30: DNN DTD	RA30: XGB Shap
1	Prev. inpat.stay count	Prev. inpat.stay count	Prev. inpat. stay count	Prev. inpat. stay count
2	Prev. inpat.stay count²	Patho.: No unique tests	Patho.: No unique tests	Patho.: No unique tests
3	Patho.: No unique tests	Elcc. status: NA	Adm. source: Broader	Elcc. status: NA*
4	Adm. source: Broader	LOS²	LOS²	LOS
5	Elec. status: NA	Elec. status: Emerg admis.	Adm. source: Emerg. dep	Path. : No pat.observable codes

*LOS-Length of staty, SpO2-Oxygen saturation. SBP-Systolic Blood Pressure, DBP-Diastolic Blood Pressure, NA-Not Assigned.

Impact on the clinical workflow

Match benchmarking

The fraction of samples for which the explanations and true flag triggers were matching is shown in Figure 3(a). The top row shows the fraction of samples that matched the ground truth when correctly predicted samples that have an administrative feature (e.g. LOS) as the main contributor were excluded from comparison as they cannot be directly benchmarked against the causes recorded by the Early Warning System (EWS). The percentage of excluded samples was 2.7% for DNN-DTD, 13% for DNN-Shap 13% and 37% for XGB-Shap. Bottom row reports matching when samples with an administrative feature as the main predictors were considered.

Figure 3.

Match benchmarking: Fraction for which explanations of the most contributing predictors match recorded red flag events (a) and patient-wise agreement on the top contributor across all methods over 5 independent runs (b).

Overall agreement varied from moderate (47%) to strong (80%). Excluding samples where LOS was the top feature (37% of correct predictions in XGB) resulted in remarkable agreement increase (75-80%, Figure 3(a)), a level that is clinically relevant i.e. potentially reliable, and actionable explanations. Agreement on the top contributor across the methods for VS dataset is high (84%, Figure 3(b)) including its sign agreement (99%, Figure 5(c)) contributed building the trust.

Patient-wise concordance of top contributors

For the VS, the average agreement across all methods over 5 independent runs (Figure 3(b)) was 84% when we compared whether the top features are derivatives of the same vital sign (e.g. SpO2), somewhat less (82%) when we compared the exact names of the features (e.g. min SpO2). The concordance was 20% for the top two features (regardless of their order), and less than 3% when the top three features were considered. For the RA30 dataset, the concordance overall was remarkably poorer (Figure 3(b)).

Detailed results of agreement analysis are reported in Supplemental Information Table S27–28.

Explanations for correct and incorrect predictions

Features identified as main contributors in correctly classified samples were also main contributors for the incorrect ones (Figure 4). Though means differ (Supplement S29, S30) the variability in the two subgroups is such that their ranges are not distinctly separate.

Figure 4.

Impact on the clinical workflow: Main contributors for correct and incorrect predictions. (a) VS: datset; LOS: length of stay; HR: heart rate; AP: level of consciousness; BP: blood pressure; RR: respiratory rate; O2 FR: oxygen flow rate; NRVS: num.rec.VS. (b) RA30 datset. PISC: Prev. inpat. stay count; PNUT: Patho.: No unique tests; CTOC: care type: Other care; ESNA: Elec. status: Not Assigned; ESEA: Elec. status: Emerg. Care.

Consistency

Shap demonstrated insensitivity to underlying design variations, moderate consistency with respect to the deployed model and strong sign agreement for the main contributor. Explanations obtained with the DTD had the poorest consistency (see Figure 5(a)–(c)).

Figure 5.

Consistency results represented by an average agreement across 5 independent runs.

Discussion

Several studies^2,19–21 have explored the role of explainability in AI-based clinical decision-support tools.

Physicians viewed XAI as crucial for their trust in technology.¹⁹ Given hypothetical scenario, physicians trusted outputs from model-agnostic explainability methods more than those from models without explanations.²⁰ These studies assumed explanation consistency, but a recent study showed that state-of-the-art explanation methods often disagree in their outputs.¹¹ This study explores XAI agreement and disagreement in a clinical context, examining its potential implications to address a key challenge in the clinical adoption of ML models: the need for transparency and explainability.

Regarding domain-appropriate representation, except for the regression model (L1), all other methods included two predictors suggested by experts. One of the reasons for the unexpected result with L1 on the RA30 outcome metric could be the particularity of the paediatric cohort. Some features recognised as relevant for these models, however, might be partially or completely non-representative for supporting clinical decision-making. For example, a patient’s admission source or elective status may not be as clinically relevant as recent pathology results.

In the context of predicting patient deterioration and risk of readmission, explanations obtained at the patient level were recognised by clinical collaborators as actionable, i.e. can guide the choice of intervention and can help busy clinicians prioritise their efforts while evaluating patients.¹⁴ However, while the studies^14,15 have assessed model performance, there is no evidence supporting explanations correctness. To that end, one of the major contributions of this paper relates to the examination of the correctness of the generated explanations against the true causes recorded by the data collection system deployed in the study hospital (i.e. benchmarking), albeit only on one dataset (VS).

Benchmarking explanations obtained for patient representation was impossible as there was no ground truth. Still, identified top predictors benefit from being already recognized as informative in predicting patient representation. Though, the agreement on the most contributing factor across the methods was 55% which may affect clinicians’ trust adversely. The overlap in top features between correct and incorrect predictions suggests that, although the model is identifying relevant features, it may not be using them effectively. This could be due to highly skewed distributions and overlapping feature values between correct and incorrect predictions, leading to inconsistent performance. For clinicians and practitioners, this emphasizes the importance of carefully interpreting model outputs and potentially refining the model to reduce incorrect predictions.

When considering timeliness, both algorithms can be leveraged to provide real-time predictions and explanations that provide relevant complementary information that is well aligned with the current clinical workflows, allowing for early intervention and the prioritisation of clinical efforts for care planning.

Results demonstrate violation of consistency criteria, undermining reliability and actionability whcih in turn may impact clinicians’ trust adversly. From the clinical perspective, there are 3 sources of imperfection in the input features and the data available that could cause discrepancy and/or incorrectness of the results¹²: (i) Incomplete information to exactly “resolve” the question: missing features which, if available, would explain more of the variation and causation of the outputs. (ii) Dependence between factors. Several or most of the inputs have interrelationships with other input features and observations, e.g. in vital signs, pulse and blood pressure or respiratory rate, oxygen flow, and oxygen saturation. While these are each “independently related to the outcome” in the model, they are not independent of each other. This means that different groups of features could have the same implications for clinicians and hence the decision-making. Aas et al.¹⁸ showed Shap¹⁰ may lead to incorrect explanations when features are highly correlated. (iii) Errors, missing data that could be observed or contradictory information. From the modelling perspective, the main reasons for discrepancy can be attributed to the optimisation objective which is simply minimisation of the prediction error. Consequently (i) Causal and/or clinically relevant associations might not be discovered and hence post hoc not explainable. (ii) Since different ML methods operate on different principles, different features may be identified as the most relevant.

Findings align with the literature¹¹ and support opinion that current explainability approaches are not reliable and cannot be trusted at the patient level.²² However, while the initial observation of inconsistency and misalignment with the true triggers might give the impression that XAI is useless to clinicians as it cannot guide interventions, it is important to consider other factors before drawing conclusion. If we view explanations as a mean to gain deeper insights while acknowledging their imperfections,²³ they can still provide valuable information to clinicians, especially when minimal trust is established. Moreover, when highly correlation different groups of features may point out in the same direction. In situations where a clinician knows little about the patient’s condition, receiving an explanation could be a helpful suggestion, even if it is correct only half of the time because it provides some additional information. However, if a clinician is already highly confident that a patient is deteriorating in a recognisable, clinical pattern, providing an alternative explanation would require consideration in the same way alternative diagnoses or treatments would be considered. Additionally, it can aid model troubleshooting and system audits, helping improve performance and identify biases in model development phase.²²

XAI-explanations should be rigorously assessed based on their domain-appropriate representations, potential actionability and consistency with the approach deployed in this study. Accounting for feature interdependencies, defining explanation uncertainty and CIs based on data distributions for correct/incorrect predictions and patient risk groups, implementation of robust data validation and imputation strategies, aligning optimisation objectives in model development with clinical goals, enhanced data collection, considering and presenting only actionable explanations to inform clinicians can help in mitigating identified limitations at the practitioners’ end. XAI can be misleading and considered harmful.^24,25 Thus, using clinical judgment to critically evaluate AI explanations, being vigilant about data quality, aware of XAI limitations, and treating AI outputs just as additional information can contribute to more informed decisions while preventing over-reliance

Limitations

This study considered only the most popular and readily available state-of-the-art XAI methods. Exploring recent methods that account for causality or multicollinearity e.g.^26–30 would be valuable in future. The clinical implications of the results have been discussed with one senior clinician. Qualitative study on this topic could generalise insights.

Conclusion

Evaluation of two popular XAIs for explaining clinical predictive models showed good domain-appropriate representation, albeit moderate consistency and agreement with reality. This paper is suggesting that 1) if sufficient disparate ML methods agree on influential relationships, 2) if observable and input factors can be modified and they change the model output, and 3) if the adjusted ML model outcomes concur and agree with the real-world results, then we are one step closer to trustworthy explanations. Unless these conditions are met, no explanation method should be considered trustworthy and or used to guide the choice of clinical intervention. However, they might still be useful in helping clinicians in cases when they know little about a patient although the information is present or to identify “at risk” patients who look deceptively stable. In the latter case there is a danger of over-relying on AI that could lead to incorrect clinical decisions.

The process of selecting the best model involves well-established criteria for assessment. Though a large body of literature uses XAI to meet regulatory requirements or increase trust, there are currently no standardised criteria or recommendations for their application-specific development, evaluation, and/or selection. Also, studies involving XAI component are already available at ClinicalTrials.gov. Thus, the development of comprehensive clinically informed actionable XAI recommendations is critical. Their integration into the development process can assist clinicians to optimise trust by avoiding over-reliance while not dismissing potential benefits.

Supplemental Material

Supplemental Material - Benchmarking the most popular XAI used for explaining clinical predictive models: Untrustworthy but could be useful

Supplemental Material Benchmarking the most popular XAI used for explaining clinical predictive models: Untrustworthy but could be useful by Aida Brankovic, David Cook, Jessica Rahman, Sankalp Khanna and Wenjie Huang in Health Informatics Journal

Footnotes

Acknowledgement

We acknowledge the Responsible Innovation Future Science Program for recognizing the value of this study, supporting its extension, and co-funding it alongside the Australian e-Health Research Centre.

Author contributions

A.B. initiated the project and was a project lead, provided project conception, and performed the analysis. D.C. was the clinical lead and provided clinical guidance in interpreting the results and discussing their implications. J.R. contributed to the literature review and the creation of a Supplementary information file. S.K. contributed results discussion. W.H. created a pipeline for the NN model and DTD explainer. A.B made the first article draft. All authors contributed to revising the first article draft and approving the final version of the manuscript.

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

Research funding provided to A.B. from the University of Queensland, with in-kind contributions from AEHRC.

Ethical statement

Preprint

Brankovic, Aida, et al. “Evaluation of Popular XAI Applied to Clinical Prediction Models: Can They be Trusted?.” arXiv preprint arXiv:2306.11985, 2023, https://arxiv.org/abs/2306.11985.

ORCID iDs

Aida Brankovic

Wenjie Huang

Supplemental Material

Supplemental material for this article is available online.

References

Callahan

Shah

. Machine learning in healthcare: key advances in clinical informatics. Key Advances in Clinical Informatics. 2017; Chapter 19: 279–291.

Tonekaboni

Joshi

McCradden

, et al. What clinicians want: contextualizing explainable machine learning for clinical end use. In: Machine learning for healthcare conference. USA : PMLR, pp. 359–380.

Tjoa

Guan

. A survey on explainable artificial intelligence (xai): toward medical xai. IEEE Transact Neural Networks Learn Syst 2020; 32(11): 4793–4813.

Vayena

Blasimme

Cohen

. Machine learning in medicine: addressing ethical challenges. PLoS Med 2018; 15(11): e1002689.

Smilkov

Thorat

Kim

, et al. Smoothgrad: removing noise by adding noise. arXiv preprint arXiv:170603825 2017.

Sundararajan

Taly

Yan

. Axiomatic attribution for deep networks. In: International conference on machine learning. USA, PMLR, pp. 3319–3328.

Montavon

Lapuschkin

Binder

, et al. Explaining nonlinear classification decisions with deep taylor decomposition. Pattern Recogn 2017; 65: 211–222.

Montavon

Binder

Lapuschkin

, et al. Layer-wise relevance propagation: an overview. Explainable AI: interpreting, explaining and visualizing deep learning. Switzerland: Springer, Cham, 2019, pp. 193–209.

Ribeiro

Singh

Guestrin

. Why should i trust you?” explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 13-17 2016, San Francisco, USA, pp. 1135–1144.

10.

Lundberg

Erion

Chen

, et al. From local explanations to global understanding with explainable ai for trees. Nat Mach Intell 2020; 2(1): 56–67.

11.

Krishna

Han

, et al. The disagreement problem in explainable machine learning: a practitioner’s perspective. arXiv preprint arXiv:220201602 2022.

12.

Brankovic

Huang

Cook

, et al. Elucidating discrepancy in explanations of predictive models developed using emr. MedInfo2023 2023.

13.

Lauritsen

Kristensen

Olsen

, et al. Explainable artificial intelligence model to predict acute critical illness from electronic health records. Nat Commun 2020; 11(1): 3852.

14.

Brankovic

Hassanzadeh

Good

, et al. Explainable machine learning for real-time deterioration alert prediction to guide pre-emptive treatment. Sci Rep 2022; 12(1): 11734.

15.

Brankovic

Rolls

Boyle

, et al. Identifying patients at risk of unplanned re-hospitalisation using statewide electronic health records. Sci Rep 2022; 12(1): 16592.

16.

Bishop

Nasrabadi

. Pattern recognition and machine learning. Switzerland: Springer, 2006, 4.

17.

Chen

Guestrin

Xgboost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 13-17 2016, San Francisco, USA, pp. 785–794.

18.

Aas

Jullum

Løland

. Explaining individual predictions when features are dependent: more accurate approximations to shapley values. Artif Intell 2021; 298: 103502.

19.

Liu

Chen

Kuo

, et al.

Does ai explainability affect physicians’ intention to use ai?

Int J Med Inf 2022; 168: 104884.

20.

Diprose

Buist

Hua

, et al. Physician understanding, explainability, and trust in a hypothetical machine learning risk calculator. J Am Med Inf Assoc 2020; 27(4): 592–600.

21.

Panigutti

Beretta

Giannotti

, et al. Understanding the impact of explanations on advice-taking: a user study for ai-based clinical decision support systems. In: Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems, April 30–May 5 2022, New Orleans, LANew York, Association for Computing Machinery , pp. 1–9.

22.

Ghassemi

Oakden-Rayner

Beam

. The false hope of current approaches to explainable artificial intelligence in health care. The Lancet Digital Health 2021; 3(11): e745–e750.

23.

Schwartz

George

Rossetti

, et al. Factors influencing clinician trust in predictive clinical decision support systems for in-hospital deterioration: qualitative descriptive study. JMIR Human Factors 2022; 9(2): e33960.

24.

Cabitza

Fregosi

Campagner

, et al. Explanations considered harmful: the impact of misleading explanations on accuracy in hybrid human-ai decision making. World conference on explainable artificial intelligence. Switzerland: Springer, pp. 255–269.

25.

Cabitza

Campagner

Ronzio

, et al. Rams, hounds and white boxes: investigating human–ai collaboration protocols in medical diagnosis. Artif Intell Med 2023; 138: 102506.

26.

Heskes

Sijben

Bucur

, et al. Causal shapley values: exploiting causal knowledge to explain individual predictions of complex models. Adv Neural Inf Process Syst 2020; 33: 4778–4789.

27.

Salih

. Explainable artificial intelligence and multicollinearity: a mini review of current approaches. arXiv preprint arXiv:240611524 2024.

28.

Albini

Long

Dervovic

, et al. Counterfactual shapley additive explanations. Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, June 21-24 2022. Seoul, South Korea, 2022, pp. 1054–1070.

29.

Wang

Lian

, et al. A causality inspired framework for model interpretation. In: Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, August 6 - 10, 2023, Long Beach CA USA, pp. 2731–2741.

30.

Cinquini

Guidotti

. Causality-aware local interpretable model-agnostic explanations. In: World Conference on Explainable Artificial Intelligence, July 17–19, 2024, Valletta, Malta. Springer, pp. 108–124.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.76 MB