Sage Journals: Discover world-class research

Abstract

The current approaches to explaining black box machine learning models have primarily been based on the intuition of model developers, rather than being informed by end-user needs or existing literature. Our goal is to utilize existing cognitive science and human factors research to design explanation displays. To achieve this, we used the Cleveland Heart Disease Data Set to create an eXtreme Gradient Boosting heart disease prediction model. We established an initial context of use to inform the design of a prototype explanation display. Our design choices were based on cognitive chunk organization, and we used SHapley Additive exPlanation to generate instance-level explanations for our model. Model evaluation showed good performance, and we developed four prototype explanation displays. Our work demonstrates that it is feasible to design multiple prototype explanation displays for complex machine learning models by organizing features in a structured manner. We also provide a set of steps that can be followed for designing and evaluating user-centered explanations in healthcare.

Keywords

Artificial Intelligence Machine Learning Explainable AI health care human factors XGBoost usability

Introduction

Machine learning (ML), a subset of artificial intelligence (AI), continues to gain traction in various fields of medicine (Beam & Kohane, 2018). However, advances in ML have involved black box models whose algorithms do not provide human-understandable explanations in support of their decisions (Panigutti et al., 2020). The field of eXplainable AI (XAI) aims to address this problem by providing visibility into how black-box ML models make predictions (Rai, 2020).

Context of Use of Explainable ML

The context of use for a product or system, which consists of users, user needs and environment in which it is used (Maguire, 2001), determines its usability (Harvey et al., 2011). For explainable ML to be effective, it is crucial to identify the intended context of use, that is, who the users are, why these users require explanations, and where these users will use the explanations (Barda et al., 2020). Explainable ML users may be characterized based on their expertise, functional role (Bhatt et al., 2020), and knowledge and the contexts in which knowledge manifest (Suresh et al., 2021). User needs may be distilled into high-level domain goals (e.g., building trust in the model) and low-level specific tasks (e.g., assessing reliability of a given prediction) (Suresh et al., 2021). Depending on users and their needs, explanations may be designed to explain the decisions of the machine learning (ML) model at the local or global level. Model explanations should be defined by how much they augment users’ abilities to complete various tasks.

Explainability Techniques for Complex ML Models

Techniques to develop explanations may be categorized as either intrinsic or post-hoc (Arrieta et al., 2019; Rudin, 2019). Intrinsic models (e.g., decision trees, linear and logistic regression models) are inherently interpretable (Rudin, 2019). Post-hoc methods use outputs of trained complex ML models (e.g., tree ensembles, support vector machines, neural networks) to explain why certain predictions are made (Arrieta et al., 2019). Post-hoc methods may be applicable to all types of models (i.e., model-agnostic) or certain model types (i.e., model-specific) (Arrieta et al., 2019). Further, these methods may aim to explain a single prediction (i.e., local) or the entire model (i.e., global) (Belle & Papantonis, 2021).

Various methods, including feature relevance explanation, visual explanation, and local explanation, have been proposed to produce post-hoc explanations from black box ML models (Arrieta et al., 2019; Belle & Papantonis, 2021). These methods explain a model’s decision by ranking the relevance of each input feature (Arrieta et al., 2019). Locally Interpretable Model-agnostic Explanations (LIME; (Ribeiro et al., 2016)) and SHapley Additive exPlanations (SHAP; (Lundberg & Lee, 2017)) are two popular feature relevance explanation techniques. Whereas LIME trains local surrogate models to explain individual predictions SHAP unifies several local explanation methods into a single approach to explain model predictions (Molnar, 2018).

Related Studies

Prior studies have utilized several approaches to explain predictions of complex ML models in the health care domain (Lundberg et al., 2018; Pan et al., 2019). For example, Lundberg et al. (2018) developed an explainable ML system using SHAP to enable anesthesiologists predict hypoxemia risk during general anesthesia. Pan et al. (2019) used LIME to develop a predictive model to diagnose central precocious puberty.

While previous studies have explored alternative approaches to model explanation, most have not been informed by end-user needs and insights from the literature (Abdul et al., 2018; T. Miller, 2019). XAI developers can build on existing research in cognitive science and human-computer interaction to design effective explanations that are interpretable for users (T. Miller, 2019).

Cognitive Chunks

Of particular interest in the present study is how cognitive chunks are organized. Chunking is a process by which humans break down and group information into a meaningful whole (G. A. Miller, 1956). A chunk represents an organizational unit in the short-term memory of humans. Factors that correspond to explanation needs and thus may influence the presentation of an explanation include (1) the form of cognitive chunks, (2) the number of cognitive chunks, (3) how the cognitive chunks are organized, and (4) interaction among cognitive chunks (Doshi-Velez & Kim, 2017). Organizing chunks into meaningful groups can potentially reduce cognitive load and processing time.

Objective

Humans typically seek for explanations of events that they find abnormal, and prefer those that reveal causes that are abnormal or controllable (T. Miller, 2019). The term abnormality concerns the ability to identify abnormal events (T. Miller, 2019), while controllability concerns the extent to which an event can be altered (Girotto et al., 1991). Grouping factors by abnormality and controllability can help provide a causal explanation of an event (T. Miller, 2019).

Our objective is to leverage the organization of cognitive chunks to design prototype explanation displays for a heart disease prediction model. In this study, an explanation display refers to the graphical interface designed to help users understand the contribution of each feature to the prediction of an instance made by the model.

Methodology

The Cleveland Heart Disease Data Set

We utilized the Cleveland Heart Disease Data Set (Aha & Kibler, 1988) available on the Kaggle website (Heart Disease UCI, n.d.). This data set contains 303 instances. Although the data set has 76 features, majority of the published studies use only 14 of them. We used these 14 features as described in Table 1. We predicted whether a person has heart disease or not based on first 13 features. Since our main aim was to design user-centered explanation displays, we made no attempt to learn a best performing model.

Table 1.

Description of Cleveland Heart Disease Data Set.

Number	Feature	Data type	Description
1	age	continuous	Patient age in years
2	sex	categorical	Gender of patient. male or female
3	cp	categorical	Chest pain type. 0 = typical angina, 1 = atypical angina, 2 = non-anginal pain, 3 = asymptomatic
4	trestbps	continuous	Resting blood pressure in mm Hg
5	chol	continuous	Serum cholesterol in mg/dl
6	fbs	categorical	Fasting blood sugar > 120 mg/dl. 1 = true; 0 = false
7	restecg	categorical	Resting electrocardiographic results. 0 = normal, 1 = having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of >0.05 mV), 2 = showing probable or definite left ventricular hypertrophy by Estes’ criteria
8	thalach	continuous	Maximum heart rate achieved
9	exang	categorical	Exercise induced angina. 1 = yes; 0 = no
10	oldpeak	continuous	ST depression induced by exercise relative to rest
11	slope	categorical	Slope of the peak exercise ST segment. 0 = upsloping, 1 = flat, 2 = downsloping
12	ca	categorical	Number of major vessels (0–3) colored by fluoroscopy. Categorical data with 4 levels-0, 1, 2, 3
13	thal	categorical	Heart status as retrieved from Thallium test. 1 = normal; 2 = fixed defect; 3 = reversable defect
14	target	categorical	Diagnosis of heart disease. 0 = absence of heart disease; 1 = presence of heart disease

Extreme Gradient Boosting

We used extreme gradient boosting (XGBoost), a decision-tree-based ensemble ML algorithm, implemented in Python in the XGBoost open-source library (Chen et al., 2015). XGBoost was chosen because of its ability to handle multivariate attributes and its support for explainability (Ravaut et al., 2021). Also, XGBoost has been used in prior health care studies, including prediction of diabetes (Ravaut et al., 2021; L. Wang et al., 2020), chronic kidney disease (Ogunleye & Wang, 2019), emergency department disposition (Barak-Corren et al., 2021) and disease course in COVID-19 patients (Montomoli et al., 2021).

Data Preprocessing

There were neither outliers nor missing values in the dataset. Further, there was no data imbalance problem to handle since the dataset is relatively balanced; 54.46% positive heart disease instances and 45.54% negative heart disease instances. We did not encode the categorical variables (i.e., sex, cp, fbs, restecg, exang, slope, ca, thal) because they have already been encoded by the data provider (see Table 1). We used min-max transformation to rescale each continuous feature (i.e., age, trestbps, chol, thalach, oldpeak) to the [0, 1] interval.

Model Development

We split the dataset into 80% for training and 20% for performance evaluation. We applied 10-fold cross-validation on the training set to tune the hyperparameters of the XGBoost classifier. We employed the grid search method to find optimal hyperparameters for optimal model performance. Hyperparameters specified to guide the learning process included minimum sum of instance weight needed in a child node (“min_child_weight” = [1, 5, 10]), threshold of gain improvement to keep a split tree (“gamma” = [1, 2, 5]), subsample proportion (“subsample” = [0.6, 0.8, 1.0]), subsample ratio of columns when constructing each tree (“colsample_bytree” = 0.6, 0.8, 1.0]), maximum tree depth(“max_depth” = [3, 4, 5]).

Model Evaluation

We applied the optimal model to the testing dataset to evaluate its performance. Based on Figure 1 we evaluated model performance using the following metrics: accuracy ((TP + TN)/(TP + FP + FN + TN)), specificity (TN/(FP + TN)), sensitivity (TP/(TP + FN)), positive predictive value (PPV; TP/(TP + FP)), negative predictive value (NPV; TN/(FN + TN)), and F1 score (2TP/(2TP + FP + FN)). Further, we evaluated the model’s area under the receiver operating characteristic (AUROC).

Figure 1.

Basis for deriving accuracy, specificity, sensitivity, positive predictive value, negative predictive value and F1 score.

Framing the Context of Use

We defined an initial context of use for explanations for our heart prediction model by identifying who might need the explanation, where they might need the explanation, and why they might need the explanation. Our intended users are cardiology-care providers (e.g., residents, fellows, cardiologists) with varying clinical experience and limited statistical and ML knowledge. We intend for our predictive model to be integrated into an electronic health record or a clinical decision support system to support existing clinical workflows and clinical decision-making in the cardiology care setting.

Design of Prototype Explanation Displays

Our defined context of use informed the design of our prototype explanation displays. We envisaged our intended users, cardiology-care providers, will have limited ML knowledge and might use explanation displays in cognitively demanding environments. Thus, we designed the explanations to help our intended users assess model credibility and minimize the cognitive effort required to process the content they present. Further, for specific predictions, we envisaged that cardiology-care providers would likely seek explanations that highlight features that are pushing a prediction toward one outcome in contrast to another. A model-agnostic feature relevance explanation technique would allow for the design of explanation displays that meet our initial context of use requirements.

We utilized SHAP to generate instance-level explanations of feature influence for our heart disease prediction model. We chose SHAP because of its ability to explain local accuracy and consistency (Lundberg et al., 2020). SHAP, a game theoretic approach, computes Shapley values and uses that as a basis to explain the prediction of an instance. The Shapley value of a feature is the average marginal contribution of that feature, considering all possible combinations of features.

Recall that cognitive chunks are the basic units of explanation. By grouping features in a structured way rather than presenting them individually, large cognitive chunks can be obtained for instance-level explanations of feature influence (Barda et al., 2020). We utilized design options based on cognitive chunk organization for the design of four prototypes. Design options were none, group by influence, and group by assessment. The none option implies cognitive chunks were not organized in a structured. The group by influence (or abnormality) option means cognitive chunks were organized in a structured way to show their contribution, positive or negative, to the prediction of an instance by our predictive model. The group by assessment (or controllability) option implies cognitive chunks were organized by whether they are modifiable (e.g., laboratory test results) or not (e.g., demographics). After consulting with a cardiologist, we categorized the 13 features into demographics, diagnoses and physical assessment, and laboratory test results (see Table 2).

Table 2.

Categorization of Features.

Feature	Category
age	Demographics
sex	Demographics
cp	Diagnoses, & Physical assessment
trestbps	Diagnoses, & Physical assessment
chol	Laboratory test results
fbs	Laboratory test results
restecg	Laboratory test results
thalach	Laboratory test results
exang	Laboratory test results
oldpeak	Laboratory test results
slope	Laboratory test results
ca	Laboratory test results
thal	Laboratory test results

Table 3 describes the options considered for design of the four explanation display prototypes in this study. Table 4 shows the specific design option chosen for each prototype. Prototype 1 was designed to display individual features, with features not organized in a structured way. Prototype 2 was designed to display features grouped by influence into positive contributing factors and negative contributing factors. Prototype 3 was designed to display features grouped by assessment into demographics, diagnoses and physical assessment, and laboratory test results. Prototype 4 was diagnosed to display features first grouped by assessment, and each subgroup was then grouped by influence.

Table 3.

Design Options Considered for Explanation Display Prototypes.

Factor	Design option
Cognitive chunk organization	NoneGroup by influenceGroup by assessment

Table 4.

Design Options Used for Explanation Display Prototypes.

Prototype	Design option
1	None: Individual features
2	Group by influence:Positive contributing factorsNegative contributing factors
3	Group by assessment:DemographicsDiagnoses, & Physical assessmentLaboratory test results
4	Group by assessment and Group by influence:DemographicsPositive contributing factorsNegative contributing factorsDiagnoses, & Physical assessmentPositive contributing factorsNegative contributing factorsLaboratory test resultsPositive contributing factorsNegative contributing factors

The display format of each prototype is tornado plot, an intuitive technique of depicting the influence of each feature on the model’s prediction of an instance (D. Wang et al., 2021). Also, as suggested by D. Wang et al. (2019), we supported each explanation plot with additional information, and raw feature values.

Results

Modeling and Parameter Tuning

Our optimal model had the following hyperparameters: colsample_bytree = 0.6; max_depth = 3; gamma = 2; min_child_weight = 5; subsample = 0.6. Model performance on the testing set is shown in Table 5.

Table 5.

Performance of XGBoost Model on Testing Set.

Metric	Point estimate [95% CI]
Accuracy	0.82 [0.76, 0.87]
Specificity	0.85 [0.75, 0.94]
Sensitivity	0.78 [0.66, 0.88]
PPV	0.82 [0.70, 0.91]
NPV	0.82 [0.73, 0.90]
AUROC	0.81 [0.76, 0.86]
F1 score	0.79 [0.73, 0.86]

Prototype Explanation Displays

Figures 2 to 5 describe and provide two instance-level explanation display examples for each of the four prototypes. For each explanation plot, the abscissa is SHAP values (impact on prediction), and the ordinate is different features. Each explanation plot is supported with additional information, and raw feature values.

Figure 2.

Prototype 1. Explanation display made up of explanation plot and table of raw values of features used in our XGBoost model. (A) Predicted outcome is “Patient does not have heart disease,” (B) Predicted outcome is “Patient has heart disease.”

Figure 3.

Prototype 2. Explanation display made up of explanation plot and table of raw values of features used in our XGBoost model. (A) Predicted outcome is “Patient does not have heart disease,” (B) Predicted outcome is “Patient has heart disease.”

Figure 4.

Prototype 3. Explanation display made up of explanation plot and table of raw values of features used in our XGBoost model. (A) Predicted outcome is “Patient does not have heart disease,” (B) Predicted outcome is “Patient has heart disease.”

Figure 5.

Prototype 4. Explanation display made up of explanation plot and table of raw values of features used in our XGBoost model. (A) Predicted outcome is “Patient does not have heart disease,” (B) Predicted outcome is “Patient has heart disease.”

Discussion

We developed an XGBoost heart disease prediction model, after which we framed an initial context of use that informed the design of four prototype explanation displays. Accuracy of 0.82 (0.76–0.87), specificity of 0.85 (0.75–0.94), sensitivity of 0.78 (0.66–0.88), PPV of 0.82 (0.70–0.91), NPV of 0.82 (0.73–0.90), AUROC of 0.81 (0.76–0.86), and F1-score of 0.79 (0.73–0.86) reveal good performance of our model in predicting heart disease. We utilized SHAP to generate instance-level explanations of our model and employed design options based on cognitive chunks organization to design our prototypes. We demonstrate that it is possible to design multiple prototype explanation displays for a black box ML model based on structured organization of features. Our prototype explanation displays may be presented to a representative group of cardiology-care providers to solicit their feedback on which is best at enabling them to assess model credibility with the least cognitive effort.

Toward a User-centered Design of Prototype Explanation Displays

What constitutes a “good” explanation is often based on the intuition of model developers whose knowledge and background are mostly not representative of the end-user expertise (T. Miller, 2019; T. Miller et al., 2017). However, this is a suboptimal approach as failing to design user-centered explanations may translate into reduced usability and user acceptance (Abdul et al., 2018). User-centered design (UCD), a design approach that focuses on users and their needs, holds considerable potential for improving the design and implementation of XAI within health care (Dopp et al., 2019).

The first logical step toward the design of user-centered explanation displays is to refine our initial context of use. We will conduct focus groups with a convenient sample of cardiology-care providers to solicit their feedback on our prototype explanation displays. We will seek their perceptions of the explanation displays for explaining predictions from our heart prediction model and explore their preferences on the design options. Insights from the focus group will inform the design of a final user-centered explanation display.

Next, we will conduct a mixed-methods within-subjects laboratory study to evaluate our user-centered design. This will be a human-grounded evaluation (Doshi-Velez & Kim, 2017) in which cardiology-care providers will use the explanations to perform simplified tasks. Experimental conditions will be: (1) feature data only, (2) feature data and model prediction of the data, and (3) feature data, model prediction of the data, and user-centered explanations from the model. We aim to utilize hypothesis testing to evaluate the utility of our user-centered explanations and how they impact the speed and accuracy with which cardiology-care providers assess patient condition.

Conclusion

This study drew on existing cognitive science and human-computer interaction research to design prototype explanation displays for a heart disease prediction model. We employed design options based on cognitive chunk organization and utilized SHAP, a post-hoc explainer, to generate instance-level explanations for model predictions. Finally, we provided steps that can be followed toward the design and evaluation of user-centered explanations in health care.

Footnotes

Declaration of Conflicting Interests

The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was supported in part by the U.S. National Science Foundation under Award Number 2232869.

References

Abdul

Vermeulen

Wang

Lim

B. Y.

Kankanhalli

(2018). Trends and trajectories for explainable, accountable and intelligible systems: An HCI research agenda [Conference session]. Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, Montreal, QC (pp. 1–18). Association for Computing Machinery.

Aha

Kibler

(1988). Instance-based prediction of heart-disease presence with the Cleveland database. University of California, 3(1), 3–2.

Arrieta

A. B.

Díaz-Rodríguez

Del Ser

Bennetot

Tabik

Barbado

García

Gil-López

Molina

Benjamins

Chatila

Herrera

(2019). Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI. ArXiv:1910.10045 [Cs]. http://arxiv.org/abs/1910.10045

Barak-Corren

Chaudhari

Perniciaro

Waltzman

Fine

A. M.

Reis

B. Y.

(2021). Prediction across healthcare settings: A case study in predicting emergency department disposition. Npj Digital Medicine, 4(1), 1–7. https://doi.org/10.1038/s41746-021-00537-x

Barda

A. J.

Horvat

C. M.

Hochheiser

(2020). A qualitative research framework for the design of user-centered displays of explanations for machine learning model predictions in healthcare. BMC Medical Informatics and Decision Making, 20(1), 257. https://doi.org/10.1186/s12911-020-01276-x

Beam

A. L.

Kohane

I. S.

(2018). Big data and machine learning in health care. JAMA, 319(13), 1317–1318. https://doi.org/10.1001/jama.2017.18391

Belle

Papantonis

(2021). Principles and practice of explainable machine learning. Frontiers in Big Data, 4, 688969.

Bhatt

Xiang

Sharma

Weller

Taly

Jia

Ghosh

Puri

Moura

J. M.

Eckersley

(2020). Explainable machine learning in deployment [Conference session]. Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, Barcelona, Spain (pp. 648–657). Association for Computing Machinery.

Chen

Benesty

Khotilovich

Tang

Cho

(2015). Xgboost: Extreme gradient boosting. R Package Version 0.4-2, 1(4), 1–4.

10.

Dopp

A. R.

Parisi

K. E.

Munson

S. A.

Lyon

A. R.

(2019). A glossary of user-centered design strategies for implementation experts. Translational Behavioral Medicine, 9(6), 1057–1064. https://doi.org/10.1093/tbm/iby119

11.

Doshi-Velez

Kim

(2017). Towards a rigorous science of interpretable machine learning. ArXiv:1702.08608 [Cs, Stat]. http://arxiv.org/abs/1702.08608

12.

Girotto

Legrenzi

Rizzo

(1991). Event controllability in counterfactual thinking. Acta Psychologica, 78(1–3), 111–133.

13.

Harvey

Stanton

N. A.

Pickering

C. A.

McDonald

Zheng

(2011). Context of use as a factor in determining the usability of in-vehicle devices. Theoretical Issues in Ergonomics Science, 12(4), 318–338. https://doi.org/10.1080/14639221003717024

14.

Heart Disease UCI . (n.d.). Kaggle. Retrieved January 30, 2022, from https://archive.ics.uci.edu/dataset/45/heart+disease

15.

Lundberg

S. M.

Erion

Chen

DeGrave

Prutkin

J. M.

Nair

Katz

Himmelfarb

Bansal

Lee

S.-I.

(2020). From local explanations to global understanding with explainable AI for trees. Nature Machine Intelligence, 2(1), 56–67.

16.

Lundberg

S. M.

Lee

S.-I.

(2017). A unified approach to interpreting model predictions [Conference session]. Proceedings of the 31st International Conference on Neural Information Processing Systems (pp. 4768–4777). Curran Associates Inc.

17.

Lundberg

S. M.

Nair

Vavilala

M. S.

Horibe

Eisses

M. J.

Adams

Liston

D. E.

Low

D. K.-W.

Newman

S.-F.

Kim

Lee

S.-I.

(2018). Explainable machine-learning predictions for the prevention of hypoxaemia during surgery. Nature Biomedical Engineering, 2(10), 749–760. https://doi.org/10.1038/s41551-018-0304-0

18.

Maguire

(2001). Context of use within usability activities. International Journal of Human-Computer Studies, 55(4), 453–483. https://doi.org/10.1006/ijhc.2001.0486

19.

Miller

G. A.

(1956). The magical number seven, plus or minus two: Some limits on our capacity for processing information. Psychological Review, 63(2), 81.

20.

Miller

(2019). Explanation in artificial intelligence: Insights from the social sciences. Artificial Intelligence, 267, 1–38. https://doi.org/10.1016/j.artint.2018.07.007

21.

Miller

Howe

Sonenberg

(2017). Explainable AI: Beware of inmates running the asylum or: How I learnt to stop worrying and love the social and behavioural sciences. ArXiv Preprint. ArXiv:1712.00547.

22.

Molnar

(2018). A guide for making black box models explainable. Github. Io/Interpretable-Ml-Book. https://Christophm

23.

Montomoli

Romeo

Moccia

Bernardini

Migliorelli

Berardini

Donati

Carsetti

Bocci

M. G.

Wendel Garcia

P. D.

Fumeaux

Guerci

Schüpbach

R. A.

Ince

Frontoni

Hilty

M. P.

Alfaro-Farias

Vizmanos-Lamotte

Tschoellitsch

Colak

(2021). Machine learning using the extreme gradient boosting (XGBoost) algorithm predicts 5-day delta of SOFA score at ICU admission in COVID-19 patients. Journal of Intensive Medicine, 1(2), 110–116. https://doi.org/10.1016/j.jointm.2021.09.002

24.

Ogunleye

Wang

Q.-G.

(2019). XGBoost model for chronic kidney disease diagnosis. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 17(6), 2131–2140.

25.

Pan

Liu

Mao

Zhang

Liang

(2019). Development of prediction models using machine learning algorithms for girls with suspected central precocious puberty: Retrospective study. JMIR Medical Informatics, 7(1), e11728. https://doi.org/10.2196/11728

26.

Panigutti

Perotti

Pedreschi

(2020). Doctor XAI: An ontology-based approach to black-box sequential data classification explanations [Conference session]. Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, Barcelona, Spain (pp. 629–639). Association for Computing Machinery. https://doi.org/10.1145/3351095.3372855

27.

Rai

(2020). Explainable AI: From black box to glass box. Journal of the Academy of Marketing Science, 48(1), 137–141. https://doi.org/10.1007/s11747-019-00710-5

28.

Ravaut

Harish

Sadeghi

Leung

K. K.

Volkovs

Kornas

Watson

Poutanen

Rosella

L. C.

(2021). Development and validation of a machine learning model using administrative health data to predict onset of type 2 diabetes. JAMA Network Open, 4(5), e2111315.

29.

Ribeiro

M. T.

Singh

Guestrin

(2016). “Why should i trust you?” Explaining the predictions of any classifier [Conference session]. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA (pp. 1135–1144). Association for Computing Machinery.

30.

Rudin

(2019). Stop explaining Black Box machine learning models for high stakes decisions and use interpretable models instead. ArXiv:1811.10154 [Cs, Stat]. http://arxiv.org/abs/1811.10154

31.

Suresh

Gomez

S. R.

Nam

K. K.

Satyanarayan

(2021). Beyond expertise and roles: A framework to characterize the stakeholders of interpretable machine learning and their needs [Conference session]. Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, Yokohama, Japan (pp. 1–16). Association for Computing Machinery.

32.

Wang

Yang

Abdul

Lim

B. Y.

(2019). Designing theory-driven user-centric explainable AI [Conference session]. Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, Glasgow (pp. 1–15). Association for Computing Machinery. https://doi.org/10.1145/3290605.3300831

33.

Wang

Zhang

Lim

B. Y.

(2021). Show or suppress? Managing input uncertainty in machine learning model explanations. Artificial Intelligence, 294, 103456.

34.

Wang

Chen

Jin

Che

(2020). Prediction of type 2 diabetes risk and its effect evaluation based on the XGBoost model. Healthcare, 8(3), 247.

Toward User-centered Explainable Displays for Complex Machine Learning Models in Healthcare: A Case Study of Heart Disease Prediction

Abstract

Keywords

Introduction

Context of Use of Explainable ML

Explainability Techniques for Complex ML Models

Related Studies

Cognitive Chunks

Objective

Methodology

The Cleveland Heart Disease Data Set

Extreme Gradient Boosting

Data Preprocessing

Model Development

Model Evaluation

Framing the Context of Use

Design of Prototype Explanation Displays

Results

Modeling and Parameter Tuning

Prototype Explanation Displays

Discussion

Toward a User-centered Design of Prototype Explanation Displays

Conclusion

Footnotes

Declaration of Conflicting Interests

Funding

References