Abstract
Background
Traditional risk scores for ST-segment elevation myocardial infarction (STEMI), such as the Thrombolysis in Myocardial Infarction (TIMI) score, were developed predominantly in Western populations and may exhibit suboptimal performance in Asian patients due to differing clinical and genetic profiles. Accurate in-hospital mortality prediction is essential for optimizing clinical management in this high-risk group. However, key aspects such as model explainability (e.g., SHapley Additive exPlanations (SHAP) analysis) and probability calibration are often neglected, limiting the clinical utility and trustworthiness of predictive models.
Objectives
This study aimed to develop and validate explainable, well-calibrated machine learning models for predicting in-hospital mortality among Asian STEMI patients, benchmarking their performance against the TIMI risk score. Interpretability was enhanced using SHAP for both global and local explanations, and model calibration was systematically addressed.
Methods
We conducted a retrospective cohort study using data from 49,574 Asian STEMI patients in the Malaysian National Cardiovascular Disease registry (2006–2021). A temporal split was applied to simulate prospective deployment: data from 2006–2018 were used for model training, 2019 data for calibration, and 2020–2021 data as an independent test set. Multiple ML algorithms including logistic regression (LR), support vector machine, random forest, gradient boosting machine, and XGBoost were developed and compared. Stacked ensemble models were constructed using combinations of these base learners. Performance was evaluated on the independent test set using area under the receiver operating characteristic curve (AUC-ROC), accuracy, recall, specificity, Brier score (for calibration), and Net Reclassification Index (NRI), with benchmarking against the TIMI score. Isotonic regression was applied for probability calibration, and SHAP was used for interpretability.
Results
The calibrated LR model achieved the best overall performance (AUC = 0.8884, 95% CI (confidence interval): 0.8756–0.9011; accuracy = 0.8538; recall = 0.7746; specificity = 0.8617; Brier score = 0.0598; NRI = 0.5828 vs. TIMI). SHAP analysis confirmed that model predictions were aligned with established clinical reasoning, enhancing interpretability. Probability calibration further improved model reliability, as evidenced by a reduced Brier score.
Conclusions
Calibrated LR, supported by SHAP-based explainability and robust probability calibration, significantly outperformed the TIMI score for in-hospital mortality prediction in Asian STEMI patients. This approach improves predictive accuracy, reliability, and interpretability, supporting more personalized and clinically trustworthy risk stratification. These findings highlight strong potential for real-world clinical integration and improved patient outcomes in diverse Asian populations.
Keywords
Introduction
Cardiovascular disease (CVD) is the leading cause of death worldwide, with Asia accounting for 58% (10.8 million) of global CVD deaths in 2019. In Asia, 35% of all deaths are due to CVD, and nearly 39% are premature substantially higher than in the US (23%) or Europe (22%). 1 Among CVD subtypes, STEMI shows significant regional disparities across Asia, with incidence rates ranging from 33 to 138 per 100,000 person-years. 2
Given the substantial burden of CVD in Asia, accurate prediction of in-hospital mortality and early detection of STEMI are essential for improving outcomes.3,4 TIMI and Global Registry of Acute Coronary Events (GRACE) scores are widely used for risk stratification in acute coronary syndromes (ACS).5,6 Among these, the TIMI risk score is particularly prevalent in Asian hospitals, owing to its reliance on easily obtainable clinical parameters.5,6 However, these models were developed in predominantly Western and Caucasian populations and often perform suboptimal in Asian cohorts due to differences in genetic backgrounds, risk factor profiles, and healthcare systems.7,8 For instance, studies in Singapore have shown that GRACE underestimates in-hospital mortality after acute myocardial infarction (AMI) in multiracial Asian populations by approximately 4-fold (predicted: 1.6–2.4% vs. observed: 6.4–9.8%). 9 This highlights the need for recalibrated or novel approaches to risk prediction tailored to Asia, as reliance on Western models may lead to inaccurate risk estimation and suboptimal care.10,11
ML algorithms offer the potential to overcome limitations of traditional regression-based risk scores by capturing complex, non-linear interactions among numerous clinical variables.12–15 ML models including LR, RF, SVM and XGB have shown improved predictive performance for ACS risk stratification compared to conventional scoring systems.15–18 Ensemble approaches, particularly stacking, further enhance prediction by integrating multiple ML algorithms to generate robust and accurate outcomes. 19 While ML, including stacking, has been applied to ACS risk prediction in some Asian settings, 20 rigorous application of stacked ensembles for STEMI mortality prediction in large, multi-ethnic Asian populations remains limited. 21
To enhance generalizability and address methodological challenges including temporal concept drift and class imbalance, we employed temporal validation and cost-sensitive learning strategies throughout model development.22,23 Moreover, despite the superior predictive accuracy of advanced ML models, particularly stacked ensembles, their “black box” nature often hinders clinical adoption. 24 Enhancing the transparency and interpretability of these models is crucial for clinician trust and integration into routine practice. 25 To address this, explainable AI (XAI) methods such as SHAP have emerged, providing both global and local interpretability by quantifying each feature's contribution to model predictions. 26 Integrating SHAP into ML workflows can enhance clinician trust, support personalized interventions, and facilitate adoption especially in diverse Asian healthcare settings where interpretability is essential for clinical acceptance. 27
Model calibration also ensures that predicted probabilities align closely with actual event rates, which is critical for threshold-based decision-making and clinical integration. 28 Although ML models such as RF, XGB, and stacked ensembles often demonstrate high discrimination, their probability estimates can be poorly calibrated by default. 29 Despite its importance, calibration remains underreported in ACS risk prediction. Fewer than 10% of studies include calibration plots or Brier scores. 30 Calibration methods like Platt scaling and isotonic regression are therefore vital to improve risk communication and clinical applicability. 31
This study addresses key gaps in STEMI risk prediction for Asian populations by developing and validating explainable and calibrated ML models tailored to regional needs. By benchmarking these models, including stacked ensembles, against conventional risk scores, integrating SHAP for interpretability, and rigorously evaluating calibration, we aim to provide more accurate, transparent, and clinically applicable tools for in-hospital mortality prediction in diverse Asian cohorts. To enhance generalizability and simulate real-world deployment, we employed a temporal split of the dataset, and incorporated cost-sensitive learning to address class imbalance during model training.
While machine learning has been previously applied to the NCVD registry,20,32 this study provides three distinct and crucial contributions. First, our model is developed on contemporary data, reflecting modern patient profiles and treatment strategies, which is critical as risk models can become outdated. Second, we address the limitations of standard random-split validation by implementing a rigorous temporal validation strategy that simulates prospective deployment and assesses robustness to concept drift. Finally, responding to the need for models to demonstrate clear clinical relevance beyond algorithmic performance, our primary objective is to develop a calibrated and explainable model. We systematically apply and evaluate probability calibration using a dedicated dataset to ensure that the model's risk predictions are reliable and trustworthy for real-world clinical decision-making.
Methods
In the methodology section of this study, we adhered to the TRIPOD (Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis) checklists to ensure the transparency and reproducibility of our predictive modeling process.
The overall workflow of this study is depicted in Figure 1, which outlines the ML development process. We followed a structured ML pipeline to develop, evaluate, and interpret calibrated models for predicting in-hospital mortality among STEMI patients. The methodology encompassed data preprocessing, feature selection, model development, evaluation, calibration, comparative analysis, and application of SHAP for model interpretability.

Flow diagram.
Study design and data source
This retrospective cohort study utilized data from the NCVD registry, a nationwide clinical database that systematically captures standardized information on patient demographics, clinical presentation, management, and outcomes for individuals admitted with ACS at participating hospitals across Malaysia. Although ACS encompasses a range of clinical presentations, this study specifically focused on patients diagnosed with STEMI.
The Medical Review & Ethics Committee (MREC) of the Ministry of Health (MOH) of Malaysia approved the NCVD registry study (Approval Code: NMRR-07–38–164). The MREC waived patient informed consent for NCVD. This study also has been authorized by the UiTM ethics committee (reference number: 600-TNCPI (5/1/6)) and National Heart Association of Malaysia (NHAM). The data used in this study were made anonymous before use, as in our research data are interested only in the values and features without having access to patient personal information.
Data preprocessing
Prior to model development, the dataset underwent several preprocessing steps, data splitting, data imputation and normalization to ensure data quality and suitability for ML algorithms. These procedures form the foundation for building reliable, robust predictive models.
Statistical significance for baseline characteristics was assessed using appropriate tests based on data type and distribution. For continuous variables, variables were compared using independent t-tests. For categorical variables, chi-square tests were employed. A two-sided p-value <0.001 was considered statistically significant for all baseline characteristic comparisons 33
Data splitting
To evaluate generalizability and simulate prospective deployment, a temporal split was applied. 34 Records from 2006–2018 were used for model training and development, 2019 data served as a calibration set, and 2020–2021 data constituted an independent temporal test set. Temporal validation is preferable to random splitting for clinical prediction because it tests performance against potential “concept drift” arising from shifts in patient mix, treatment patterns, or data-capture practices over time. 22
The distribution of patients across these datasets is shown in Table 1. Records from 2006–2018 were used for model training and development (n = 37,519), 2019 data served as a calibration set (n = 4429), and 2020–2021 data constituted an independent temporal test set (n = 7626).
Total number of patients for each dataset.
Data imputation
Missing data were assessed and found predominantly missing completely at random (MCAR). 35 Continuous variables were imputed with the median, which is less sensitive to outliers in clinical measurements. Categorical variables were imputed with the mode, preserving the original distribution when missingness is limited or one category predominates. 36 The extent of missing data for each variable in the training dataset is detailed in Table 2.
Missing counts for training dataset.
Class balancing via cost-sensitive learning
In this dataset, the in-hospital mortality was rare, with a class ratio of approximately 1 : 16.5 (non-survivors : survivors), creating a pronounced imbalance. 23 Cost-sensitive learning was adopted instead of over- or under-sampling because it embeds misclassification costs directly into the loss function, avoiding data duplication or deletion and reducing over-fitting while preserving feature distributions. 37 Higher weights were assigned to minority-class instances and lower weights to majority-class instances, 38 imposing greater penalties for misclassifying deaths and improving sensitivity to these clinically critical cases. 39
Normalization
Z-score normalization was applied on the continuous predictor variables. This method transforms the data for each feature by subtracting its mean and dividing by its standard deviation, calculated from the training dataset. The result is that each normalized feature has a mean of 0 and a standard deviation of 1.
40
The normalization can be calculated using the following:
Standardization prevents features with larger absolute values or wider ranges from disproportionately influencing model training. It also contributes to making the model less sensitive to the original units of measurement of the input features and can be more robust to the presence of outliers compared to other scaling techniques like min–max scaling, particularly if the feature distributions are approximately normal or do not contain extreme, influential outliers.40,41
Feature selection
To identify the most relevant features for in-hospital mortality, reduce model complexity, enhance interpretability, and mitigate overfitting, a structured feature selection process was employed. Specifically, backward feature elimination (BFE), a wrapper method, was utilized. 42 Models were first trained using all features, and then iteratively re-trained while removing one feature at a time. At each iteration, features were ranked based on their impact on AUC, and the feature contributing the least to model performance was eliminated. This process continued until an optimal subset was identified that achieved the best trade-off between feature count and AUC performance. The final model was selected based on the subset that provided the highest AUC with the fewest features, balancing predictive power and model simplicity. 43
BFE was selected over alternative methods based on prior findings indicating its superior performance in predictive modeling within this context.32,44 BFE was applied independently to LR, RF, SVM, and XGB models to identify feature subsets achieving optimal balance between AUC and dimensionality reduction.
Model development
Our approach comprised two stages: first, to develop and evaluate multiple base models, and second, to construct stacked ensemble models. All models’ performance metrics were recorded and compared in this study.
Base model
To predict in-hospital mortality among STEMI patients, we developed and evaluated several ML algorithms, including LR, RF, SVM, GBM, XGB, and a stacked ensemble model.45–49 All models used the feature set selected by XGB through backward elimination (14 features, as detailed in the Feature Selection section). Model hyperparameters are detailed in Table 3.
Models’ parameters.
To optimize model hyperparameters and prevent overfitting during the development phase, a 5-fold cross-validation (CV) strategy was employed. This CV was performed exclusively within the 2006–2018 training dataset. For each model, this five-fold CV process was used to guide the hyperparameter tuning (i.e., grid search optimization) and to generate the validation scores for model training.
Ensemble stacking model
In this study, we propose two stacking models for in-hospital mortality prediction in STEMI patients. The first model stacks all base learners; LR, RF, SVM, GBM, and XGB with LR as the meta-learner. The second model comprises only RF, GBM, and XGB as base learners, again with LR serving as the meta-learner. The performance of all base and stacked models will be compared.
Recent stacking studies reaffirm that a simple, regularized LR as meta-learner can give the ensemble both stability and transparency. In a large multi-dataset heart-disease project, Ganie et al. benchmarked LR and found that LR yielded the highest accuracy and AUC while showing the lowest fold-to-fold variance, crediting its superior “calibration and resistance to over-fitting” for the gain 50
Model evaluation
To assess predictive performance, all developed models were evaluated on an independent temporal test set. This evaluation provides the most realistic estimate of model generalizability to future, unseen patient data.
Threshold selection
To convert the continuous outputs into discrete binary classifications (dead or alive), an optimal decision threshold is selected. In this study, the optimal threshold was determined using the Youden index (J index). 51
Youden's J statistic is
J = sensitivity + specificity − 1 = recall1 + recall0 − 1.
This threshold maximizes the vertical distance from the diagonal line representing random chance, effectively balancing a model's sensitivity (true positive rate) and specificity (true negative rate) by assigning equal importance to both. 52 The Youden index is widely used in medical diagnostics and prognostics for selecting cut-off points in continuous tests; it gives equal clinical significance for sensitivity and specificity. 53
Performance metrics
After selecting the optimal classification threshold for each model, performance was assessed using established metrics. These metrics offer complementary perspectives on the predictive capabilities of each model. Binary classification performance is evaluated using a confusion matrix with four outcomes: True Positives (TP, correctly identified deaths), True Negatives (TN, correctly identified survivors), False Positives (FP, survivors incorrectly predicted as deaths), and False Negatives (FN, deaths incorrectly predicted as survivors). From these values, the following metrics are calculated:
AUC-ROC measures the model's ability to discriminate between patients who experienced in-hospital mortality and those who did not, across all possible thresholds. An AUC-ROC value of 1.0 indicates perfect discrimination, whereas a value of 0.5 reflects performance equivalent to random chance. AUC-ROC is widely regarded as a robust metric for evaluating classifier performance, particularly in medical applications.
54
Accuracy represents the proportion of all instances in the test set that were correctly classified by the model using the selected threshold (based on the Youden index). This metric provides an overall assessment of the model's correctness in predicting both positive and negative cases.
Specificity (true negative rate) measures the proportion of actual negatives (i.e., survivors) that were correctly identified as such by the model. This metric reflects the model's ability to accurately recognize patients who did not experience in-hospital mortality.
Recall (sensitivity, true positive rate) quantifies the proportion of actual positives (i.e., non-survivors) that were correctly identified as non-survivors by the model. In the context of mortality prediction, where the positive class (deaths) is often underrepresented, recall is a particularly important metric. High recall indicates the model's effectiveness in detecting true cases of in-hospital mortality. Since mortality datasets are frequently imbalanced, relying on accuracy alone can be misleading, as a model may achieve high accuracy simply by favoring the majority class.
Brier Score measures the mean squared difference between predicted probabilities and actual outcomes, providing an assessment of the model's calibration.
55
Lower Brier scores indicate better alignment between predicted risks and observed outcomes.
Model calibration
Model calibration refers to the agreement between predicted probabilities and observed outcome frequencies. A well-calibrated model generates probability estimates that accurately reflect true event likelihood, which is essential for clinical decision-making and effective risk communication. 56 To enhance calibration, isotonic regression a non-parametric technique that fits a monotonically increasing function to predicted probabilities was applied, as it has demonstrated effectiveness in clinical ML settings.57,58 This approach allows for flexible adjustment without assuming a specific functional form, particularly when the relationship between predicted and actual probabilities deviates from logistic scaling, thereby improving interpretability and trustworthiness of risk estimates. 59 Following calibration, model performance was reassessed using the optimal threshold and compared with pre-calibration results. Calibration curves were plotted to visualize improvements in model calibration.
Explainable AI (XAI)
To enhance the transparency, interpretability, and clinical trustworthiness of the best-performing ML model, SHAP was employed. 60 SHAP quantifies each feature's contribution to individual predictions by assigning SHAP values that reflect their impact on shifting the model's output from a baseline. For global interpretability, SHAP summary plots rank features by mean absolute SHAP values, demonstrating overall feature importance and the direction of their effects. For local interpretability, SHAP force or waterfall plots decompose individual predictions to show how patient-specific feature values influence risk estimates relative to the baseline, thus providing case-level explanation.
Comparative analysis
Model performance was further evaluated using the NRI index. 61 NRI quantifies the degree to which a new model improves patient risk classification compared to an established standard. In this study, comparisons were made against the TIMI score. This metric is particularly relevant in clinical contexts, where improved risk stratification can influence treatment decisions and resource allocation. An NRI value of zero indicates equivalent discriminative ability between the ML and TIMI models, while a positive NRI denotes superior reclassification by the ML model. Conversely, a negative NRI suggests poorer performance in distinguishing between risk categories. 61
For the calculation of NRI, we dichotomized patients using a TIMI risk score threshold of ≥4 to define high-risk status. While clinical practice employs various thresholds for risk stratification (TIMI ≥4 or ≥5), we selected TIMI ≥4 for several reasons. First, the NRI methodology requires binary classification to quantify the proportion of patients correctly reclassified into appropriate risk categories. Second, TIMI ≥4 represents a more stringent definition that identifies patients at the upper end of the intermediate-risk and all high-risk categories who derive the greatest benefit from aggressive interventions, as demonstrated in the TACTICS-TIMI 18 trial. 62 Third, this threshold optimizes sensitivity for identifying patients who require intensive monitoring, early invasive strategies, or transfer to higher-level care facilities, which is clinically preferable to false-negative misclassification of high-risk patients. The use of TIMI ≥4 has been adopted in clinical pathways at several institutions for triaging acute coronary syndrome patients. 63
Prototype development
The best-performing model from this study will be integrated into a publicly accessible website. This interface will allow users, including clinicians and researchers, to input relevant patient data and obtain real-time risk predictions based on the validated model. The web-based tool is designed to enhance accessibility, promote transparency, and facilitate clinical decision-making by providing an easy-to-use platform for exploring model outputs in practice.
Results
Patient characteristics
Table 4 summarizes the baseline characteristics of the study population. A total of 49,574 patients were included in the final analysis. Of these, 46,740 patients survived to hospital discharge, while 2834 patients (5.7%) experienced in-hospital mortality. The mean age was 26.21 years, and the cohort was predominantly male, accounting for approximately 86% of the population. Most patients were of Malay ethnicity (56.3%), followed by Chinese (17.0%), Indian (15.6%), and other ethnic groups (11.1%).
Patient characteristics.
With respect to smoking status, 49.8% were current smokers, 20.0% were former smokers, and 30.2% had never smoked. The high proportion of current smokers underscores a significant modifiable risk factor within this population, particularly in the context of cardiovascular risk management.
Regarding comorbidities, 36.4% had diabetes mellitus, 62.2% had hypertension, and 10.6% had a prior history of cerebrovascular disease. Approximately 46.7% of patients underwent cardiac catheterization, and 36.9% received percutaneous coronary intervention (PCI).
Feature selection
Following the backward elimination process described in the Methods section, the performance of each model was recorded (Figure 2).

All models’ BFE scores.
The XGB model achieved its highest AUC of 0.8916 when retaining 14 features as shown in Table 5, compared to the best AUC of 0.8905 with 45 features for RF, 0.8909 with 30 features for LR, and 0.8739 with 31 features for SVM. These results indicate that the combination of features selected by the XGB model provided the best balance between model complexity and predictive accuracy. This feature subset was subsequently used for all downstream modeling and evaluation.
Final 14 selected features.
Figure 3 shows the 14 selected features and its feature importance ranking from most important to least important. The 14 selected features by XGB models are age, heart rate, low density lipoprotein cholesterol, fasting blood glucose, gender, Killip class, diabetes, chronic renal disease, cardiac catheterization, PCI, beta blocker usage, ACEI usage, statin usage, and oral hypoglycemic agent usage. All 14 selected predictor variables demonstrated statistically significant associations with in-hospital mortality (p < 0.0001), confirming their relevance for inclusion in the prediction model.

Feature importance.
Model performance
Table 6 presents the performance metrics of the ML models, including NRI, prior to calibration. Several models demonstrated comparable AUC values around 0.89, such as LR (0.8905; 95% CI: 0.8776–0.9029), SVM (0.8898; 95% CI: 0.8772–0.9022), XGB (0.8904; 95% CI: 0.8773–0.9016), GBM (0.8899; 95% CI: 0.8772–0.9016), Stacked All Base Model (0.8886; 95% CI: 0.8758–0.9002), and Stacked Ensemble Base Model (0.8851; 95% CI: 0.8719–0.8966). Notably, LR and GBM achieved the highest NRI, with approximately 21% improvement over the TIMI risk score. The traditional TIMI score demonstrated limited discriminative ability for predicting in-hospital mortality among STEMI patients in the test set, with an AUC of 0.746. In contrast, all ML models evaluated prior to calibration outperformed the TIMI score, with AUC values ranging from approximately 0.87 to 0.89.
Performance metrics prior to calibration.
Logistic regression model coefficients and odd ratios
Table 7 presents the coefficients and odds ratios (OR) from the LR model. Disease severity indicators demonstrated increased risk, with Killip class (OR = 1.76, β = 0.56) and creatinine levels (OR = 1.71, β = 0.54) showing the strongest associations. Patient characteristics including age at notification (OR = 1.34), sex (OR = 1.32), and comorbid diabetes mellitus (OR = 1.30) were associated with increased risk. Elevated fasting blood glucose (OR = 1.26) and heart rate (OR = 1.23) also contributed to higher risk.
Logistic regression model coefficient and odd ratios.
Pharmacological interventions showed protective effects. Oral hypoglycemic agents demonstrated the strongest protective association (OR = 0.23, β = −1.45), followed by statin therapy (OR = 0.57), ACEI (OR = 0.59), and beta-blockers (OR = 0.61). Cardiac interventions including PCI (OR = 0.79) and cardiac catheterization (OR = 0.80) were associated with reduced risk. LDL cholesterol showed a near-neutral effect (OR = 0.92). These findings align with established clinical evidence supporting the beneficial effects of guideline-directed medical therapy and revascularization in post-myocardial infarction patients.
SHAP
The global SHAP summary plot (Figure 4) illustrates the relative contribution of each feature to model predictions across the cohort. The most influential variables were Killip class, beta-blocker use, angiotensin-converting enzyme inhibitor use, oral hypoglycemic agent use, and age. In the plot, higher feature values are indicated in red, while lower values are in blue. SHAP values represent the impact of each feature on the model's output. In this study, positive SHAP values indicate higher predicted probability of in-hospital death, and negative values correspond to lower risk.

Global SHAP summary plot.
The directionality of feature effects is consistent with clinical understanding, higher Killip class, absence of beta-blocker or ACEI therapy, and older age are associated with increased mortality risk. This alignment between model-derived importance and known clinical predictors enhances the model's face validity and interpretability.
Figure 5 displays a local SHAP force plot for a single patient, illustrating how each feature contributes to the predicted probability of in-hospital death (class = 1). The model output for this patient is f(x) = −0.533, which is higher (i.e., less negative) than the cohort mean prediction E[f(x)] = −0.804. This indicates that the patient's estimated risk is lower than average.

Local SHAP explanation.
In the plot, bars extending to the right (red in color) increase the predicted risk of death, while those extending to the left (blue in color) reduce it. For this patient, several features shifted the prediction leftward (lower risk), including low Killip class (killipclass), heart rate, fasting blood glucose (fbg), absence of diabetes (cdm), cardiac catheterization (cardiaccath), sex (ptsex), use of statin, and absence of chronic renal diseases (crenal).
Conversely, the absence of beta blocker (bb), angiotensin-converting enzyme inhibitor (acei), and oral hypoglycemic therapy (oralhypogly), age (ptageatnotification), percutaneous coronary intervention (pci) as well as low-density lipoprotein cholesterol (ldlc), increased the estimated risk. Although the overall leftward (protective) contributions outweighed the rightward (risk-increasing) ones, yielding a net negative score consistent with predicted survival.
Model calibration
The models’ performance following calibration is summarized in Table 8. Notably, the Brier score decreased for all models, indicating improved agreement between predicted and observed risks and underscoring the importance of calibration for translating raw model outputs into clinically meaningful probabilities. The most substantial reduction in Brier score was observed in the calibrated LR model, which decreased from 0.1501 to 0.0598, the lowest among all models.
Model performance metrics after calibration.
Additionally, NRI showed marked improvement relative to the TIMI risk score, with calibrated models such as LR, RF, SVM, and XGB demonstrating up to a 58% increase in NRI. The calibrated LR model achieved the best overall performance in this study, with an AUC of 0.8884 (95% CI: 0.8756–0.9011), overall accuracy of 0.8538, recall of 0.7746, specificity of 0.8617, Brier score of 0.0598, and NRI of 0.5828, outperforming all other ML models.
As shown in Figure 6, the calibration curve for the LR model demonstrates the impact of isotonic calibration on model probability estimates. The original (uncalibrated) model exhibited notable deviation from the ideal diagonal, indicating poor agreement between predicted probabilities and actual outcomes. Following isotonic calibration, the curve aligns much more closely with the ideal diagonal, reflecting substantial improvement in the agreement between predicted risks and observed event rates across the full range of probabilities. This result underscores the effectiveness of calibration in generating clinically meaningful and reliable probability estimates, which is essential for risk stratification in practice.

Logistic regression calibration curve.
Prototype
As a proof of concept for real-world application, the best-performing calibrated logistic regression model was deployed in a prototype web-based interface named MyHeart STEMI Risk Calculator, publicly accessible at https://myheart-rho.vercel.app/calculators/myheart_stemi. This website is a research prototype, designed to demonstrate feasibility and gather usability feedback from the research community. It is not intended for clinical use, as the next steps in our roadmap are the rigorous external and prospective validation required for clinical deployment.
The web application accepts the 14 validated predictors as input: age, heart rate, low-density lipoprotein cholesterol, fasting blood glucose, sex, Killip class, diabetes, chronic renal disease, cardiac catheterization status, PCI status, and use of beta blockers, ACE inhibitors, statins, and oral hypoglycemic agents. Upon data entry, the system generates a real-time probability estimate of in-hospital mortality along with SHAP-based explanations showing which factors contribute most to the individual patient's risk.
The interface features an intuitive design suitable for both clinicians and researchers, displaying the predicted mortality risk as a percentage along with visual risk stratification (low, moderate and high). The SHAP values accompany each prediction, enhancing transparency by showing how each clinical variable influenced the risk estimate. This prototype demonstrates the feasibility of translating our validated ML model into a practical clinical decision-support tool and provides a platform for prospective validation and user feedback collection. Figure 7 shows a screenshot of the prototype interface with example patient data.

MyHeart STEMI prototype.
Discussion
Robust feature selection and transparent model explainability are critical to ensuring clinical interpretability and building physician confidence in AI-driven prediction systems. Our final model identified 14 predictors of in-hospital mortality in STEMI: demographics and comorbidities (age, sex, diabetes, and chronic kidney disease), vital signs and laboratory values (heart rate, fasting blood glucose, and LDLC), clinical presentation (Killip class), in-hospital management (cardiac catheterization and PCI), and medications at admission or during hospitalization (beta blockers, ACEI, statins, and oral hypoglycemic agents).
All variables were incorporated into the ML models and were significantly associated with mortality on univariate analysis (each p < 0.001), confirming clinical relevance. Many overlaps with established scores: age, heart rate, Killip class, and diabetes appear in TIMI, while age and Killip class also feature in GRACE. 6 A 2023 systematic review of 50 ML models for acute coronary syndromes reported that age, sex, heart rate, Killip class, renal function, blood glucose, and hemoglobin were the most frequently selected predictors, eight of which are included in our feature set. 64 This concordance with existing literature supports the face validity of our model.
In this study, ML models including individual algorithms and a stacked ensemble demonstrated substantially better performance than the conventional TIMI risk score for predicting in-hospital mortality in a Southeast Asian STEMI cohort. The calibrated LR model emerged as the best-balanced predictor across all performance metrics, achieving an AUC of 0.8884 (95% CI: 0.8756–0.9011), overall accuracy of 0.8538, recall (sensitivity) of 0.7746, and specificity of 0.8617. After probability calibration, this model attained a Brier score of 0.0598, indicating excellent calibration, and it yielded the highest net reclassification improvement (NRI = 0.5828) compared to TIMI. These results underscore the promise of ML-based approaches in advancing risk stratification for diverse patient populations.
Notably, the relatively simple calibrated logistic regression model slightly outperformed more complex models, including our stacked ensemble models, suggesting that added model complexity did not necessarily translate to better discrimination in this dataset. This finding reinforces the importance of careful model selection and calibration in achieving optimal performance.
Our results also represent a methodological advancement over prior ML models developed on this registry.20,32 While earlier work demonstrated high discrimination (AUC), this study builds on those findings by: (a) using a more contemporary patient cohort (up to 2021) that reflects modern treatment patterns; (b) proving model robustness using a strict temporal validation rather than a conventional random split, and (c) systematically addressing probability calibration as a primary endpoint. By demonstrating that an uncalibrated LR model (Brier score: 0.1501) can be significantly improved (Brier score: 0.0598), we answer the call from recent JACC perspectives to move beyond simple performance metrics and focus on developing tools that are clinically reliable and trustworthy. 65
Our findings align with a growing body of literature showing that ML algorithms can capture complex, non-linear relationships between clinical features and outcomes that traditional risk scores often miss. Conventional risk scores such as TIMI were developed over two decades ago using predominantly Western cohorts and a limited set of variables. As a result, their performance tends to be suboptimal in contemporary Asian populations. They often underperform in high-risk patients and fail to reflect the impact of modern interventions and treatment strategies. 66 The performance of our ML models, developed using a large Asian registry, highlights the need for updated risk stratification tools that reflect current populations and treatment practices.
When compared to the TIMI score, all ML models in this study demonstrated improvements in risk reclassification. This was quantified using the NRI, which evaluates how effectively a new model reassigns patients into more appropriate risk categories relative to an existing model. In our analysis, the ML models yielded substantial NRI gains over TIMI, indicating that a meaningful proportion of patients were reclassified into more accurate risk strata. These findings are consistent with previous studies on ACS that have shown improved reclassification performance with ML-based models compared to traditional risk scores.67–69 Collectively, these comparisons reinforce that data-driven ML models can enhance both the discrimination and clinical usefulness of risk predictions relative to conventional risk scores.
In this study, ensemble learning methods specifically RF, XGB and stacking yielded only modest improvements in AUC compared to the calibrated LR model, which demonstrated the best overall performance across all metrics. This observation is consistent with recent reviews showing that complex ML algorithms do not consistently outperform well-specified logistic regression models in clinical prediction contexts. 70 Our findings contribute to this area by evaluating ensemble methods and underscore that model interpretability, appropriate feature selection, and probability calibration remain essential for clinical implementation.
An often-underreported component of predictive modeling is calibration, which assesses how closely a model's predicted probabilities align with actual event rates. Calibration is critical for real-world clinical application, as well-calibrated risk estimates support reliable decision-making. Van Calster et al. reported that fewer than 10% of ML based clinical prediction studies include calibration metrics such as Brier scores or calibration plots, 30 underscoring a gap in reporting practices. In this study, calibration was addressed by applying isotonic regression to each model's predicted probabilities. This step improved probabilistic accuracy and consistency. For the calibrated LR model, which performed best overall, the Brier score was 0.0598, indicating strong alignment between predicted and observed mortality rates. Accurate calibration reduces the risk of both overtreatment and undertreatment, which is particularly relevant in settings with limited resources. These findings highlight the importance of incorporating calibration into future cardiovascular risk models developed for Asian populations.
Another key methodological consideration in this study was the use of cost-sensitive learning to address the substantial class imbalance between survivors and non-survivors. In-hospital mortality occurred in approximately 5.7% of cases, creating a ratio of about 1:16.5, which could bias standard ML models toward the majority class. Unlike oversampling or undersampling techniques that risk data duplication or loss, cost-sensitive learning embeds class weights directly into the loss function, enabling the algorithm to prioritize correct classification of the minority class without altering the data distribution. This approach has been shown to improve model sensitivity to rare but clinically important events and is particularly well suited to healthcare applications where misclassifying a high-risk patient may have serious consequences. By incorporating this strategy during model training, we aimed to enhance the model's ability to detect high-risk patients while preserving computational efficiency and generalizability.
In addition, by training and testing on data spanning over a decade (with patients from 2006 through 2021), the model implicitly learned from temporal trends in patient characteristics and treatments. We further performed a temporal validation, using the most recent years of data as a hold-out test set to mimic prospective performance. This form of validation is stronger than random split internal validation, as it assesses the model's robustness to changes over time (such as evolving clinical practices or population demographics). Our model's stable performance on the 2020–2021 temporal test set indicates resilience to such shifts, which is encouraging for its reliability in future use. This approach aligns with emerging best-practice recommendations that emphasize temporal and geographic validation to ensure generalizability. Clinically, the improved performance of our calibrated ML model over the TIMI score holds significant implications. It means that, in practice, more high-risk patients could be correctly identified early (and potentially receive aggressive therapy or closer monitoring), while low-risk patients could be spared unnecessary interventions and ultimately leading to more efficient and effective care. The ease of interpretability through SHAP and the inclusion of readily available clinical variables also facilitate integration into routine workflows. In sum, the combination of a diverse development cohort, rigorous validation, and model explainability provides confidence that our ML model could be safely and effectively deployed to aid risk stratification in STEMI patients across similar healthcare settings.
Globally, SHAP values confirmed that Killip class, beta blocker use, fasting blood glucose, and age exerted the greatest influence on predicted risk, consistent with their known impact on STEMI outcomes.71–74 Locally, SHAP explanations clarified how individual features increased or decreased each patient's predicted risk, enabling clinicians to explore counterfactual scenarios (e.g., lowering heart rate or initiating specific therapies) and identify modifiable factors. Such explainability is essential for clinician trust and integration of AI tools into care where SHAP's additive attributions align well with human reasoning. 26 By delivering both cohort-level and patient-specific explanations, our approach provides transparent, clinically interpretable predictions an indispensable requirement for AI deployment in high-stakes medical decision-making. 75
Despite the promising results, this study has limitations. The Malaysian NCVD registry is comprehensive and nationally representative, but healthcare delivery and population characteristics can vary in different environments. Therefore, the generalizability of our model to settings outside of Malaysia for example, other Asian nations or lower-tier hospitals remain to be proven. Further validation on external datasets would strengthen confidence in the model's broad applicability. External validation in other Asian nations and lower-tier hospitals is required to confirm generalizability. Furthermore, while the GRACE risk score is a well-validated tool for ACS risk stratification, we chose the TIMI score as our primary comparator. This decision was based on several factors, including its availability within our dataset. This availability is a key methodological advantage, as it ensures our models are benchmarked against the actual score used in routine clinical practice, thus avoiding the potential biases of retrospective calculation and providing a more faithful comparison against the real-world standard of care. The TIMI score is a simpler bedside tool that has been locally validated in Malaysia's multi-ethnic STEMI patients, making it more routinely adopted in our hospitals. 76
By contrast, the GRACE score though often more predictive in Western cohorts has notable limitations in Asian settings. It was derived largely from Western populations with minimal Asian representation and applying it without adjustment can be problematic. 77 For instance, a Singaporean study found that the original GRACE model significantly underestimated in-hospital mortality for Chinese, Malay, and Indian patients, necessitating recalibration to improve its accuracy. 78 GRACE also requires laboratory variables (e.g., serum creatinine), which can delay risk assessment at presentation. 77 These factors have limited GRACE's uptake in our setting, where TIMI's ease of use and proven utility make it the preferred score. Nevertheless, emerging evidence indicates that modern ML models can outperform both TIMI and GRACE. Recent meta-analyses and regional studies consistently report that ML algorithms achieve higher discrimination (AUC/C-statistic) for ACS mortality prediction compared to traditional scores. 78 Similarly, large Asian studies show that AI-based models markedly improve classification of high-risk patients over the TIMI score. Collectively, these findings underscore why data-driven ML approaches are poised to surpass conventional risk scores like TIMI and GRACE in diverse patient populations.
A key strength of this study is the use of a large, nationally representative registry of STEMI patients in Malaysia, encompassing a truly multi-ethnic Asian population. Our cohort includes patient demographics and risk factor profiles that are reflective of the broader Southeast Asian region (e.g., Malay, Chinese, Indian, and other ethnic groups common to Malaysia, Singapore, Indonesia, and beyond). This diversity in the development dataset enhances the external validity of our findings and suggests that the resulting model may generalize well to other Asian populations. To support real-world applicability, we have deployed a prototype web-based tool, the MyHeart STEMI Risk Calculator which is accessible at https://myheart-rho.vercel.app/calculators/myheart_stemi. This interface enables risk estimation based on the final model and will be part of an ongoing effort to collect user feedback and perform continuous external validation in diverse clinical settings.
Future work
This work opens several avenues for future research and development. Rigorous external validation of the model is a priority. We plan to test the model in external cohorts, such as registry data from neighboring countries or distinct healthcare systems, to evaluate its calibration and discrimination in new populations. This will help identify any overfitting to the derivation cohort and ensure that the model's performance holds when confronted with different case-mix and practice patterns. Equally important is the need for comprehensive fairness and equity analyses. Future work will include stratified performance evaluations across key subgroups including hospital type (urban tertiary vs. suburban vs. rural), geographic regions, sex/gender, and ethnic groups to identify any disparities in model accuracy. We will also conduct formal fairness audits to ensure the model performs equitably across these diverse populations and assess whether systematic differences in data quality across sites affect predictions. These equity analyses are essential to ensure that the model provides reliable risk stratification for all patient populations before clinical deployment.
Furthermore, while our current model focuses on the critical endpoint of in-hospital mortality to guide acute care decisions, a key priority for future work is to expand this model to predict longer-term outcomes. This will involve moving beyond binary classification to include survival analysis, allowing for the prediction of 30-day, 180-day or 1-year mortality and the generation of survival curves that can estimate patient prognosis over several years. Such models are essential for informing post-discharge planning and secondary prevention strategies.
Another important direction is the integration of the model into electronic health record (EHR) systems for prospective evaluation. Embedding the risk calculator into hospital EHRs would allow automated, real-time risk scoring for incoming STEMI patients, enabling clinicians to act on the predictions within the clinical workflow. Such integration should be accompanied by user interface design that leverages the model's explainability (e.g., showing which factors contributed most to a patient's risk) to support clinician understanding and trust.
Limitations
A further methodological limitation is the lack of a formal sensitivity analysis for our data imputation. We employed median and mode imputation due to its simplicity and computational efficiency. However, we did not test whether more complex methods, such as multivariate imputation by chained equations (MICE) or k-nearest neighbors (k-NN) imputation, would have significantly altered the model's performance. This represents an avenue for future methodological validation.
Furthermore, while we noted that GRACE's reliance on laboratory values can delay immediate risk assessment, our model includes FBG and LDL-cholesterol with the understanding that it is intended for in-hospital risk stratification rather than hyperacute triage. These laboratory values are routinely obtained within the first 24 hours as part of standard STEMI care in our setting. For settings where immediate bedside risk assessment is needed, a simplified version of the model using only clinical variables (age, heart rate, Killip class, and vital signs) could be developed and validated in future work. This would create a tiered approach: a rapid triage score at presentation and an enhanced prediction model once laboratory results are available.
Finally, we advocate for prospective clinical trials or implementation studies to assess the impact of using ML-driven risk scores on patient outcomes. For example, a stepped-wedge trial could determine whether management guided by our ML model (versus usual care or TIMI score guidance) leads to differences in treatment decisions, resource allocation, or clinical outcomes. These future studies will be crucial to move from retrospective model development to real-world clinical adoption. By addressing these next steps on expanded outcomes, external validation, EHR integration, and prospective evaluation, we aim to advance the translation of our calibrated ML risk model into a reliable decision-support tool for improving STEMI care.
Conclusion
This study introduces a clinically applicable and well-calibrated ML model for predicting in-hospital mortality among STEMI patients in an Asian population, leveraging a large, nationally representative registry from Malaysia. The model outperformed conventional risk stratification tools, including the TIMI score, in terms of discrimination, calibration, and interpretability. A temporal split of the dataset was employed to simulate prospective deployment and assess model generalizability over time. By integrating contemporary ML techniques with explainability frameworks, the model provides a transparent and actionable decision-support tool tailored to regional clinical profiles. These findings underscore the importance of periodic model updates and local validation to enhance the clinical utility of risk prediction in acute cardiovascular care.
Footnotes
Acknowledgments
This work was supported by the Higher Institution Centre of Excellence (HICoE) research grant 600-RMC/MOHE HICoE CARE-I 5/3 (01/2025) awarded to the Cardiovascular Advancement and Research Excellence Institute (CARE Institute), Universiti Teknologi MARA.
Ethical approval
The Medical Review & Ethics Committee (MREC) of the Ministry of Health (MOH) of Malaysia approved the NCVD registry study (NMRR-07-38-164), and further authorization was granted by the UiTM ethics committee (600-TNCPI (5/1/6)).
Contributorship
Sazzli Kasim was involved in conceptualization, methodology, resources, funding acquisition, writing–review and editing, supervision, and project administration. Lim Bing Feng contributed to conceptualization, methodology, software, data curation, writing–original draft preparation, and writing–review and editing. Sorayya Malek was responsible for conceptualization, methodology, software, data curation, resources, funding acquisition, writing–original draft preparation, writing–review and editing, supervision, and project administration. Putri Nur Fatin Amir Rudin contributed to conceptualization, methodology, software, and data curation. Khairul Shafiq Ibrahim and Wan Azman Wan Ahmad both contributed to writing–review and editing, supervision, and project administration. Alan Yean Yip Fong contributed to software and data curation. All authors have read and approved the final version of the manuscript.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the Higher Institution Centre of Excellence (HICoE) (grant number: 600-RMC/MOHE HICoE CARE-I 5/3 (01/2025)).
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Data availability
Data which support the findings of this research are accessible from the National Heart Association of Malaysia (NHAM), but the availability of these data is restricted, therefore they are not publicly available. It belongs to the individual ministry of health universities hospitals and private hospitals that require multiple institutional agreements for data release to third parties, therefore ethical approval is required for analysis. Data are however available from NHAM upon request using
or email them at secretariat@malaysianheart.org. Any findings from the data need to be reported and permission needs to be obtained from the NHAM committee before publication.
