Advancing ST-elevated myocardial infarction mortality risk prediction in Asian populations through explainable and calibrated machine learning

Abstract

Background

Traditional risk scores for ST-segment elevation myocardial infarction (STEMI), such as the Thrombolysis in Myocardial Infarction (TIMI) score, were developed predominantly in Western populations and may exhibit suboptimal performance in Asian patients due to differing clinical and genetic profiles. Accurate in-hospital mortality prediction is essential for optimizing clinical management in this high-risk group. However, key aspects such as model explainability (e.g., SHapley Additive exPlanations (SHAP) analysis) and probability calibration are often neglected, limiting the clinical utility and trustworthiness of predictive models.

Objectives

This study aimed to develop and validate explainable, well-calibrated machine learning models for predicting in-hospital mortality among Asian STEMI patients, benchmarking their performance against the TIMI risk score. Interpretability was enhanced using SHAP for both global and local explanations, and model calibration was systematically addressed.

Methods

We conducted a retrospective cohort study using data from 49,574 Asian STEMI patients in the Malaysian National Cardiovascular Disease registry (2006–2021). A temporal split was applied to simulate prospective deployment: data from 2006–2018 were used for model training, 2019 data for calibration, and 2020–2021 data as an independent test set. Multiple ML algorithms including logistic regression (LR), support vector machine, random forest, gradient boosting machine, and XGBoost were developed and compared. Stacked ensemble models were constructed using combinations of these base learners. Performance was evaluated on the independent test set using area under the receiver operating characteristic curve (AUC-ROC), accuracy, recall, specificity, Brier score (for calibration), and Net Reclassification Index (NRI), with benchmarking against the TIMI score. Isotonic regression was applied for probability calibration, and SHAP was used for interpretability.

Results

The calibrated LR model achieved the best overall performance (AUC = 0.8884, 95% CI (confidence interval): 0.8756–0.9011; accuracy = 0.8538; recall = 0.7746; specificity = 0.8617; Brier score = 0.0598; NRI = 0.5828 vs. TIMI). SHAP analysis confirmed that model predictions were aligned with established clinical reasoning, enhancing interpretability. Probability calibration further improved model reliability, as evidenced by a reduced Brier score.

Conclusions

Calibrated LR, supported by SHAP-based explainability and robust probability calibration, significantly outperformed the TIMI score for in-hospital mortality prediction in Asian STEMI patients. This approach improves predictive accuracy, reliability, and interpretability, supporting more personalized and clinically trustworthy risk stratification. These findings highlight strong potential for real-world clinical integration and improved patient outcomes in diverse Asian populations.

Keywords

ST-elevation myocardial infarction machine learning in-hospital mortality probability calibration explainable AI

Introduction

Cardiovascular disease (CVD) is the leading cause of death worldwide, with Asia accounting for 58% (10.8 million) of global CVD deaths in 2019. In Asia, 35% of all deaths are due to CVD, and nearly 39% are premature substantially higher than in the US (23%) or Europe (22%).¹ Among CVD subtypes, STEMI shows significant regional disparities across Asia, with incidence rates ranging from 33 to 138 per 100,000 person-years.²

Given the substantial burden of CVD in Asia, accurate prediction of in-hospital mortality and early detection of STEMI are essential for improving outcomes.^3,4 TIMI and Global Registry of Acute Coronary Events (GRACE) scores are widely used for risk stratification in acute coronary syndromes (ACS).^5,6 Among these, the TIMI risk score is particularly prevalent in Asian hospitals, owing to its reliance on easily obtainable clinical parameters.^5,6 However, these models were developed in predominantly Western and Caucasian populations and often perform suboptimal in Asian cohorts due to differences in genetic backgrounds, risk factor profiles, and healthcare systems.^7,8 For instance, studies in Singapore have shown that GRACE underestimates in-hospital mortality after acute myocardial infarction (AMI) in multiracial Asian populations by approximately 4-fold (predicted: 1.6–2.4% vs. observed: 6.4–9.8%).⁹ This highlights the need for recalibrated or novel approaches to risk prediction tailored to Asia, as reliance on Western models may lead to inaccurate risk estimation and suboptimal care.^10,11

ML algorithms offer the potential to overcome limitations of traditional regression-based risk scores by capturing complex, non-linear interactions among numerous clinical variables.^12–15 ML models including LR, RF, SVM and XGB have shown improved predictive performance for ACS risk stratification compared to conventional scoring systems.^15–18 Ensemble approaches, particularly stacking, further enhance prediction by integrating multiple ML algorithms to generate robust and accurate outcomes.¹⁹ While ML, including stacking, has been applied to ACS risk prediction in some Asian settings,²⁰ rigorous application of stacked ensembles for STEMI mortality prediction in large, multi-ethnic Asian populations remains limited.²¹

To enhance generalizability and address methodological challenges including temporal concept drift and class imbalance, we employed temporal validation and cost-sensitive learning strategies throughout model development.^22,23 Moreover, despite the superior predictive accuracy of advanced ML models, particularly stacked ensembles, their “black box” nature often hinders clinical adoption.²⁴ Enhancing the transparency and interpretability of these models is crucial for clinician trust and integration into routine practice.²⁵ To address this, explainable AI (XAI) methods such as SHAP have emerged, providing both global and local interpretability by quantifying each feature's contribution to model predictions.²⁶ Integrating SHAP into ML workflows can enhance clinician trust, support personalized interventions, and facilitate adoption especially in diverse Asian healthcare settings where interpretability is essential for clinical acceptance.²⁷

Model calibration also ensures that predicted probabilities align closely with actual event rates, which is critical for threshold-based decision-making and clinical integration.²⁸ Although ML models such as RF, XGB, and stacked ensembles often demonstrate high discrimination, their probability estimates can be poorly calibrated by default.²⁹ Despite its importance, calibration remains underreported in ACS risk prediction. Fewer than 10% of studies include calibration plots or Brier scores.³⁰ Calibration methods like Platt scaling and isotonic regression are therefore vital to improve risk communication and clinical applicability.³¹

This study addresses key gaps in STEMI risk prediction for Asian populations by developing and validating explainable and calibrated ML models tailored to regional needs. By benchmarking these models, including stacked ensembles, against conventional risk scores, integrating SHAP for interpretability, and rigorously evaluating calibration, we aim to provide more accurate, transparent, and clinically applicable tools for in-hospital mortality prediction in diverse Asian cohorts. To enhance generalizability and simulate real-world deployment, we employed a temporal split of the dataset, and incorporated cost-sensitive learning to address class imbalance during model training.

While machine learning has been previously applied to the NCVD registry,^20,32 this study provides three distinct and crucial contributions. First, our model is developed on contemporary data, reflecting modern patient profiles and treatment strategies, which is critical as risk models can become outdated. Second, we address the limitations of standard random-split validation by implementing a rigorous temporal validation strategy that simulates prospective deployment and assesses robustness to concept drift. Finally, responding to the need for models to demonstrate clear clinical relevance beyond algorithmic performance, our primary objective is to develop a calibrated and explainable model. We systematically apply and evaluate probability calibration using a dedicated dataset to ensure that the model's risk predictions are reliable and trustworthy for real-world clinical decision-making.

Methods

In the methodology section of this study, we adhered to the TRIPOD (Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis) checklists to ensure the transparency and reproducibility of our predictive modeling process.

The overall workflow of this study is depicted in Figure 1, which outlines the ML development process. We followed a structured ML pipeline to develop, evaluate, and interpret calibrated models for predicting in-hospital mortality among STEMI patients. The methodology encompassed data preprocessing, feature selection, model development, evaluation, calibration, comparative analysis, and application of SHAP for model interpretability.

Figure 1.

Flow diagram.

Study design and data source

This retrospective cohort study utilized data from the NCVD registry, a nationwide clinical database that systematically captures standardized information on patient demographics, clinical presentation, management, and outcomes for individuals admitted with ACS at participating hospitals across Malaysia. Although ACS encompasses a range of clinical presentations, this study specifically focused on patients diagnosed with STEMI.

The Medical Review & Ethics Committee (MREC) of the Ministry of Health (MOH) of Malaysia approved the NCVD registry study (Approval Code: NMRR-07–38–164). The MREC waived patient informed consent for NCVD. This study also has been authorized by the UiTM ethics committee (reference number: 600-TNCPI (5/1/6)) and National Heart Association of Malaysia (NHAM). The data used in this study were made anonymous before use, as in our research data are interested only in the values and features without having access to patient personal information.

Data preprocessing

Prior to model development, the dataset underwent several preprocessing steps, data splitting, data imputation and normalization to ensure data quality and suitability for ML algorithms. These procedures form the foundation for building reliable, robust predictive models.

Statistical significance for baseline characteristics was assessed using appropriate tests based on data type and distribution. For continuous variables, variables were compared using independent t-tests. For categorical variables, chi-square tests were employed. A two-sided p-value <0.001 was considered statistically significant for all baseline characteristic comparisons³³

Data splitting

To evaluate generalizability and simulate prospective deployment, a temporal split was applied.³⁴ Records from 2006–2018 were used for model training and development, 2019 data served as a calibration set, and 2020–2021 data constituted an independent temporal test set. Temporal validation is preferable to random splitting for clinical prediction because it tests performance against potential “concept drift” arising from shifts in patient mix, treatment patterns, or data-capture practices over time.²²

The distribution of patients across these datasets is shown in Table 1. Records from 2006–2018 were used for model training and development (n = 37,519), 2019 data served as a calibration set (n = 4429), and 2020–2021 data constituted an independent temporal test set (n = 7626).

Table 1.

Total number of patients for each dataset.

Dataset	Year	Number of patients
Training set	2006–2018	37,519
Calibration set	2019	4429
Testing set	2020–2021	7626

Data imputation

Missing data were assessed and found predominantly missing completely at random (MCAR).³⁵ Continuous variables were imputed with the median, which is less sensitive to outliers in clinical measurements. Categorical variables were imputed with the mode, preserving the original distribution when missingness is limited or one category predominates.³⁶ The extent of missing data for each variable in the training dataset is detailed in Table 2.

Table 2.

Missing counts for training dataset.

Variables	Missing count (n = 37,519)	Missing percentage
Age	0	0
Race	0	0
Smoking Status	1516	4.04
Killip Class	2925	7.8
Fasting Status	634	1.69
Sex	0	0
History of Dyslipidemia	9286	24.75
History of Diabetes	6150	16.39
History of Hypertension	5629	15
Family History of Premature Cardiovascular Disease	8328	22.2
History of Myocardial Infarction	5467	14.57
Documented CAD	6717	17.9
Chronic Angina (>2 Weeks)	5088	13.56
New Onset Angina (<2 Weeks)	4022	10.72
History of Heart Failure	4783	12.75
History of Chronic Lung Disease	4859	12.95
History of Renal Disease	4915	13.1
History of Cerebrovascular Disease	4814	12.83
History of Peripheral Vascular Disease	5044	13.44
ECG Abnormal ST Elevation >1 mm	0	0
ECG Abnormal ST Elevation >2 mm	0	0
ECG Abnormal ST Depression	0	0
ECG Abnormal T Wave Inversion	0	0
ECG Abnormal Bundle Branch Block	0	0
ECG Abnormal Inferior Leads	0	0
ECG Abnormal Anterior Leads	0	0
ECG Abnormal Lateral Leads	0	0
ECG Abnormal True Posterior	0	0
ECG Abnormal Right Ventricle	0	0
Cardiac Catheterization	1397	3.72
Percutaneous Coronary Intervention	3165	8.44
Coronary Artery Bypass Grafting	3491	9.3
Acetylsalicylic Acid (ASA)/Aspirin	843	2.25
GP Receptor Inhibitor	5386	14.36
Heparin	5120	13.65
Low Molecular Weight Heparin	4787	12.76
Beta Blocker	3740	9.97
Angiotensin-Converting Enzyme Inhibitors (ACEI)	3808	10.15
Angiotensin II Receptor Blocker	5380	14.34
Statin	1375	3.66
Lipid Lowering Agent	5261	14.02
Diuretic	4659	12.42
Calcium Antagonist	5371	14.32
Oral Hypoglycemic	13,165	35.09
Insulin	4527	12.07
Anti-Arrhythmic Agent	5399	14.39
Heart Rate	921	2.45
Systolic Blood Pressure	697	1.86
Diastolic Blood Pressure	923	2.46
Creatine Kinase	7641	20.37
Total Cholesterol	8023	21.38
High-Density Lipoprotein Cholesterol (HDLC)	8702	23.19
Low-Density Lipoprotein Cholesterol (LDLC)	8616	22.96
Triglycerides	9018	24.04
Fasting Blood Glucose	8689	23.16

Class balancing via cost-sensitive learning

In this dataset, the in-hospital mortality was rare, with a class ratio of approximately 1 : 16.5 (non-survivors : survivors), creating a pronounced imbalance.²³ Cost-sensitive learning was adopted instead of over- or under-sampling because it embeds misclassification costs directly into the loss function, avoiding data duplication or deletion and reducing over-fitting while preserving feature distributions.³⁷ Higher weights were assigned to minority-class instances and lower weights to majority-class instances,³⁸ imposing greater penalties for misclassifying deaths and improving sensitivity to these clinically critical cases.³⁹

Normalization

Z-score normalization was applied on the continuous predictor variables. This method transforms the data for each feature by subtracting its mean and dividing by its standard deviation, calculated from the training dataset. The result is that each normalized feature has a mean of 0 and a standard deviation of 1.⁴⁰ The normalization can be calculated using the following:

x^{″} = \frac{x - mean}{SD} .

Standardization prevents features with larger absolute values or wider ranges from disproportionately influencing model training. It also contributes to making the model less sensitive to the original units of measurement of the input features and can be more robust to the presence of outliers compared to other scaling techniques like min–max scaling, particularly if the feature distributions are approximately normal or do not contain extreme, influential outliers.^40,41

Feature selection

To identify the most relevant features for in-hospital mortality, reduce model complexity, enhance interpretability, and mitigate overfitting, a structured feature selection process was employed. Specifically, backward feature elimination (BFE), a wrapper method, was utilized.⁴² Models were first trained using all features, and then iteratively re-trained while removing one feature at a time. At each iteration, features were ranked based on their impact on AUC, and the feature contributing the least to model performance was eliminated. This process continued until an optimal subset was identified that achieved the best trade-off between feature count and AUC performance. The final model was selected based on the subset that provided the highest AUC with the fewest features, balancing predictive power and model simplicity.⁴³

BFE was selected over alternative methods based on prior findings indicating its superior performance in predictive modeling within this context.^32,44 BFE was applied independently to LR, RF, SVM, and XGB models to identify feature subsets achieving optimal balance between AUC and dimensionality reduction.

Model development

Our approach comprised two stages: first, to develop and evaluate multiple base models, and second, to construct stacked ensemble models. All models’ performance metrics were recorded and compared in this study.

Base model

To predict in-hospital mortality among STEMI patients, we developed and evaluated several ML algorithms, including LR, RF, SVM, GBM, XGB, and a stacked ensemble model.^45–49 All models used the feature set selected by XGB through backward elimination (14 features, as detailed in the Feature Selection section). Model hyperparameters are detailed in Table 3.

Table 3.

Models’ parameters.

Models	Parameters
LR	'C': 10, ‘class_weight': ‘balanced’, ‘max_iter': 100, ‘penalty': ‘l2’, ‘solver': ‘liblinear'
RF	'class_weight': ‘balanced’, ‘max_depth': 5, ‘min_samples_leaf': 1, ‘min_samples_split': 8, ‘n_estimators’: 300
SVM	kernel='linear’, class_weight='balanced’, ‘C': 1, ‘gamma': 0.01
XGB	'colsample_bytree': 0.8, ‘learning_rate': 0.01, ‘max_depth': 3, ‘n_estimators': 500, ‘reg_lambda': 10, ‘scale_pos_weight':20.826, ‘subsample': 0.8
GBM	'learning_rate': 0.01, ‘max_depth': 3, ‘min_samples_leaf': 3, ‘min_samples_split': 8, ‘n_estimators': 400, ‘subsample': 0.8
Stacked All Base Model (LR Meta Learner)	'C': 100, ‘max_iter': 20, ‘penalty': ‘l1’, ‘solver': ‘liblinear'
Stacked Ensemble Base Model (LR Meta Learner)	'C': 100, ‘max_iter': 25, ‘penalty': ‘l1’, ‘solver': ‘liblinear'

To optimize model hyperparameters and prevent overfitting during the development phase, a 5-fold cross-validation (CV) strategy was employed. This CV was performed exclusively within the 2006–2018 training dataset. For each model, this five-fold CV process was used to guide the hyperparameter tuning (i.e., grid search optimization) and to generate the validation scores for model training.

Ensemble stacking model

In this study, we propose two stacking models for in-hospital mortality prediction in STEMI patients. The first model stacks all base learners; LR, RF, SVM, GBM, and XGB with LR as the meta-learner. The second model comprises only RF, GBM, and XGB as base learners, again with LR serving as the meta-learner. The performance of all base and stacked models will be compared.

Recent stacking studies reaffirm that a simple, regularized LR as meta-learner can give the ensemble both stability and transparency. In a large multi-dataset heart-disease project, Ganie et al. benchmarked LR and found that LR yielded the highest accuracy and AUC while showing the lowest fold-to-fold variance, crediting its superior “calibration and resistance to over-fitting” for the gain⁵⁰

Model evaluation

To assess predictive performance, all developed models were evaluated on an independent temporal test set. This evaluation provides the most realistic estimate of model generalizability to future, unseen patient data.

Threshold selection

To convert the continuous outputs into discrete binary classifications (dead or alive), an optimal decision threshold is selected. In this study, the optimal threshold was determined using the Youden index (J index).⁵¹

Youden's J statistic is

J = sensitivity + specificity − 1 = recall₁ + recall₀ − 1.

This threshold maximizes the vertical distance from the diagonal line representing random chance, effectively balancing a model's sensitivity (true positive rate) and specificity (true negative rate) by assigning equal importance to both.⁵² The Youden index is widely used in medical diagnostics and prognostics for selecting cut-off points in continuous tests; it gives equal clinical significance for sensitivity and specificity.⁵³

Performance metrics

After selecting the optimal classification threshold for each model, performance was assessed using established metrics. These metrics offer complementary perspectives on the predictive capabilities of each model. Binary classification performance is evaluated using a confusion matrix with four outcomes: True Positives (TP, correctly identified deaths), True Negatives (TN, correctly identified survivors), False Positives (FP, survivors incorrectly predicted as deaths), and False Negatives (FN, deaths incorrectly predicted as survivors). From these values, the following metrics are calculated:

AUC-ROC measures the model's ability to discriminate between patients who experienced in-hospital mortality and those who did not, across all possible thresholds. An AUC-ROC value of 1.0 indicates perfect discrimination, whereas a value of 0.5 reflects performance equivalent to random chance. AUC-ROC is widely regarded as a robust metric for evaluating classifier performance, particularly in medical applications.⁵⁴

AUC = \int_{0}^{1} TPR ({FPR}^{- 1} (x)) d x

Accuracy represents the proportion of all instances in the test set that were correctly classified by the model using the selected threshold (based on the Youden index). This metric provides an overall assessment of the model's correctness in predicting both positive and negative cases.

Accuracy = \frac{TP + FN}{TP + FN + FP + FN}

Specificity (true negative rate) measures the proportion of actual negatives (i.e., survivors) that were correctly identified as such by the model. This metric reflects the model's ability to accurately recognize patients who did not experience in-hospital mortality.

Specificity = \frac{TN}{TN + FP}

Recall (sensitivity, true positive rate) quantifies the proportion of actual positives (i.e., non-survivors) that were correctly identified as non-survivors by the model. In the context of mortality prediction, where the positive class (deaths) is often underrepresented, recall is a particularly important metric. High recall indicates the model's effectiveness in detecting true cases of in-hospital mortality. Since mortality datasets are frequently imbalanced, relying on accuracy alone can be misleading, as a model may achieve high accuracy simply by favoring the majority class.

Recall = \frac{TP}{TP + FN}

Brier Score measures the mean squared difference between predicted probabilities and actual outcomes, providing an assessment of the model's calibration.⁵⁵ Lower Brier scores indicate better alignment between predicted risks and observed outcomes.

BS = \frac{1}{N} \sum_{t = 1}^{N} (f_{t} - o_{t}) x^{2}

Model calibration

Model calibration refers to the agreement between predicted probabilities and observed outcome frequencies. A well-calibrated model generates probability estimates that accurately reflect true event likelihood, which is essential for clinical decision-making and effective risk communication.⁵⁶ To enhance calibration, isotonic regression a non-parametric technique that fits a monotonically increasing function to predicted probabilities was applied, as it has demonstrated effectiveness in clinical ML settings.^57,58 This approach allows for flexible adjustment without assuming a specific functional form, particularly when the relationship between predicted and actual probabilities deviates from logistic scaling, thereby improving interpretability and trustworthiness of risk estimates.⁵⁹ Following calibration, model performance was reassessed using the optimal threshold and compared with pre-calibration results. Calibration curves were plotted to visualize improvements in model calibration.

Explainable AI (XAI)

To enhance the transparency, interpretability, and clinical trustworthiness of the best-performing ML model, SHAP was employed.⁶⁰ SHAP quantifies each feature's contribution to individual predictions by assigning SHAP values that reflect their impact on shifting the model's output from a baseline. For global interpretability, SHAP summary plots rank features by mean absolute SHAP values, demonstrating overall feature importance and the direction of their effects. For local interpretability, SHAP force or waterfall plots decompose individual predictions to show how patient-specific feature values influence risk estimates relative to the baseline, thus providing case-level explanation.

Comparative analysis

Model performance was further evaluated using the NRI index.⁶¹ NRI quantifies the degree to which a new model improves patient risk classification compared to an established standard. In this study, comparisons were made against the TIMI score. This metric is particularly relevant in clinical contexts, where improved risk stratification can influence treatment decisions and resource allocation. An NRI value of zero indicates equivalent discriminative ability between the ML and TIMI models, while a positive NRI denotes superior reclassification by the ML model. Conversely, a negative NRI suggests poorer performance in distinguishing between risk categories.⁶¹

For the calculation of NRI, we dichotomized patients using a TIMI risk score threshold of ≥4 to define high-risk status. While clinical practice employs various thresholds for risk stratification (TIMI ≥4 or ≥5), we selected TIMI ≥4 for several reasons. First, the NRI methodology requires binary classification to quantify the proportion of patients correctly reclassified into appropriate risk categories. Second, TIMI ≥4 represents a more stringent definition that identifies patients at the upper end of the intermediate-risk and all high-risk categories who derive the greatest benefit from aggressive interventions, as demonstrated in the TACTICS-TIMI 18 trial.⁶² Third, this threshold optimizes sensitivity for identifying patients who require intensive monitoring, early invasive strategies, or transfer to higher-level care facilities, which is clinically preferable to false-negative misclassification of high-risk patients. The use of TIMI ≥4 has been adopted in clinical pathways at several institutions for triaging acute coronary syndrome patients.⁶³

Prototype development

The best-performing model from this study will be integrated into a publicly accessible website. This interface will allow users, including clinicians and researchers, to input relevant patient data and obtain real-time risk predictions based on the validated model. The web-based tool is designed to enhance accessibility, promote transparency, and facilitate clinical decision-making by providing an easy-to-use platform for exploring model outputs in practice.

Results

Patient characteristics

Table 4 summarizes the baseline characteristics of the study population. A total of 49,574 patients were included in the final analysis. Of these, 46,740 patients survived to hospital discharge, while 2834 patients (5.7%) experienced in-hospital mortality. The mean age was 26.21 years, and the cohort was predominantly male, accounting for approximately 86% of the population. Most patients were of Malay ethnicity (56.3%), followed by Chinese (17.0%), Indian (15.6%), and other ethnic groups (11.1%).

Table 4.

Patient characteristics.

Variables	Category	Total	Alive	Dead	P-value
N	Continuous	49,574	46,740 (94.3)	2834(5.7)
Age	Continuous	56.21 ± 12.01	55.87 ± 11.93	61.90 ± 11.94	<0.001
Sex	Male	42,665 (86.1%)	40,481 (86.6%)	2184 (77.1%)	<0.001
Sex	Female	6909 (13.9%)	6259 (13.4%)	650 (22.9%)
Race	Malay	27,901 (56.3%)	26,246 (56.2%)	1655 (58.4%)	<0.001
	Chinese	8423 (17.0%)	7879 (16.9%)	544 (19.2%)
	Indian	7734 (15.6%)	7351 (15.7%)	383 (13.5%)
	Others	5516 (11.1%)	5264 (11.3%)	252 (8.9%)
Smoking Status	Never	14,996 (30.2%)	13,888 (29.7%)	1108 (39.1%)	<0.001
	Former	9914 (20.0%)	9231 (19.7%)	683 (24.1%)
	Current	24,664 (49.8%)	23,621 (50.5%)	1043 (36.8%)
History of Dyslipidemia	No	37,132 (74.9%)	35,040 (75.0%)	2092 (73.8%)	0.1774
History of Dyslipidemia	Yes	12,442 (25.1%)	11,700 (25.0%)	742 (26.2%)
History of Hypertension	No	18,756 (37.8%)	17,874 (38.2%)	882 (31.1%)	<0.001
History of Hypertension	Yes	30,818 (62.2%)	28,866 (61.8%)	1952 (68.9%)
Family History of Premature Cardiovascular Disease	No	44,309 (89.4%)	41,665 (89.1%)	2644 (93.3%)	<0.001
Family History of Premature Cardiovascular Disease	Yes	5265 (10.6%)	5075 (10.9%)	190 (6.7%)
History of Myocardial Infarction	No	44,642 (90.1%)	42,125 (90.1%)	2517 (88.8%)	0.0255
History of Myocardial Infarction	Yes	4932 (9.9%)	4615 (9.9%)	317 (11.2%)
Documented CAD	No	44,842 (90.5%)	42,362 (90.6%)	2480 (87.5%)	<0.001
Documented CAD	Yes	4732 (9.5%)	4378 (9.4%)	354 (12.5%)
Chronic Angina More (>2 Weeks)	No	47,520 (95.9%)	44,796 (95.8%)	2724 (96.1%)	0.5017
Chronic Angina More (>2 Weeks)	Yes	2054 (4.1%)	1944 (4.2%)	110 (3.9%)
New Onset Angina (<2 Weeks)	No	13,411 (27.1%)	12,704 (27.2%)	707 (24.9%)	0.01
New Onset Angina (<2 Weeks)	Yes	36,163 (72.9%)	34,036 (72.8%)	2127 (75.1%)
History of Heart Failure	No	48,444 (97.7%)	45,725 (97.8%)	2719 (95.9%)	<0.001
History of Heart Failure	Yes	1130 (2.3%)	1015 (2.2%)	115 (4.1%)
Chronic Lung Disease	No	48,765 (98.4%)	46,005 (98.4%)	2760 (97.4%)	<0.001
Chronic Lung Disease	Yes	809 (1.6%)	735 (1.6%)	74 (2.6%)
Cerebrovascular Disease	No	48,367 (97.6%)	45,657 (97.7%)	2710 (95.6%)	<0.001
Cerebrovascular Disease	Yes	1207 (2.4%)	1083 (2.3%)	124 (4.4%)
Peripheral Vascular Disease	No	49,442 (99.7%)	46,632 (99.8%)	2810 (99.2%)	<0.001
Peripheral Vascular Disease	Yes	132 (0.3%)	108 (0.2%)	24 (0.8%)
ECG Abnormal ST Elevation >1 mm	No	27,129 (54.7%)	25,545 (54.7%)	1584 (55.9%)	0.2049
ECG Abnormal ST Elevation >1 mm	Yes	22,445 (45.3%)	21,195 (45.3%)	1250 (44.1%)
ECG Abnormal ST Elevation >2 mm	No	21,649 (43.7%)	20,465 (43.8%)	1184 (41.8%)	0.0383
ECG Abnormal ST Elevation >2 mm	Yes	27,925 (56.3%)	26,275 (56.2%)	1650 (58.2%)
ECG Abnormal ST Depression	No	44,158 (89.1%)	41,735 (89.3%)	2423 (85.5%)	<0.001
ECG Abnormal ST Depression	Yes	5416 (10.9%)	5005 (10.7%)	411 (14.5%)
ECG Abnormal T Wave Inversion	No	46,845 (94.5%)	44,147 (94.5%)	2698 (95.2%)	0.098
ECG Abnormal T Wave Inversion	Yes	2729 (5.5%)	2593 (5.5%)	136 (4.8%)
ECG Abnormal Bundle Branch Block	No	48,624 (98.1%)	45,893 (98.2%)	2731 (96.4%)	<0.001
ECG Abnormal Bundle Branch Block	Yes	950 (1.9%)	847 (1.8%)	103 (3.6%)
ECG Abnormal Inferior Leads	No	26,178 (52.8%)	24,618 (52.7%)	1560 (55.0%)	0.0147
ECG Abnormal Inferior Leads	Yes	23,396 (47.2%)	22,122 (47.3%)	1274 (45.0%)
ECG Abnormal Anterior Leads	No	23,278 (47.0%)	22,027 (47.1%)	1251 (44.1%)	0.0021
ECG Abnormal Anterior Leads	Yes	26,296 (53.0%)	24,713 (52.9%)	1583 (55.9%)
ECG Abnormal Lateral Leads	No	38,472 (77.6%)	36,469 (78.0%)	2003 (70.7%)	<0.001
ECG Abnormal Lateral Leads	Yes	11,102 (22.4%)	10,271 (22.0%)	831 (29.3%)
ECG Abnormal True Posterior	No	45,759 (92.3%)	43,131 (92.3%)	2628 (92.7%)	0.4001
ECG Abnormal True Posterior	Yes	3815 (7.7%)	3609 (7.7%)	206 (7.3%)
ECG Abnormal Right Ventricle	No	46,456 (93.7%)	43,793 (93.7%)	2663 (94.0%)	0.5908
ECG Abnormal Right Ventricle	Yes	3118 (6.3%)	2947 (6.3%)	171 (6.0%)
CABG	No	49,132 (99.1%)	46,332 (99.1%)	2800 (98.8%)	0.0902
CABG	Yes	442 (0.9%)	408 (0.9%)	34 (1.2%)
ASA	No	1722 (3.5%)	1543 (3.3%)	179 (6.3%)	<0.001
ASA	Yes	47,852 (96.5%)	45,197 (96.7%)	2655 (93.7%)
GP Receptor Inhibitor	No	48,398 (97.6%)	45,612 (97.6%)	2786 (98.3%)	0.0173
GP Receptor Inhibitor	Yes	1176 (2.4%)	1128 (2.4%)	48 (1.7%)
Heparin	No	43,138 (87.0%)	40,627 (86.9%)	2511 (88.6%)	0.0106
Heparin	Yes	6436 (13.0%)	6113 (13.1%)	323 (11.4%)
LMWH	No	39,818 (80.3%)	37,629 (80.5%)	2189 (77.2%)	<0.001
LMWH	Yes	9756 (19.7%)	9111 (19.5%)	645 (22.8%)
Angiotensin II Receptor Blocker	No	48,134 (97.1%)	45,338 (97.0%)	2796 (98.7%)	<0.001
Angiotensin II Receptor Blocker	Yes	1440 (2.9%)	1402 (3.0%)	38 (1.3%)
Lipid Lowering Agent	No	48,393 (97.6%)	45,583 (97.5%)	2810 (99.2%)	<0.001
Lipid Lowering Agent	Yes	1181 (2.4%)	1157 (2.5%)	24 (0.8%)
Diuretics	No	39,800 (80.3%)	37,808 (80.9%)	1992 (70.3%)	<0.001
Diuretics	Yes	9774 (19.7%)	8932 (19.1%)	842 (29.7%)
Calcium Antagonist	No	47,439 (95.7%)	44,679 (95.6%)	2760 (97.4%)	<0.001
Calcium Antagonist	Yes	2135 (4.3%)	2061 (4.4%)	74 (2.6%)
Insulin	No	38,611 (77.9%)	36,590 (78.3%)	2021 (71.3%)	<0.001
Insulin	Yes	10,963 (22.1%)	10,150 (21.7%)	813 (28.7%)
Anti-Arrhythmic Agent	No	47,167 (95.1%)	44,600 (95.4%)	2567 (90.6%)	<0.001
Anti-Arrhythmic Agent	Yes	2407 (4.9%)	2140 (4.6%)	267 (9.4%)
Killip Class	I	32,572 (65.7%)	31,809 (68.1%)	763 (26.9%)	<0.001
	II	8258 (16.7%)	7850 (16.8%)	408 (14.4%)
	III	2175 (4.4%)	1939 (4.1%)	236 (8.3%)
	IV	6569 (13.3%)	5142 (11.0%)	1427 (50.4%)
Diabetes	No	31,509 (63.6%)	30,073 (64.3%)	1436 (50.7%)	<0.001
Diabetes	Yes	18,065 (36.4%)	16,667 (35.7%)	1398 (49.3%)
Chronic Renal Disease	No	47,831 (96.5%)	45,269 (96.9%)	2562 (90.4%)	<0.001
Chronic Renal Disease	Yes	1743 (3.5%)	1471 (3.1%)	272 (9.6%)
Cardiac Catheterization	No	26,434 (53.3%)	24,448 (52.3%)	1986 (70.1%)	<0.001
Cardiac Catheterization	Yes	23,140 (46.7%)	22,292 (47.7%)	848 (29.9%)
PCI	No	31,273 (63.1%)	29,099 (62.3%)	2174 (76.7%)	<0.001
PCI	Yes	18,301 (36.9%)	17,641 (37.7%)	660 (23.3%)
Beta Blocker	No	18,158 (36.6%)	16,389 (35.1%)	1769 (62.4%)	<0.001
Beta Blocker	Yes	31,416 (63.4%)	30,351 (64.9%)	1065 (37.6%)
ACEI	No	23,116 (46.6%)	20,957 (44.8%)	2159 (76.2%)	<0.001
ACEI	Yes	26,458 (53.4%)	25,783 (55.2%)	675 (23.8%)
Statin	No	3514 (7.1%)	2960 (6.3%)	554 (19.5%)	<0.001
Statin	Yes	46,060 (92.9%)	43,780 (93.7%)	2280 (80.5%)
Oral Hypoglycemic Agent	No	42,867 (86.5%)	40,125 (85.8%)	2742 (96.8%)	<0.001
Oral Hypoglycemic Agent	Yes	6707 (13.5%)	6615 (14.2%)	92 (3.2%)
Heart Rate	Continuous	83.02 ± 20.98	82.48 ± 20.49	91.85 ± 26.44	<0.001
Systolic Blood Pressure	Continuous	133.88 ± 28.26	134.64 ± 27.92	121.40 ± 30.70	<0.001
Diastolic Blood Pressure	Continuous	81.06 ± 18.08	81.47 ± 17.87	74.31 ± 20.07	<0.001
Creatine Kinase	Continuous	1397.68 ± 1720.72	1400.67 ± 1709.30	1348.28 ± 1898.92	0.1155
Triglyceride	Continuous	1.72 ± 0.98	1.72 ± 0.99	1.60 ± 0.71	0
Total Cholesterol	Continuous	5.26 ± 1.24	5.28 ± 1.24	4.93 ± 1.21	<0.001
HDLC	Continuous	1.08 ± 0.29	1.08 ± 0.29	1.06 ± 0.26	0.002
LDLC	Continuous	3.36 ± 1.13	3.37 ± 1.13	3.04 ± 1.05	<0.001
Fasting Blood Glucose	Continuous	8.10 ± 3.63	8.04 ± 3.52	9.18 ± 4.99	<0.001

With respect to smoking status, 49.8% were current smokers, 20.0% were former smokers, and 30.2% had never smoked. The high proportion of current smokers underscores a significant modifiable risk factor within this population, particularly in the context of cardiovascular risk management.

Regarding comorbidities, 36.4% had diabetes mellitus, 62.2% had hypertension, and 10.6% had a prior history of cerebrovascular disease. Approximately 46.7% of patients underwent cardiac catheterization, and 36.9% received percutaneous coronary intervention (PCI).

Feature selection

Following the backward elimination process described in the Methods section, the performance of each model was recorded (Figure 2).

Figure 2.

All models’ BFE scores.

The XGB model achieved its highest AUC of 0.8916 when retaining 14 features as shown in Table 5, compared to the best AUC of 0.8905 with 45 features for RF, 0.8909 with 30 features for LR, and 0.8739 with 31 features for SVM. These results indicate that the combination of features selected by the XGB model provided the best balance between model complexity and predictive accuracy. This feature subset was subsequently used for all downstream modeling and evaluation.

Table 5.

Final 14 selected features.

Types	Selected features
Patients Demographics	Age
	Sex
	Diabetes
	Heart Rate
	Fasting Blood Glucose
	Killip Class
	Chronic Renal Disease
	Low-Density Lipoprotein
Medications	Angiotensin-Converting Enzyme Inhibitors
	Beta Blockers
	Oral Hypoglycemic Agent
	Statin
Interventions	Percutaneous Coronary Intervention
Interventions	Cardiac Catheterization

Figure 3 shows the 14 selected features and its feature importance ranking from most important to least important. The 14 selected features by XGB models are age, heart rate, low density lipoprotein cholesterol, fasting blood glucose, gender, Killip class, diabetes, chronic renal disease, cardiac catheterization, PCI, beta blocker usage, ACEI usage, statin usage, and oral hypoglycemic agent usage. All 14 selected predictor variables demonstrated statistically significant associations with in-hospital mortality (p < 0.0001), confirming their relevance for inclusion in the prediction model.

Figure 3.

Feature importance.

Model performance

Table 6 presents the performance metrics of the ML models, including NRI, prior to calibration. Several models demonstrated comparable AUC values around 0.89, such as LR (0.8905; 95% CI: 0.8776–0.9029), SVM (0.8898; 95% CI: 0.8772–0.9022), XGB (0.8904; 95% CI: 0.8773–0.9016), GBM (0.8899; 95% CI: 0.8772–0.9016), Stacked All Base Model (0.8886; 95% CI: 0.8758–0.9002), and Stacked Ensemble Base Model (0.8851; 95% CI: 0.8719–0.8966). Notably, LR and GBM achieved the highest NRI, with approximately 21% improvement over the TIMI risk score. The traditional TIMI score demonstrated limited discriminative ability for predicting in-hospital mortality among STEMI patients in the test set, with an AUC of 0.746. In contrast, all ML models evaluated prior to calibration outperformed the TIMI score, with AUC values ranging from approximately 0.87 to 0.89.

Table 6.

Performance metrics prior to calibration.

Name	Threshold	AUC	Accuracy	Recall	Specificity	Brier score	NRI vs. TIMI >= 4
TIMI Score	-	0.746	-	-	-	-	-
LR	0.51	0.8905 (95% CI: 0.8776–0.9029)	0.8059	0.8425	0.8023	0.1501	0.21
RF	0.47	0.8763 (95% CI: 0.8627–0.8882)	0.8029	0.8035	0.8029	0.1453	0.1612
SVM	0.05	0.8898 (95% CI: 0.8772–0.9022)	0.8055	0.8324	0.8029	0.0683	0.1944
XGB	0.47	0.8904 (95% CI: 0.8773–0.9016)	0.8142	0.8338	0.8122	0.1304	0.2059
GBM	0.04	0.8899 (95% CI: 0.8772–0.9016)	0.8009	0.8483	0.7962	0.0741	0.2124
Stacked All Base Model (LR Meta Learner)	0.05	0.8886 (95% CI: 0.8758–0.9002)	0.8293	0.8049	0.8317	0.0718	0.1863
Stacked Ensemble Base Model (LR Meta Learner)	0.05	0.8851 (95% CI: 0.8719–0.8966)	0.8337	0.7905	0.838	0.073	0.1808

Logistic regression model coefficients and odd ratios

Table 7 presents the coefficients and odds ratios (OR) from the LR model. Disease severity indicators demonstrated increased risk, with Killip class (OR = 1.76, β = 0.56) and creatinine levels (OR = 1.71, β = 0.54) showing the strongest associations. Patient characteristics including age at notification (OR = 1.34), sex (OR = 1.32), and comorbid diabetes mellitus (OR = 1.30) were associated with increased risk. Elevated fasting blood glucose (OR = 1.26) and heart rate (OR = 1.23) also contributed to higher risk.

Table 7.

Logistic regression model coefficient and odd ratios.

Features	Coefficient	Odd ratios
Killip Class (killipclass)	0.564	1.4545
History of Renal Disease (crenal)	0.5374	1.7115
Age (ptageatnotification)	0.2903	1.3369
Sex (ptsex)	0.2797	1.3227
Diabetes (cdm)	0.264	1.3021
Fasting Blood Glucose (fbg)	0.2339	1.2635
Heart Rate (heartrate)	0.2039	1.2262
Low-Density Lipoprotein Cholesterol (ldlc)	−0.0826	0.9207
Cardiac Catheterization (cardiaccath)	−0.2259	0.7978
Percutaneous Coronary Intervention (pci)	−0.2325	0.7925
Beta Blocker (bb)	−0.4884	0.6136
Angiotensin-Converting Enzyme inhibitor (acei)	−0.5239	0.5922
Statin	−0.5617	0.5703
Oral Hypoglycemic Agent (oralhypogly)	−1.4545	0.2335

Pharmacological interventions showed protective effects. Oral hypoglycemic agents demonstrated the strongest protective association (OR = 0.23, β = −1.45), followed by statin therapy (OR = 0.57), ACEI (OR = 0.59), and beta-blockers (OR = 0.61). Cardiac interventions including PCI (OR = 0.79) and cardiac catheterization (OR = 0.80) were associated with reduced risk. LDL cholesterol showed a near-neutral effect (OR = 0.92). These findings align with established clinical evidence supporting the beneficial effects of guideline-directed medical therapy and revascularization in post-myocardial infarction patients.

SHAP

The global SHAP summary plot (Figure 4) illustrates the relative contribution of each feature to model predictions across the cohort. The most influential variables were Killip class, beta-blocker use, angiotensin-converting enzyme inhibitor use, oral hypoglycemic agent use, and age. In the plot, higher feature values are indicated in red, while lower values are in blue. SHAP values represent the impact of each feature on the model's output. In this study, positive SHAP values indicate higher predicted probability of in-hospital death, and negative values correspond to lower risk.

Figure 4.

Global SHAP summary plot.

The directionality of feature effects is consistent with clinical understanding, higher Killip class, absence of beta-blocker or ACEI therapy, and older age are associated with increased mortality risk. This alignment between model-derived importance and known clinical predictors enhances the model's face validity and interpretability.

Figure 5 displays a local SHAP force plot for a single patient, illustrating how each feature contributes to the predicted probability of in-hospital death (class = 1). The model output for this patient is f(x) = −0.533, which is higher (i.e., less negative) than the cohort mean prediction E[f(x)] = −0.804. This indicates that the patient's estimated risk is lower than average.

Figure 5.

Local SHAP explanation.

In the plot, bars extending to the right (red in color) increase the predicted risk of death, while those extending to the left (blue in color) reduce it. For this patient, several features shifted the prediction leftward (lower risk), including low Killip class (killipclass), heart rate, fasting blood glucose (fbg), absence of diabetes (cdm), cardiac catheterization (cardiaccath), sex (ptsex), use of statin, and absence of chronic renal diseases (crenal).

Conversely, the absence of beta blocker (bb), angiotensin-converting enzyme inhibitor (acei), and oral hypoglycemic therapy (oralhypogly), age (ptageatnotification), percutaneous coronary intervention (pci) as well as low-density lipoprotein cholesterol (ldlc), increased the estimated risk. Although the overall leftward (protective) contributions outweighed the rightward (risk-increasing) ones, yielding a net negative score consistent with predicted survival.

Model calibration

The models’ performance following calibration is summarized in Table 8. Notably, the Brier score decreased for all models, indicating improved agreement between predicted and observed risks and underscoring the importance of calibration for translating raw model outputs into clinically meaningful probabilities. The most substantial reduction in Brier score was observed in the calibrated LR model, which decreased from 0.1501 to 0.0598, the lowest among all models.

Table 8.

Model performance metrics after calibration.

Name	Threshold	AUC	Accuracy	Recall	Specificity	Brier score	NRI vs. TIMI >= 4
Calibrated LR	0.14	0.8884 (95% CI: 0.8756–0.9011)	0.8538	0.7746	0.8617	0.0598	0.5828
Calibrated RF	0.11	0.8753 (95% CI: 0.8614–0.8876)	0.8205	0.7818	0.8243	0.0639	0.5526
Calibrated SVM	0.12	0.8875 (95% CI: 0.8749–0.9001)	0.8472	0.763	0.8556	0.0599	0.5651
Calibrated XGB	0.14	0.8871 (95% CI: 0.8744–0.8987)	0.7927	0.854	0.7866	0.0615	0.5871
Calibrated GBM	0.13	0.8885 (95% CI: 0.8759–0.9000)	0.8088	0.8425	0.8055	0.0621	0.2131
Calibrated Stacked All Base Model (LR Meta Learner)	0.11	0.8877 (95% CI: 0.8748–0.8992)	0.8012	0.8425	0.7971	0.0614	0.2035
Stacked Ensemble Base Model (LR Meta Learner)	0.13	0.8842 (95% CI: 0.8705–0.8957)	0.8281	0.7991	0.831	0.0621	0.1783

Additionally, NRI showed marked improvement relative to the TIMI risk score, with calibrated models such as LR, RF, SVM, and XGB demonstrating up to a 58% increase in NRI. The calibrated LR model achieved the best overall performance in this study, with an AUC of 0.8884 (95% CI: 0.8756–0.9011), overall accuracy of 0.8538, recall of 0.7746, specificity of 0.8617, Brier score of 0.0598, and NRI of 0.5828, outperforming all other ML models.

As shown in Figure 6, the calibration curve for the LR model demonstrates the impact of isotonic calibration on model probability estimates. The original (uncalibrated) model exhibited notable deviation from the ideal diagonal, indicating poor agreement between predicted probabilities and actual outcomes. Following isotonic calibration, the curve aligns much more closely with the ideal diagonal, reflecting substantial improvement in the agreement between predicted risks and observed event rates across the full range of probabilities. This result underscores the effectiveness of calibration in generating clinically meaningful and reliable probability estimates, which is essential for risk stratification in practice.

Figure 6.

Logistic regression calibration curve.

Prototype

As a proof of concept for real-world application, the best-performing calibrated logistic regression model was deployed in a prototype web-based interface named MyHeart STEMI Risk Calculator, publicly accessible at https://myheart-rho.vercel.app/calculators/myheart_stemi. This website is a research prototype, designed to demonstrate feasibility and gather usability feedback from the research community. It is not intended for clinical use, as the next steps in our roadmap are the rigorous external and prospective validation required for clinical deployment.

The web application accepts the 14 validated predictors as input: age, heart rate, low-density lipoprotein cholesterol, fasting blood glucose, sex, Killip class, diabetes, chronic renal disease, cardiac catheterization status, PCI status, and use of beta blockers, ACE inhibitors, statins, and oral hypoglycemic agents. Upon data entry, the system generates a real-time probability estimate of in-hospital mortality along with SHAP-based explanations showing which factors contribute most to the individual patient's risk.

The interface features an intuitive design suitable for both clinicians and researchers, displaying the predicted mortality risk as a percentage along with visual risk stratification (low, moderate and high). The SHAP values accompany each prediction, enhancing transparency by showing how each clinical variable influenced the risk estimate. This prototype demonstrates the feasibility of translating our validated ML model into a practical clinical decision-support tool and provides a platform for prospective validation and user feedback collection. Figure 7 shows a screenshot of the prototype interface with example patient data.

Figure 7.

MyHeart STEMI prototype.

Discussion

Robust feature selection and transparent model explainability are critical to ensuring clinical interpretability and building physician confidence in AI-driven prediction systems. Our final model identified 14 predictors of in-hospital mortality in STEMI: demographics and comorbidities (age, sex, diabetes, and chronic kidney disease), vital signs and laboratory values (heart rate, fasting blood glucose, and LDLC), clinical presentation (Killip class), in-hospital management (cardiac catheterization and PCI), and medications at admission or during hospitalization (beta blockers, ACEI, statins, and oral hypoglycemic agents).

All variables were incorporated into the ML models and were significantly associated with mortality on univariate analysis (each p < 0.001), confirming clinical relevance. Many overlaps with established scores: age, heart rate, Killip class, and diabetes appear in TIMI, while age and Killip class also feature in GRACE.⁶ A 2023 systematic review of 50 ML models for acute coronary syndromes reported that age, sex, heart rate, Killip class, renal function, blood glucose, and hemoglobin were the most frequently selected predictors, eight of which are included in our feature set.⁶⁴ This concordance with existing literature supports the face validity of our model.

In this study, ML models including individual algorithms and a stacked ensemble demonstrated substantially better performance than the conventional TIMI risk score for predicting in-hospital mortality in a Southeast Asian STEMI cohort. The calibrated LR model emerged as the best-balanced predictor across all performance metrics, achieving an AUC of 0.8884 (95% CI: 0.8756–0.9011), overall accuracy of 0.8538, recall (sensitivity) of 0.7746, and specificity of 0.8617. After probability calibration, this model attained a Brier score of 0.0598, indicating excellent calibration, and it yielded the highest net reclassification improvement (NRI = 0.5828) compared to TIMI. These results underscore the promise of ML-based approaches in advancing risk stratification for diverse patient populations.

Notably, the relatively simple calibrated logistic regression model slightly outperformed more complex models, including our stacked ensemble models, suggesting that added model complexity did not necessarily translate to better discrimination in this dataset. This finding reinforces the importance of careful model selection and calibration in achieving optimal performance.

Our results also represent a methodological advancement over prior ML models developed on this registry.^20,32 While earlier work demonstrated high discrimination (AUC), this study builds on those findings by: (a) using a more contemporary patient cohort (up to 2021) that reflects modern treatment patterns; (b) proving model robustness using a strict temporal validation rather than a conventional random split, and (c) systematically addressing probability calibration as a primary endpoint. By demonstrating that an uncalibrated LR model (Brier score: 0.1501) can be significantly improved (Brier score: 0.0598), we answer the call from recent JACC perspectives to move beyond simple performance metrics and focus on developing tools that are clinically reliable and trustworthy.⁶⁵

Our findings align with a growing body of literature showing that ML algorithms can capture complex, non-linear relationships between clinical features and outcomes that traditional risk scores often miss. Conventional risk scores such as TIMI were developed over two decades ago using predominantly Western cohorts and a limited set of variables. As a result, their performance tends to be suboptimal in contemporary Asian populations. They often underperform in high-risk patients and fail to reflect the impact of modern interventions and treatment strategies.⁶⁶ The performance of our ML models, developed using a large Asian registry, highlights the need for updated risk stratification tools that reflect current populations and treatment practices.

When compared to the TIMI score, all ML models in this study demonstrated improvements in risk reclassification. This was quantified using the NRI, which evaluates how effectively a new model reassigns patients into more appropriate risk categories relative to an existing model. In our analysis, the ML models yielded substantial NRI gains over TIMI, indicating that a meaningful proportion of patients were reclassified into more accurate risk strata. These findings are consistent with previous studies on ACS that have shown improved reclassification performance with ML-based models compared to traditional risk scores.^67–69 Collectively, these comparisons reinforce that data-driven ML models can enhance both the discrimination and clinical usefulness of risk predictions relative to conventional risk scores.

In this study, ensemble learning methods specifically RF, XGB and stacking yielded only modest improvements in AUC compared to the calibrated LR model, which demonstrated the best overall performance across all metrics. This observation is consistent with recent reviews showing that complex ML algorithms do not consistently outperform well-specified logistic regression models in clinical prediction contexts.⁷⁰ Our findings contribute to this area by evaluating ensemble methods and underscore that model interpretability, appropriate feature selection, and probability calibration remain essential for clinical implementation.

An often-underreported component of predictive modeling is calibration, which assesses how closely a model's predicted probabilities align with actual event rates. Calibration is critical for real-world clinical application, as well-calibrated risk estimates support reliable decision-making. Van Calster et al. reported that fewer than 10% of ML based clinical prediction studies include calibration metrics such as Brier scores or calibration plots,³⁰ underscoring a gap in reporting practices. In this study, calibration was addressed by applying isotonic regression to each model's predicted probabilities. This step improved probabilistic accuracy and consistency. For the calibrated LR model, which performed best overall, the Brier score was 0.0598, indicating strong alignment between predicted and observed mortality rates. Accurate calibration reduces the risk of both overtreatment and undertreatment, which is particularly relevant in settings with limited resources. These findings highlight the importance of incorporating calibration into future cardiovascular risk models developed for Asian populations.

Another key methodological consideration in this study was the use of cost-sensitive learning to address the substantial class imbalance between survivors and non-survivors. In-hospital mortality occurred in approximately 5.7% of cases, creating a ratio of about 1:16.5, which could bias standard ML models toward the majority class. Unlike oversampling or undersampling techniques that risk data duplication or loss, cost-sensitive learning embeds class weights directly into the loss function, enabling the algorithm to prioritize correct classification of the minority class without altering the data distribution. This approach has been shown to improve model sensitivity to rare but clinically important events and is particularly well suited to healthcare applications where misclassifying a high-risk patient may have serious consequences. By incorporating this strategy during model training, we aimed to enhance the model's ability to detect high-risk patients while preserving computational efficiency and generalizability.

In addition, by training and testing on data spanning over a decade (with patients from 2006 through 2021), the model implicitly learned from temporal trends in patient characteristics and treatments. We further performed a temporal validation, using the most recent years of data as a hold-out test set to mimic prospective performance. This form of validation is stronger than random split internal validation, as it assesses the model's robustness to changes over time (such as evolving clinical practices or population demographics). Our model's stable performance on the 2020–2021 temporal test set indicates resilience to such shifts, which is encouraging for its reliability in future use. This approach aligns with emerging best-practice recommendations that emphasize temporal and geographic validation to ensure generalizability. Clinically, the improved performance of our calibrated ML model over the TIMI score holds significant implications. It means that, in practice, more high-risk patients could be correctly identified early (and potentially receive aggressive therapy or closer monitoring), while low-risk patients could be spared unnecessary interventions and ultimately leading to more efficient and effective care. The ease of interpretability through SHAP and the inclusion of readily available clinical variables also facilitate integration into routine workflows. In sum, the combination of a diverse development cohort, rigorous validation, and model explainability provides confidence that our ML model could be safely and effectively deployed to aid risk stratification in STEMI patients across similar healthcare settings.

Globally, SHAP values confirmed that Killip class, beta blocker use, fasting blood glucose, and age exerted the greatest influence on predicted risk, consistent with their known impact on STEMI outcomes.^71–74 Locally, SHAP explanations clarified how individual features increased or decreased each patient's predicted risk, enabling clinicians to explore counterfactual scenarios (e.g., lowering heart rate or initiating specific therapies) and identify modifiable factors. Such explainability is essential for clinician trust and integration of AI tools into care where SHAP's additive attributions align well with human reasoning.²⁶By delivering both cohort-level and patient-specific explanations, our approach provides transparent, clinically interpretable predictions an indispensable requirement for AI deployment in high-stakes medical decision-making.⁷⁵

Despite the promising results, this study has limitations. The Malaysian NCVD registry is comprehensive and nationally representative, but healthcare delivery and population characteristics can vary in different environments. Therefore, the generalizability of our model to settings outside of Malaysia for example, other Asian nations or lower-tier hospitals remain to be proven. Further validation on external datasets would strengthen confidence in the model's broad applicability. External validation in other Asian nations and lower-tier hospitals is required to confirm generalizability. Furthermore, while the GRACE risk score is a well-validated tool for ACS risk stratification, we chose the TIMI score as our primary comparator. This decision was based on several factors, including its availability within our dataset. This availability is a key methodological advantage, as it ensures our models are benchmarked against the actual score used in routine clinical practice, thus avoiding the potential biases of retrospective calculation and providing a more faithful comparison against the real-world standard of care. The TIMI score is a simpler bedside tool that has been locally validated in Malaysia's multi-ethnic STEMI patients, making it more routinely adopted in our hospitals.⁷⁶

By contrast, the GRACE score though often more predictive in Western cohorts has notable limitations in Asian settings. It was derived largely from Western populations with minimal Asian representation and applying it without adjustment can be problematic.⁷⁷ For instance, a Singaporean study found that the original GRACE model significantly underestimated in-hospital mortality for Chinese, Malay, and Indian patients, necessitating recalibration to improve its accuracy.⁷⁸ GRACE also requires laboratory variables (e.g., serum creatinine), which can delay risk assessment at presentation.⁷⁷ These factors have limited GRACE's uptake in our setting, where TIMI's ease of use and proven utility make it the preferred score. Nevertheless, emerging evidence indicates that modern ML models can outperform both TIMI and GRACE. Recent meta-analyses and regional studies consistently report that ML algorithms achieve higher discrimination (AUC/C-statistic) for ACS mortality prediction compared to traditional scores.⁷⁸ Similarly, large Asian studies show that AI-based models markedly improve classification of high-risk patients over the TIMI score. Collectively, these findings underscore why data-driven ML approaches are poised to surpass conventional risk scores like TIMI and GRACE in diverse patient populations.

A key strength of this study is the use of a large, nationally representative registry of STEMI patients in Malaysia, encompassing a truly multi-ethnic Asian population. Our cohort includes patient demographics and risk factor profiles that are reflective of the broader Southeast Asian region (e.g., Malay, Chinese, Indian, and other ethnic groups common to Malaysia, Singapore, Indonesia, and beyond). This diversity in the development dataset enhances the external validity of our findings and suggests that the resulting model may generalize well to other Asian populations. To support real-world applicability, we have deployed a prototype web-based tool, the MyHeart STEMI Risk Calculator which is accessible at https://myheart-rho.vercel.app/calculators/myheart_stemi. This interface enables risk estimation based on the final model and will be part of an ongoing effort to collect user feedback and perform continuous external validation in diverse clinical settings.

Future work

This work opens several avenues for future research and development. Rigorous external validation of the model is a priority. We plan to test the model in external cohorts, such as registry data from neighboring countries or distinct healthcare systems, to evaluate its calibration and discrimination in new populations. This will help identify any overfitting to the derivation cohort and ensure that the model's performance holds when confronted with different case-mix and practice patterns. Equally important is the need for comprehensive fairness and equity analyses. Future work will include stratified performance evaluations across key subgroups including hospital type (urban tertiary vs. suburban vs. rural), geographic regions, sex/gender, and ethnic groups to identify any disparities in model accuracy. We will also conduct formal fairness audits to ensure the model performs equitably across these diverse populations and assess whether systematic differences in data quality across sites affect predictions. These equity analyses are essential to ensure that the model provides reliable risk stratification for all patient populations before clinical deployment.

Furthermore, while our current model focuses on the critical endpoint of in-hospital mortality to guide acute care decisions, a key priority for future work is to expand this model to predict longer-term outcomes. This will involve moving beyond binary classification to include survival analysis, allowing for the prediction of 30-day, 180-day or 1-year mortality and the generation of survival curves that can estimate patient prognosis over several years. Such models are essential for informing post-discharge planning and secondary prevention strategies.

Another important direction is the integration of the model into electronic health record (EHR) systems for prospective evaluation. Embedding the risk calculator into hospital EHRs would allow automated, real-time risk scoring for incoming STEMI patients, enabling clinicians to act on the predictions within the clinical workflow. Such integration should be accompanied by user interface design that leverages the model's explainability (e.g., showing which factors contributed most to a patient's risk) to support clinician understanding and trust.

Limitations

A further methodological limitation is the lack of a formal sensitivity analysis for our data imputation. We employed median and mode imputation due to its simplicity and computational efficiency. However, we did not test whether more complex methods, such as multivariate imputation by chained equations (MICE) or k-nearest neighbors (k-NN) imputation, would have significantly altered the model's performance. This represents an avenue for future methodological validation.

Furthermore, while we noted that GRACE's reliance on laboratory values can delay immediate risk assessment, our model includes FBG and LDL-cholesterol with the understanding that it is intended for in-hospital risk stratification rather than hyperacute triage. These laboratory values are routinely obtained within the first 24 hours as part of standard STEMI care in our setting. For settings where immediate bedside risk assessment is needed, a simplified version of the model using only clinical variables (age, heart rate, Killip class, and vital signs) could be developed and validated in future work. This would create a tiered approach: a rapid triage score at presentation and an enhanced prediction model once laboratory results are available.

Finally, we advocate for prospective clinical trials or implementation studies to assess the impact of using ML-driven risk scores on patient outcomes. For example, a stepped-wedge trial could determine whether management guided by our ML model (versus usual care or TIMI score guidance) leads to differences in treatment decisions, resource allocation, or clinical outcomes. These future studies will be crucial to move from retrospective model development to real-world clinical adoption. By addressing these next steps on expanded outcomes, external validation, EHR integration, and prospective evaluation, we aim to advance the translation of our calibrated ML risk model into a reliable decision-support tool for improving STEMI care.

Conclusion

This study introduces a clinically applicable and well-calibrated ML model for predicting in-hospital mortality among STEMI patients in an Asian population, leveraging a large, nationally representative registry from Malaysia. The model outperformed conventional risk stratification tools, including the TIMI score, in terms of discrimination, calibration, and interpretability. A temporal split of the dataset was employed to simulate prospective deployment and assess model generalizability over time. By integrating contemporary ML techniques with explainability frameworks, the model provides a transparent and actionable decision-support tool tailored to regional clinical profiles. These findings underscore the importance of periodic model updates and local validation to enhance the clinical utility of risk prediction in acute cardiovascular care.

Footnotes

Acknowledgments

This work was supported by the Higher Institution Centre of Excellence (HICoE) research grant 600-RMC/MOHE HICoE CARE-I 5/3 (01/2025) awarded to the Cardiovascular Advancement and Research Excellence Institute (CARE Institute), Universiti Teknologi MARA.

ORCID iDs

Khairul Shafiq Ibrahim

Wan Azman Wan Ahmad

Sorayya Malek

Ethical approval

The Medical Review & Ethics Committee (MREC) of the Ministry of Health (MOH) of Malaysia approved the NCVD registry study (NMRR-07-38-164), and further authorization was granted by the UiTM ethics committee (600-TNCPI (5/1/6)).

Contributorship

Sazzli Kasim was involved in conceptualization, methodology, resources, funding acquisition, writing–review and editing, supervision, and project administration. Lim Bing Feng contributed to conceptualization, methodology, software, data curation, writing–original draft preparation, and writing–review and editing. Sorayya Malek was responsible for conceptualization, methodology, software, data curation, resources, funding acquisition, writing–original draft preparation, writing–review and editing, supervision, and project administration. Putri Nur Fatin Amir Rudin contributed to conceptualization, methodology, software, and data curation. Khairul Shafiq Ibrahim and Wan Azman Wan Ahmad both contributed to writing–review and editing, supervision, and project administration. Alan Yean Yip Fong contributed to software and data curation. All authors have read and approved the final version of the manuscript.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the Higher Institution Centre of Excellence (HICoE) (grant number: 600-RMC/MOHE HICoE CARE-I 5/3 (01/2025)).

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data availability

Data which support the findings of this research are accessible from the National Heart Association of Malaysia (NHAM), but the availability of these data is restricted, therefore they are not publicly available. It belongs to the individual ministry of health universities hospitals and private hospitals that require multiple institutional agreements for data release to third parties, therefore ethical approval is required for analysis. Data are however available from NHAM upon request using or email them at secretariat@malaysianheart.org. Any findings from the data need to be reported and permission needs to be obtained from the NHAM committee before publication.

References

Zhao

. Epidemiological features of cardiovascular disease in Asia. JACC: Asia [Internet]. 2021 Jun 1 [cited 2025 May 13];1:1–13. Available from: https://www.jacc.org/doi/10.1016/j.jacasi.2021.04.007.

Elendu

Amaechi

Elendu

, et al. Comprehensive review of ST-segment elevation myocardial infarction: understanding pathophysiology, diagnostic strategies, and current treatment approaches. Medicine [Internet] 2023 Oct 27 [cited 2025 Jun 10]; 102: e35687. Available from: https://pmc.ncbi.nlm.nih.gov/articles/PMC10615529/.

Faridi

Wang

Minges

, et al. Predicting mortality in patients hospitalized with acute myocardial infarction: from the national cardiovascular data registry. Circ Cardiovasc Qual Outcomes [Internet]. 2025 Mar;18:e011259; Available from: https://www.ahajournals.org/doi/10.1161/CIRCOUTCOMES.124.011259.

Anggraini

Said

Umar

, et al. Importance of acute coronary syndrome knowledge to improve early detection and reduce prehospital delay in patient with acute coronary syndrome: a systematic review. Open Access Maced J Med Sci [Internet] 2023 Jan 2 [cited 2025 May 13]; 11: 33–42.

Tang

Wong

Herbison

. Global registry of acute coronary events (GRACE) hospital discharge risk score accurately predicts long-term mortality post acute coronary syndrome. Am Heart J [Internet] 2007 [cited 2025 Jun 10]; 153: 29–35. Available from: https://pubmed.ncbi.nlm.nih.gov/17174633/.

Antman

Cohen

Bernink

PJLM

, et al. The TIMI risk score for unstable angina/non-ST elevation MI: a method for prognostication and therapeutic decision making. J Am Med Assoc [Internet] 2000 Aug 16 [cited 2025 Jun 10]; 284: 835–842. Available from: https://pubmed.ncbi.nlm.nih.gov/10938172/.

Selvarajah

Fong

AYY

Selvaraj

, et al. An Asian validation of the TIMI risk score for ST-segment elevation myocardial infarction. PLoS One [Internet] 2012 Jul 16 [cited 2025 May 13]; 7: e40249. Available from: https://pmc.ncbi.nlm.nih.gov/articles/PMC3398026/.

Martha

Sihite

Listina

. The difference in accuracy between global registry of acute coronary events score and thrombolysis in myocardial infarction score in predicting in-hospital mortality of acute ST-elevation myocardial infarction patients. Cardiol Res [Internet] 2021 May 14 [cited 2025 May 13]; 12: 177–185. Available from: https://cardiologyres.org/index.php/Cardiologyres/article/view/1247/1220.

Chan

Shah

Gao

, et al. Recalibration of the global registry of acute coronary events risk score in a multiethnic Asian population. Am Heart J [Internet] 2011 Aug [cited 2025 Jun 10]; 162: 291–299. Available from: https://pubmed.ncbi.nlm.nih.gov/21835290/.

10.

Krittayaphong

. Cardiovascular risk prediction model in Asians: current status and future direction. JACC: Asia [Internet]. 2024 Apr 1 [cited 2025 May 13];4:275–278. Available from: https://www.jacc.org/doi/10.1016/j.jacasi.2024.01.007.

11.

L-J

. Cardiovascular risk prediction: basic concepts, current status, and future directions. Circulation [Internet] 2010 Apr [cited 2025 May 13]; 121: 1768–1777. Available from: https://pubmed.ncbi.nlm.nih.gov/20404268/.

12.

Sritharan

Nguyen

Ciofani

, et al. Machine-learning based risk prediction of in-hospital outcomes following STEMI: the STEMI-ML score. Front Cardiovasc Med [Internet]. 2024 Oct 10 [cited 2025 May 14];11:1454321. Available from: https://doi.org/10.3389/fcvm.2024.1454321.

13.

Yang

, et al. A machine learning model for predicting in-hospital mortality in Chinese patients with ST-segment elevation myocardial infarction: findings from the China myocardial infarction registry. J Med Internet Res [Internet] 2024 Jul 30 [cited 2025 May 14]; 26: e50067. Available from: https://www.jmir.org/2024/1/e50067.

14.

Zhang

Wang

, et al. The predictive value of machine learning for mortality risk in patients with acute coronary syndromes: a systematic review and meta-analysis. Eur J Med Res [Internet] 2023 Dec 1 [cited 2025 Jun 11]; 28: 1–13. Available from: https://eurjmedres.biomedcentral.com/articles/10.1186/s40001-023-01027-4.

15.

Chen

Wang

, et al. Machine learning-based in-hospital mortality prediction models for patients with acute coronary syndrome. Am J Emerg Med [Internet]. 2022 Mar 1 [cited 2025 Jun 11];53:127–134. Available from: https://pubmed.ncbi.nlm.nih.gov/35033770/.

16.

VanHouten

Starmer

Lorenzi

, et al. Machine learning for risk prediction of acute coronary syndrome. AMIA Annual Symposium Proceedings [Internet] 2014 [cited 2025 May 19]; 2014: 1940. Available from: https://pmc.ncbi.nlm.nih.gov/articles/PMC4419888/.

17.

Kwon

Lee

, et al. An algorithm based on deep learning for predicting in-hospital cardiac arrest. J Am Heart Assoc [Internet] 2018 Jul 1 [cited 2025 May 19]; 7: e008678. Available from: https://www.ahajournals.org/doi/10.1161/JAHA.118.008678.

18.

Cheng

Lee

Nfor

, et al. Using machine learning-based algorithms to construct cardiovascular risk prediction models for Taiwanese adults based on traditional and novel risk factors. BMC Med Inform Decis Mak [Internet] 2024 Dec 1 [cited 2025 May 19]; 24: 183. Available from: https://pubmed.ncbi.nlm.nih.gov/39039467/.

19.

Tran

Choi

Byeon

. Explainable stacking ensemble with feature tokenizer transformers for men’s diabetes prediction. J Men Health [Internet] 2024 [cited 2025 Jun 11]; 20: 38–56. Available from: www.jomh.org.

20.

Aziz

Malek

Ibrahim

, et al. Short- and long-term mortality prediction after an acute ST-elevation myocardial infarction (STEMI) in Asians: a machine learning approach. PLoS One [Internet] 2021 Aug 1 [cited 2025 May 14]; 16: e0254894. Available from: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0254894.

21.

Ismail

Khalil

MKN

Mohamad

MSF

, et al. Systematic review and meta-analysis of prognostic models in Southeast Asian populations with acute myocardial infarction. Front Cardiovasc Med [Internet]. 2022 Jul 26 [cited 2025 Nov 3]; 9:913220. Available from: https://pubmed.ncbi.nlm.nih.gov/35958391/.

22.

Austin

van Klaveren

Vergouwe

, et al. Geographic and temporal validity of prediction models: different approaches were useful to examine model performance. J Clin Epidemiol [Internet]. 2016 Nov 1 [cited 2025 May 14];79:76. Available from: https://pmc.ncbi.nlm.nih.gov/articles/PMC5708595/.

23.

Kaur

Pannu

Malhi

. A systematic review on imbalanced data challenges in machine learning. ACM Computing Surveys (CSUR) [Internet] 2019 Aug 30 [cited 2025 May 14]; 52: 56. Available from: https://dl.acm.org/doi/10.1145/3343440.

24.

Rudin

. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nat Mach Intell [Internet]. 2018 Nov 26 [cited 2025 Jun 11];1:206–215. Available from: https://arxiv.org/pdf/1811.10154.

25.

Wadden

. Defining the undefinable: the black box problem in healthcare artificial intelligence. J Med Ethics [Internet] 2022 Oct 1 [cited 2025 May 14]; 48: 764–768. Available from: https://jme.bmj.com/content/48/10/764.

26.

Lundberg

Lee

. A unified approach to interpreting model predictions . pp. 4766–4775. Available from: https://arxiv.org/abs/1705.07874v2.

27.

Madhusai

Aarthi

VPMB

Jenila

, et al. Explainable AI for cardiovascular health: a SHAP-based framework . pp.353–358. Available from: https://www.researchgate.net/publication/390412654_Explainable_AI_for_Cardiovascular_Health_A_SHAP-Based_Framework.

28.

Caruana

. Predicting good probabilities with supervised learning Alexandru Niculescu-Mizil. ICML '05: Proceedings of the 22nd international conference on Machine learning 2005 Aug 7 [cited 2025 May 28]: 625–632. Available from: https://doi.org/10.1145/1102351.1102430.

29.

Niculescu-Mizil

Caruana

. Obtaining calibrated probabilities from boosting. Proceedings of the 21st Conference on Uncertainty in Artificial Intelligence (UAI 2005) 2012 Jul 4 [cited 2025 May 28]: 413–420. Available from: https://doi.org/10.48550/arXiv.1207.1403.

30.

Van Calster

McLernon

Van Smeden

, et al. Calibration: the Achilles heel of predictive analytics. BMC Med [Internet] 2019 Dec 16 [cited 2025 May 28]; 17: 1–7. Available from: https://bmcmedicine.biomedcentral.com/articles/10.1186/s12916-019-1466-7.

31.

Zadrozny

Elkan

. Transforming classifier scores into accurate multiclass probability estimates . pp. 694–699. Available from: https://www.researchgate.net/publication/2571315_Transforming_Classifier_Scores_into_Accurate_Multiclass_Probability_Estimates.

32.

Kasim

Malek

Amir

PNF

, et al. Predicting the risk of in-hospital Asian NSTEMI patients using stacked ensemble learning. Eur Heart J [Internet] 2023 Nov 9 [cited 2025 May 14]; 44(Suppl 2): ehad655.2939. Available from: https://doi.org/10.1093/eurheartj/ehad655.2939.

33.

Kirkwood

Sterne

JAC

. Essential medical statistics. 2nd ed. Malden (MA): Blackwell Science, 2003.

34.

Janse

Abu-Hanna

Vagliano

, et al. When the whole is greater than the sum of its parts: why machine learning and conventional statistics are complementary for predicting future health outcomes. Clin Kidney J [Internet] 2025 Apr 8 [cited 2025 May 14]; 18: sfaf059. Available from: https://doi.org/10.1093/ckj/sfaf059.

35.

Rubin

. Inference and missing data. Biometrika [Internet] 1976 Dec 1 [cited 2025 Jun 11]; 63: 581–592. Available from: https://doi.org/10.1093/biomet/63.3.581.

36.

Little

RJA

Rubin

. Statistical analysis with missing data. 3rd ed. Hoboken (NJ): John Wiley & Sons, Inc., 2019 [cited 2025 May 14], Available from: https://doi.org/10.1002/9781119482260.

37.

Araf

Idri

Chairi

. Cost-sensitive learning for imbalanced medical data: a review. Artif Intellig Rev 2024 [Internet] 2024 Mar 1 [cited 2025 Jun 18]; 57: 1–72. Available from: https://link.springer.com/article/10.1007/s10462-023-10652-8.

38.

Elkan

. The foundations of cost-sensitive learning. In Proceedings of the 17th International Joint Conference on Artificial Intelligence [Internet] 2001 Aug 4 [cited 2025 May 14]; 2: 973–978. Available from: https://dl.acm.org/doi/10.5555/1642194.1642224.

39.

Razali

Arbaiy

Lin

, et al. Optimizing multiclass classification using convolutional neural networks with class weights and early stopping for imbalanced datasets. Electronics 2025 [Internet] 2025 Feb 12 [cited 2025 May 14]; 14: 705. Available from: https://www.mdpi.com/2079-9292/14/4/705/htm.

40.

Muhammad Ali

Faraj

. Data Normalization and Standardization: A Technical Report. 2014.

41.

Gopal

Patro

Kumar Sahu

. Normalization: a preprocessing stage. IARJSET [Internet]. 2015 Mar 19 [cited 2025 May 14];2:20–22. Available from: https://arxiv.org/abs/1503.06462v1.

42.

Zebari

Abdulazeez

Zeebaree

, et al. A comprehensive review of dimensionality reduction techniques for feature selection and feature extraction. J Appl Sci Technol Trends [Internet] 2020 May 15 [cited 2025 May 14]; 1: 56–70. Available from: https://doi.org/10.38094/jastt1110.

43.

Dissanayake

Md Johar

. Heart disease diagnostics using meta-learning-based hybrid feature selection. Appl Computat Intellig Soft Comput [Internet]. 2024 Jan 1 [cited 2025 May 14];2024:8800497. Available from: https://onlinelibrary.wiley.com/doi/full/10.1155/2024/8800497.

44.

Kasim

Amir Rudin

PNF

Malek

, et al. Ensemble machine learning for predicting in-hospital mortality in Asian women with ST-elevation myocardial infarction (STEMI). Scientific Reports 2024 [Internet] 2024 May 29 [cited 2025 Jun 24]; 14: 1–16. Available from: https://www.nature.com/articles/s41598-024-61151-x.

45.

Berkson

. Application of the logistic function to bio-assay. J Am Stat Assoc 1944; 39: 357–365.

46.

Breiman

. Random forests. Mach Learn 2001 Oct; 45: 5–32.

47.

Cortes

Vapnik

. Support-vector networks. Mach Learn 1995 Sep; 20: 273–297.

48.

Chen

Guestrin

. XGBoost: a scalable tree boosting system. August-2016: 785–94.

49.

Wolpert

. Stacked generalization. Neural Netw 1992; 5: 241–259.

50.

Ganie

Pramanik

PKD

Zhao

. Ensemble learning with explainable AI for improved heart disease prediction based on multiple datasets. Sci Rep [Internet] 2025 Dec 1 [cited 2025 Jul 14]; 15: 13912. Available from: https://pmc.ncbi.nlm.nih.gov/articles/PMC12015489/.

51.

Berrar

. Performance measures for binary classification. Encyclop Bioinform Computat Biol ABC Bioinform 2018 Jan 1; 1: 1–3, 546–60. Available from: https://doi.org/10.6000/1929-6029.2012.01.01.08.

52.

Çorbacıoğlu

ŞK

Aksel

. Receiver operating characteristic curve analysis in diagnostic accuracy studies: a guide to interpreting the area under the curve value. Turk J Emerg Med [Internet] 2023 Oct 1 [cited 2025 May 14]; 23: 195. Available from: https://pmc.ncbi.nlm.nih.gov/articles/PMC10664195/.

53.

Schisterman

Faraggi

Reiser

, et al. Youden Index and the optimal threshold for markers with mass at zero. Stat Med [Internet] 2008 Jan 30 [cited 2025 Jun 16]; 27: 297. Available from: https://pmc.ncbi.nlm.nih.gov/articles/PMC2749250/.

54.

Zweig

Campbell

. Receiver-operating characteristic (ROC) plots: a fundamental evaluation tool in clinical medicine. Clin Chem [Internet] 1993 Apr 1 [cited 2025 May 14]; 39: 561–577.

55.

Brier

. Verification of forecasts expressed in terms of probability. Mon Weather Rev 1950 Jan; 78: –3.

56.

Van Calster

Nieboer

Vergouwe

, et al. A calibration hierarchy for risk models was defined: from utopia to empirical data. J Clin Epidemiol [Internet] 2016 Jun 1 [cited 2025 May 14]; 74: 167–176. Available from: https://pubmed.ncbi.nlm.nih.gov/26772608/.

57.

Zoabi

Kehat

Lahav

, et al. Predicting bloodstream infection outcome using machine learning. Sci Rep [Internet]. 2021 Dec 1 [cited 2025 Jun 18];11:1–11. Available from: https://www.nature.com/articles/s41598-021-99105-2.

58.

, et al. Real-Time prediction of sepsis in critical trauma patients: machine learning–based modeling study. JMIR Form Res [Internet]. 2023 [cited 2025 Jun 18];7:e42452. Available from: https://pmc.ncbi.nlm.nih.gov/articles/PMC10131736/.

59.

Jiang

Osl

Kim

, et al. Smooth isotonic regression: a new method to calibrate predictive models. AMIA Jt Summits Transl Sci Proc 2011; 2011: 16–20.

60.

MdM

ASMdM

. Insights into manipulation: unveiling tampered images using modified ELA, deep learning, and explainable AI. J Comput Commun [Internet] 2024 Jun 21 [cited 2025 May 14]; 12: 135–151. Available from: https://www.scirp.org/journal/paperabs?paperid=134130.

61.

Pencina

D’Agostino

Vasan

. Evaluating the added predictive ability of a new marker: from area under the ROC curve to reclassification and beyond. Stat Med [Internet] 2008 Jan 30 [cited 2025 Jun 16]; 27: 157–172.

62.

Cannon

Weintraub

Demopoulos

, et al. Comparison of early invasive and conservative strategies in patients with unstable coronary syndromes treated with the glycoprotein IIb/IIIa inhibitor tirofiban. N Engl J Med 2001 Jun 21; 344: 1879–1887.

63.

Hoekstra

Cohen

. Management of patients with unstable angina/non-ST-elevation myocardial infarction: a critical review of the 2007 ACC /AHA guidelines. Int J Clin Pract 2009 Apr; 63: 642–655.

64.

Zhang

Wang

, et al. The predictive value of machine learning for mortality risk in patients with acute coronary syndromes: a systematic review and meta-analysis. Eur J Med Res [Internet] 2023 Dec 1 [cited 2025 Jun 17]; 28: 1–13. Available from: https://eurjmedres.biomedcentral.com/articles/10.1186/s40001-023-01027-4.

65.

Jain

Mortazavi

Chan You

, et al. Moving beyond the model: our perspective on meaningful AI research in cardiovascular care. J Am Coll Cardiol [Internet] 2025 Sep 9 [cited 2025 Nov 4]; 86: 691–695.

66.

Rathore

Weinfurt

Foody

JAM

, et al. Performance of the thrombolysis in myocardial infarction (TIMI) ST-elevation myocardial infarction risk score in a national cohort of elderly patients. Am Heart J [Internet] 2005 Sep [cited 2025 Jun 18]; 150: 402. Available from: https://pmc.ncbi.nlm.nih.gov/articles/PMC2790534/.

67.

Eom

Moon

, et al. Addition of routine blood biomarkers to TIMI risk score improves predictive performance of 1-year mortality in patients with ST-segment elevation myocardial infarction. BMC Cardiovasc Disord [Internet] 2020 Dec 1 [cited 2025 Jun 24]; 20: 1–9. Available from: https://bmccardiovascdisord.biomedcentral.com/articles/10.1186/s12872-020-01777-7.

68.

Amin

Morrow

Braunwald

, et al. Dynamic TIMI risk score for STEMI. J Am Heart Assoc [Internet] 2013 Jan 29 [cited 2025 Jun 24]; 2: e003269. Available from: https://www.ahajournals.org/doi/10.1161/JAHA.112.003269.

69.

Myers

Scirica

Stultz

. Machine learning improves risk stratification after acute coronary syndrome. Sci Rep [Internet]. 2017 Dec 1 [cited 2025 Jun 24];7:12692. Available from: https://pmc.ncbi.nlm.nih.gov/articles/PMC5627253/.

70.

Christodoulou

Collins

, et al. A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models. J Clin Epidemiol [Internet]. 2019 Jun 1 [cited 2025 Jul 16];110:12–22. Available from: https://pubmed.ncbi.nlm.nih.gov/30763612/.

71.

Vicent

Velásquez-Rodríguez

Valero-Masa

, et al. Predictors of high Killip class after ST segment elevation myocardial infarction in the era of primary reperfusion. Int J Cardiol 2017 Dec 1; 248: 46–50.

72.

Mirna

Berezin

Schmutzler

, et al. Early beta-blocker therapy improves in-hospital mortality of patients with non-ST-segment elevation myocardial infarction - a meta-analysis. Int J Cardiol [Internet]. 2023 Oct 15 [cited 2025 May 28]; 389:131174. Available from: https://pubmed.ncbi.nlm.nih.gov/37423571/.

73.

David

Almeida

Cruz

, et al. Diabetes mellitus and glucose as predictors of mortality in primary coronary percutaneous intervention. Arq Bras Cardiol [Internet] 2014 Oct 1 [cited 2025 May 28]; 103: 323. Available from: https://pmc.ncbi.nlm.nih.gov/articles/PMC4206363/.

74.

Tumminello

D’Errico

Maruccio

, et al. Age-related mortality in STEMI patients: insight from one year of HUB centre experience during the pandemic. J Cardiovasc Dev Dis [Internet] 2022 Dec 1 [cited 2025 May 28]; 9: 432. Available from: https://pmc.ncbi.nlm.nih.gov/articles/PMC9781871/.

75.

Rajkomar

Dean

Kohane

. Machine learning in medicine. N Engl J Med [Internet] 2019 Apr 4 [cited 2025 Jul 16]; 380: 1347–1358. Available from: https://www.nejm.org/doi/abs/10.1056/NEJMra1814259.

76.

Gupta

Mustafiz

Mutahar

, et al. Machine learning vs traditional approaches to predict all-cause mortality for acute coronary syndrome: a systematic review and meta-analysis. Can J Cardiol [Internet] 2025 [cited 2025 Jul 16]; 41: 1564–1583. Available from: https://pubmed.ncbi.nlm.nih.gov/39971002/.

77.

Ismail

Khalil

MKN

Mohamad

MSF

, et al. Systematic review and meta-analysis of prognostic models in Southeast Asian populations with acute myocardial infarction. Front Cardiovasc Med [Internet] 2022 Jul 26 [cited 2025 Jul 16]; 9: 921044.

78.

Sia

Zheng

, et al. Comparison of the modified Singapore myocardial infarction registry risk score with GRACE 2.0 in predicting 1-year acute myocardial infarction outcomes. Sci Rep [Internet]. 2022 Aug 22 [cited 2025 Jul 17]; 12(1): 1–9. Available from: https://www.nature.com/articles/s41598-022-16523-6. doi:10.1038/s41598-022-16523-6