Sage Journals: Discover world-class research

Abstract

Machine learning techniques offer significant potential for improving the diagnosis of coronary heart disease by enabling earlier detection and timely intervention. This study presents a machine learning-based method utilizing clinical records to evaluate the impact of different data preprocessing sequences on predictive accuracy. Two clinical datasets were examined: one comprising heart failure patient data with 14 clinical features, and the Cleveland Heart Disease Dataset. The investigation compared two preprocessing strategies: standardisation prior to balancing, and balancing prior to scaling. Six machine learning models (XGBoost, GBDT, AdaBoost, Random Forest, KNN, and RaSE) were trained on an 80:20 data split and assessed using accuracy, precision, recall, and F1-score. Hyperparameters were optimized with Bayesian Optimisation. Results showed that both preprocessing designs achieved perfect accuracy on the Cleveland dataset. For the heart failure dataset, balancing before scaling led to improved accuracy (95%) compared with standardising before balancing (93.33%), and yielded higher macro-average and weighted-average F1-scores, signifying better overall classification performance. Among the evaluated models, XGBoost consistently provided the most robust predictions across conditions. These findings highlight the critical influence of preprocessing sequence on model effectiveness in imbalanced clinical data and suggest that balancing before scaling significantly enhances classification accuracy. XGBoost stands out as a reliable model for potential implementation in clinical decision support systems. Overall, this study advances the development of AI-driven tools for digital health applications, contributing meaningful insights to the field of health informatics.

Keywords

machine learning coronary heart disease balancing Bayesian optimisation health informatics

Introduction

Introduction of heart disease

Heart disease caused approximately one-third of death on the world annually.¹ Most of heart disease require immediately treatment, making it a medical emergency for physicians.^2,3 As emergency it is, though, there are several predicable signs for physicians to judge the possibility of an individual to have heart disease, and furthermore, the severity of the disease.^4,5 The signs, or risk factors can make it a powerful material for predicting heart disease event. Scientists utilise these risk factors combining with machine learning technique to develop predicting models, aiming to build an accessory diagnosis system to assist physicians and speed up evaluating process.

Data-preprocessing approaches

Data-preprocessing is crucial in elevating the performance of training models by decreasing any possible misunderstanding data.⁶ The common data-preprocessing approaches includes data scaling, feature extraction, and data balancing.⁷ Data scaling includes methods such as Standardisation, MinMaxScaler, MaxAbsScaler, and RobustScaler.^8–10 Another important approach is data balancing, which is particularly significant in disease prediction research due to the substantial difference in quantity between disease and healthy classes.^11,12 In 2020, Ambesange, Sateesh, et al. applied several data-balancing techniques and achieved 100% accuracy on the Indian Liver Patient Dataset (ILPD).¹³ The approaches of data-balancing included up-sampling, or so-called over-sampling, and down-sampling. Up-sampling aims to increase the quantity of the minority class, with methods such as Random over sampling (ROS),¹⁴ Synthetic Minority Over-sampling Technique (SMOTE),¹⁵ Border Line SMOTE (BLSMOTE)^16,17 and Adaptive Synthetic Sampling Approach (ADASYN).¹⁸ Conversely, down-sampling aims to decrease the quantity of the majority class, including methods like Random under-sampling (RUS),^19,20 Cluster under-sampling (CUS) , Cluster centroid under-sampling (CCS),²¹ and NearMiss 1, 2, 3.²²

Ensemble learning

Ensemble learning is a powerful machine learning technique that combines the predictions from multiple models to improve the overall performance and accuracy.²³ This approach is widely used in various applications, including classification, regression, and anomaly detection. Budholiya, et al. in 2022, utilised XGBoost as model on detecting heart disease and achieved 91.8% accuracy.²⁴ Pal, et al. in 2021, applied Random Forest as model also on heart disease detection, and achieve 93.3% accuracy.²⁵ The primary reason for using ensemble learning is its ability to reduce errors and improve predictive performance. By aggregating the outputs of several models, ensemble methods can mitigate the risk of overfitting and enhance generalisation to new data.²⁶ Current common ensemble learning techniques include XGBoost.^27,28 Gradient Boosting Decision Trees (GBDT),²⁹ Random Forest,^30,31 Adaptive Boosting (AdaBoost),³² K-Nearest Neighbours (KNN),³³ and Random Subspace Ensemble (RaSE).³⁴ These techniques collectively enhance the robustness, versatility, and predictive power of machine learning models in various fields.

Model enhancement through data balancing and technique optimisation

To refine the accuracy of heart disease classification, we employed the techniques. We begin by demonstrating the significance of data balancing by comparing the training outcomes of the heart failure dataset with and without data-balancing techniques. Following this, we assess the impact of different sequences of the applied techniques to determine their influence on model performance. Our objective is to propose a methodology that enhances the predictive accuracy of heart disease classification, thereby contributing to more reliable diagnostic support systems.

Contribution to digital health and health informatics

This study offers a novel contribution to the field of health informatics by integrating advanced machine learning algorithms with real-world clinical datasets to enhance the prediction of coronary heart disease. Unlike previous research that predominantly employed traditional statistical methods or relied on single-center data, our approach uses a dual-dataset comparative framework and systematically examines how the sequence of preprocessing steps, specifically the order of data balancing and scaling, affects model performance. This methodological aspect has received limited attention in cardiovascular prediction research, highlighting the originality and practical value of our design.

By incorporating automated preprocessing pipelines, ensemble learning models, and interpretable artificial intelligence techniques such as SHAP analysis, we propose a scalable and transparent diagnostic framework. These innovations directly address the gap between advancements in machine learning and their integration into clinical workflows. The superior performance of Design II, which applies data balancing before scaling, provides a validated framework for developing accurate and clinically applicable diagnostic tools.

From the perspective of digital health, our findings support the development of artificial intelligence–based clinical decision support systems that emphasize both accuracy and interpretability, which are essential for clinical adoption. The results have practical implications for integration into electronic health records, telemedicine services, and mobile health applications, where personalized and real-time cardiovascular risk assessment is increasingly vital. As healthcare continues to advance toward digitalization, this study provides essential insights into how predictive analytics can be translated into routine clinical practice to support data-informed and patient-centered care.

Material

Dataset and ethical considerations

This study utilised two distinct medical records databases: the Heart Failure Dataset³⁵ and the Cleveland Heart Disease Dataset.³⁶ The datasets were selected based on their relevance to cardiovascular diseases and their availability for research purposes. The study focused on heart failure and heart disease, which are common cardiovascular conditions. The following ICD-10 codes were used to define the diseases: Heart Failure: I50 (Heart Failure), coronary artery disease (CAD): I25 (Chronic Ischemic Heart Disease). These codes were used to classify patients into disease and healthy categories within the datasets.

Heart Failure Dataset: This dataset contains clinical records of 299 patients with heart failure, including 13 features such as age, ejection fraction, serum sodium, and creatinine levels, among others. These records were collected during the patients’ follow-up period. The dataset is imbalanced, with 96 disease samples and 203 healthy samples. The descriptive statistics are showed in Table 1.

Table 1.

Descriptive statistics of heart failure dataset.

Characteristic	Mean	std dev	min	max
Age	60.83	11.89	40.0	95.0
Anaemia	0.43	0.50	0.0	1.0
Creatinine phosphokinase	581.84	970.29	23.0	7861.0
Diabetes	0.42	0.49	0.0	1.0
Ejection fraction	38.08	11.83	14.0	80.0
High blood pressure	0.35	0.48	0.0	1.0
Platelets	263358.03	97804.24	25100.0	850000.0
Serum creatinine	1.39	1.03	0.5	9.4
Serum sodium	136.63	4.41	113.0	148.0
Sex	0.65	0.48	0.0	1.0
Smoking	0.32	0.47	0.0	1.0
Time	130.26	77.61	4.0	285.0
Death event	0.32	0.47	0.0	1.0

Cleveland Heart Disease Dataset: This dataset comprises 1025 patient records with 14 clinical features, including cholesterol levels, blood pressure, age, and maximum heart rate achieved, among others. It is a balanced dataset, consisting of 526 disease samples and 499 healthy samples. The descriptive statistics are showed in Table 2.

Table 2.

Descriptive statistics of Cleveland heart disease dataset.

Characteristic	Mean	std dev	min	max
Age	54.43	9.07	29.0	77.0
Sex	0.70	0.46	0.0	1.0
Chest pain type (4 values)	0.94	1.03	0.0	3.0
Resting blood pressure	131.61	17.52	94.0	200.0
Serum cholesterol in mg/dl	246.00	51.59	126.0	564.0
Fasting blood sugar > 120 mg/dl	0.15	0.36	0.0	1.0
Resting electrocardiographic results (values 0,1,2)	0.53	0.53	0.0	2.0
Maximum heart rate achieved	149.11	23.01	71.0	202.0
Exercise induced angina	0.34	0.47	0.0	1.0
Oldpeak = ST depression induced by exercise relative to rest	1.07	1.18	0.0	6.2
The slope of the peak exercise ST segment	1.39	0.62	0.0	2.0
-Value 0: upsloping
-Value 1: flat
-Value 2: downsloping
Number of major vessels (0-4) coloured by fluoroscopy	0.75	1.03	0.0	4.0
Thal: 1 = normal; 2 = fixed defect; 3 = reversable defect	2.32	0.62	0.0	3.0
Diagnosis of heart disease (angiographic disease status)	0.51	0.50	0.0	1.0
- Value 0: < 50% diameter narrowing
- Value 1: > 50% diameter narrowing

Ethical approval for the use of these datasets was obtained from the respective governing bodies, ensuring that patient data was anonymized to protect privacy. Inclusion criteria for both datasets included adult patients with complete clinical records. Exclusion criteria included patients with incomplete or missing data. Both datasets were processed using a series of scaling and balancing techniques to prepare them for modelling.

Data scaling

The scaling process includes methods such as Standardisation, MinMaxScaler, MaxAbsScaler, and RobustScaler. StandardScaler uses mean-based scaling by subtracting the mean and dividing by the standard deviation. MinMaxScaler scales each feature to a specified range, typically between zero and one. MaxAbsScaler adjusts each feature, so its maximum absolute value is 1.0. Lastly, RobustScaler removes the median and scales based on the quantile range, reducing the impact of outliers. The algorithms of each approach are shown in Table 3.

Table 3.

Algorithm of four data scaling approaches.

Approach	Algorithm
StandardScaler	$X_{s t d} = \frac{X_{i} - μ_{X}}{σ_{X}}$
MinMaxScaler	$X_{s t d} = \frac{X_{i} - X_{\min}}{X_{\max} - X_{\min}}$
MaxAbsScaler	$X_{s t d} = \frac{X_{i}}{{\| X \|}_{\max}}$
RobustScaler	$X_{s t d} = \frac{X_{i} - X_{m e d}}{X_{75} - X_{25}}$

Up-sampling processing

Random over-sampling (ROS)

Random over-sampling is the simplest technique to increase the quantity of samples through randomly duplicating samples in the minority class.

Synthetic minority over-sampling technique (SMOTE)

SMOTE is a widely recognized oversampling method. Instead of oversampling the minority class by simple duplication, it generates “synthetic” samples. This is achieved by taking each minority class sample and creating synthetic examples along the line segments connecting the sample to its nearest minority class neighbours. The number of neighbours used is randomly selected based on the required level of oversampling.

Border line SMOTE (BLSMOTE)

Borderline SMOTE enhances the original SMOTE method by addressing the issue of noise. While SMOTE randomly selects minority class samples to generate new ones without checking for outliers (NOISE), which can result in new noisy samples within the majority class and complicate classification, Borderline SMOTE categorizes samples into three types: (1) Safe samples, where the majority of neighbours are also minority samples; (2) Danger samples, where most neighbours belong to other classes; and (3) Noise samples, where all neighbours are from other classes. By generating new samples only near Danger samples and avoiding Safe and Noise samples, Borderline SMOTE aims to balance the dataset and sharpen the decision boundary.

Adaptive synthetic sampling approach (ADASYN)

The core concept of ADASYN is to generate synthetic data based on a weighted distribution that prioritizes minority class examples according to their learning difficulty. More synthetic data is created for the harder-to-learn minority class examples than for the easier ones. This method enhances learning by: (1) minimizing the bias caused by class imbalance, and (2) adaptively moving the classification decision boundary closer to the challenging examples.

Under-sampling processing

Random under-sampling (RUS)

Random under-sampling is a technique used to address class imbalance by randomly selecting instances from the majority class and removing them. This reduces the number of instances in the majority class, making it more balanced with the minority class. While this method is simple and effective, it can potentially remove informative instances, which may negatively impact the performance of the model.

Cluster under-sampling (CUS)

Cluster under-sampling uses clustering algorithms, such as k-means, to group instances from the majority class into clusters. Once the clusters are formed, a subset of these clusters is selected, and all instances within the selected clusters are used for training. This method can help preserve the structure of the data and reduce the risk of losing important information.

Cluster centroid under-sampling (CCS)

Cluster centroid under-sampling also uses clustering algorithms, like k-means, to group instances from the majority class into clusters. However, instead of using all instances within the selected clusters, this method calculates the centroid (mean or median) of each cluster. The centroids represent the central point of each cluster and are used as the training instances. This approach reduces the data size while retaining representative information from the original data.

NearMiss (1,2,3)

NearMiss is a family of methods designed to address class imbalance by selecting instances from the majority class based on their proximity to instances in the minority class. Specifically, NearMiss-1 focuses on selecting majority class instances that have the smallest average distance to the three closest minority class instances. In contrast, NearMiss-2 targets majority class instances that have the smallest average distance to the three farthest minority class instances. Lastly, NearMiss-3 selects majority class instances that are closest to each minority class instance until all minority class instances are surrounded.

Ensemble learning

Extreme gradient boosting (XGBoost)

XGBoost, short for Extreme Gradient Boosting, is a powerful and efficient implementation of the gradient boosting framework. It is widely used for supervised learning problems, particularly in regression and classification tasks. XGBoost improves upon the traditional gradient boosting approach by implementing optimizations such as regularization, parallel processing, and tree pruning, which enhance its performance and speed. Its ability to handle missing values and prevent overfitting makes it a popular choice in machine learning competitions and real-world applications.

Gradient boosting decision trees (GBDT)

Gradient Boosting Decision Trees (GBDT) is an ensemble learning technique that builds a model from weak learners, typically decision trees. By sequentially adding trees, each new tree corrects the errors made by the previous ones. The key idea is to optimize a loss function by adding weak learners that reduce the overall error. GBDT is known for its robustness and accuracy, making it suitable for a variety of predictive modelling tasks.

Random forest (RF)

Random Forest is an ensemble learning method that constructs multiple decision trees during training and merges their results to improve accuracy and control overfitting. Each tree in the forest is built from a random subset of the training data and features, ensuring diversity among the trees. The final prediction is made by averaging the outputs (regression) or taking the majority vote (classification) of the individual trees. Random Forest is known for its high accuracy, scalability, and ability to handle a large number of input variables.

Adaptive boosting (AdaBoost)

AdaBoost, short for Adaptive Boosting, is an ensemble learning method that combines multiple weak classifiers to form a strong classifier. It works by iteratively adjusting the weights of misclassified instances, allowing subsequent classifiers to focus on the more difficult cases. The final model is a weighted sum of the individual classifiers. AdaBoost is effective in improving the accuracy of algorithms but can be sensitive to noisy data and outliers.

K-nearest neighbours (KNN)

K-Nearest Neighbours (KNN) is a simple, non-parametric classification and regression algorithm. It works by identifying the ‘k' closest training instances to a given query point and assigning the majority label (classification) or averaging the values (regression) of these neighbours. The algorithm’s simplicity and effectiveness make it a popular choice for various applications, although it can be computationally expensive and sensitive to the choice of ‘k' and the distance metric.

Random subspace ensemble (RaSE)

Random Subspace Ensemble (RaSE) is an ensemble learning technique that creates diverse classifiers by training each classifier on a random subset of the feature space. This method aims to improve the robustness and generalization of the model by reducing the correlation between individual classifiers. By combining the predictions from multiple classifiers, RaSE enhances the overall performance, particularly in high-dimensional data scenarios.

Methodology

The preprocessing phase aimed to investigate the impact of different orders of preprocessing techniques on model performance. As descript in Figure 1. Two preprocessing designs were considered:

Figure 1.

Study design.

Design I: In this design, the original datasets were first scaled using one of the following scaling techniques: Standardisation, MinMaxScaler, MaxAbsScaler, or RobustScaler. After scaling, data balancing was performed using one of the following techniques: up-sampling methods (ROS, SMOTE, Borderline-SMOTE, ADASYN) or down-sampling methods (CCS, RUS, CUS, NearMiss).

Design II: In this design, data balancing was performed first, followed by scaling. This order allowed for exploring the impact of balancing before scaling on model performance.

The clinical features used for scaling and balancing included variables such as age, blood pressure, cholesterol levels, and ejection fraction. These features were selected based on their clinical relevance to cardiovascular health.

Model evaluation

For model evaluation, the data were randomly split into an 80% training set and a 20% validation set. The validation sets were used to assess the performance of various preprocessing and machine learning combinations. Evaluation metrics including accuracy, precision, recall, and F1-score were calculated, as shown in Table 2. A 5-fold cross-validation (5-fold CV) procedure was conducted, where in each fold, 80% of the training data was used for model training, and the remaining 20% was used for validation.^37,38 Bayesian Optimisation was employed to efficiently tune the hyperparameters for XGBoost, GBDT, AdaBoost, Random Forest, KNN, and RaSE. After identifying the optimal hyperparameters, the final model was retrained using the entire training dataset (the original 80%) and used for final predictions.

To assess each model’s performance, we plotted ROC curves, which represent the True Positive Rate against the False Positive Rate, with the Area Under the Curve (AUC) used to measure discriminative ability. Additionally, Precision-Recall (PR) curves were analysed to examine the trade-off between Precision and Recall, which is especially useful for imbalanced datasets. The algorithm of four evaluation approaches are illustrated in Table 4.

Table 4.

Algorithm of four evaluation approaches.

Approach	Algorithm
Accuracy	$\frac{T P + T N}{T P + T N + F P + F N}$
Precision	$\frac{T P}{T P + F P}$
Recall	$\frac{T P}{T P + F N}$
F1-score	$\frac{2 T P}{2 T P + F P + F N}$

Result

Study design

In this study, we developed two distinct data-preprocessing workflows to evaluate the impact of preprocessing sequences on model performance. Table 5 displays the optimal results obtained from training models on the Cleveland heart disease dataset (balanced dataset), while Table 6 highlights the performance metrics on the heart failure dataset (imbalanced dataset). It is apparent that the accuracy is consistent across both designs for the balanced dataset, with each achieving a 100% accuracy rate. However, when applied to the imbalanced dataset, Design II, which incorporates balancing prior to scaling, demonstrates superior accuracy (95%).

Table 5.

Cleveland heart disease dataset (Balanced dataset).

Design I				Design II
Scale	Balance	ML	Acc.	Balance	Scale	ML	Acc.
StandardScaler	SMOTE	XGBoost	100%	SMOTE	RobustScaler	XGBoost	100%
StandardScaler	ADASYN	XGBoost	100%	SMOTE	MaxAbsScaler	XGBoost	100%
				SMOTE	MinMaxScaler	XGBoost	100%
				SMOTE	StandardScaler	XGBoost	100%

Table 6.

Heart failure dataset (Imbalanced dataset).

Design I				Design II
Scale	Balance	ML	Acc	Balance	Scale	ML	Acc.
StandardScaler	SMOTE	XGBoost	93.33%	SMOTE	RobustScaler	XGBoost	95%
StandardScaler	ADASYN	XGBoost	93.33%	SMOTE	MaxAbsScaler	XGBoost	95%
				SMOTE	MinMaxScaler	XGBoost	95%
				SMOTE	StandardScaler	XGBoost	95%

The importance of the order of data-preprocessing approaches

We further evaluation of additional performance metrics on the Heart failure imbalanced dataset reveals that the combination of StandardScaler with SMOTE is optimal. Table 7 indicate that Design II outperforms Design I across several key metrics. Design II achieves a higher accuracy of 95% compared to Design I’s 93.33%. Moreover, Design II exhibits a superior F1-score for the healthy class, with a score of 96.55%, whereas Design I’s score is 94.59%. Although Design II’s F1-score for the sick class is slightly lower at 90.91% compared to Design I’s 91.30%, it remains competitive. The macro-average F1-score for Design II is 93.73%, which is higher than Design I’s 92.95%. Additionally, the weighted-average F1-score for Design II is 94.95%, surpassing Design I’s 93.44%. Overall, Design II demonstrates enhanced performance across critical metrics, establishing it as the preferred approach.

Table 7.

Comparison of highest result (Balance: SMOTE, Scaler: StandardScaler).

Design I on heart failure dataset					Design II on heart failure dataset
Classification report:					Classification report:
	Precision	Recall	F1-score	support	Precision	Recall	F1-score	support
0 (Healthy)	1.0000	0.8974	0.9459	39	0.9545	0.9767	0.9655	43
1 (Sicked)	0.8400	1.0000	0.9130	21	0.9375	0.8824	0.9091	17
accuracy			0.9333	60			0.9500	60
Macro avg	0.9200	0.9487	0.9295	60	0.9460	0.9295	0.9373	60
Weighted avg	0.9440	0.9333	0.9344	60	0.9497	0.9500	0.9495	60

macro average (averaging the unweighted mean per label), weighted average (averaging the support weighted mean per label).

Evaluation of machine learning process

Figures 2 and 3 present the analysis results from the balanced and imbalanced datasets, respectively. The data clearly indicate that XGBoost consistently outperforms other machine learning models, including Random Forest, RaSE, and GBDT. In the balanced dataset, although RaSE exhibits strong performance, XGBoost demonstrates robustness across all performance metrics, underscoring its adaptability. In the context of the imbalanced dataset, XGBoost’s superiority is even more pronounced, particularly in addressing the challenges associated with predicting minority classes. This evidence supports XGBoost as the most effective and reliable machine learning method for both balanced and imbalanced datasets, establishing it as the preferred choice for predictive modelling across varied data scenarios.

Figure 2.

MLapplied in Top 3 best result. XGBoost and RaSE have a significant presence among the highest-performing models.

Figure 3.

ML applied in Top 3 best result. XGBoost and Random Forest have a substantial presence among the highest-performing models.

ROC & PR curves on highest result of heart failure dataset

Figures 4 and 5 provide a detailed evaluation of six machine learning models—AdaBoost, XGBoost, GBDT, RaSE, KNN, and Random Forest—using ROC and PR curves. XGBoost and Random Forest demonstrate superior performance with AUC values of 0.9535 and 0.9590, respectively, indicating strong discriminative power, while KNN, with the lowest AUC of 0.7770, shows the weakest performance. XGBoost again be as the top performer with an AP of 0.9429, followed closely by Random Forest at 0.9059, while KNN has the lowest AP score of 0.5775, indicating challenges in maintaining a favorable precision-recall trade-off. This analysis highlights XGBoost and Random Forest as the most effective models.

Figure 4.

ROC curve of the highest result. The AUC values of XGBoost and RF achieving the highest, signifying better predictive accuracy.

Figure 5.

PR curve of the highest result. XGBoost and RF exhibit the highest AP, indicating superior performance in distinguishing.

Predictive device development

Figure 6 illustrates the design and functionality of our detection device. Leveraging the results of our training, we developed this device to enable users to input their basic information and personal health data. The device then provides a preliminary diagnosis based on the entered details.

Figure 6.

Content of the predictive device.

Discussion

Key findings and interpretation

Our study compares the performance of balanced and imbalanced datasets across different sequences of data preprocessing and concludes that Design II—applying data balancing before data scaling—is the preferable approach, particularly for imbalanced datasets. Table 8 provides a comparative analysis of scaling-only, balancing-only, and combined approaches, highlighting the significant role of dataset balancing in processing imbalanced data. Notably, dataset balancing alone can sometimes yield performance equivalent to that achieved through both preprocessing techniques.

Table 8.

Comparison of using pre-processing approaches on imbalanced data.

Model	Scalling	Accuracy	Precision	Recall	F1-score
A	Only scaling	93.33%	93.33%	93.33%	93.33%
	Only balancing	95.00%	94.97%	95.00%	94.95%
	Both	95.00%	94.97%	95.00%	94.95%
B	Only scaling	35.00%	12.25%	35.00%	18.15%
	Only balancing	95.00%	94.97%	95.00%	94.95%
	Both	95.00%	94.97%	95.00%	94.95%

Model A: Utilizes SMOTE for oversampling, StandardScaler for feature scaling, and XGBoost as the classifier. Model B: Employs ADASYN for oversampling, StandardScaler for feature scaling, and XGBoost as the classifier.

Feature importance and clinical relevance

To assess the contribution of individual features to model predictions and their impact on heart disease and heart failure outcomes, we conducted a SHAP (SHapley Additive exPlanations) analysis. Figures 7 and 8 illustrate how various features influence model predictions. In both plots, the “sex” feature indicates that males are generally associated with a higher predicted risk, whereas females are linked to lower risk. This effect is more pronounced in our heart disease models than in our heart failure models, suggesting that while gender consistently plays a role in heart-related conditions, its influence varies in magnitude depending on the specific disease context.

Figure 7.

Results of SHAP analysis for heart disease risk prediction (a) summary plot of SHAP analysis, displaying the distribution of how the value of each risk factor influences the model output across various test instances (b) force plot illustrating the explanation for a low-risk prediction (c) force plot illustrating the explanation for a high -risk prediction.

Figure 8.

Results of SHAP analysis for heart failure risk prediction (a) summary plot of SHAP analysis, displaying the distribution of how the value of each risk factor influences the model output across various test instances (b) force plot illustrating the explanation for a low-risk prediction (c) force plot illustrating the explanation for a high-risk prediction.

Feature importance analysis: Impact of sex variable removal

Table 9 examines the impact of removing the “sex” feature from the heart disease model, demonstrating that while accuracy remains at 100%, the number of models achieving perfect accuracy declines. In contrast, removing “sex” from the heart failure dataset reduces accuracy from 95% to 93%, indicating a lesser but still notable influence. These findings underscore the importance of incorporating sex as a predictive variable, as its removal leads to either reduced accuracy or a decrease in high-performing models.

Table 9.

Comparison about removing specific feature of dataset.

Feature selection	Heart disease dataset		Heart failure dataset
Feature selection	Highest accuracy	Quantity	Highest accuracy	Quantity
Original	100%	38	95%	10
Remove “sex”	100%	12	93%	17

Comparison with existing literature

A broader comparison with existing literature further contextualises our findings. Table 10 summarises recent studies on the Cleveland heart disease and heart failure datasets. Our approach—integrating SMOTE for class balancing, StandardScaler for feature normalisation, and XGBoost for classification—achieves superior predictive performance. Specifically, our model attains 100% accuracy in the Cleveland heart disease dataset, outperforming previous methods such as the Support Vector Machine (SVM) approach by Ahmad et al.³⁹ (98.47%) and the multilayer perceptron (MLP) method by Veisi et al. (94.60%).⁴⁰ Similarly, for the heart failure dataset, our method achieves 95% accuracy, surpassing high-performing alternatives such as the Extra Tree Classifier by Ishaq et al. (92.62%)⁴¹ and the Rotation Forest combined with Logistic Model Tree (LMT) by Plati et al. (91.23%).⁴²

Table 10.

Recent progress on Cleveland heart disease and heart failure dataset.

Cleveland heart disease dataset			Heart failure dataset
Author	Method	Accuracy	Author	Method	Accuracy
Jindal, et al.⁴³ (2021)	KNN, Logistic Regression	88.52%	Muntasir, et al.⁴⁵ (2022)	SMOTE-ENN	90.00%
Sahoo, et al.⁴⁴ (2022)	Random Forest	90.16%	Plati, et al.⁴² (2021)	Rotation Forest	91.23%
Sahoo, et al.⁴⁴ (2022)	Random Forest	90.16%	Plati, et al.⁴² (2021)	Logistic Model Tree	91.23%
Veisi, et al.⁴⁰ (2021)	multilayer perceptron (MLP)	94.60%	Ishaq, et al.⁴¹ (2021)	Extra Tree Classifier	92.62%
Ahmad, et al.³⁹ (2023)	Support Vector Machine (SVM)	98.47%	Sutradhar, et al.⁴⁶ (2023)	BOOST	93.67%
Ahmad, et al.³⁹ (2023)	Support Vector Machine (SVM)	98.47%	Sutradhar, et al.⁴⁶ (2023)	CBCEC	93.67%
Our study	SMOTE	100%	Our study	SMOTE	95.00%
	StandardScaler			StandardScaler
	XGBoost			XGBoost

These results highlight the robustness of our approach and raise important considerations for real-world clinical applications. The superior performance of our method is likely attributable to the synergy of SMOTE’s ability to mitigate class imbalance, StandardScaler’s role in feature normalisation, and the predictive power of XGBoost. Notably, while SVM and MLP models have been shown to perform well, they may be more sensitive to data imbalance and lack the adaptability of XGBoost’s boosting mechanism. The clinical implications of these findings are substantial, as enhanced predictive accuracy in heart disease and heart failure models could facilitate earlier diagnosis, more targeted interventions, and improved patient outcomes.

Furthermore, our results underscore the importance of model interpretability in clinical decision-making. The SHAP analysis not only validates the significance of key features such as sex but also offers insights into how predictive models align with existing medical knowledge. Given that sex differences in cardiovascular disease are well-documented in clinical research, our findings reinforce the necessity of including sex as a feature to improve the reliability of heart disease and heart failure prediction models.

Implications for digital health innovation and clinical decision support

Our research shows that improved preprocessing techniques enhance the reliability of AI cardiovascular risk assessment tools, advancing digital healthcare delivery. The superior performance achieved through our Design II approach (balancing before scaling) provides a practical framework for developing more accurate clinical decision support systems that can assist healthcare providers in early diagnosis and risk stratification. The integration of machine learning models, particularly XGBoost, with optimized preprocessing strategies offers a pathway for implementing real-time cardiovascular risk assessment tools in clinical settings, aligning with the contemporary shift toward patient-centered digital health solutions that leverage AI to provide personalized care recommendations based on individual risk profiles.

Furthermore, our methodology addresses the critical need for interpretable AI in healthcare by incorporating SHAP analysis, ensuring that clinical decisions supported by our models remain transparent and clinically meaningful. The practical implications extend to the development of mobile health applications and telemedicine platforms, where accurate and rapid cardiovascular risk assessment is essential. Our research provides the methodological foundation for deploying robust prediction models in diverse digital health environments, from hospital-based clinical decision support systems to consumer-facing health monitoring applications.

Conclusion

This study contributes significantly to health informatics by demonstrating that preprocessing sequence critically impacts cardiovascular disease prediction accuracy. Design II (balancing before scaling) achieved superior performance (95% vs 93.33% accuracy) and established a methodological framework applicable to clinical decision support systems. The consistent reliability of XGBoost across both balanced and imbalanced datasets, combined with SHAP-based interpretability, provides healthcare informatics professionals with a validated approach for developing reliable AI-driven diagnostic tools.

Future research should explore the generalisability of these findings across other datasets and machine learning models. Additionally, the development of more sophisticated preprocessing techniques tailored to specific data characteristics may further enhance predictive performance in complex data environments.

Footnotes

Acknowledgements

We thank all individuals who participated in this study.

ORCID iD

Ping-Nan Chen

Author contributions

PNC, CWT and LCS performed the experiments and wrote the manuscript. KFL and PNC provided the concept and experimental design of the study and reviewed the paper prior to submission. All authors discussed the results, analyzed the data, and commented on the manuscript. All authors have read and approved the submitted version.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The design and costs of collection, analysis and interpretation of data and writing are funded by the Ministry of National Defense-Medical Affairs Bureau (MND-MAB-D-113143).

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data Availability Statement

All data generated or analyzed during this study are included in this published article. It also available in- https://www.kaggle.com/datasets/andrewmvd/heart-failure-clinical-data and .

References

Soni

Ansari

Sharma

, et al. Predictive data mining for medical diagnosis: an overview of heart disease prediction. Int J Comput Appl 2011; 17: 43–48.

Brickner

Hillis

Lange

. Congenital heart disease in adults. First of two parts. N Engl J Med 2000; 342: 256–263.

Hoffman

Kaplan

. The incidence of congenital heart disease. J Am Coll Cardiol 2002; 39: 1890–1900.

Palaniappan

Awang

. Intelligent heart disease prediction system using data mining techniques. In: Proceedings ACS/IEEE international conference on computer systems and applications 2008, pp. 108–115.

Shah

Patel

Bharti

. Heart disease prediction using machine learning techniques. SN Comput Sci 2020; 1: 345.

Mennel

Symonowicz

Wachter

, et al. Ultrafast machine vision with 2D material neural network image sensors. Nature 2020; 579: 62–66.

Gerretzen

Szymańska

Jansen

, et al. Simple and effective way for data preprocessing selection based on design of experiments. Anal Chem 2015; 87: 12096–12103.

Almadhor

Sattar

Al Hejaili

, et al. An efficient computer vision-based approach for acute lymphoblastic leukemia prediction. Front Comput Neurosci 2022; 16: 1083649.

Pires

Hussain

M Garcia

, et al. Homogeneous data normalization and deep learning: a case study in human activity classification. Future Internet 2020; 12: 194.

10.

Song

Zhang

, et al. Research on time series characteristics of the gas drainage evaluation index based on lasso regression. Sci Rep 2021; 11: 20593.

11.

Yang

Liang

, et al. Self-paced balance learning for clinical skin disease recognition. IEEE Trans Neural Netw Learn Syst 2019; 31: 2832–2846.

12.

Trabassi

Castiglia

Bini

, et al. Optimizing rare disease gait classification through data balancing and generative AI: insights from hereditary cerebellar ataxia. Sensors 2024; 24: 3613.

13.

Ambesange

Vijayalaxmi

Uppin

, et al. Optimizing liver disease prediction with random forest by various data balancing techniques. In: Proceedings of IEEE international conference on cloud computing in emerging markets (CCEM), Bengaluru, India: IEEE, 2020, pp. 98–102.

14.

Horn

Perona

. The devil is in the tails: fine-grained classification in the wild. arXiv 2017:1709.01450.

15.

Chawla

Bowyer

Hall

, et al. SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 2002; 16: 321–357.

16.

Han

Wang

Mao

. Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: Proceedings of advances in intelligent computing ICIC. Berlin, Heidelberg: Springer, 2005, pp. 878–887.

17.

Nguyen

Cooper

Kamei

. Borderline over-sampling for imbalanced data classification. Int J Knowl Eng Soft Data Paradig 2011; 3: 4–21.

18.

Bai

Garcia

, et al. ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In: Proceedings of international joint conference on neural networks. Hong Kong: IEEE, 2008, pp. 1322–1328.

19.

Jindaluang

Chouvatut

Kantabutra

. Under-sampling by algorithm with performance guaranteed for class-imbalance problem. In: Proceedings of international computer science and engineering conference (ICSEC). Khon Kaen, Thailand: IEEE, 2014, pp. 215–221.

20.

Tahir

Kittler

Yan

. Inverse random under sampling for class imbalance problem and its application to multi-label classification. Pattern Recognit 2012; 45: 3738–3750.

21.

Lin

Tsai

, et al. Clustering-based undersampling in class-imbalanced data. Inf Sci 2017; 409–410: 17–26.

22.

Yen

Lee

. Under-sampling approaches for improving prediction of the minority class in an imbalanced dataset. Lect Notes Control Inf Sci 2006; 344: 731.

23.

Polikar

. Ensemble learning. Ensemble Machine Learning: Methods and Applications. New York, NY, USA: Springer US, 2012, pp. 1–34.

24.

Budholiya

Shrivastava

Sharma

. An optimized XGBoost based diagnostic system for effective prediction of heart disease. J King Saud Univ Comput Inf Sci 2022; 34: 4514–4523.

25.

Pal

Parija

. Prediction of heart diseases using random forest. J Phys Conf Ser 2021; 1817: 012009.

26.

Dong

Cao

, et al. A survey on ensemble learning. Front Comput Sci 2020; 14: 241–258.

27.

Dong

Huang

Lehane

, et al. XGBoost algorithm-based prediction of concrete electrical resistivity for structural health monitoring. Autom Constr 2020; 114: 103155.

28.

Ramraj

Uzir

Sunil

, et al. Experimenting XGBoost algorithm for prediction and classification of different datasets. Int J Control Theory Appl 2016; 9: 651–662.

29.

. Hyperparameter tuning of GDBT models for prediction of heart disease. In: Proceedings of international conference on electronic information engineering and computer science (EIECS). Changchun, China: SPIE, 2022, pp. 686–691.

30.

Biau

Scornet

. A random forest guided tour. Test 2016; 25: 197–227.

31.

Rigatti

. Random forest. J Insur Med 2017; 47: 31–39.

32.

Cao

Miao

Liu

, et al. Advance and prospects of AdaBoost algorithm. Acta Autom Sin 2013; 39: 745–758.

33.

Zhang

Zhou

. ML-KNN: a lazy learning approach to multi-label learning. Pattern Recognit 2007; 40: 2038–2048.

34.

Tian

Feng

. RaSE: random subspace ensemble classification. J Mach Learn Res 2021; 22: 1–93.

35.

Ahmad

Munir

Bhatti

, et al. Survival analysis of heart failure patients: a case study. PLoS One 2017; 12: e0181001.

36.

Detrano

Janosi

Steinbrunn

, et al. International application of a new probability algorithm for the diagnosis of coronary artery disease. Am J Cardiol 1989; 64: 304–310.

37.

Wang

Yan

Zhang

, et al. Machine learning-based prediction of gastroparesis risk following complete mesocolic excision. Discov Oncol 2024; 15: 483.

38.

Song

Lin

, et al. Developing a prognostic model for primary biliary cholangitis based on a random survival forest model. Int J Med Sci 2024; 21: 61–69.

39.

Ahmad

Polat

. Prediction of heart disease based on machine learning using jellyfish optimization algorithm. Diagnostics 2023; 13: 2392.

40.

Veisi

Ghaedsharaf

Ebrahimi

. Improving the performance of machine learning algorithms for heart disease diagnosis by optimizing data and features. Soft Comput J 2021; 8: 70–85.

41.

Ishaq

Sadiq

Umer

, et al. Improving the prediction of heart failure patients’ survival using SMOTE and effective data mining techniques. IEEE Access 2021; 9: 39707–39716.

42.

Plati

Tripoliti

Bechlioulis

, et al. A machine learning approach for chronic heart failure diagnosis. Diagnostics 2021; 11: 1863.

43.

Jindal

Agrawal

Khera

, et al. Heart disease prediction using machine learning algorithms. IOP Conf Ser Mater Sci Eng 2021; 1022: 012072.

44.

Sahoo

Kanike

Das

, et al. Machine learning-based heart disease prediction: a study for home personalized care. In: Proceedings of international workshop on machine learning for signal processing (MLSP). Xi'an, China: IEEE, 2022, pp. 01–06.

45.

Nishat

Faisal

Ratyl

, et al. A comprehensive investigation of machine learning classifiers with SMOTE-ENN. Sci Program 2022; 2022: 3649406.

46.

Sutradhar

Al Rafi

Shamrat

FMJM

, et al. BOO-ST and CBCEC: two novel hybrid machine learning methods aim to reduce the mortality of heart failure patients. Sci Rep 2023; 13: 22874.

Enhancing coronary heart disease diagnosis: Comparative analysis of data pre-processing techniques and machine learning models using clinical medical records

Abstract

Keywords

Introduction

Introduction of heart disease

Data-preprocessing approaches

Ensemble learning

Model enhancement through data balancing and technique optimisation

Contribution to digital health and health informatics

Material

Dataset and ethical considerations

Data scaling

Up-sampling processing

Random over-sampling (ROS)

Synthetic minority over-sampling technique (SMOTE)

Border line SMOTE (BLSMOTE)

Adaptive synthetic sampling approach (ADASYN)

Under-sampling processing

Random under-sampling (RUS)

Cluster under-sampling (CUS)

Cluster centroid under-sampling (CCS)

NearMiss (1,2,3)

Ensemble learning

Extreme gradient boosting (XGBoost)

Gradient boosting decision trees (GBDT)

Random forest (RF)

Adaptive boosting (AdaBoost)

K-nearest neighbours (KNN)

Random subspace ensemble (RaSE)

Methodology

Model evaluation

Result

Study design

The importance of the order of data-preprocessing approaches

Evaluation of machine learning process

ROC & PR curves on highest result of heart failure dataset

Predictive device development

Discussion

Key findings and interpretation

Feature importance and clinical relevance

Feature importance analysis: Impact of sex variable removal

Comparison with existing literature

Implications for digital health innovation and clinical decision support

Conclusion

Footnotes

Acknowledgements

ORCID iD

Author contributions

Funding

Declaration of conflicting interests

Data Availability Statement

References