Sage Journals: Discover world-class research

Abstract

Objective

Hypothyroidism, hyperthyroidism, thyroid nodules, and other thyroid disorders are common around the world, affect millions of people worldwide, and untreated health conditions may lead to serious health issues. An accurate and timely diagnosis serves as crucial for proper management and medication. This study utilizes a dataset from the UCI machine-learning repository to put forward the comprehensive machine-learning technique for diagnosing thyroid disorders.

Methods

The proposed methodology involved exploratory data analysis and preparation, which included handling missing values, encoding categorical values, and selecting features. The synthetic minority over-sampling technique technique is utilized to overcome the problem of class imbalance. Five advanced machine learning (ML) algorithms, logistic regression, support vector machine, decision tree, random forest, and gradient boosting are employed to develop predictive models. Further, an innovative stacking ensemble method is proposed with the help of four applied models. The results from these models are aggregated, and logistic regression serves as a meta-learner.

Results

A 10-fold cross-validation technique is utilized to ensure robust model evaluation and reduce the risk of overfitting by using one test set for each subset and training on the rest of the subsets. The ensemble model attained an accuracy of 99.86%, outperforming individual models.

Conclusion

These results reveal the capability of ML, especially ensemble approaches, to enhance accurate and timely diagnosis of thyroid disorders.

Keywords

Machine learning thyroid disorders predictive modeling ensemble method cross-validation synthetic minority over-sampling technique

Introduction

Thyroid disorders are one of the major problems of the modern world and are widespread in all countries, and the presence of symptoms of these diseases is often not even suspected.¹ These disorders, which result from malfunctioning of the thyroid gland, disturb the metabolism of the body and result in a number of health problems. The thyroid gland is a small, butterfly-shaped gland located at the base of the throat. It is actively involved in the production of hormones and is responsible for metabolic, growth, and development processes. Thyroid disorders primarily manifest in three forms: Hypothyroidism, hyperthyroidism, and thyroid nodules.² Hypothyroidism, associated with low hormone production, manifest as fatigue, depression, weight gain, and poor memory. Hyperthyroidism, characterized by overproduction of hormones, is associated with nervousness, weight loss, and an increased heart rate. Thyroid nodules are lumps that develop in the thyroid gland and may be malignant or benign. Thyroid cancer is the cancer with the highest increasing incidence rate among women. Thyroid disorders are relatively prevalent among women, as it is estimated that one in eight women would have the disease at some point in their lifetime. This higher prevalence is due to several reasons; such as hormonal changes during pregnancy and in the postpartum period making women more prone to thyroid disorders. Despite the significant impact of thyroid diseases on health, diagnosis is made in only half of the cases, as symptoms are often confused with other conditions.

Timely and accurate thyroid diagnosis is crucial for effective management and treatment. However, the thyroid disorders’ symptoms can be many, varied, and nonspecific, so mistakes can be made or the disease can take a long time to discover, which underlines the importance of better diagnostics. Thyroid disorders are challenging to diagnose and involve various clinical evaluations, biochemical tests, and imaging techniques.³ Conventional diagnostic procedures take a lot of time and are normally executed by manpower hence are obscure. Over the last few years, it has been investigated whether using machine learning (ML) algorithms could enhance the accuracy and efficiency of thyroid condition diagnosis. Artificial intelligence utilizes algorithms to analyze vast datasets to uncover patterns and correlations, offering a promising alternative to traditional diagnostic methods.

Earlier work has addressed how different ML algorithms may be applicable when diagnosing thyroid disorder. For instance, Gyanendra Chaubey (2021) utilized predictive analytics with ML models such as k-nearest neighbors (KNNs), decision trees (DTs), and logistic regression (LR) to diagnose thyroid disorders, highlighting the potential of these algorithms in medical diagnostics.⁴ In the same way, Ankita Tyagi (2018) focused on predicting thyroid disease using KNNs, support vector machine (SVM), and DTs, demonstrating the value of ML in improving diagnostic accuracy.⁵

Previously, Yadav & Pal et al.⁶ explored the use of ensemble data mining approaches, including Boosting, Bagging, Stacking, and Voting, achieving a high accuracy of 98.80%. Mehk et al.⁷ investigates the use of RF, SVM, and LR. Dhamodaran et al.⁸ investigates the application of ML algorithms like KNN, SVM and Naive Bayes for the diagnosis of hyper and hypothyroidism. Ahmad et al.⁹ developed a hybrid decision support system for thyroid diagnosis, achieving a classification accuracy of 99.1%. These studies highlight the potential of ML in improving diagnostic accuracy and efficiency.

Chaganti et al.¹⁰ improved thyroid disease prediction by utilizing advanced feature engineering techniques, achieving an accuracy of 0.99 with a RF classifier. Ahmad et al.⁹ developed a hybrid decision support system for the diagnosis of thyroid conditions. With a 99.1% classification accuracy, their method proved to be efficient in both feature reduction and diagnostic precision. George Obaido et al.¹¹ study proposes a method for the identification of thyroid disease that combines a stacking ensemble of many machine-learning models with filter-based feature selection. Through the effective mitigation of data imbalance and reduction of model biases, this ensemble approach significantly boosts predictive accuracy and resilience by combining the capabilities of many base models. This approach significantly improves the performance of thyroid disease diagnosis, demonstrating the potential of ensemble methods over single-model approaches. Wu S. et al.¹² proposed a DT ensemble method for detecting thyroid, leveraging the UCI dataset.

Haneet Kour et al.¹³ study introduces a bagged ensemble model using linear discriminant analysis (LDA) combined with synthetic minority over-sampling technique (SMOTE) for thyroid disorder prediction. The model employs a majority voting approach, integrating five LDA models trained on bootstrap samples augmented with SMOTE. Evaluated on primary (1092 records) and secondary (7200 records) datasets, the model achieved accuracies of 85.45% and 82.71%, respectively, significantly improving upon the classic LDA accuracy of 69.55% and 75.28%. The approach demonstrates enhanced efficiency in diagnosing thyroid disorders compared to conventional ML classifiers.

Ritesh Jha et al.¹⁴ study focuses on enhancing thyroid disease prediction accuracy by applying dimension reduction techniques and data augmentation. By using these methods, along with classifying reduced-dimension data, the study achieved a high accuracy of 99.95% with a deep neural network model. The proposed two-stage approach outperforms existing techniques, demonstrating the effectiveness of these techniques in improving disease prediction accuracy. Stacking uses a meta-learner to integrate predictions from several base learners, which frequently leads to better performance. Abbad et al.¹⁵ study evaluates various ML classifiers for diagnosing thyroid disease, including KNN, DT, Naive Bayes, SVM, and LR. It uses a unique dataset with additional features like pulse rate, BMI, and blood pressure.

Research gap

Despite these advancements, challenges remain. Imbalanced datasets, where the proportion of patients with thyroid disease is significantly lower than those without, can lead to biased models. Techniques such as resampling and synthetic data generation, like SMOTE, have been employed to address this issue. Additionally, missing values in medical datasets pose a significant challenge, with researchers using various imputation techniques to handle incomplete data.¹⁶ Furthermore, the interpretability of complex models, such as neural networks or ensemble methods, remains a concern, as their decision-making processes are often difficult to understand.¹⁵

While ML models have shown high accuracy in thyroid disease prediction, several challenges remain. Imbalanced datasets, where the proportion of patients with thyroid disease is significantly lower then those of those without, can lead to biased models.

Techniques such as resampling and synthetic data generation have been employed to address this issue. Moreover, missing values in medical datasets pose a significant challenge.

Researchers have used various imputation techniques to deal with missing data, such as mode, median, mean, and model-based imputations. Feature engineering is the process of developing new features from previous ones and has also been crucial in improving model performance. Interpretability of complex models, such as neural networks and ensemble methods, is another area of concern. While these models provide high accuracy, understanding their decision-making process is often difficult.

Contributions

This study employs five ML algorithms and implements a Stacking Ensemble method to address these challenges. The ensemble approach combines the strengths of individual models with LR as a meta-learner to enhance predictive accuracy. Our methodology includes data preprocessing, feature engineering, and SMOTE to handle class imbalance, ensuring unbiased predictions. We use k-fold cross-validation to improve model robustness and generalizability. This study highlights the effectiveness of ML in thyroid disorder detection and underscores the benefits of ensemble methods and rigorous validation in medical diagnostics.

Among this research paper’s primary contributions are:

The implementation and analysis of numerous ML models in contrast.

The application of an innovative Stacking ensemble method with LR as a meta-learner, enhancing diagnostic accuracy.

The application of SMOTE for handling class imbalance, improving the model’s capability to predict minority classes.

A robust validation framework using 10-fold cross-validation, ensuring the generalizability of the results.

Methods

This study utilized a clinical dataset comprising 3772 patient records to develop a model for thyroid disorder detection, as shown in Figure 1. The dataset included various attributes, such as demographic information and laboratory test results. To ensure that the data is prepared for analysis, important preprocessing procedures included encoding categorical variables, normalizing numerical characteristics, and addressing missing values. The SMOTE is used to solve the issue of class imbalance, ensuring a fair representation of classes and improving the ability of the model to forecast minority class outcomes. Feature selection is conducted using recursive feature elimination (RFE), which identified the most relevant features by iteratively eliminating less significant ones. This approach optimized model performance and reduced complexity.

Figure 1.

Analyzing of methodology process diagram for focused study.

In total, 80% of the dataset is allocated for training, while 20% is allocated for testing. K-fold cross-validation, specifically a 10-fold strategy, is employed during model training to ensure robust evaluation and minimize overfitting. This approach splits the data into ten subsets, training the model on nine and validating it on the one remaining subset, repeating this process ten times. Five advanced ML models, including LR, gradient boosting (GB), DT, RF, and SVM are developed and fine-tuned using hyperparameter optimization techniques like Grid Search. These models are subsequently analyzed according to their prediction accuracy and efficiency. A The stacking ensemble method is used to integrate the strengths of multiple models, combining their predictions into a final, more accurate model. This ensemble approach leveraged diverse model capabilities, enhancing overall prediction performance.

Dataset description

The dataset used in this study comes from the UCI ML Repository, a widely recognized and publicly available repository for ML datasets.¹⁷ The study data collection was conducted at the Garavan Institute and first published year 2019. The dataset comprises 3772 instances and 30 attributes, including demographic information(e.g., age, sex), laboratory test results(e.g., TSH, T3, TT4), thyroid medication status, and diagnostic classes (e.g., negative, primary hypothyroid). The target attribute (Class) signifies the presence or absence of thyroid disorders, with subcategories like “compensated hypothyroid” and “secondary hypothyroid.”

Ethical considerations

The UCI dataset is publicly available and anonymized, ensuring no personally identifiable information is included. As such, ethical approval and consent are not required for its use in this study. However, all experiments were conducted in compliance with ethical guidelines for the use of publicly available datasets.

Data preprocessing

The study’s dataset is obtained from the UCI ML Repository and contains various attributes relevant to thyroid function. Data preprocessing is an important step before using the dataset in the ML approach. Initially, handling missing values is essential. Numeric values that are missing are substituted with each column’s median value, ensuring that the central tendency is preserved. For categorical variables with missing values, the most frequently occurring value (mode) in each column is used as a replacement to maintain the integrity of the data. Next, label encoding is applied to convert the categorical variables into numerical values. This process is crucial as it converts the categorical data into a format that ML algorithms can efficiently process, enabling the model to process the data accurately. Additionally, standardizing the numeric features is performed to ensure consistency in the feature scales. Each numeric feature is adjusted to have a mean of zero and a standard deviation of one, making certain that each feature has an equal impact on the model’s functionality. By taking this step, features with wider scales are kept from unreasonably impacting the model’s outcomes, as expressed in Figure 2.

Figure 2.

The steps involved in data preprocessing.

Feature engineering

Feature engineering is undertaken to create new features or modify existing ones, enhancing the efficiency of ML models.

Addressing class imbalance

In order to balance the distribution of classes within the dataset, SMOTE is used to create synthetic samples for the minority class. This technique ensured that the classifier was not biased towards the majority class, facilitating a more accurate prediction of the minority class, as shown in Figure 3.

Figure 3.

The class distribution histograms before and after applying synthetic minority over-sampling technique (SMOTE).

Feature selection

RFE is employed for feature selection, using LR as the base model.¹⁸ RFE iteratively removed the least important features and selected the most relevant ones, optimizing the models predictive power by focusing on key attributes, as shown in Table 1.

Table 1.

The selected features description analysis.

Feature	Description
Age	The patient’s age
on_thyroxine	If thyroxine is being used by the patient
thyroidSurgery	History of thyroid surgery
TSH	Thyroid stimulating hormone
T3	Triiodothyronine
TT4measured	Total thyroxine measured indicator
TT4	Total thyroxine
T4Umeasured	Thyroxine uptake measured indicator
T4U	Thyroxine uptake
FTImeasured	Free thyroxine index measured indicator
FTI	Free thyroxine index
referral_source	Source of referral

Applied machine learning models

This section examines different ML techniques¹⁹ used for the prediction of thyroid disorders. In our study, we assess several advanced machine-learning models for diagnosing thyroid disorders. Each model is selected based on its suitability for handling imbalanced medical datasets, interpretability, generalization ability, and performance in previous studies on similar classification tasks.

LR, a linear model,²⁰ is utilized for its simplicity and interpretability. It predicts the probability of a binary outcome, making it suitable for classification problems like thyroid disorder detection. This model estimates the possibility that a particular input will belong to a particular class (e.g., the presence or absence of thyroid disorder) using a logistic function. We selected LR because it provides a strong baseline for performance evaluation and offers insights into feature importance through model coefficients. It is especially useful when the relationship between attributes and the target variable is approximately linear.

To understand how different features influence model predictions, we examined the learned coefficients of LR, as shown in Figure 4. The feature query hyperthyroid has the highest positive coefficient, indicating a strong influence in classifying cases. This aligns with clinical expectations, as query_hyperthyroid status is a critical indicator of thyroid disorders.

Figure 4.

Bar plot showing the coefficients of each feature in the logistic regression (LR).

GB is an ensemble technique that sequentially builds several weak learners, typically DTs.²¹ Each new model is trained to correct errors made by the previous models, resulting in a strong predictive model that handles complex data with non-linear relationships. We chose GB because it is known for high accuracy, robustness, and effectiveness in medical data classification. The ability to tune parameters such as tree depth, learning rate, and number of estimators allows further optimization of performance.

To gain insight into the predictive power of different features, we analyzed feature importance scores obtained from the GB model. As shown in Figure 5, the most influential features in predicting thyroid disorders are free thyroxine index (FTI), query hyperthyroid, and FTI measured. These features contribute the most to the model’s decisions. Additionally, thyroid-stimulating hormone (TSH) also plays a significant role, reinforcing its clinical relevance in thyroid disorder diagnosis. Other features such as thyroxine, TSH measured, and referral source have a lower but non-negligible impact on model predictions.

Figure 5.

Bar plot showing the feature importance in the gradient boosting (GB).

DT provides an interpretable and non-parametric approach to classification by formulating decisions based on input attribute values.²² DTC is particularly useful when transparency is required in model predictions, making it a valuable tool in medical diagnostics. The model can handle both numerical and categorical data while capturing non-linear relationships between features. However, DTs are prone to overfitting, which is managed using pruning techniques to enhance generalization.

Figure 6 presents the DT structure generated in our study. The root node (top-most decision point) is based on the onthyroxine feature, indicating its importance in classifying thyroid disorders. This suggests that whether a patient is on thyroxine medication is a key determinant in the decision-making process.

Figure 6.

The structure of decision tree (DT) model.

RF is an ensemble method that combines multiple DTs to improve accuracy and robustness.²³ Each tree is trained on a random subset of the data and features, which helps reduce overfitting and enhances the model’s ability to generalize well on unseen data. We selected RF because of its capability to handle imbalanced data, provide feature importance scores, and deliver high accuracy in classification tasks. It is widely used in medical applications due to its stability and interpretability.

Figure 7 illustrates the feature importance scores derived from the RF model. The queryhyperthyroid feature is the most influential predictor, followed by FTI and TSH. These findings align with medical knowledge, as these biomarkers play a crucial role in diagnosing thyroid disorders.

Figure 7.

Bar plot showing feature importance in random forest (RF).

SVM is a robust classifier that determines the optimum hyperplane for classifying the data. It works especially well in situations when the decision boundary is nonlinear and in high-dimensional spaces. SVMs utilize kernel functions (e.g., linear, polynomial, and radial basis functions) to transform the input space into a higher-dimensional space where a linear separation is possible.²⁴ We included SVM in our study because it has been successfully applied in medical classification tasks and offers strong generalization capabilities.

Figure 8 illustrates the decision boundaries of the SVM model using 2D PCA-transformed data. The visualization demonstrates how SVM separates different classes by defining distinct regions in the feature space. The support vectors, which influence the decision boundary, are shown at the class margins.

Figure 8.

The decision boundary analysis of support vector machine (SVM).

Proposed stacking ensemble method

The stacking ensemble method is a powerful meta-learning technique that combines multiple base models to enhance predictive accuracy. Unlike bagging or boosting, which focus on aggregating similar models, stacking employs heterogeneous models and uses a meta-learner to make the final prediction. In this study, we employed five diverse ML algorithms—LR, GB, DT, RF, and SVM—as base models. Each base model is trained on the training dataset and generated predictions, which were then used as input features for the meta-learner. The meta-learner, implemented using LR, learned from these predictions and produced the final classification output.

The key advantage of stacking is that it leverages the strengths of multiple models while minimizing their weaknesses. For instance, tree-based models like RF and DTs capture complex patterns in the data, while LR provides robustness against overfitting. The combination of these models ensures a well-balanced predictive capability. The diversity of base models ensures that the ensemble captures a wide range of data patterns. For example, LR excels in modeling linear relationships, while GB and RF effectively handle non-linear interactions. This diversity allows the ensemble to generalize better across various data patterns inherent in thyroid datasets.

Additionally, the Stacking Ensemble, combined with SMOTE, effectively addresses class imbalance, a common issue in thyroid datasets where minority classes (e.g., rare thyroid conditions) are underrepresented. By leveraging the strengths of individual models, the ensemble improves predictions for minority classes, ensuring balanced and accurate results. The meta-learner (LR) plays a crucial role in optimizing the ensemble’s performance. It learns to combine the predictions of the base models, focusing on the most reliable outputs and reducing errors. This optimization process enhances the overall accuracy and robustness of the model.

Furthermore, thyroid datasets often contain noise due to measurement errors or variability in clinical data. The ensemble approach mitigates the impact of noise by averaging out errors from individual models, leading to more robust predictions. By combining these factors—diversity of base models, handling of class imbalance, meta-learner optimization, and robustness to noise—the Stacking Ensemble achieves state-of-the-art performance in thyroid disease prediction, outperforming individual models and other ensemble techniques.

A visual representation of our stacking ensemble architecture is provided in Figure 9. The base models generate intermediate predictions, which are fed into the meta-learner to refine the final classification. This hierarchical approach enables better generalization, particularly in imbalanced datasets such as those found in medical diagnostics.

Figure 9.

The introduced stacking ensemble workflow analysis.

Models training and validation

To evaluate the predictive capabilities of different ML techniques, several algorithms are trained and assessed using K-fold cross-validation. The models included LR, GB, DT, RF, and SVM. The data is split into ten subsets as part of a 10-fold cross-validation technique. Each subset is used as a test set once, with the remaining subsets serving as the training set. This approach ensured that the model evaluation was robust and less prone to overfitting.

Hyperparameter tuning

The grid search approach is utilized for hyperparameter tuning to determine the optimal parameters for each model, improving their performance. This process tests various hyperparameter combinations, selecting those that yield the best results. The training set (80%) is used to train the models, and the testing set (20%) is used for evaluation. Table 2 contains the best hyperparameters and scores from grid search.

Table 2.

Best hyperparameters and scores from grid search.

Model	Best score	Best parameters
LR	0.98498	{’C’: 10}
GB	0.998599	{’learning_rate’: 0.1, ’n_estimators’: 50}
DTC	0.998727	{’max_depth’: 7}
RF	0.997453	{’max_depth’: 7, ’n_estimators’: 50}
SVM	0.992097	{’C’: 10, ’kernel’: ’rbf’}

SVM: support vector machine; RF: random forest; GB: gradient boosting; LR: logistic regression.

Statistical analysis

To evaluate the performance of the ML models and ensure the reliability of the results, the following statistical analyses are conducted:

Descriptive statistics: Descriptive statistics, including mean, median, and standard deviation, are calculated for all numerical features in the dataset. This provided insights into the central tendency and variability of the data.

Class distribution analysis: The distribution of the target classes (thyroid disorder categories) is analyzed before and after applying SMOTE to quantify the effectiveness of the balancing technique.

Performance metrics: The performance of each model is evaluated using standard metrics, including accuracy, precision, recall, and F1 score. These metrics are calculated for both the training and testing datasets to assess the model’s generalization ability.

Confidence intervals: Confidence intervals (95%) are calculated for the performance metrics (accuracy, precision, recall, F1 score) to assess the stability and reliability of the results across different folds of cross-validation.

Receiver operating characteristic (ROC) curve and area under the ROC curve (AUC) analysis: The AUC is calculated for each model to evaluate its ability to distinguish between classes. The ROC curves are plotted to visualize the trade-off between the true positive rate (TPR) and false positive rate (FPR) at various threshold settings.

Evaluation metrics

Accuracy: The ratio of cases that are accurately predicted to all instances.

$A c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N}$ .

Precision: The ratio of all predicted positive observations to accurately predicted positive observations.

P r e c i s i o n = \frac{T P}{T P + F P}

Recall: The ratio of all observations made in the actual class to all positively anticipated observations.

R e c a l l = \frac{T P}{T P + F N}

F1 Score: A weighted average of recall and precision. It takes into account false negatives as well as false positives.

F 1 S c o r e = 2 \times \frac{P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

Where FP stands for false positive, TN for true negative, FN for false negative, and TP for true positive.

Results

The section explores the exploratory methods and the findings from the studies aimed at predicting thyroid disorders. The results, which include all relevant attributes, are presented, focusing on a multiclass classification task that uses the target attribute to figure out whether a patient has thyroid illness or not. To balance the dataset, the SMOTE technique is applied, and hyperparameter tuning is performed to improve the ML classifiers’ performance metrics. The ML algorithms are then trained using the balanced dataset.

Experiment design

The algorithm’s performance is evaluated using supervised ML models. The ML classifiers are developed using Scikit-Learn and Python. An 80:20 ratio is used to split the data into training and testing phases. The efficacy of the ML algorithms is evaluated using a variety of performance criteria. All of the tests are carried out in Python using different Scikit-Learn libraries. Measured metrics include recall, precision, accuracy, and f1 score.

Results without using SMOTE

This section examines the performance of various models trained and verified on the original dataset without using SMOTE. The analysis focuses on the metrics of recall, precision, accuracy, and F1 score to determine how well each model predicts the target classes in an imbalanced dataset. Figure 10 shows the accuracy plot without SMOTE.

Figure 10.

The accuracy of all implemented models without the synthetic minority over-sampling technique (SMOTE).

Results validation using K-fold cross validation without SMOTE

Table 3 shows metrics for the performance of models validated by K-fold cross-validation without SMOTE. These metrics show how models perform in a situation of class imbalance, where some classes are underrepresented in the dataset.

Table 3.

Performance metrics without SMOTE.

Model	Accuracy	Precision	Recall	F1 score
LR	96.29%	95.89%	96.29%	95.90%
GB	99.74%	99.74%	99.74%	99.73%
DT	99.60%	99.74%	99.60%	99.66%
RF	99.60%	99.63%	99.60%	99.61%
SVM	97.75%	97.69%	97.75%	97.64%
Stacking ensemble	99.74%	99.74%	99.74%	99.73%

SVM: support vector machine; RF: random forest; SMOTE: synthetic minority over-sampling technique; GB: gradient boosting; LR: logistic regression.

The LR model obtained an accuracy of 96.29%, with recall, precision, and F1 score all hovering around the same value. While these results are reasonable, the model’s performance suggests that it may struggle with correctly predicting the minority class due to the inherent imbalance in the data. The GB model demonstrated remarkable performance, attaining an accuracy of 99.74%. The consistency of the precision, recall, and F1 scores at 99.74% indicates that GB is highly effective even without addressing the class imbalance, though it might still benefit from techniques like SMOTE.

The DTC model also demonstrated strong performance with an accuracy of 99.60%. The high precision and recall values indicate that the model is fairly robust, but as with GB, there may be room for improvement by addressing the class imbalance. Similar to the DTC, the RF model achieved an accuracy of 99.60%, with precision and recall values closely aligned. The performance of the model indicates that it can handle imbalanced data relatively well, but like the other models, it might benefit from further enhancements.

The SVM model reached an accuracy of 97.75%, which is slightly less than the other models. The precision and recall are also slightly less, suggesting that the model may be more sensitive to class imbalance. The proposed Stacking Ensemble model matched the performance of GB with an accuracy of 99.74%, indicating that combining multiple models can lead to robust predictions even without balancing the dataset.

Classification report outcomes of implemented models without SMOTE

Table 4 provides the detailed classification report for each of the employed models without SMOTE, illustrating the precision, recall, f1 score, and support for each class. The data highlights how well each model performs in predicting both the majority and minority classes. The report reveals that models like Stacking Ensemble and GB perform well across all classes. Others, like LR and SVM, struggle with the minority class.

Table 4.

Class-wise summary report for each model based on specific target class without SMOTE.

Model	Class	Precision	Recall	F1 score	Support
LR	0	0.750000	0.486486	0.590164	37
	1	0.970629	0.995696	0.983003	697
	2	0.937500	0.714286	0.810811	21
GB	0	0.973684	1.000000	0.986667	37
	1	0.998567	1.000000	0.999283	697
	2	1.000000	0.904762	0.950000	21
DTC	0	0.948718	1.000000	0.973684	37
	1	1.000000	0.998565	0.999282	697
	2	1.000000	0.904762	0.950000	21
RF	0	0.973684	1.000000	0.986667	37
	1	1.000000	1.000000	1.000000	697
	2	1.000000	0.952381	0.975610	21
SVM	0	0.848485	0.756757	0.800000	37
	1	0.983027	0.997131	0.990028	697
	2	1.000000	0.714286	0.833333	21
Stacking ensemble	0	0.973684	1.000000	0.986667	37
	1	0.997139	1.000000	0.998567	697
	2	1.000000	0.857143	0.923077	21

SVM: support vector machine; RF: random forest; SMOTE: synthetic minority over-sampling technique; GB: gradient boosting; LR: logistic regression.

Study results using SMOTE

The SMOTE is applied to address the issue of class imbalance, which can greatly affect the efficiency of classification models. By generating synthetic examples for the minority class, SMOTE helps in achieving a more balanced training dataset. This section discusses the performance of the models when validated using K-fold cross-validation after applying SMOTE, providing a more robust understanding of the model’s predictive capabilities. Figure 11 shows the accuracy plot with SMOTE.

Figure 11.

The accuracy result of all implemented algorithms with the use of synthetic minority over-sampling technique (SMOTE).

Table 5 presents the performance metrics for each model after applying SMOTE and validating with K-fold cross-validation. The application of SMOTE resulted in a significant improvement in the models ability to predict the minority class, as reflected in the increased precision, recall, and F1 scores.

Table 5.

Performance metrics with SMOTE.

Model	Accuracy	Precision	Recall	F1 Score
LR	98.99%	99.01%	98.99%	98.99%
GB	99.86%	99.86%	99.86%	99.86%
DTC	99.93%	99.93%	99.93%	99.93%
RF	99.82%	99.82%	99.82%	99.82%
SVM	99.03%	99.05%	99.03%	99.03%
Stacking Ensemble	99.86%	99.86%	99.86%	99.86%

SVM: support vector machine; RF: random forest; SMOTE: synthetic minority over-sampling technique; GB: gradient boosting; LR: logistic regression.

Classification report results of implemented models with SMOTE

Table 6 provides a detailed classification report for the employed models after applying SMOTE, showcasing the precision, recall, F1-score, and support for each class. The results clearly indicate that SMOTE significantly enhanced the model’s ability to accurately predict across all classes, particularly the minority classes, which are previously underrepresented in the dataset.

These results highlight the challenges that arise when models are trained on imbalanced datasets. Certain models, like the stacking Ensemble and GB, perform well even without SMOTE, and others, particularly LR and SVM, show a need for techniques that address class imbalance to improve their predictive accuracy across all classes.

Table 6.

Class-wise summary report for each model based on specific target class with SMOTE.

Model	Class	Precision	Recall	F1 score	Support
LR	0	0.974255	1.000000	0.986960	719
	1	0.997019	0.970972	0.983824	689
	2	0.991241	0.988355	0.989796	687
	3	0.998553	1.000000	0.999276	690
GB	0	0.998611	1.000000	0.999305	719
	1	0.998544	0.995646	0.997093	689
	2	0.998544	0.998544	0.998544	687
	3	0.998553	1.000000	0.999276	690
DTC	0	0.997222	0.998553	0.997886	719
	1	0.998547	0.995646	0.997093	689
	2	0.998544	1.000000	0.999273	687
	3	0.998553	1.000000	0.999276	690
RF	0	0.995845	1.000000	0.997918	719
	1	1.000000	0.992743	0.996358	689
	2	0.998547	1.000000	0.999273	687
	3	0.998553	1.000000	0.999276	690
SVM	0	0.975610	0.998553	0.986960	719
	1	0.987013	0.998558	0.992753	689
	2	0.993486	0.998558	0.996017	687
	3	0.998553	0.998553	0.998553	690
Stacking ensemble	0	0.998611	1.000000	0.999305	719
	1	0.998544	0.998553	0.998549	689
	2	0.998544	0.998553	0.998549	687
	3	0.998553	1.000000	0.999276	690

SVM: support vector machine; RF: random forest; SMOTE: synthetic minority over-sampling technique; GB: gradient boosting; LR: logistic regression.

Comparative analysis of the current study with and without SMOTE

The comparative analysis reveals significant performance improvements after balancing the dataset with SMOTE. Particularly, SMOTE enhanced the recall and F1 scores, underscoring its crucial role in predictive modeling for imbalanced medical datasets. The application of SMOTE is especially beneficial in improving the model’s capability to predict minority classes accurately, which is often a challenge in real-world medical scenarios.

Figure 12 illustrates the comparison of accuracy plots for models with and without SMOTE, highlighting the effectiveness of SMOTE in generating more reliable predictions. This graphical representation clearly shows that models trained on a balanced dataset perform significantly better, particularly in terms of recall and F1-score, which are critical metrics in medical diagnostics.

Figure 12.

The accuracy result of all implemented algorithms with and without the synthetic minority over-sampling technique (SMOTE).

Confusion matrix analysis

A confusion matrix provides a thorough summary of the model’s predictions, displaying True Positives, True Negatives, False Positives, and False Negatives for each class. In this study, confusion matrices are generated for each ML model after applying the SMOTE to address class imbalance.

The confusion matrix for LR Figure 13(a) reveals the model’s ability to predict all classes. Although LR is generally efficient, but it showed some challenges in accurately predicting the minority class, even after SMOTE is applied. The matrix indicates a tendency towards misclassifying the minority class as the majority class, which highlights the models limitations in handling complex, imbalanced datasets.

Figure 13.

The confusion matrix analysis.

Figure 13(b) demonstrates that GB’s confusion matrix performs well across all classes. The matrix reflects high accuracy, with balanced predictions, suggesting that the model clearly distinguishes between classes. This outcome aligns with the expectations for GB, known for its robustness and precision in classification tasks.

Figure 13(c) and (d) present the confusion matrices for the DTC and RF models, respectively. While both models performed reasonably well, The RF model has a greater TPR and lesser false negative rate than the DT. This suggests that RF, with its ensemble of DTs, provides better generalization and accuracy, particularly in predicting the minority class.

The SVM models confusion matrix Figure 13(e) highlights its capability in class distinction. The matrix shows that SVM is particularly effective in separating the classes, even in the presence of class imbalance. However, there is a slight inclination towards misclassifying instances of the minority class, although SMOTE helped mitigate this issue.

The Stacking Ensemble model, as depicted in Figure 13(f), showcased balanced predictions across all classes. The confusion matrix underscores the strength of ensemble methods and improved forecast performance by integrating numerous models. The matrix reveals that the The stacking Ensemble model successfully minimized both false positives and false negatives, indicating a well-rounded performance.

The confusion matrix analysis demonstrates that the Stacking Ensemble model outperformed the other models, providing balanced predictions across all classes. The GB and RF models also exhibited strong performance, particularly in accurately predicting the minority class. LR and SVM, while effective, showed some limitations in classifying minority instances, even with SMOTE applied. Overall, the application of SMOTE significantly improved the model’s ability to handle imbalanced data, with ensemble methods proving to be the most robust approach.

AUC and ROC curve analysis

The ROC curve and the AUC are essential tools for evaluating the performance of classification models, particularly in imbalanced datasets. The ROC curve plots the TPR against the FPR at various threshold settings, while the AUC provides a single metric to quantify the model’s ability to distinguish between classes. In this study, ROC curves and AUC values are generated for each ML model after applying SMOTE to address class imbalance.

The ROC curve for the LR model, as shown in Figure 14(a), demonstrates its ability to distinguish between classes. The AUC value of 0.98 indicates strong performance, with the model achieving a high TPR while maintaining a low FPR. However, the curve reveals a slight decline in performance for the minority class, suggesting that while LR is effective, it struggles slightly with complex, imbalanced datasets even after SMOTE is applied.

Figure 14.

The ROC AUC curves analysis. ROC: receiver operating characteristic; AUC: area under the ROC curve.

Figure 14(b) highlights the exceptional performance of the GB model. With an AUC value of 0.99, the model achieves near-perfect class separation, reflecting its robustness and precision. The curve remains close to the top-left corner, indicating a high TPR and a low FPR across all thresholds. This outcome aligns with the expectations for GB, which is known for its ability to handle imbalanced data and complex relationships.

The ROC curve for the DT model, depicted in Figure 14(c), shows good performance with an AUC value of 0.97. However, the curve reveals a slight drop in TPR for the minority class, suggesting that the model may overfit to the majority class. While the DT performs well, it is less robust compared to ensemble methods like GB and RF.

Figure 14(d) demonstrates the superior performance of the RF model. With an AUC value of 0.99, the model achieves excellent class separation, outperforming the DT. The curve remains consistently high, indicating that the RF model generalizes well and maintains a high TPR even for the minority class. This result underscores the strength of ensemble methods in handling imbalanced datasets.

The ROC curve for the SVM model, as shown in Figure 14(e), highlights its capability in class distinction. With an AUC value of 0.98, the model performs well, particularly in separating the majority classes. However, the curve reveals a slight decline in TPR for the minority class, indicating that while SVM is effective, it still faces challenges in accurately predicting minority instances, even with SMOTE applied.

The ROC curve for the Stacking Ensemble model, depicted in Figure 14(f), showcases its balanced and robust performance. With an AUC value of 0.99, the model achieves near-perfect class separation, outperforming all individual models. The curve remains consistently close to the top-left corner, indicating a high TPR and a low FPR across all thresholds. This result underscores the strength of the Stacking Ensemble in integrating the predictions of multiple models to improve overall performance. The model successfully minimizes both false positives and false negatives, demonstrating its ability to handle imbalanced data effectively.

The AUC and ROC curve analysis demonstrates that the Stacking Ensemble model outperforms all other models, achieving the highest AUC value and maintaining a consistently high TPR across all classes. The GB and RF models also exhibit strong performance, with AUC values close to 0.99, reflecting their robustness in handling imbalanced data. While LR and SVM perform well, they show slight limitations in accurately predicting minority classes, even with SMOTE applied. Overall, the application of SMOTE significantly improves the models’ ability to handle imbalanced data, with ensemble methods proving to be the most effective approach.

Comparison with other related studies

Several research have explored the application of ML algorithms for predicting thyroid disease, each employing different methodologies and achieving varying levels of accuracy. A comparison with these studies underscore the efficiency of the proposed model in this research, as expressed in Table 7.

Table 7.

Comparison of machine learning approaches for thyroid disease prediction.

Study	Techniques	Accuracy
Yadav and Pal⁶ (2019)	Ensemble data mining techniques	98.8%
Blagus and Lusa²⁵ (2013)	Feature selection, SMOTE	98.5%
Ahmad et al.²⁶ (2018)	Hybrid decision support system	98.5%
Chai²⁷ (2020)	Knowledge graph technology, BiLSTM	88.7%
Proposed study	Stacking ensemble with synthetic minority over-sampling technique (SMOTE), 10-fold CV	99.8%

Yadav & Pal (2019)⁶ utilized ensemble data mining techniques such as Boosting, Bagging, Stacking, and Voting, achieving an accuracy of 98.80%. Blagus (2013)²⁵ concentrated on effective feature selection and handling class imbalance, reaching an accuracy of 98.5%. Similarly, Waheed Ahmad (2018)²⁶ developed a hybrid decision support system that achieved 98.5% accuracy and 99.7% specificity. CHAI (2020)²⁷ integrated knowledge graph technology with a BiLSTM network, achieving an accuracy of 88.76%.

The proposed study beats earlier studies with an accuracy of 99.86% with the Stacking ensemble method. This superior performance can be attributed to the robust methodology, including extensive data preprocessing, feature engineering, and the application of SMOTE to address class imbalance. The use of a 10-fold cross-validation technique further ensures that the models are robust and reliable.

Multi-dataset validation

To evaluate the generalization capabilities of the proposed model, we validated its performance on two distinct thyroid disease datasets: The original dataset is UCI ML Repository dataset, and an additional Thyroid dataset publicly available on Kaggle.²⁸ This multi-dataset validation approach ensures that the model is robust and can perform well across different data sources and clinical settings.

The additional dataset, comprising 3772 records with a different class distribution, is used to test the model’s generalizability. Despite the differences in data characteristics, the Stacking Ensemble model achieved an accuracy of 97.40%, with precision, recall, and F1 scores consistently above 97%. This performance is comparable to the results on the original dataset, indicating that the model is not overfitting to the training data and can generalize well to new, unseen data.

The performance metrics for both datasets are summarized in Table 8. The Stacking Ensemble model consistently outperformed individual models, achieving the highest accuracy and F1 scores across both datasets. This highlights the effectiveness of combining multiple base models to improve predictive performance.

Table 8.

Performance metrics on original and additional datasets.

Model	Dataset	Accuracy	Precision	Recall	F1 score
Logistic regression	Original	96.95%	96.73%	96.95%	96.64%
	Additional	94.99%	97.31%	94.99%	95.28%
Gradient boosting	Original	99.97%	99.97%	99.97%	99.97%
	Additional	97.43%	99.97%	97.43%	98.44%
Decision tree	Original	99.89%	99.90%	99.89%	99.89%
	Additional	97.32%	99.92%	97.32%	98.37%
Random forest	Original	99.87%	99.87%	99.87%	99.87%
	Additional	97.45%	99.92%	97.45%	98.46%
Support vector machine (SVM)	Original	98.59%	98.52%	98.59%	98.50%
	Additional	96.31%	98.65%	96.31%	96.97%
Stacking ensemble	Original	99.89%	99.84%	99.89%	99.87%
	Additional	97.40%	99.88%	97.40%	98.39%

The results confirm that the proposed Stacking Ensemble model is not only accurate but also robust and generalizable, making it a promising tool for thyroid disease prediction in clinical practice.

Discussion

The results of this study demonstrate the effectiveness of ML algorithms, particularly ensemble methods, in improving the accuracy of thyroid disorder diagnosis. Without the application of SMOTE, models such as GB and the Stacking Ensemble achieved high accuracy (99.74%), precision, recall, and F1 scores, indicating their robustness even in the presence of imbalanced data. However, models like LR and SVM showed limitations in predicting the minority class, as reflected in their lower recall and F1 scores for underrepresented classes.

The application of SMOTE significantly improved the performance of all models, particularly for the minority class. For instance, the accuracy of LR increased from 96.29% to 98.99%, and its recall and F1 scores for the minority class improved substantially. This highlights the importance of addressing class imbalance in medical datasets, where minority classes often represent critical cases that require accurate prediction.

The Stacking Ensemble model consistently outperformed other models, achieving an accuracy of 99.86% with SMOTE. This superior performance can be attributed to the ensemble’s ability to combine the strengths of multiple base models, resulting in more robust and generalizable predictions. Similarly, GB and RF demonstrated strong performance, further validating the effectiveness of ensemble methods in handling imbalanced datasets.

In conclusion, this study contributes to the growing body of research on ML applications in healthcare by proposing a robust framework for thyroid disorder prediction. The results underscore the potential of ensemble methods and data-balancing techniques in improving diagnostic accuracy, while also highlighting the need for further research to address existing limitations and explore new avenues for innovation.

Limitations

However, while the results are promising, the study acknowledges that further advancements are needed. Future work could involve the exploration of deep learning techniques, which have shown significant potential in other areas of medical diagnostics due to their capability to extract attributes automatically from raw data. Additionally, applying the proposed model to bigger and more diverse datasets would be essential to improve its generalizability and reliability across different populations. This could involve the inclusion of data from various geographic regions, different demographic groups, and diverse clinical settings.

Conclusion

This research presents comprehensive ML techniques for predicting thyroid disorder, utilizing the well-known dataset from the UCI ML Repository. The research effectively addresses challenges such as class imbalance and model complexity through the implementation of data preprocessing techniques and SMOTE. These methods, combined with the advanced Stacking ensemble technique, resulted in a substantial improvement in predictive performance. Specifically, the ensemble model achieved an impressive accuracy of 99.86%, underscoring its effectiveness in identifying thyroid disorders. This high accuracy demonstrates the model’s potential to support clinicians in the early diagnosis and treatment of thyroid-related conditions, which is critical given the complexities of thyroid disorders and their impact on patient’s health.

Future work

Moreover, future studies could investigate the integration of this model into clinical workflows, evaluating its real-world performance and its impact on patient outcomes. Ultimately, the goal would be to refine the model further so that it can be a practical, reliable tool in the clinical environment, assisting healthcare professionals in making informed decisions regarding thyroid disease diagnosis and treatment.

Footnotes

ORCID iDs

Ali Raza

Norma Latif Fitriyani

Muhammad Syafrudin

Seung Won Lee

Author contributions

Conceptualization, AH, SR, AR, MMI, AS, NLF, MS, and SWL; methodology, AH, SR, AR, NLF, MS, and SWL; validation, MMI and AS; formal analysis, AH, SR, AR, and NLF; investigation, MMIand AS; data curation, AH, SR, AR, and MMI,; writing—original draft preparation, AH, SR, AR, MMI, AS, and NLF; writing—review and editing, MS and SWL; visualization, AH, SR, AR, and NLF; supervision, MS and SWL; funding acquisition, MS and SWL. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Bio&Medical Technology Development Program of the National Research Foundation (NRF) funded by the Korean government (MSIT): NRF[2021-R1-I1A2(059735)]; RS[2024-0040(5650)]; RS[2024-0044(0881)]; RS[2019-II19(0421)].

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Guarantor

SWL.

References

Burguera

Gharib

. Thyroid incidentalomas: prevalence, diagnosis, significance, and management. Endocrinol Metab Clin North Am 2000; 29: 187–203.

Niedziela

. Thyroid nodules. Best Pract Res Clin Endocrinol Metab 2014; 28: 245–277.

Giovanella

Campennì

Tuncel

, et al. Integrated diagnostics of thyroid nodules. Cancers 2024; 16: 311.

Chaubey

Bisen

Arjaria

, et al. Thyroid disease prediction using machine learning approaches. Natl Acad Sci Lett 2021; 44: 233–238.

Tyagi

Mehra

Saxena

. Interactive thyroid disease prediction system using machine learning technique. In: 2018 Fifth international conference on parallel, distributed and grid computing (PDGC), 2018, pp.689–693. IEEE.

Yadav

Pal

. Thyroid prediction using ensemble data mining techniques. Int J Inform Technol 2022; 14: 1273–1283.

Mehak

Rasheed

Ibupoto

, et al. Machine learning algorithms for prediction of thyroid syndrome at initial stages in females. Kurdish Stud 2024; 12: 466–470.

Dhamodaran

Shankar

Balachander

, et al. Estimation of Thyroid by Means of Machine Learning and Feature Selection Methods. Cham: Springer International Publishing, 2023. ISBN 978-3-031-23602-0 pp.327–344. DOI: https://doi.org/10.1007/978-3-031-23602-0_19.

Ahmad

Huang

Ahmad

, et al. Thyroid diseases forecasting using a hybrid decision support system based on anfis, k-nn and information gain method. J Appl Environ Biol Sci 2017; 7: 78–85.

10.

Chaganti

Rustam

De La

, et al. Thyroid disease prediction using selective features and machine learning techniques. Cancers 2022; 14: 3914.

11.

Obaido

Achilonu

Ogbuokiri

, et al. An improved framework for detecting thyroid disease using filter-based feature selection and stacking ensemble. IEEE Access 2024; 12: 89098–89112.

12.

Yadav

Pal

. Prediction of thyroid disease using decision tree ensemble method. Hum-Intell Syst Integ 2020; 2: 89–95.

13.

Chen

Yang

Wang

, et al. A three-stage expert system based on support vector machines for thyroid disease diagnosis. J Med Syst 2012; 36: 1953–1963.

14.

Jha

Bhattacharjee

Mustafi

. Increasing the prediction accuracy for thyroid disease: a step towards better health for society. Wirel Person Commun 2022; 122: 1921–1938.

15.

Abbad Ur Rehman

Lin

Mushtaq

, et al. Performance analysis of machine learning algorithms for thyroid disease. Arab J Sci Eng 2021; 46: 1–13.

16.

Prasad

Rao

Babu

MSP

. Thyroid disease diagnosis via hybrid architecture composing rough data sets theory and machine learning algorithms. Soft Comput 2016; 20: 1179–1189.

17.

Quinlan

. Thyroid disease dataset. UCI Machine Learning Repository, 1986. https://doi.org/10.24432/C5D010. Licensed under Creative Commons Attribution 4.0 International (CC BY 4.0).

18.

Yin

Jang-Jaccard

, et al. Igrf-rfe: a hybrid feature selection method for mlp-based network intrusion detection on unsw-nb15 dataset. J Big Data 2023; 10: 15.

19.

Akhtar

Atiq

Shahid

, et al. Novel glassbox based explainable boosting machine for fault detection in electrical power transmission system. PLoS ONE 2024; 19: e0309459.

20.

Raza

Younas

Siddiqui

HUR

, et al. An improved deep convolutional neural network-based youtube video classification using textual features. Heliyon 2024; 10.

21.

Raza

Rehman

Sehar

, et al. Optimized virtual reality design through user immersion level detection with novel feature fusion and explainable artificial intelligence. PeerJ Comput Sci 2024; 10: 1–25.

22.

Younas

Raza

Thalji

, et al. An efficient artificial intelligence approach for early detection of cross-site scripting attacks. Decis Anal J 2024; 11: 100466.

23.

Rustam

Raza

Qasim

, et al. A novel approach for real-time server-based attack detection using meta-learning. IEEE Access 2024; 12: 39614–39627.

24.

Shankar

Lakshmanaprabu

Gupta

, et al. Optimal feature-based multi-kernel SVM approach for thyroid disease classification. J Supercomput 2020; 76: 1128–1143.

25.

Blagus

Lusa

. Smote for high-dimensional class-imbalanced data. BMC Bioinform 2013; 14: 1–16.

26.

Ahmad

, et al. A novel hybrid decision support system for thyroid disease forecasting. Soft Comput 2018; 22: 5377–5383.

27.

Chai

. Diagnosis method of thyroid disease combining knowledge graph and deep learning. IEEE Access 2020; 8: 149787.

28.

Shakir

. Thyroid disease data set. Available at: https://www.kaggle.com/datasets/yasserhessein/thyroid-disease-data-set?resource=download (accessed on 30 June 2024).

Improving thyroid disorder diagnosis via innovative stacking ensemble learning model

Abstract

Objective

Methods

Results

Conclusion

Keywords

Introduction

Research gap

Contributions

Methods

Dataset description

Ethical considerations

Data preprocessing

Feature engineering

Addressing class imbalance

Feature selection

Applied machine learning models

Proposed stacking ensemble method

Models training and validation

Hyperparameter tuning

Statistical analysis

Evaluation metrics

Results

Experiment design

Results without using SMOTE

Results validation using K-fold cross validation without SMOTE

Classification report outcomes of implemented models without SMOTE

Study results using SMOTE

Classification report results of implemented models with SMOTE

Comparative analysis of the current study with and without SMOTE

Confusion matrix analysis

AUC and ROC curve analysis

Comparison with other related studies

Multi-dataset validation

Discussion

Limitations

Conclusion

Future work

Footnotes

ORCID iDs

Author contributions

Funding

Declaration of conflicting interests

Guarantor

References