Sage Journals: Discover world-class research

Abstract

Imbalanced classification remains a critical challenge in decision-sensitive domains such as healthcare, finance, and cybersecurity, where minority class recognition is often paramount. This paper introduces FRSTU-Forest, a novel hybrid framework that integrates K-Nearest Neighbor (k-NN) imputation, Fixed Random State Undersampling (FRSTU), and Random Forest to enhance both minority class detection and model reproducibility. Unlike conventional undersampling, FRSTU applies deterministic sampling with a fixed random seed, ensuring consistent training subsets across runs and significantly reducing performance variance. The framework was comprehensively evaluated on seven benchmark datasets with moderate imbalance ratios (1.25–3.36) and rigorously tested on synthetic datasets with extreme imbalance ratios up to 1:100, high dimensionality (100 features), and substantial label noise (20%). FRSTU-Forest consistently outperformed baseline models (RF, k-NNimp+RF, RSTU+RF), achieving an average accuracy of 87.88%, minority-class F1-score up to 99.78%, and Cohen’s Kappa of 0.86 on benchmark datasets. More importantly, under extreme imbalance conditions (1:100 ratio), it maintained a balanced accuracy of 0.807 with 100 features, demonstrating remarkable robustness. Statistical significance was confirmed via the Bonferroni-Dunn test ( $p = 0.0015$ ), while computational efficiency remained practical with an average runtime of 2.01 seconds per dataset. In real-world applications, the model achieved 96.92% recall on breast cancer detection and substantially improved credit risk classification. These results affirm that FRSTU-Forest provides a reliable, reproducible, and robust decision-support tool for imbalanced data environments, particularly effective in scenarios with severe class imbalance, high dimensionality, and noisy labels.

Keywords

imbalanced data decision support reproducibility random forest ensemble learning extreme imbalance robust classification

Get full access to this article

View all access options for this article.

References

Witten

Frank

Hall

, et al. Data mining: practical machine learning tools and techniques. 5th ed. Burlington, MA: Morgan Kaufmann. Elsevier, 2025.

Aggarwal

. Data mining: the textbook. Cham: Springer, 2015.

Breiman

. Random forests. Mach Learn 2001; 45: 5–32.

Ignatenko

Surkov

Koltcov

. Random forests with parametric entropy-based information gains for classification and regression problems. PeerJ Comput Sci 2024; 10: e1775.

Behera

Dash

. A novel feature selection technique for enhancing performance of unbalanced text classification problem. Intell Decis Technol 2022; 16: 51–69.

Varshavardhini

Rajesh

. Modeling of class imbalance handling with optimal deep learning enabled big data classification model. Intell Decis Technol 2023; 17: 1179–1197.

Aubaidan

Kadir

Lajb

, et al. A review of intelligent data analysis: machine learning approaches for addressing class imbalance in healthcare - challenges and perspectives. Intell Data Anal Int J 2025; 29: 699–719.

Ilham

Kindarto

Fathurohman

, et al. CFCM-SMOTE: a robust fetal health classification to improve precision modeling in multiclass scenarios. Int J Comput Digital Syst 2024; 15: 471–486.

Elreedy

Atiya

Kamalov

. A theoretical distribution analysis of synthetic minority oversampling technique (SMOTE) for imbalanced learning. Mach Learn 2024; 113: 4903–4923.

10.

Fernández

Garcia

Galar

, et al. Learning from imbalanced data sets. Cham: Springer, 2018.

11.

Shao

Yan

. Noise-Robust Gaussian Distribution Based Imbalanced Oversampling. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2023. Lecture Notes in Computer Science, vol 14488. Springer. In 2024. p. 221–34.

12.

Archana

Prakash

. Biomedical named entity recognition through improved balanced undersampling for addressing class imbalance and preserving contextual information. Int J Inf Technol 2024; 16: 4995–5003.

13.

Wang

Chi

. Cost-sensitive stacking ensemble learning for company financial distress prediction. Expert Syst Appl 2024; 255: 124525.

14.

Wang

Zheng

Zhang

. Dense fuzzy support vector machine to binary classification for imbalanced data. J Intell Fuzzy Syst 2023; 45: 9643–9653.

15.

Salehi

Khedmati

. Hybrid clustering strategies for effective oversampling and undersampling in multiclass classification. Sci Rep 2025; 15: 3460.

16.

Joloudari

Marefat

Nematollahi

, et al. Effective class-imbalance learning based on SMOTE and convolutional neural networks. Appl Sci 2023; 13: 4006.

17.

Sasirekha

Kanisha

. Adaptive ensemble framework with synthetic sampling for tackling class imbalance problem. Eng Rep 2025; 7: e70109.

18.

Ilham

Silva

Mercado-Caruso

, et al. Impact of Class Imbalance on Convolutional Neural Network Training in Multi-class Problems. In: Advances in Intelligent Systems and Computing. 2021. p. 309–318.

19.

German

. Glass Identification [dataset]. 1987. UCI Machine Learning Repository. https://doi.org/10.24432/C5WW2P.

20.

Nakai

. Ecoli [dataset]. 1996. UCI Machine Learning Repository. https://doi.org/10.24432/C5388M.

21.

Quinlan

. Credit Approval [dataset]. 1987. UCI Machine Learning Repository. https://doi.org/10.24432/C5FS30.

22.

Wolberg

. Breast Cancer Wisconsin (Original) [dataset]. 1990. UCI Machine Learning Repository. https://doi.org/10.24432/C5HP4Z.

23.

Hofmann

. Statlog (German Credit Data) [dataset]. 1994. UCI Machine Learning Repository. https://doi.org/10.24432/C5NC77.

24.

Nakai

. Yeast [dataset]. 1991. UCI Machine Learning Repository. https://doi.org/10.24432/C5KG68.

25.

Galar

Fernández

Barrenechea

, et al. A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans Syst Man Cybern Part C Appl Rev 2011; 42: 463–484.

26.

McDermott

M.B.A

Wang

Marinsek

, et al. Reproducibility in machine learning for health research: Still a ways to go. Science Translational Medicine 2021; 13(586): eabb1655.

27.

Gurcan

Soylu

. Learning from imbalanced data: integration of advanced resampling techniques and machine Learning models for enhanced cancer diagnosis and prognosis. Cancers (Basel) 2024; 16: 3417.

28.

Chen

Yang

, et al. A survey on imbalanced learning: latest research, applications and future directions. Artif Intell Rev 2024; 57: 137.

29.

Younas

Usman

Yan

. A deep ensemble learning method for colorectal polyp classification with optimized network parameters. Appl Intell 2023; 53: 2410–2433.

30.

Zhu

Xia

Jin

, et al. Class weights random forest algorithm for processing Class imbalanced medical data. IEEE Access 2018; 6: 4641–4652.

FRSTU-Forest: A fixed random-state undersampling framework for reliable decision support in imbalanced classification

Abstract

Keywords

Get full access to this article

References