Solving class imbalance problem using bagging,boosting techniques,with and without using noise filtering method

Abstract

In numerous real-world applications/domains, the class imbalance problem is prevalent/hot topic to focus. In various existing work, for solving class imbalance problem, almost data is labeled as one class called majority class, while fewer data is labeled as the other class, called minority class (more important class to focus). But, none of the work has performed efficiently (in terms of accuracy). This work presents a comparison of the performance of several boosting and bagging techniques from imbalanced datasets. The wide range of application of data mining and machine learning encounters class imbalance problem. An imbalanced datasets consists of samples with skewed distribution and traditional methods show biased towards the negative (majority) samples. Note that popular pre-processing technique for handling class imbalance problems is called over-sampling. It balances the datasets to achieve a high classification rate and also avoids the bias towards majority class samples. Over-sampling technique takes full minority samples in the training data into consideration while performing classification. But, the presence of some noise (in the minority samples and majority samples) may degrade the classification performance. Hence, the work presents a performance comparison using boosting and bagging (i.e., with both techniques) with and without using noise filtering. This work evaluates the performance with the state of-the-art methods based on ensemble learning like AdaBoost, RUSBoost, SMOTEBoost, Bagging, OverBagging, SMOTEBagging on 25 imbalance binary class datasets with various Imbalance Ratios (IR). The experimental results show that our approach works as promising and effective for dealing with imbalanced datasets using metrics like F-Measure and AUC.

Keywords

Class imbalance problem ensemble learning method noise filter boosting bagging

Get full access to this article

View all access options for this article.

References

Estabrooks

and Japkowicz

, A multiple resampling method for learning from imbalanced data sets, Computational intelligence 20(1) (2004), 18–36.

Fernández

García

del Jesus

M.J.

, and Herrera

, A study of the behaviour of linguistic fuzzy rule based classification systems in the framework of imbalanced data-sets, Fuzzy Sets and Systems 159(18) (2008), 2378–2398.

FernáNdez

LóPez

Galar

Del Jesus

M.J.

and Herrera

, Analysing the classification of imbalanced data-sets with multiple classes: Binarization techniques and ad-hoc approaches, Knowledge-based systems 42 (2013), 97–110.

Wang

and Pineau

, Online bagging and boosting for imbalanced data streams, IEEE Transactions on Knowledge and Data Engineering (2016), 1–1.

Zhu

Baesens

and van den Broucke

S.K.

, An empirical comparison of techniques for the class imbalance problem in churn prediction, Information sciences 408 (2017), 84–99.

Elkan

, The foundations of cost-sensitive learning, in: International Joint Conference on Artificial Intelligence, Vol. 17, Lawrence Erlbaum Associates Ltd, 2001, pp. 973–978.

Seiffert

Khoshgoftaar

T.M.

Van Hulse

and Napolitano

, RUSBoost: A hybrid approach to alleviating class imbalance, IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans 40(1) (2010), 185–197.

Farid

D.M.

Nowé

and Manderick

, A new data balancing method for classifying multi-class imbalanced genomic data, in: 25𝑡ℎ Belgian-Dutch Conference on Machine Learning (Benelearn), 2016, pp. 1–2.

Rekha

Tyagi

A.K.

and Reddy

V.K.

, A novel approach to solve class imbalance problem using noise filter method, in: ISDA 2018, VIT Vellore, India.

10.

Haixiang

Yijing

Yanan

Xiao

and Jinling

, BPSO-Adaboost-KNN ensemble learning algorithm for multiclass imbalanced data classification, Engineering Applications of Artificial Intelligence 49 (2016), 176–193.

11.

Weiss

G.M.

, Mining with rarity: A unifying framework, ACM SIGKDD Explorations Newsletter 6(1) (2004), 7–19.

12.

Han

Wang

W.-Y.

and Mao

B.-H.

, Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning, in: International Conference on Intelligent Computing, Springer, 2005, pp. 878–887.

13.

Visentini

Snidaro

and Foresti

G.L.

, Diversity-aware classifier ensemble selection via f-score, Information Fusion 28 (2016), 24–43.

14.

Alcalá-Fdez

Sánchez

del Jesus

M.J.

Garcia

Ventura

Garrell

J.M.

Otero

Romero

Bacardit

Rivas

V.M.

et al., KEEL: A software tool to assess evolutionary algorithms for data mining problems, Soft Computing 13(3) (2009), 307–318.

15.

Sáez

J.A.

Luengo

Stefanowski

and Herrera

, SMOTE-IPF: Addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering, Information Sciences 291 (2015), 184–203.

16.

Díez-Pastor

J.F.

Rodríguez

J.J.

García-Osorio

C.I.

and Kuncheva

L.I.

, Diversity techniques improve the performance of the best imbalance learning ensembles, Information Sciences 325 (2015), 98–117.

17.

Stefanowski

, Dealing with data difficulty factors while learning from imbalanced data, in: Challenges in Computational Statistics and Data Mining, Springer, 2016, pp. 333–363.

18.

Van Hulse

Khoshgoftaar

T.M.

and Napolitano

, Experimental perspectives on learning from imbalanced data, in: Proceedings of the 24𝑡ℎ international conference on Machine learning, ACM, 2007, pp. 935–942.

19.

Van Hulse

Khoshgoftaar

T.M.

and Napolitano

, A novel noise filtering algorithm for imbalanced data, in: Machine Learning and Applications (ICMLA), 2010 Ninth International Conference on, IEEE, 2010, pp. 9–14.

20.

Alibeigi

Hashemi

and Hamzeh

, DBFS: An effective density based feature selection scheme for small sample size and high dimensional imbalanced data sets, Data and Knowledge Engineering 81 (2012), 67–103.

21.

Fanrong

Chunxiao

and Bing

, Fuzzy possibilistic support vector machines for class imbalance learning, Journal of Convergence Information Technology 8(3) (2013).

22.

Galar

Fernandez

Barrenechea

Bustince

and Herrera

, A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 42(4) (2012), 463–484.

23.

Galar

Fernández

Barrenechea

Bustince

and Herrera

, Ordering-based pruning for improving the performance of ensembles of classifiers in the framework of imbalanced datasets, Information Sciences 354 (2016), 178–196.

24.

Galar

Fernández

Barrenechea

and Herrera

, EUSBoost: Enhancing ensembles for highly imbalanced data-sets by evolutionary undersampling, Pattern Recognition 46(12) (2013), 3460–3471.

25.

Hall

Frank

Holmes

Pfahringer

Reutemann

and Witten

I.H.

, The WEKA data mining software: An update, ACM SIGKDD Explorations Newsletter 11(1) (2009), 10–18.

26.

Chawla

N.V.

Bowyer

K.W.

Hall

L.O.

and Kegelmeyer

W.P.

, SMOTE: Synthetic minority over-sampling technique, Journal of Artificial Intelligence Research 16 (2002), 321–357.

27.

Chawla

N.V.

Lazarevic

Hall

L.O.

and Bowyer

K.W.

, SMOTEBoost: Improving prediction of the minority class in boosting, in: European Conference on Principles of Data Mining and Knowledge Discovery, Springer, 2003, pp. 107–119.

28.

Otsu

, A threshold selection method from gray-level histograms, IEEE Transactions on Systems, Man, and Cybernetics 9(1) (1979), 62–66.

29.

Kang

Huang

and Zhou

, Dynamic behavior of artificial Hodgkin-Huxley neuron model subject to additive noise, IEEE Transactions on Cybernetics 46(9) (2016), 2083–2093.

30.

Wang

Luo

Huang

Feng

and Liu

, A novel ensemble method for imbalanced data learning: Bagging of extrapolation-SMOTE SVM, Computational Intelligence and Neuroscience 2017 (2017).

31.

Barandela

Valdovinos

R.M.

and Sánchez

J.S.

, New applications of ensembles of classifiers, Pattern Analysis and Applications 6(3) (2003), 245–256.

32.

Ali

Majid

Javed

S.G.

and Sattar

, Can-CSC-GBE: Developing cost-sensitive classifier with gentleboost ensemble for breast cancer classification using protein amino acids and imbalanced data, Computers in Biology And Medicine 73 (2016), 38–46.

33.

Wang

and Yao

, Diversity analysis on imbalanced data sets by using ensemble models, in: Computational Intelligence and Data Mining, 2009. CIDM’09. IEEE Symposium on, IEEE, 2009, pp. 324–331.

34.

Amit Kumar

and Rekha

, Machine learning with big data, in: Proceedings of International Conference on Sustainable Computing in Science, Technology and Management, Elsevier, 2019.

35.

and Chu

, Adaptive ensemble undersampling-boost: A novel learning framework for imbalanced data, Journal of Systems and Software 132 (2017), 272–282.

36.

Liu

X.-Y.

and Zhou

Z.-H.

, Exploratory undersampling for class-imbalance learning, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 39(2) (2009), 539–550.

37.

Zhu

and Wu

, Class noise vs. attribute noise: A quantitative study, Artificial Intelligence Review 22(3) (2004), 177–210.

38.

Freund

Schapire

R.E.

et al., Experiments with a new boosting algorithm, in: Icml, Vol. 96, Citeseer, 1996, pp. 148–156.