wCM based hybrid pre-processing algorithm for class imbalanced dataset

Abstract

Imbalanced dataset classification is challenging because of the severely skewed class distribution. The traditional machine learning algorithms show degraded performance for these skewed datasets. However, there are additional characteristics of a classification dataset that are not only challenging for the traditional machine learning algorithms but also increase the difficulty when constructing a model for imbalanced datasets. Data complexity metrics identify these intrinsic characteristics, which cause substantial deterioration of the learning algorithms’ performance. Though many research efforts have been made to deal with class noise, none of them focused on imbalanced datasets coupled with other intrinsic factors. This paper presents a novel hybrid pre-processing algorithm focusing on treating the class-label noise in the imbalanced dataset, which suffers from other intrinsic factors such as class overlapping, non-linear class boundaries, small disjuncts, and borderline examples. This algorithm uses the wCM complexity metric (proposed for imbalanced dataset) to identify noisy, borderline, and other difficult instances of the dataset and then intelligently handles these instances. Experiments on synthetic datasets and real-world datasets with different levels of imbalance, noise, small disjuncts, class overlapping, and borderline examples are conducted to check the effectiveness of the proposed algorithm. The experimental results show that the proposed algorithm offers an interesting alternative to popular state-of-the-art pre-processing algorithms for effectively handling imbalanced datasets along with noise and other difficulties.

Keywords

Classification class imbalance data complexity overlapping bayes error pre-processing learning algorithms

Get full access to this article

View all access options for this article.

References

Branco

, Torgo

and Ribeiro

R.P.

, A survey of predictive modeling on imbalanced domains, ACM Comput. Surv. 49(2) (2016), 1–50.

Wozniak

, Grana

and Corchado

, A survey of multiple classifier systems as hybrid systems, Information Fusion 16 (2014), 3–17.

Czarnecki

W.M.

and Tabor

, Extreme entropy machines: robust information theoretic classification, Pattern Anal. Appl. 20(2) (2017), 383–400.

Ksieniewicz

, Grana

and Wozniak

, Paired feature multilayer ensemble- concept and evaluation of a classifier, J. Intelligent and Fuzzy Systems 32(2) (2017), 1427–1436.

Gosain

, Gupta

and Singh

, Hybrid Data-Level Techniques for Class Imbalance Problem. In: Gupta D., Khanna A., Bhattacharyya S., Hassanien A.E., Anand S., Jaiswal A. (eds) International Conference on Innovative Computing and Communications. Advances in Intelligent Systems and Computing, 1165. Springer, Singapore, 2021. https://doi.org/10.1007/978-981-15-5113-0_95

Gosain

, Saha

and Singh

, Analysis of sampling based classification techniques to overcome class imbalancing. Proc 10th INDIACom-2016 IEEE Int Conference. 2016, pp. 320–326

Haixiang

, Yijing

, Shang

, Mingyun

, Yuanyue

and Binge

, Learning from class-imbalanced data: Review of methods and applications, Expert Systems with Applications 73 (2017), 220–239.

Barella

V.H.

, Garcia

L.P.F.

, De Souto

M.P.

, Lorena

A.C.

and De Carvalho

, Data complexity measures for imbalanced classification tasks. Intl. Joint Conf. on Neural Networks (IJCNN), Rio de Janeiro, 2018, pp. 1–8. doi: 10.1109/IJCNN.2018.8489661

and Japkowicz

, Class imbalances versus small disjuncts, SIGKDD Explor. Newsl. 6(1) (2004), 40–49. https://doi.org/10.1145/1007730.1007737

10.

Batista

G.E.A.P.A.

, Prati

R.C.

and Monard

M.C.

, Balancing strategies and class overlapping. In: IDA. 2005, pp. 24–35.

11.

Denil

and Trappenberg

T.P.

, Overlap versus imbalance. In: Canadian Conference on AI. 2010, pp. 220–231.

12.

Gracia

, Mollineda

R.A.

and Sanchez

J.S.

, On the k-NN performance in a challenging scenario of imbalance and overlapping, Pattern Anal. Appl. 11(3) (2008), 269–280.

13.

Garcia

L.P.F.

, De Carvalho

A.C.P.L.F.

and Lorena

A.C.

, Effect of label noise in the complexity of classification problems, J. Neurocomputing 160 (2015), 108–119.

14.

Napierala

and Stefanowski

, Types of minority class examples and their influence on learning classifiers from imbalanced data, J. Intelligent Information Systems 46(3) (2016), 563–597.

15.

Alejo

, Valdovinos

R.M.

, Garcia

and Pacheco-Sanchez

J.H.

, A hybrid method to face class overlap and class on neural networks and multi-class scenarios, Pattern Recognition Letters 34(4) (2013), 380–388.

16.

Saez

J.A.

, Luengo

, Stefanowski

and Herrera

, Predicting noise filtering efficacy with data complexity measures for nearest neighbor classification, Pattern Recognition. Elsevier Journal. 46 (2013), 355–364.

17.

Napieral-a

and Stefanowski

, Addressing imbalanced data with argument based rule learning, Expert Syst Appl. 42(24) (2015), 9468–81.

18.

Fernandez

, Jesus

M.J.D.

, Herrera

, Addressing overlapping in classification with imbalanced datasets: A first multi-objective approach for feature and instance selection. In: K. Jackowski, R. Burduk, K. Walkowiak, M. Wozniak, H. Yin (eds) Intelligent Data Engineering and Automated Learning – IDEAL Lecture Notes in Computer Science. 9375 (2015), 36–44.

19.

Kaur

and Gosain

, An intelligent undersampling technique based upon intuitionistic fuzzy sets to alleviate class imbalance problem of classification with noisy environment, International Journal of Intelligent Engineering Informatics 6(5) (2018), 417–433. DOI: 10.1504/IJIEI.2018.10015598.

20.

Koziarskia

, Krawczykb

and Wozniak

, Radial-based oversampling for noisy imbalanced data classification, Neurocomputing 343 (2019), 19–33.

21.

-Ponce

A.G.

, Valdovinos

R.M.

, Sanchez

J.S.

and Marcial-Romero

J.R.

, A new under-sampling method to face class overlap and imbalance, Applied Sciences; Basel 10 (2020), 5164. DOI: 10.3390/app10155164

22.

Siddappa

N.G.

and Kampalappa

, Imbalance data classification using local mahalanobis distance learning based on nearest neighbor, SN Comput. Sci 1 (2020), 76.

23.

, Basu

and Law

, Measures of geometrical complexity in classification problems. Data Complexity in Pattern Recognition Ser. Advanced Information and Knowledge Processing. Springer, London, 2006, pp. 1–23. https://doi.org/10.1007/978-1-84628-172-31.

24.

Provost

and Fawcett

, Robust classification for imprecise environments, J Machine Learning 42 (2001), 203–231.

25.

Xiong

, Wu

and Liu

, Classification with class overlapping: a systematic study, Proc. Intl. Conf. on E-Business Intelligence (2010), 491–497.

26.

Gosain

, Saha

and Singh

, Measuring harmfulness of class imbalance by data complexity measures in oversampling methods, International J. of Intelligent Engineering Informatics 7(2–3) (2019), 203–230.

27.

Singh

, Gosain

and Saha

, Weighted k-nearest neighbour based data complexitymetrics for imbalanced datasets, J. Statistical Analysis and Data mining 2020, 394–404. https://doi.org/10.1002/sam.11463

28.

T.K.

and Basu

, Complexity measures of supervised classification problems, IEEE Transactions on Pattern Analysis and Machine Intelligence 24 (2002), 289–300.

29.

Singh

, Multiresolution estimates of classification complexity, IEEE Trans. Pattern Anal. Mach. Intell. 25, pp. 1534–1539.

30.

Sanchez

J.S.

, Mollineda

R.A.

and Sotoca

J.M.

, An analysis of how training data complexity affects the nearest neighbor classifiers, Pattern Analysis Application, Springer 10 (2007), 189–201.

31.

Garcia

, Cano

J.R.

, Bernado-Mansilla

and Herrera

, Diagnose of effective evolutionary prototype selection using an overlapping measure, Intl. J. Pattern Recognition Artificial Intelligence 23 (2009), 2378–2398.

32.

Macia

, Mansilla

E.B.

, Puig

A.O.

and Ho

T.K.

, Learner excellence biased by data set selection: A case for data characterisation and artificial data sets, Pattern Recognition Elsevier 46 (2013), 1054–1066.

33.

Luengo

and Herrera

, An automatic extraction method of the domains of competence for learning classifiers using data complexity measures, J. Knowledge and Information Systems 42(1) (2015), 147–180.

34.

Zubek

and Plewczynski

, Complexity curve: A graphical measure of data complexity and classifier performance, Peer J Computer Science 2(2–3) (2016), e76.

35.

Brun

A.L.

, Britto

A.S.

Jr , Oliveira

L.S.

, Enembreck

and Sabourin

, A framework for dynamic classifier selection oriented by the classification problem difficulty, Pattern Recognition 76 (2018), 175–190.

36.

Anwar

, Jones

and Ganesh

, Measurement of data complexity for classification problems with imbalanced data, J. Statistical Analysis and Data Mining 7 (2014), 194–211.

37.

Xing

, Cai

, Hejlesen

, Toft

, Preliminary evaluation of classification complexity measures on imbalanced data, Proc. Chinese Intelligent Automation Conference (2013), 189–196.

38.

, Ni

, Xu

, Qin

and Jv

, Estimating harmfulness of class imbalance by scatter matrix based class separability measure, J. Intelligent Data Analysis 18 (2014), 203–216.

39.

Diez-Pastor

J.F.

, Rodriguez

J.J.

, Garcia-Osorio

C.I.

and Kuncheva

L.I.

, Diversity techniques improve the performance of the best imbalance learning ensembles, Information Sciences 325 (2015), 98–117.

40.

Fernandez

L.M.

, Canedo

V.B.

and Betanzos

A.A.

, Can classification performance be predicted by complexity measures? A study using microarray data, Intl. J. Knowledge and Information Systems, Springer 51(3) (2017), 1067–1090.

41.

, Cheung

Y.-M.

and Tang

Y.Y.

, Bayes imbalance impact index: A measure of class imbalanced data set for classification problem, IEEE Transactions on Neural Networks and Learning Systems 31(9) (2020), 3525–3539. DOI: 10.1109/TNNLS.2019.2944962.

42.

https://sci2s.ugr.es/keel/imbalanced.php

43.

https://sci2s.ugr.es/keel/classNoise.php#subB