Critical Instances Removal based Under-Sampling (CIRUS): A solution for class imbalance problem 1

Abstract

The most critical issue in real world applications are class imbalance problems. Imbalanced data sets are common across different domain including banking, health care, finance and other. When such data sets are trained on typical classification algorithm they tends to be biased towards the majority class. The learning task becomes more challenging when there is also an overlap of instances from different classes. In this paper, we propose an undersampling framework for binary classification datasets by removing overlapped data points called Critical Instances Removal based Under-Sampling (CIRUS). Our method is designed to identify and eliminate majority class instances from the overlapping region. Accurate identification and elimination of these instances maximise the visibility of the minority class instances and at the same time minimises excessive elimination of data, which reduces loss of information. Extensive experiments using simulated and real-world datasets were carried out and the results show comparable performance with state-of-the-art methods across different common metrics with exceptional and statistically significant improvements in sensitivity.

Keywords

Imbalanced dataset undersampling k-NN class overlap classification

Get full access to this article

View all access options for this article.

References

Alcalá-Fdez

Fernández

Luengo

Derrac

García

Sánchez

and Herrera

, Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework, Journal of Multiple-Valued Logic & Soft Computing, 2011, 17.

Aşkan

and Sayın

, Svm classification for imbalanced data sets using a multiobjective optimization framework, Annals of Operations Research 216(1) (2014), 191–203.

Barua

Islam

M.M.

Yao

and Murase

, Mwmote-majority weighted minority oversampling technique for imbalanced data set learning, IEEE Transactions on Knowledge and Data Engineering 26(2) (2012), 405–425.

Bunkhumpornpat

and Sinapiromsaran

, Dbmute: density-based majority under-sampling technique, Knowledge and Information Systems 50(3) (2017), 827–850.

Bunkhumpornpat

Sinapiromsaran

and Lursinsap

, Safe-level-smote: Safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem, in: Pacific-asia Conference on Knowledge Discovery and Data Mining, Springer, 2009, pp. 475–482.

Bunkhumpornpat

Sinapiromsaran

and Lursinsap

, Dbsmote: density-based synthetic minority over-sampling technique, Applied Intelligence 36(3) (2012), 664–684.

Chawla

N.V.

Bowyer

K.W.

Hall

L.O.

and Kegelmeyer

W.P.

, Smote: synthetic minority over-sampling technique, Journal of Artificial Intelligence Research 16 (2002), 321–357.

Chawla

N.V.

Japkowicz

and Kotcz

, Special issue on learning from imbalanced data sets, ACM Sigkdd Explorations Newsletter 6(1) (2004), 1–6.

Cunningham

and Delany

S.J.

, k-nearest neighbour classifiers, Multiple Classifier Systems 34(8) (2007), 1–17.

10.

Das

Datta

and Chaudhuri

B.B.

, Handling data irregularities in classification: foundations, trends, and future challenges, Pattern Recognition 81 (2018), 674–693.

11.

Devi

Purkayastha

et al., Redundancy-driven modified tomek-link based undersampling: a solution to class imbalance, Pattern Recognition Letters 93 (2017), 3–12.

12.

Douzas

Bacao

and Last

, Improving imbalanced learning through a heuristic oversampling method based on k-means and smote, Information Sciences 465 (2018), 1–20.

13.

Ezawa

K.J.

Singh

and Norton

S.W.

, Learning goal oriented bayesian networks for telecommunications risk management, in: Proceedings of the International Conference on Machine Learning, 1996, pp. 139–147.

14.

Fawcett

and Provost

F.J.

, Combining data mining and machine learning for effective user profiling, in: Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, 1996, pp. 8–13.

15.

FernáNdez

LóPez

Galar

Del Jesus

M.J.

and Herrera

, Analysing the classification of imbalanced data-sets with multiple classes: binarization techniques and ad-hoc approaches, Knowledge-based Systems 42 (2013), 97–110.

16.

Freitas

Costa-Pereira

and Brazdil

, Cost-sensitive decision trees applied to medical data, in: International Conference on Data Warehousing and Knowledge Discovery, Springer, 2007, pp. 303–312.

17.

García

and Herrera

, Evolutionary undersampling for classification with imbalanced datasets: proposals and taxonomy, Evolutionary Computation 17(3) (2009), 275–306.

18.

Gillala Rekha

A.K.T.

and Krishna Reddy

, Chaotic salp swarm optimization using svm for class imbalance problems, in: 19th International Conference on Hybrid Intelligent Systems (HIS 2019), Springer, 2019.

19.

Gillala Rekha

A.K.T.

and Krishna Reddy

, A novel approach for solving skewed classification problem using cluster based ensemble method, Mathematical Foundations of Computing, 2020.

20.

Gong

and Kim

, Rhsboost: improving classification performance in imbalance data, Computational Statistics & Data Analysis 111 (2017), 1–13.

21.

Haixiang

Yijing

Shang

Mingyun

Yuanyue

and Bing

, Learning from class-imbalanced data: review of methods and applications, Expert Systems with Applications 73 (2017), 220–239.

22.

Han

Wang

W.-Y.

and Mao

B.-H.

, Borderline-smote: a new over-sampling method in imbalanced data sets learning, in: International Conference on Intelligent Computing, Springer, 2005, pp. 878–887.

23.

Bai

Garcia

E.A.

and Li

, Adasyn: Adaptive synthetic sampling approach for imbalanced learning, in: 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), IEEE, 2008, pp. 1322–1328.

24.

Japkowicz

and Stephen

, The class imbalance problem: a systematic study, Intelligent Data Analysis 6(5) (2002), 429–449.

25.

Johnson

J.M.

and Khoshgoftaar

T.M.

, Survey on deep learning with class imbalance, Journal of Big Data 6(1) (2019), 27.

26.

Kubat

Holte

R.C.

and Matwin

, Machine learning for the detection of oil spills in satellite radar images, Machine Learning 30(2–3) (1998), 195–215.

27.

Laurikkala

, Improving identification of difficult small classes by balancing class distribution, in: Conference on Artificial Intelligence in Medicine in Europe, Springer, 2001, pp. 63–66.

28.

Lin

W.-C.

Tsai

C.-F.

Y.-H.

and Jhang

J.-S.

, Clustering-based undersampling in class-imbalanced data, Information Sciences 409 (2017), 17–26.

29.

López

Del Río

Benítez

J.M.

and Herrera

, Cost-sensitive linguistic fuzzy rule based classification systems under the mapreduce framework for imbalanced big data, Fuzzy Sets and Systems 258 (2015), 5–38.

30.

Mazurowski

M.A.

Habas

P.A.

Zurada

J.M.

J.Y.

Baker

J.A.

and Tourassi

G.D.

, Training neural network classifiers for medical decision making: the effects of imbalanced datasets on classification performance, Neural Networks 21(2–3) (2008), 427–436.

31.

Napierala

and Stefanowski

, Types of minority class examples and their influence on learning classifiers from imbalanced data, Journal of Intelligent Information Systems 46(3) (2016), 563–597.

32.

Nekooeimehr

and Lai-Yuen

S.K.

, Adaptive semi-unsupervised weighted oversampling (a-suwo) for imbalanced datasets, Expert Systems with Applications 46 (2016), 405–416.

33.

Onan

, Consensus clustering-based undersampling approach to imbalanced learning, Scientific Programming, 2019, 2019.

34.

Patel

and Thakur

G.S.

, Classification of imbalanced data using a modified fuzzy-neighbor weighted approach, International Journal of Intelligent Engineering and Systems 10(1) (2017), 56–64.

35.

Prati

R.C.

Batista

G.E.

and Monard

M.C.

, Class imbalances versus class overlapping: an analysis of a learning system behavior, in: Mexican International Conference on Artificial Intelligence, Springer, 2004, pp. 312–321.

36.

Rekha

and Tyagi

A.K.

, Necessary information to know to solve class imbalance problem: From a user’s perspective, in: Proceedings of ICRIC 2019, Springer, 2020, pp. 645–658.

37.

Rekha

Tyagi

A.K.

and Krishna Reddy

, Solving class imbalance problem using bagging, boosting techniques, with and without using noise filtering method, International Journal of Hybrid Intelligent Systems (Preprint) (2019), 1–10.

38.

Rekha

Tyagi

A.K.

and Krishna Reddy

, A wide scale classification of class imbalance problem and its solutions: a systematic literature review, Journal of Computer Science 15 (2019), 886–929.

39.

Sáez

J.A.

Luengo

Stefanowski

and Herrera

, Smote-ipf: addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering, Information Sciences 291 (2015), 184–203.

40.

Sun

Song

Zhu

Sun

and Zhou

, A novel ensemble method for classifying imbalanced data, Pattern Recognition 48(5) (2015), 1623–1637.

41.

Tsai

C.-F.

Lin

W.-C.

Y.-H.

and Yao

G.-T.

, Under-sampling class imbalanced datasets by combining clustering analysis and instance selection, Information Sciences 477 (2019), 47–54.

42.

Vuttipittayamongkol

Elyan

Petrovski

and Jayne

, Overlap-based undersampling for improving imbalanced data classification, in: International Conference on Intelligent Data Engineering and Automated Learning, Springer, 2018, pp. 689–697.