Sage Journals: Discover world-class research

Abstract

The classification problem of imbalanced datasets has received much attention in recent years. This imbalance problem usually occurs in when the ratio between classes is high. Many techniques have been developed to tackle the imbalance problem in supervised learning. The Synthetic Minority Over-sampling Technique (SMOTE) is one of the most effective over-sampling methods processing this problem, which changes the distribution of training sets to balance the different number of examples of each class. However, SMOTE randomly synthesizes the minority instances along a line joining a minority instance and its selected nearest neighbors, ignoring nearby majority instances and isolated points, which would affect the final classification result. In this paper, we propose two improved techniques based on SMOTE through sparse representation theory. This extension results in Sparse-SMOTE and SROT (Sparse Representation Based Over-Sampling Technique). The Sparse-SMOTE replaces the k-nearest neighbors of the SMOTE with sparse representation, and the SROT uses a sparse dictionary to create a synthetic sample directly. The experiments are performed on 10 UCI datasets using C4.5 as the learning algorithm. The experimental results show that both proposed methods can achieve better performance on TP-Rate, F-Measure, G-Mean and AUC values. Moreover, the results show that our new proposals’ perform is more effective compared with SMOTE and some other approaches.

Keywords

Imbalanced dataset over-sampling SMOTE sparse-SMOTE SROT

Get full access to this article

View all access options for this article.

References

Ramentol

Verbiest

and Bello

, SMOTE-FRST: a new resampling method using fuzzy rough set theory//10th International FLINS conference on uncertainty modelling in knowledge engineering and decision making (to appear), 2012.

Huang

Yang

King

et al., Biased Minimax Probability Machine for Medical Diagnosis//AMAI, 2004.

Chen

Lin

and Xiong

, Exploiting probabilistic topic models to improve text categorization under class imbalance, Information Processing & Management 47 (2011), 202–214.

and Garcia

E.A.

, Learning from imbalanced data, IEEE Transactions on Knowledge and Data Engineering 21 (2009), 1263–1284.

Branco

Torgo

and Ribeiro

R.P.

, A survey of predictive modeling on imbalanced domains, ACM Computing Surveys (CSUR) 49(2) (2016), 31.

Chawla

N.V.

Bowyer

K.W.

Hall

L.O.

et al., SMOTE: synthetic minority over-sampling technique, Journal of Artificial Intelligence Research (2002), 321–357.

Martınez-Trinidad

J.F.

, SMOTE-D a Deterministic Version of SMOTE//Pattern Recognition: 8th Mexican Conference, MCPR 2016, Guanajuato, Mexico, June 22–25, 2016. Proceedings. Springer, 9703 (2016), 177.

Yun

and Lee

J.S.

, Automatic Determination of Neighborhood Size in SMOTE//Proceedings of the 10th International Conference on Ubiquitous Information Management and Communication. ACM, 2016, 100.

Bunkhumpornpat

Sinapiromsaran

and Lursinsap

, Safe-level-smote: Safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem, Advances in Knowledge Discovery and Data Mining (2009), 475–482.

10.

Dong

and Wang

, A new over-sampling approach: random-SMOTE for learning from imbalanced data sets//International Conference on Knowledge Science, Engineering and Management. Springer Berlin Heidelberg, 2011: 343–352.

11.

Wang

B.X.

and Japkowicz

, Imbalanced data set learning with synthetic samples//Proc. IRIS Machine Learning Workshop. 2004, p. 19.

12.

Aharon

Elad

and Bruckstein

, K-SVD: An Algorithm for Designing Overcomplete Dictionaries for Sparse Representation, IEEE Transactions on Signal Processing 54(11) (2006), 4311.

13.

Wright

Yang

A.Y.

Ganesh

et al., Robust Face Recognition via Sparse Representation, IEEE Transactions on Pattern Analysis and Machine Intelligence 31(2) (2009), 210–227.

14.

Breiman

Friedman

and Stone

C.J.

, Classification and regression trees, CRC press, 1984.

15.

Di Martino

Hernández

and Fiori

, A new framework for optimal classifier design, Pattern Recognition 46(8) (2013), 2249–2255.

16.

Drummond

and Holte

R.C.

, C4.5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling. InWorkshop on Learning from Imbalanced Datasets II, volume 11. Citeseer, 2003.

17.

Batista

G.E.A.P.A.

Prati

R.C.

and Monard

M.C.

, Astudy of the behavior of several methods for balancing machine learning training data, ACM SIGKDD Explorations Newsletter 6(6) (2004), 20–29.

18.

Han

Wang

W.Y.

and Mao

B.H.

, Borderline-SMOTE: a new over-sampling method in imbalanced datasets learning//Advances in intelligent computing. Springer Berlin Heidelberg, 2005, pp. 878–887.

19.

Batista

G.E.

Prati

R.C.

and Monard

M.C.

, A study of the behavior of several methods for balancing machine learning training data, ACM Sigkdd Explorations Newsletter 6(1) (2004), 20–29.

20.

Martınez-Trinidad

J.F.

, SMOTE-D a Deterministic Version of SMOTE//Pattern Recognition: 8th Mexican Conference, MCPR 2016, Guanajuato, Mexico, June 22–25, 2016. Proceedings. Springer, 9703 (2016), 177.

21.

Sáez

J.A.

Luengo

and Stefanowski

, SMOTE-IPF: Addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering, Information Sciences 291 (2015), 184–203.

22.

Elkan

, The foundations of cost-sensitive learning//International joint conference on artificial intelligence, Lawrence Erlbaum Associates Ltd 17 (2001), 973–978.

23.

Dietterich

T.G.

, Ensemble methods in machine learning//Multiple classifier systems. Springer Berlin Heidelberg, 2000, pp. 1–15.

24.

Padmaja

T.M.

Krishna

P.R.

and Bapi

R.S.

, Majority filter-based minority prediction (MFMP): An approach for imbalaced datasets//TENCON 2008–2008 IEEE Region 10 Conference. IEEE, 2008, pp. 1–6.

25.

Hall

Frank

and Holmes

, The WEKA data mining software: an update, ACM SIGKDD Explorations Newsletter 11(1) (2009), 10–18.

26.

Donoho

D.L.

, Compressed sensing, IEEE Transactions on Information Theory 52 (2006), 1289–1306.

27.

Jerri

A.J.

, The Shannon sampling theorem – Its various extensions and applications: A tutorial review, Proceedings of the IEEE 65 (1977), 1565–1596.

28.

Compressed sensing: theory and applications. Cambridge University Press, 2012.

29.

Donoho

D.L.

and Elad

, Optimally sparse representation in general (nonorthogonal) dictionaries via ℓ 1 minimization, Proceedings of the National Academy of Sciences 100 (2003), 2197–2202.

30.

Donoho

D.L.

, Compressed sensing, IEEE Transactions on Information Theory 52 (2006), 1289–1306.

31.

Huang

and Aviyente

, Sparse representation for signal classification//NIPS, 19 (2006), 609–616.

32.

Wright

and Mairal

, Sparse representation for computer vision and pattern recognition, Proceedings of the IEEE 98(6) (2010), 1031–1044.

33.

Wright

Yang

A.Y.

and Ganesh

, Robust face recognition via sparse representation, IEEE Transactions on Pattern Analysis and Machine Intelligence 31(2) (2009), 210–227.

34.

Cao

Zhao

and Lai

, Landmark recognition with sparse representation classification and extreme learning machine, Journal of the Franklin Institute 352(10) (2015), 4528–4545.

35.

Bergeaud

and Mallat

, Matching pursuit of images//Time-Frequency and Time-Scale Analysis, Proceedings of the IEEE-SP International Symposium on. IEEE, 1994, pp. 330–333.

36.

Figueras i Ventura

R.M.

Vandergheynst

and Frossard

, Low-rate and flexible image coding with redundant representations, IEEE Transactions on Image Processing 3 (2015), 726–739.

37.

Michal

Elad

and Alfred

, K-SVD: An algorithm for designing over-complete dictionaries for sparse representation, IEEE Transactions on Signal Processing 54 (2006), 4311–4322.

38.

Yang

A.Y.

Sastry

S.S.

Ganesh

and Ma

, Fast

\ell

1-minimization algorithms and an application in robust face recognition: A review//Image Processing (ICIP), 2010 17th IEEE International Conference on. IEEE, 2010: 1849–1852.

39.

Quinlan

J.R.

, C4. 5: programs for machine learning. Morgan Kaufmann, San Mateo, CA, 1992.

40.

Kumar

Quinlan

J.R.

Ghosh

Yang

Motoda

McLachlan

G.J.

Liu

P.S.

Zhou

Z.-H.

Steinbach

Hand

D.J.

and Steinberg

, Top 10 Algorithms in Data Mining. Knowl Inf Syst 14(1) (2008), 1–37.

41.

Batista

G.E.A.P.A.

Prati

R.C.

and Monard

M.C.

, A study of the behaviour of several methods for balancing machine learning training data, SIGKDD Explor 6(1) (2004), 20–29.

42.

Han

Wang

W.Y.

and Mao

B.H.

, Borderline-SMOTE: a new over-sampling method in imbalanced datasets learning//Advances in intelligent computing. Springer Berlin Heidelberg, 2005, pp. 878–887.

43.

Kubat

Holte

and Matwin

, Learning when negative examples abound/Machine Learning: ECML-97, Springer Berlin Heidelberg, 1997, pp. 146–153.

44.

Yen

S.J.

and Lee

Y.S.

, Cluster-based under-sampling approaches for imbalanced data distributions, Expert Systems with Applications 36 (2009), 5718–5727.

45.

Swets

J.A.

, Measuring the accuracy of diagnostic systems, Science 240 (1988), 1285–1293.

46.

Sáez

J.A.

Luengo

and Stefanowski

, SMOTE-IPF: Addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering, Information Sciences 291 (2015), 184–203.

47.

Napierala

Stefanowski

Wilk

, Learning from imbalanced data in presence of noisy and borderline examples, in: Rough Sets and Current Trends in Computing, Lecture Notes in Computer Science, vol. 6086, Springer, Berlin/Heidelberg, 2010, pp. 158–167.

48.

Ramentol

Caballero

and Bello

, SMOTE-RSB*: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using SMOTE and rough sets theory, Knowledge and Information Systems 33(2) (2012), 245–265.

49.

García

Fernández

and Luengo

, A study of statistical techniques and performance measures for genetics-based machine learning: accuracy and interpretability, Soft Computing 13(10) (2009), 959–977.

Improved over-sampling techniques based on sparse representation for imbalance problem

Abstract

Keywords

Get full access to this article

References