Sampling based hybrid algorithms for imbalanced data classification

Abstract

The microarray technology can exhibit the expression levels of tens of thousands of genes simultaneously, which helps to diagnose diseases particularly cancer at molecular level. But one of the most challenging issues associated with this technology is the skewed nature of the datasets, which makes the traditional classifiers inefficient in producing accurate classification results. However, a lot of work addressing this issue on binary class problems has been done by many researchers. This paper has combined three different sampling techniques namely, over sampling; under sampling and SMOTE with a meta-learning algorithm `DECORATE' to deal with a highly imbalanced multi-class microarray cancer dataset. The rate of accuracy of classification of the predictive models in case of imbalanced problem cannot be considered as an appropriate measure of effectiveness. Hence, different metrics are applied here to measure the performance of the proposed hybrid methods of classification. The experimental results show that unlike other traditional classification algorithms, our proposed hybrid methods are not sensitive to highly skewed multi-class microarray dataset.

Keywords

Imbalanced problem SMOTE under sampling over sampling resampling

Get full access to this article

View all access options for this article.

References

Bhattacharjee

, Richards

W.G.

, Staunton

et al., Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses, Proceedings of the National Academy of Sciences of the United States of America 98(24) (2001), 13790-13795.

Estabrooks

, Jo

and Japkowicz

, A multiple resampling method for learning from imbalanced data sets, Computational Intelligence 20(1) (2004), 18-36.

Krogh

and Vedelsby

, Neural network ensembles, cross validation and active learning. in Advances in Neural Information Processing Systems, MIT Press, 1995, pp. 231-238.

Seiffert

, Khoshgoftaar

, van Hulse

and Napolitano

, RUSBoost: A hybrid approach to alleviating class imbalance, IEEE Transactions on Systems, Man, and Cybernetics Part A:Systems and Humans 40(1) (2010), 185-197.

Chawla

N.V.

, Cieslak

D.A.

, Hall

L.O.

and Joshi

, Automatically Countering Imbalance and its Empirical Relationship to Cost, Data Mining and Knowledge Discovery 17(2) (2008), 225-252.

Forman

, An extensive empirical study of feature selection metrics for text classification, J Machine Learning Research 3 (2003), 1289-1305.

Zenobi

and Cunningham

, Using diversity in preparing ensembles of classifiers based on different feature subsets to minimize generalization error, Proceedings of the European Conference on Machine Learning 2167 (2001), 576-587.

, Hong

, Yang

, Ni

, Dan

and Qin

, Recognition of multiple imbalanced cancer types based on DNA microarray data using ensemble classifier, BioMed Research International 2013 (2013), 1-13.

Guyon

and Elisseeff

, An introduction to variable and feature selection, Journal of Machine Learning Research 3 (2003), 1157-1182.

10.

Cohen

, A coefficient of agreement for nominal scales, Educational and Psychological Measurement (1960).

11.

Kira

and Rendell

, The feature selection problem: Traditional methods and new algorithms. In Proc. of the 9th International Conference on Machine Learning, 1992, pp. 249-256.

12.

Kuncheva

and Whitaker

, Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy, Machine Learning 51 (2003), 181-207.

13.

Mazurowski

M.A.

, Habas

P.A.

, Zurada

J.M.

, Lo

J.Y.

, Baker

J.A.

and Tourassi

G.D.

, Training neural network classifiers for medical decision making: the effects of imbalanced datasets on classification performance, Neural Networks 21(2-3) (2008), 427-436.

14.

Garcia-Pedrajas

, Perez-Rodriguez

, Garcia-Pedrajas

, Ortiz-Boyer

and Fyfe

, Class Imbalance methods for translation initiation site recognition in DNA sequences, Knowledge-Based Systems 25(1) (2012), 22-34.

15.

Ueda

and Nakano

, Generalization error of ensemble estimators, Proc. IEEE Int. Conf. Neural Netw. 1 (1996), 90-95.

16.

Chawla

N.V.

, Lazarevic

, Hall

L.O.

and Bowyer

K.W.

, SMOTEBoost: improving prediction of the minority class in boosting, in Knowledge Discovery in Databases, 2003, pp. 107-119.

17.

Chawla

N.V.

, Bowyer

K.W.

, Hall

L.O.

et al., SMOTE: Synthetic minority over-sampling technique, Journal of ArtificialIntelligence Research 16 (2002), 321-357.

18.

Melville

and Mooney

R.J.

, Constructing Diverse Classifier Ensembles Using Artificial Training Examples, Eighteenth International Joint Conference on Artificial Intelligence (2003), 505-510.

19.

Yin

Q.Y.

, Zhang

J.S.

, Zhang

C.X.

and Ji

N.N.

, A novel selective ensemble algorithm for imbalanced data classification based on exploratory understanding, Mathematical Problems in Engineering 2014 (2014), 1-14.

20.

Blagus

and Lusa

, Evaluation of SMOTE for highdimensionalclass-imbalanced microarray data, in Proceedingsof the 11th International Conference on Machine Learning andApplications, Boca Raton, Fla, USA, 2012, pp. 89-94.

21.

Robnik-Sikonja

and Kononenko

, Theoritical and empirical analysis of Relief and Relief, Mach. Learn. 53 (2003), 23-69.

22.

Dash

and Dash

, A Correlation based Multilayer Perceptron algorithm for Cancer Classification with Gene-Expression Dataset, in Proceedings of the International Conference on Hybrid Intelligent Systems (HIS), published in IEEE Xplore, 978-1-4799-7633-1/14/$31.00, Kuwait, 2014.

23.

Dash

and Patra

B.N.

, Feature selection algorithms for classification and clustering. Global trends in intelligent computing research and development - a volume in the advances in computational intelligence and robotics book series, IGI Global, 2013, pp. 111-130.

24.

Dash

, Diverse Meta Learning Ensemble Technique to Handle Imbalanced Microarray Dataset, Advances in Nature and Biologically Inspired Computing: Proceedings of the 7th World Congress 419 (2015), 1-13.

25.

Ramaswamy

, Tamayo

, Rifkin

et al., Multiclass cancer diagnosis using tumor gene expression signatures, Proceedings of the National Academy of Sciences of the United States of America 98(26) (2001), 15149-15154.

26.

Wang

and Yao

, Multiclass imbalance problems: analysisand potential solutions, IEEE Transactions on Systems, Man, and Cybernetics B 42(4) (2012), 1119-1130.

27.

Pham

T.D.

, Wells

and Grane

D.I.

, Analysis of microarray gene expression data, Current Bioinformatics 1(1) (2006), 37-53.

28.

Khreich

, Granger

, Miri

and Sabourin

, Iterative Boolean combination of classifiers in the ROC space: An application to anomaly detection with HMMs, Pattern Recognition 43(8) (2010), 2732-2752.

29.

Sun

, Kamel

M.S.

, Wong

A.K.

and Wang

, Cost sensitive boosting for classification of imbalanced data, PatternRecognition 40(12) (2007), 3358-3378.

30.

Zhu

Z.-B.

and Song

Z.-H.

, Fault diagnosis based on imbalance modified kernel fisher discriminant analysis, Chemical Engineering Research and Design 88(8) (2010), 936-951.