Data reduction and stacking for imbalanced data classification

Abstract

Class imbalance arises when the number of examples belonging to one class is much greater than the number of examples belonging to another. The discussed approach focuses on combining several techniques including data reduction and stacking with the aim of improving the performance of the machine classification in the case of imbalanced data. The paper proposes a cluster-based data reduction approach assuming that the instances are selected from a cluster, the data reduction is carried-out on instances belonging to the majority classes, and the aim of the instance selection is to reduce the imbalance ratio between the majority and minority classes. The process of instance selection is carried out with using an agent-based population learning algorithm. To increase performance and generalization ability of the prototype-based machine learning classification it was decided to use the stacking technique. The proposed approach is validated experimentally using several benchmark datasets from the KEEL repository. Advantages and main features of the approach are discussed considering the results of the computational experiment.

Keywords

Instance selection clustering stacking imbalanced data team of agents

Get full access to this article

View all access options for this article.

References

Wolper

D.H.

The supervised learning no free lunch theorems, Technical Report, NASA Ames Research Center, Moffett Field, California, USA 2001.

Kim

S.-W.

and Oommen

B.J.

A brief taxonomy and ranking of creative prototype reduction schemes, Pattern Analysis Application 6 (2003), 232–244.

Wilson

D.R.

and Martinez

T.R.

Reduction techniques for instance-based learning algorithm, Machine Learning 33(3) (2000), 257–286.

Bhanu

and Peng

Adaptive integration image segmentation and object recognition, IEEE Transactions on Systems, Man and Cybernetics 30(4) (2000), 427–441.

Czarnowski

and Jędrzejowicz

A New Cluster-based Instance Selection Algorithm, In: J. O’Shea et al. (Eds.): KES-AMSTA 2011, LNAI 6682, Springer-Verlag, Berlin Heidelberg (2011), pp. 436–44.

Uno

Multi-sorting algorithm for finding pairs of similar short substrings from large-scale string data, Knowledge and Information Systems. (2009), doi: 10.1007/s10115-009-0271-6

Sun

, Chen

, Wang

and Xie

Evolutionary under-sampling based bagging ensemble method for imbalanced data classification, Frontiers of Computer Science 12(2) (2018), 331–350.

Galar

, Fernandez

, Barrenechea

, Bustince

and Herrera

A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches, IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews 42(4) (2012), 463–484.

Lin

W-C.

, Chih-Fong

, Hu

Y.-H.

and Jhang

J.-S.

Clustering-based undersampling in class-imbalanced data, Information Sciences 409 (2017), 10.1016./j.ins.2017.05.008.

10.

Alcalá-Fdez

, Fernández

, Luengo

, Derrac

, García

, Sánchez

and Herrera

KEEL data-mining software tool: Data set repository, integration of algorithms and experimental analysis framework, Journal of Multiple-Valued Logic and Soft Computing17(2– 3) (2011), 255–287. (last accessed to the repository 2018/04/10)

11.

Czarnowski

and Jędrzejowicz

Cluster Integration for the Cluster-based Instance Selection, In: Pan J.-S., Chen S.-M., and Nguyen N.T. (Eds.): ICCCI 2010, Part I, LNAI 6421, Springer-Verlag, Berlin Heidelberg (2010), pp.353–362 .

12.

Jędrzejowicz

Social learning algorithm as a tool for solving some difficult scheduling problems, Foundation of Computing and Decision Sciences 24 (1999), 51–66.

13.

Hamo

and Markovitch

The COMPSET Algorithm for Subset Selection, In: Proceedings of The Nineteenth International Joint Conference for Artificial Intelligence, Edinburgh, Scotland, (2005), 728–733.

14.

Czarnowski

Distributed Learning with Data Reduction, In: N.T. Nguyen (ed.), Transactions on CCI IV, LNCS 6660, Springer-Verlag Berlin Heidelberg, (2011), 3–121.

15.

Talukdar

, Baerentzen

, Gove

and de Souza

Asynchronous teams: Co-operation schemes for autonomous, computer-based agents, Technical Report EDRC 18-59-96, Carnegie Mellon University, Pittsburgh, 1996.

16.

Czarnowski

and Jędrzejowicz

An approach to data reduction and integrated machine classification, New Generation Computing 28(1) (2010), 21–40.

17.

Quinlan

J.R.

C4.5: Programs for machine learning, Morgan Kaufmann Publishers, SanMateo 1993.

18.

Fernandez

, del Jesus

M.J.

and Herrera

Hierarchical fuzzy rule based classification systems with genetic rule selection for imbalanced data-sets, International Journal of Approximate Reasoning 50 (2009), 561–577, doi: 10.1016/j.ijar.2008.11.004.

19.

Czarnowski

, Yildirim

and Jędrzejowicz

Kuo-Ming

(eds.), Overcoming “Big Data” Barriers in Machine Learning Techniques for the Real-Life Applications, Complexity 2018.

20.

Somasundaram

and Srinivasulu

Redd, Data Imbalance: Effects and Solutions for Classification of Large and Highly Imbalanced Data. Proc. of 1st International Conference on Research in Engineering, Computers and Technology (ICRECT 2016) 28–34.

21.

Czarnowski

and Jędrzejowicz

Cluster-Based Instance Selection for the Imbalanced Data Classification, In: N.T. Nguyen (eds.), Proceedings of the 10th International Conference on Computational Collective Intelligence (ICCCI 2018), LNCS 11056, Springer (2018), 191–200.

22.

Kuncheva

L.I.

, Arnaiz-Gonzalez

, Diez-Pastor

J.-F.

and Dunn

I.A.

Instance Selection Improves Geometric Means Accuracy: A Study on Imbalanced Data Classification. CoRR, abs/1804.07155, 2018.

23.

Last

, Douzas

and Bacao

Oversampling for Imbalanced Learning Based on K-Means and SMOTE, CoRR, abs/1711.00837, 2017.

24.

Czarnowski

and J¸edrzejowicz

, An Approach to Data Reduction for Learning from Big Datasets: Integrating Stacking, Rotation, and Agent Population Learning Techniques, Complexity, vol. 2018, 2018, pp. 13. Article ID 7404627, https://org/10.1155/2018/7404627

25.

Wolpert

Stacked Generalization, Neural Networks 5 (1992), 241–259.

26.

Lohr

S.L.

Sampling: Design and Analysis. 2nd ed. Boston, MA: Cengage Learning, 2009.

27.

Skalak

D.B.

Prototype selection for composite neighbor classifiers, University of Massachusetts Amherst. 1997. Available at: https://web.cs.umass.edu/publication/docs/1996/UM-CS-1996-089.pdf