A combined approach to tackle imbalanced data sets

Abstract

Learning with imbalanced data causes high error-rates. Several approaches have been developed for addressing this problem. In this paper, a new learning model, integrating the C4.5 classifier and evolutionary algorithms, is introduced. To strengthen the model, two separate partitioning data sets are chosen for each original data set, by applying two distinct partitioning schemes proposed in this investigation, and these are used in sequence by the learning model. More specifically, the hybrid system first applies the base method (C4.5) to produce a set of rules (R) from a training set (say T₁), as constructed by the first data partitioning scheme. The R is then passed to the Genetic Algorithm to discover another set of rules (say R_GA) from another disjoint training set (say T₂). T₂ is decided by the proposed second partitioning method. Finally, some informative rules of R_GA are included into R. The presented system is tested on several real data sets collected from the UCI machine learning repository and compared with standard C4.5. Experimental results show the good suitability of the system on imbalanced data sets. However, the model does not show negative performance on balanced data sets too.

Keywords

Hybrid imbalanced prediction accuracy improvement

Get full access to this article

View all access options for this article.