Abstract
Learning with imbalanced data causes high error-rates. Several approaches have been developed for addressing this problem. In this paper, a new learning model, integrating the C4.5 classifier and evolutionary algorithms, is introduced. To strengthen the model, two separate partitioning data sets are chosen for each original data set, by applying two distinct partitioning schemes proposed in this investigation, and these are used in sequence by the learning model. More specifically, the hybrid system first applies the base method (C4.5) to produce a set of rules (R) from a training set (say T1), as constructed by the first data partitioning scheme. The R is then passed to the Genetic Algorithm to discover another set of rules (say RGA) from another disjoint training set (say T2). T2 is decided by the proposed second partitioning method. Finally, some informative rules of RGA are included into R. The presented system is tested on several real data sets collected from the UCI machine learning repository and compared with standard C4.5. Experimental results show the good suitability of the system on imbalanced data sets. However, the model does not show negative performance on balanced data sets too.
Get full access to this article
View all access options for this article.
