Instance-based data reduction for improved identification of difficult small classes

Abstract

We studied three different methods to improve identification of small classes, which are also difficult to classify, by balancing an imbalanced class distribution with data reduction. The new method, neighborhood cleaning (NCL) rule, outperformed simple random sampling within classes and one-sided selection method in the experiments with ten real world data sets. All reduction methods improved clearly identification of small classes (20--30%) true-positive rates of the three-nearest neighbor method and the C4.5 decision tree generator, but the differences between the methods were insignificant. However, the significant differences in accuracies, true-positive rates, and true-negative rates obtained from the reduced data were in favor of our method. The results suggest that the NCL rule is a useful method for improving modeling of difficult small classes, as well as for building classifiers that identify these classes from the real world data which frequently have an imbalanced class distribution.

Keywords

data pre-processing data mining imbalanced class distribution nearest neighbor method

Get full access to this article

View all access options for this article.