Abstract
In machine learning, classification involves identifying the categories or classes to which a new observation belongs based on a training set. The performance of a classification model is generally measured by the classification accuracy of a test set. The first step in developing a classification model is to divide an acquired dataset into training and test sets through random sampling. In general, random sampling does not guarantee that test accuracy reflects the performance of a developed classification model. If random sampling produces biased training/test sets, the classification model may result in bias. In this study, we show the problems of random sampling and propose balanced sampling as an alternative. We also propose a measure for evaluating sampling methods. We perform empirical experiments using benchmark datasets to verify that our sampling algorithm produces proper training and test sets. The results confirm that our method produces better training and test sets than random and several non-random sampling methods can.
Get full access to this article
View all access options for this article.
