Abstract
Much of the research literature in data mining and machine learning has focused on developing classification models for various application-specific learning tasks. In contrast, the characteristics of the underlying data, and their impacts on learning, have received much less attention. While it is generally understood that imbalanced, noisy and relatively small datasets make classification tasks more difficult, there has been, to our knowledge, no comprehensive examination of the impacts of these important and commonly-encountered dataset characteristics on the learning process. In this work, we present a comprehensive empirical analysis of learning from imbalanced, limited and noisy data. We present the performance of 11 commonly used learning algorithms and the effects of dataset size, class distribution, noise level and noise distribution on each learner. In this work, for which over one million classification models were built, we identify which learners are most robust to changing each of these experimental factors using two different performance metrics. Our results show that each of these factors plays a critical role in learner performance, with some learners exhibiting much greater stability than others.
Get full access to this article
View all access options for this article.
