Abstract
Data gathered from real world often contains label noise, which is harmful to the quality of data. Moreover, any data mining process suffers a deterioration when it is applied on noisy data. In this paper, a new approach is proposed to improve data quality by correcting mislabeled data. The proposed method employs a procedure to estimate the level of the noise in the data and combines this noise estimation with a correction process. A clustering method and k nearest neighbors approach are applied in the correction process. Extensive experimental results using real-world data sets from UCI machine learning repository are provided. The empirical study shows that our approach successfully improves data quality in many cases and outperforms several correction methods.
Get full access to this article
View all access options for this article.
