Improving data quality with label noise correction

Abstract

Data gathered from real world often contains label noise, which is harmful to the quality of data. Moreover, any data mining process suffers a deterioration when it is applied on noisy data. In this paper, a new approach is proposed to improve data quality by correcting mislabeled data. The proposed method employs a procedure to estimate the level of the noise in the data and combines this noise estimation with a correction process. A clustering method and k nearest neighbors approach are applied in the correction process. Extensive experimental results using real-world data sets from UCI machine learning repository are provided. The empirical study shows that our approach successfully improves data quality in many cases and outperforms several correction methods.

Keywords

Label noise noise correction noise rate estimation classification

Get full access to this article

View all access options for this article.

References

Abellán

Mantas

C.J.

and Castellano

J.G.

, Adaptative CC4.5: Credal C4.5 with a rough class noise estimator, Expert Systems with Applications 92 (2017).

Aha

D.W.

Kibler

and Albert

M.K.

, Instance-based learning algorithms, Machine Learning 6(1) (1991), 37–66.

Bouveyron

and Girard

, Robust supervised classification with mixture models: Learning from data with uncertain labels, Pattern Recognition 42(11) (2009), 2649–2658.

Delany

S.J.

Segata

and Namee

B.M.

, Profiling instances in noise reduction, Knowledge-Based Systems 31(31) (2012), 28–40.

Devijver

P.A.

and Kittler

, On the edited nearest neighbor rule, in: Proceedings of the Fifth International Conference on Pattern Recognition, 1980, pp. 72–80.

Everitt

B.S.

Landau

Leese

and Stahl

, Miscellaneous Clustering Methods, John Wiley and Sons, Ltd, 2011, pp. 215–255.

Fefilatyev

Shreve

Kramer

and Hall

, Label-noise reduction with support vector machines, in: International Conference on Pattern Recognition, 2012, pp. 3504–3508.

Fix

and Hodges

J.L.

, Discriminatory analysis. Nonparametric discrimination: Consistency properties, International Statistical Review 57(3) (1989), 238–247.

Frénay

and Verleysen

, Classification in the presence of label noise: A survey, IEEE Transactions on Neural Networks and Learning Systems 25(5) (2014), 845–869.

10.

Gamberger

Lavrac

and Groselj

, Experiments With Noise Filtering in a Medical Domain, in: Proceedings of the Sixteenth International Conference on Machine Learning, 1999, pp. 143–151.

11.

García

Luengo

and Herrera

, Dealing with noisy data, Intelligent Systems Reference Library 72 (2015), 107–145.

12.

Guan

Yuan

Lee

Y.K.

and Lee

, Identifying mislabeled training data with the aid of unlabeled data, Applied Intelligence 35(3) (2011), 345–358.

13.

Hartigan

J.A.

and Wong

M.A.

, Algorithm as 136: A K-means clustering algorithm, Journal of the Royal Statistical Society 28(1) (1979), 100–108.

14.

Kononenko

and Kukar

, Machine Learning and Data Mining: Introduction to Principles and Algorithms, Horwood Publishing Limited, 2007.

15.

Sheng

V.S.

Jiang

and Li

, Noise filtering to improve data and model quality for crowdsourcing, Knowledge-Based Systems 107(C) (2016), 96–103.

16.

Liu

and Tao

, Classification with noisy labels by importance reweighting, IEEE Transactions on Pattern Analysis & Machine Intelligence 38(3) (2016), 447–461.

17.

Natarajan

Dhillon

I.S.

Ravikumar

and Tewari

, Learning with noisy labels, Advances in Neural Information Processing Systems 26 (2013), 1196–1204.

18.

Nicholson

Sheng

V.S.

and Zhang

, Label noise correction and application in crowdsourcing, Expert Systems with Applications 66 (2016), 149–162.

19.

Northcutt

C.G.

and Chuang

I.L.

, Learning with Confident Examples: Rank Pruning for Robust Classification with Noisy Labels, arXiv preprint arXiv:1705.01936, 2017.

20.

Olvera-López

J.A.

Carrasco-Ochoa

J.A.

Martínez-Trinidad

J.F.

and Kittler

, A review of instance selection methods, Artificial Intelligence Review 34(2) (2010), 133–143.

21.

Quinlan

J.R.

, C4.5: programs for machine learning, San Francisco, CA, USA: Morgan Kaufmann Publishers Inc, 1993.

22.

Song

Wang

Zhang

Sun

and Yang

, Spectral label refinement for noisy and missing text labels, in: Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015, pp. 2972–2978.

23.

Teng

C.M.

, Correcting Noisy Data, in: Sixteenth International Conference on Machine Learning, 1999, pp. 239–248.

24.

Verbaeten

and Assche

A.V.

, Ensemble Methods for Noise Elimination in Classification Problems, in: Multiple Classifier Systems, International Workshop, Mcs 2003, Guilford, Uk, June 11–13, 2003, Proceedings, 2003, pp. 317–325.

25.

Ianakiev

K.G.

and Govindaraju

, Improvements in K-Nearest Neighbor Classification, in: Advances in Pattern Recognition – ICAPR 2001, Second International Conference Rio de Janeiro, Brazil, March 11–14, 2001, Proceedings, 2001, pp. 222–229.

26.

Yuan

Guan

and Khattak

A.M.

, Classification with class noises through probabilistic sampling, Information Fusion 41 (2018), 57–67.

27.

Zhang

and Wu

, Integrating induction and deduction for noisy data mining, Information Sciences 180(14) (2010), 2663–2673.

28.

Zhang

Sheng

V.S.

and Wu

, Improving crowdsourced label quality using noise correction, IEEE Transactions on Neural Networks and Learning Systems PP(99) (2017), 1–14.