SMOTEFRIS-INFFC: Handling the challenge of borderline and noisy examples in imbalanced learning for software defect prediction

Abstract

The object of Software Defect Prediction (SDP) is to identify modules that are prone to defect. This is achieved by training prediction models with datasets obtained by mining software historical depositories. When one acquires data through this approach, it often includes class imbalance which has an unequal class representation among their example. We hypothesize that the imbalance learning is not a problem in itself and decrease in performance is also influenced by other factors related to class distribution in the data. One of these is the existence of noisy and borderline examples. Thus, the objective of our research is to propose a novel preprocessing method using Synthetic Minority Over-Sampling Technique (SMOTE), Fuzzy-rough Instance Selection type II (FRIS-II) and Iterative Noise Filter based on the Fusion of Classifiers (INFFC) which can overcome these problems. The experimental results show that the new proposal significantly outperformed all the methods compared in this study.

Keywords

Software defect prediction data sampling fuzzy rough set noise filtering

Get full access to this article

View all access options for this article.

References

Radatz

, Geraci

and Katki

, IEEE standard glossary of software engineering terminology, IEEE Std 610121990(121990) (1990), 1–84.

Lessmann

, Baesens

, Mues

and Pietsch

, Benchmarking classification models for software defect prediction: A proposed framework and novel findings, IEEE Transactions on Software Engineering 34(4) (2008), 485–496.

Catal

and Diri

, A systematic review of software fault prediction studies, Expert Systems with Applications 36(4) (2009), 7346–7354.

Chawla

N. V.

, Bowyer

K. W.

, Hall

L. O.

and Kegelmeyer

W. P.

, SMOTE: synthetic minority over-sampling technique, Journal of Artificial Intelligence Research 16(1) (2002), 321–357.

Bashir

, Li

, Yohannese

C. W.

and Mahama

, Enhancing software defect prediction using supervised-learning based framework. In 12th International Conference on Intelligent Systems and Knowledge Engineering (ISKE) (2017), 1–6.

Branco

, Torgo

and Ribeiro

R. P.

, A survey of predictive modeling on imbalanced domains, ACM Computing Surveys 49(2) (2016), 1–50.

Chubato

W. Y.

and Li

, A combined-learning based framework for improved software fault prediction, International Journal of Computational Intelligence Systems 10(1) (2017), 647–662.

Galar

, Fernandez

, Barrenechea

, Bustince

and Herrera

, A review on ensembles for the class imbalance problem: Bagging-, boosting-, and hybrid-based approaches, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 42(4) (2012), 463–484.

Khoshgoftaar

T. M.

, Gao

and Seliya

, Attribute selection and imbalanced data: Problems in software defect prediction. In 22nd IEEE International Conference on Tools with Artificial Intelligence 1 (2010), 137–144.

10.

Ogawa

, Matsumoto

and Hashimoto

, Editing training sets from imbalanced data using fuzzy-rough sets. In IFIP International Conference on Artificial Intelligence Applications and Innovations (2015), 115–129.

11.

Seiffert

, Khoshgoftaar

T. M.

, Van Hulse

and Napolitano

, RUSBoost: A hybrid approach to alleviating class imbalance, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans 40(1) (2010), 185–197.

12.

Chubato

W. Y.

, Li

, Simfukwe

and Khurshid

, Ensembles based combined learning for improved software fault prediction: A comparative study. In International Conference on Intelligent Systems and Knowledge Engineering (ISKE) (2017), 1–6.

13.

Wang

and Yao

, Using class imbalance learning for software defect prediction, IEEE Transactions on Reliability 62(2) (2013), 434–443.

14.

Chubato

W. Y.

, Li

and Bashir

, A Three-Stage Based Ensemble Learning for Improved Software Fault Prediction: An Empirical Comparative Study, International Journal of Computational Intelligence Systems 10(14) (2018), 26.

15.

and Garcia

E. A.

, Learning from imbalanced data, IEEE Transactions on Knowledge & Data Engineering 21(9) (2008), 1263–1284.

16.

López

, Fernández

and Herrera

, On the importance of the validation technique for classification with imbalanced datasets: Addressing covariate shift when data is skewed, Information Sciences 257(2) (2014), 1–13.

17.

Japkowicz

, Class imbalances: are we focusing on the right issue. Inpage , Workshop on Learning from Imbalanced Data Sets II 1723 (2003), 63.

18.

García

, Sánchez

and Mollineda

, An empirical study of the behavior of classifiers on imbalanced and overlapped data sets. In Iberoamerican Congress on Pattern Recognition (2007), 397–406.

19.

Napierała

, Stefanowski

and Wilk

, Learning from imbalanced data in presence of noisy and borderline examples. In International Conference on Rough Sets and Current Trends in Computing (2010), 158–167.

20.

Kubat

and Matwin

, et al., Addressing the curse of imbalanced training sets: one-sided selection. In Icml 97 (1997), 179–186. Nashville, USA.

21.

Bashir

, Li

, Chubato

W. Y.

, Yahaya

and Ali

, A novel preprocessing approach for imbalanced learning in software defect prediction. In 13 International Conference on Data Science and Knowledge Engineering for Sensing Decision Support (FLINS 2018), Belfast, Northern Ireland, UK. World Scientific (2018).

22.

Jensen

and Cornelis

, Fuzzy-rough instance selection. In IEEE International Conference on Fuzzy Systems (FUZZ) (2010), 1–7.

23.

Saez

J. A.

, Luengo

, Stefanowski

and Herrera

, SMOTEIPF: addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering, Information Sciences 291(291) (2015), 184–203.

24.

Bunkhumpornpat

, Sinapiromsaran

and Lursinsap

, Safe-level-smote: Safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. In Pacific-Asia Conference on Knowledge Discovery and Data Mining (2009), 475–482.

25.

Han

, Wang

W.-Y.

and Mao

B.-H.

, Borderline-smote: a new over-sampling method in imbalanced data sets learning. In International Conference on Intelligent Computing (2005), 878–887.

26.

Barua

, Islam

, Yao

and Murase

, MWMOTE–majority weighted minority oversampling technique for imbalanced data set learning, IEEE Transactions on Knowledge and Data Engineering 26(2) (2014), 405–425.

27.

, Bai

, Garcia

E. A.

and Li

, Adasyn: Adaptive synthetic sampling approach for imbalanced learning. In IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence) (2008), 1322–1328.

28.

Tang

and Chen

S.-P.

, The generation mechanism of synthetic minority class examples. In 2008 International Conference on Information Technology and Applications in Biomedicine (2008), 444–447.

29.

Batista

G. E.

, Prati

R. C.

and Monard

M. C.

, A study of the behavior of several methods for balancing machine learning training data, ACM Sigkdd Explorations Newsletter 6(1) (2004), 20–29.

30.

Ramentol

, Verbiest

, Bello

, Caballero

, Cornelis

and Herrera

, Smote-Frst: A New Resampling Method Using Fuzzy Rough Set Theory, World Scientific Proceedings Series on Computer Engineering and Information Science (2012), 800–805.

31.

Pawlak

Zdzisław

, Rough sets, International journal of computer & information sciences 11(5) (1982), 341–356.

32.

Sáez

J. A.

, Galar

, Luengo

and Herrera

, INFFC: An iterative class noise filter based on the fusion of classifiers with noise sensitivity control, Information Fusion 27 (2016), 19–32.

33.

Hall

, Frank

, Holmes

, Pfahringer

, Reutemann

and Witten

I. H.

, The weka data mining software: an update, ACM SIGKDD Explorations Newsletter 11(1) (2009), 10–18.

34.

Alcalá-Fdez

, Sanchez

, Garcia

, del Jesus

M. J.

, Ventura

, Garrell

J. M.

, Otero

, Romero

, Bacardit

and Rivas

V. M.

, KEEL: a software tool to assess evolutionary algorithms for data mining problems, Soft Computing-A Fusion of Foundations, Methodologies and Applications 13(3) (2009), 307–318.

35.

D’Ambros

, Lanza

and Robbes

, Evaluating defect prediction approaches: a benchmark and an extensive comparison, Empirical Software Engineering 17(4-5) (2011), 531–577.

36.

Krishna

, Pryor

and Menzies

, The promise repository of empirical software engineering data. http://openscience.us/repo/. North Carolina State University, Department of Computer Science (2015).

37.

Weiss

, Mining with rarity: a unifying framework, ACM Sigkdd Explorations Newsletter 6(1) (2004), 7–19.

38.

Rijsbergen

, C Keith Joost, Information retrieval Information Retrieval Group, University of Glasgow, butterworths london 30(6) (1979).

39.

Landgrebe

T. C.

and Duin

R. P.

, Efficient multiclass roc approximation by decomposition via confusion matrix perturbation analysis, IEEE Transactions on Pattern Analysis and Machine Intelligence 30(5) (2008), 810–822.

40.

García

, Fernández

, Luengo

and Herrera

, Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power, Information Sciences 180(10) (2010), 2044–2064.