Logistic discrimination based on G-mean and F-measure for imbalanced problem

Abstract

As a well known statistical method, logistic discrimination has been successfully used in many practical applications including medical diagnosis and personal credit assessment. In this paper, we apply this model to imbalanced problem which is also referred to as skewed or rare class problem, characterized by having many more instances of one class (negative class or majority class) than the other (positive class or minority class). However, traditional logistic discrimination tries to pursue a high accuracy by assuming that all classes have similar size, leading to the fact that instances with positive classes are often overlooked and misclassified to negative ones. To fully consider class imbalance, we re-learn the two basic measures for imbalanced problem, g-mean and f-measure, and design two new cost functions, i.e., g-mean based metric (GM) and f-measure based metric (FM), to supervise logistic discrimination learning the corresponding parameters, where GM is the geometric mean estimation of recall of both positive and negative class as g-mean and FM is a harmonic mean between recall and precision of positive class as f-measure. The experiments on UCI data sets show that the proposed method presents significant advantage comparing to state-of-the-art classification methods on all metrics used in this paper including accuracy, recall, f-measure and g-mean.

Keywords

Imbalanced problem g-mean f-measure logistic discrimination

Get full access to this article

View all access options for this article.

References

H.B.

and Garcia

E.A.

, Learning from imbalanced Data, IEEE Transactions on Knowledge and Data Engineering21 (2009), 1263–1284.

Yao

, Wang

, Jiang

and Liu

, Fault diagnosis method based on cs-boosting for unbalanced training data, Journal of Vibration, Measurement & Diagnosis33(1) (2013), 111–115.

Liu

X.Y.

, Li

Q.Q.

and Zhou

Z.H.

, Learning imbalanced multiclass data with optimal dichotomy weights, Proceeding of IEEE 13th International Conference on Data Mining, Dallas, TX, USA, 2013, pp. 478–487.

Martin

P.D.

, Evaluation: From precision, recall and f-measure to ROC, informedness, markedness and correlation, Journal of Machine Learning Technologies2(1) (2011), 37–63.

Liu

X.Y.

, Wu

J.X.

and Zhou

Z.H.

, Exploratory Under Sampling for Class Imbalance Learning,, Proceeding of 6th IEEE International Conference on Data Mining, Hong Kong, China, 2006, pp. 965–969.

and Japkowicz

, Class Imbalances versus small disjuncts, ACM SgKDD Explorations Newsletter6(1) (2004), 40–49.

Varassin

C.G.

, Plastino

, Leitão

and Zadrozny

, Undersampling strategy based on clustering to improve the performance of splice site classification in human genes in Database and Expert Systems Applications, Proceeding of 24th IEEE International Workshop on DEXA, Prague, Czech Republic, 2013, pp. 85–89.

Zhang

and Li

, RWO-Sampling: A random walk oversampling approach to imbalanced data classification, Information Fusion20 (2014), 99–116.

Chawla

N.V.

, Bowyer

K.W.

, Hall

L.O.

and Kegelmeyer

W.P.

, SMOTE: Synthetic minority over-sampling technique, Journal of Artificial Intelligence Research16 (2002), 321–357.

10.

Sáez

J.A.

, Luengo

, Stefanowski

and Herrera

, SMOTEĺ-CIPF: Addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering, Information Sciences291 (2015), 184–203.

11.

Zieba

, Tomczak

J.M.

and Gonczarek

, RBM-SMOTE: Restricted boltzmann machines for synthetic minority oversampling technique, Proceeding of the 7th Asian Conference Intelligent Information and Database Systems, Part I, Bali, Indonesia, 2015, pp. 377–386.

12.

Joshi

M.V.

, Agarwal

R.C.

and Kumar

, Mining needle in a haystack: Classifying rare classes via two-phase rule induction, Proceeding of ACM SIGMOD International Conference on Management of Data, Santa Barbara, CA, USA, 2001, pp. 91–102.

13.

Zhang

and Zhou

Z.H.

, Cost-sensitive face recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence32(10) (2010), 1758–1769.

14.

Zhang

C.X.

, Wang

G.W.

, Zhang

J.S.

, Guo

and Ying

Q.Y.

, IRUSRT: A novel imbalanced learning technique by combining inverse random under sampling and random tree, Communications in Statistics - Simulation and Computation43(10) (2014), 2714–2731.

15.

Merigó

J.M.

and Gil-Lafuente

A.M.

, The induced generalized OWA operator, Information Sciences179 (2009), 729–741.

16.

Merigó

J.M.

and Yager

R.R.

, Generalized moving averages, distance measures and OWA operators, International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems21 (2013), 533–559.

17.

Sun

Y.M.

, Kamel

M.S.

, Wong

A.K.C.

and Yang

, Costsensitive boosting for classification of imbalanced data, Pattern Recognition40(12) (2007), 3358–3378.

18.

Japkowicz

and Stephen

, The class imbalance problem: A systematic study, Intelligent Data Analysis6(5) (2002), 429–449.

19.

Tomek

, Two modifications of CNN, IEEE Transactions on Systems Man and Communications6(11) (1976), 769–772.

20.

Angiulli

, Fast condensed nearest neighbor rule, Proceeding of the 22nd International Conference of Machine Learning, Bonn, Germany, 2005, pp. 25–32.

21.

Kubat

and Matwin

, Addressing the curse of imbalanced training sets: One-sided selection, Proceeding of the 14th International Conference ofMachine Learning, Nashville, TN, USA, 1997, pp. 179–186 .

22.

Laurikkala

, Improving identification of difficult small classes by balancing class distribution, Proceeding of 8th Conference on AI in Medicine in Europe, Cascais, Portugal, 2001, pp. 63–66.

23.

, Bai

, Garcia

E.A.

and Li

, ADASYN: Adaptive synthetic sampling approach for imbalanced learning, Proceeding of International Joint Conference on Neural Networks, Hong Kong, China, 2008, pp. 1322–1328.

24.

Chawla

N.V.

, Lazarevic

, Hall

L.O.

and Bowyer

K.W.

, SMOTEBoost: Improving prediction of the minority class in boosting, 7th European Conference on Principles and Practice of Knowledge Discovery in Databases, Cavtat-Dubrovnik, Croatia, 2003, pp. 107–119.

25.

Batista

G.E.

, Prati

R.C.

, Monard

M.C.

, A Study of the behavior of several methods for balancing machine learning training data, SIGKDD Explorations6(1) (2004), 20–29.

26.

Estabrooks

, Jo

and Japkowicz

, A multiple resampling method for learning from imbalanced data sets, Computational Intelligence20(1) (2004), 18–36.

27.

Blake

and Merz

, UCI repository of machine learning databases . (Available from: http://www.ics.uci.edu/mlearn/MLRepository.html).

28.

Demsar

, Statistical comparisons of classifiers over multiple data sets, Journal of Machine Learning Research6 (2006), 1–30.