RBSP-Boosting: A Shapley value-based resampling approach for imbalanced data classification

Abstract

Addressing the problem of imbalanced data category distribution in real applications and the problem of traditional classifiers tending to ensure the accuracy of the majority class while ignoring the accuracy of the minority class when processing imbalanced data, this paper proposes a method called RBSP-Boosting for imbalanced data classification. First, RBSP-Boosting introduces the Shapley value and calculates the Shapley value for each sample of the dataset through the truncated Monte Carlo method. Moreover, the proposed method removes the noise data according to the Shapley value and undersamples the samples with Shapley values less than zero in the majority class. Then, it takes the Shapley value as the weight of the sample and oversamples the minority class according to the weight. Finally, the new dataset is trained on the classifier through the AdaBoost classifier. Experiments are conducted on nine groups of UCI and KEEL datasets, and RBSP-Boosting is compared with four sampling algorithms: Random-OverSampler, SMOTE, Borderline-SMOTE and SVM-SMOTE. Experimental results show that the RBSP-Boosting method in the three evaluation metrics of AUC, F-score and G-mean, compared with the best performance of the four comparison algorithms, increases by 4.69%, 10.3% and 7.86%, respectively. The proposed method can significantly improve the effect of imbalanced data classification.

Keywords

Shapley value resampling imbalanced data monte carlo

Get full access to this article

View all access options for this article.

References

Zhang

Y.H.

and Adam

, Estimating a one-class naive bayes text classifier, Intelligent Data Analysis 24 (2020), 567–579.

Yang

Wei

H.C.

Sun

Z.Q.

G.Y.

Zhou

Y.C.

Xiong

and Yang

, S2OSC: A holistic semi-supervised approach for open set classification, ACM Trans. Knowl. Discov. Data 16(34) (2021), 1–27.

Liu

Zhong

H.W.

and Xiao

Y.S.

, New multi-view classification method with uncertain data, ACM Trans. Knowl. Discov. Data 16(19) (2021), 1–23.

Patel

Rajput

D.S.

Thippa

G.T.

Iwendi

Bashir

A.K.

and Jo

, A review on classification of imbalanced data for wireless sensor networks, International Journal of Distributed Sensor Networks 16 (2020), 1–15.

Yang

J.F.

X.P.

Liang

Sun

X.X.

Cheng

M.M.

Rosin

P.L.

and Wang

, Self-paced balance learning for clinical skin disease recognition, IEEE Transactions on Neural Networks and Learning Systems 31 (2020), 2832–2846.

Saqlain

Abbas

and Lee

J.Y.

, A deep convolutional neural network for wafer defect identification on an imbalanced dataset in semiconductor manufacturing processes, IEEE Transactions on Semiconductor Manufacturing 33 (2020), 436–444.

Gan

Shen

and Liu

, Integrating TANBN with cost sensitive classification algorithm for imbalanced data in medical diagnosis, Computers & Industrial Engineering 140 (2020), 106266–106274.

Mirzaei

Nikpour

and Nezamabadi-pour

, CDBH: A clustering and density-based hybrid approach for imbalanced data classification, Expert Systems with Applications 164 (2021), 114035–114049.

Kunakorntum

Hinthong

and Phunchongharn

, A synthetic minority based on probabilistic distribution (SyMProD) oversampling for imbalanced datasets, IEEE Access 8 (2020), 114692–114704.

10.

Kubat

and Matwin

, Addressing the curse of imbalanced training sets: one-sided selection, in: Proceeding of the 14th International Conference on Machine Learning, ACM, Nashville, TN, USA, 1997, pp. 179–186.

11.

Guzmán-Ponce

Sánchez

J.S.

Valdovinos

R.M.

and Marcial-Romero

J.R.

, DBIG-US: A two-stage under-sampling algorithm to face the class imbalance problem, Expert Systems with Applications 168 (2021), 114301–114313.

12.

Chawl

N.V.

Bowyer

K.W.

Hall

L.O.

and Kegelmeyer

W.P.

, SMOTE: Synthetic minority over-sampling technique, Journal of Artificial Intelligence Research 16 (2002), 321–357.

13.

Mathew

Pang

C.K.

Luo

and Leong

W.H.

, Classification of imbalanced data by oversampling in kernel space of support vector machines, IEEE Transactions on Neural Networks and Learning Systems 29 (2018), 4065–4076.

14.

Shahee

S.A.

and Ananthakumar

, An effective distance based feature selection approach for imbalanced data, Applied Intelligence 50 (2020), 717–745.

15.

Yang

P.Y.

Liu

Zhou

B.B.

Chawla

and Zomaya

A.Y.

, Ensemble-based wrapper methods for feature selection and class imbalance learning, in: Pacific-Asia Conference on Knowledge Discovery and Date Mining, Springer, Gold Coast, QLD, Australia, 2013, pp. 544–555.

16.

Geng

and Luo

X.Y.

, Cost-sensitive convolutional neural networks for imbalanced time series classification, Intelligent Data Analysis 23 (2019), 357–370.

17.

Loezer

Enembreck

Barddal

J.P.

and Britto

A.D.S.

, Cost-sensitive learning for imbalanced data streams, in: Proceedings of the 35th Annual ACM Symposium on Applied Computing (SAC ’20), ACM, Online event, [Brno, Czech Republic], 2020, pp. 498–504.

18.

Freund

and Schapire

R.E.

, A decision-theoretic generalization of online learning and an application to boosting, Journal of Computer & System Sciences 55 (1999), 119–139.

19.

Rayhan

Ahmed

Mahbub

Jani

M.R.

Shatabda

and Farid

D.M.

, CUSBoost: Cluster-based Under-sampling with Boosting for Imbalanced Classification, CoRR abs/1712.04356, 2017.

20.

Shapley

L.S.

, A value for n-person games, Contributions to the Theory of Games 2 (1953), 307–317.

21.

Wang

J.X.

Wiens

and Lundberg

, Shapley Flow: A Graph-based Approach to Interpreting Model Predictions, in: Proceedings of the 24th International Conference on Artificial Intelligence and Statistics, PMLR, Virtual event, 2021, pp. 721–729.

22.

Ghorbani

and Zou

, Data Shapley: Equitable Valuation of Data for Machine Learning, in: Proceedings of the 36th International Conference on Machine Learning, PMLR, Long Beach, California, USA, 2019, pp. 2242–2251.

23.

Jia

R.X.

Dao

Wang

B.X.

Hubis

F.A.

Gurel

N.M.

Zhang

Spanos

and Song

, Effificient task-specifific data valuation for nearest neighbor algorithms, in: Proceedings of the 45th International Conference on Very Large Data Bases, Morgan Kaufmann, Los Angeles, California, USA, 2019, pp. 1610–1623.

24.

Song

T.S.

Tong

Y.X.

and Wei

S.Y.

, Profit Allocation for Federated Learning, in: IEEE International Conference on Big Data (Big Data), IEEE, Los Angeles, California, USA, 2019, pp. 2577–2586.