Imbalanced data prediction model based on self-attention mechanism and generative adversarial network

Abstract

Imbalanced data distribution causes the traditional machine learning classification algorithms to be affected by the characteristics of the majority class, resulting in poor classification performance for the minority-class data. To improve the classification accuracy of minority classes in imbalanced data, this study has proposed a novel model—a generative adversarial network with self-attention mechanism oversampling based on a convolutional neural network (GAN-SAMO-CNN). The self-attention mechanism (SAM) of this model focused on the correlations among data elements of the minority class. The degree of correlation was first obtained by calculating the attention scores, which enabled the effective extraction of the distribution characteristics of the data. Subsequently, a generative adversarial network (GAN) was used to generate samples with high similarity to reduce data imbalances. Finally, a CNN classification model was constructed to train and predict the samples. The experimental results showed that the F1-score, G-mean, and area under PRC curve (AUPRC) of the model were considerably better than those of the other imbalanced data classification methods. The proposed method was then validated using multiple independent test datasets to demonstrate the model's generalizability and robustness.

Keywords

imbalance generative adversarial network self-attention mechanism convolutional neural network

Get full access to this article

View all access options for this article.

References

Wang

Lin

Wong

. COVID-Net: a tailored deep convolutional neural network design for detection of COVID-19 cases from chest X-ray images. Scientific Reports 2020; 10: 19549.

Van Belle

Baesens

De Weerdt

. CATCHM: a novel network-based credit card fraud detection method using node representation learning. Decision Support Systems 2023; 164: 113866.

Ullah

Raza

Malik

, et al. A churn prediction model using random forest: analysis of machine learning techniques for churn prediction and factor identification in telecom sector. IEEE Access 2019; 7: 60134–60149.

Zhu

Lin

Liu

. Synthetic minority oversampling technique for multiclass imbalance problems. Pattern Recognition 2017; 72: 327–340.

Lim

Goh

Tan

. Evolutionary cluster-based synthetic oversampling ensemble (ECO-ensemble) for imbalance learning. IEEE Transactions on Cybernetics 2017; 47: 2850–2861.

Rodriguez

Laio

. Clustering by fast search and find of density peaks. Science 2014; 344: 1492.

Huang

. Research on classification of imbalanced data based on convolutional neural network. Chengdu: Southwest Jiaotong University, 2021.

Jiao

Zhang

. Gaussian Mixture model convolution neural network based on imbalanced problem. Journal of Applied Sciences 2023; 41: 657–668.

Liu

Zhou

. Exploratory undersampling for class-imbalance learning. IEEE Transactions on Systems Man & Cybernetics Part B 2009; 39: 539–550.

10.

Imani

Arabnia

. Hyperparameter optimization and combined data sampling techniques in machine learning for customer churn prediction: a comparative analysis. Technologies 2023; 11: 167.

11.

Hou

, et al. An anti-noise ensemble algorithm for imbalance classification. Intelligent Data Analysis 2019; 23: 1205–1217.

12.

Mahadevan

Arock

. A class imbalance-aware review rating prediction using hybrid sampling and ensemble learning. Multimedia Tools and Applications 2021; 80: 6911–6938.

13.

Wang

. A hybrid sampling SVM approach to imbalanced data classification. Abstract and Applied Analysis 2014; 2014(5): 1–7.

14.

Yin

Liu

Pan

, et al. Strength of stacking technique of ensemble learning in rockburst prediction with imbalanced data: comparison of eight single and ensemble models. Natural Resources Research 2021; 30: 1795–1815.

15.

Feng

Shen

, et al. Using cost-sensitive learning and feature selection algorithms to improve the performance of imbalanced classification. IEEE Access 2020; 8: 69979–69996.

16.

Galar

Fernandez

Barrenechea

, et al. A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 2012; 42: 463–484.

17.

Wang

Yao

. Diversity analysis on imbalanced data sets by using ensemble models. In: Presented at the IEEE Symposium on Computational Intelligence & Data Mining. Nashville, Tennessee, USA: IEEE, 2009, pp.324–331.

18.

Chawla

Lazarevic

Hall

, et al. SMOTEBoost: Improving prediction of the minority class in boosting. In: presented at the European Conference on Principles of Data Mining and Knowledge Discovery. Berlin, Heidelberg: Springer, 2003, pp.107–119.

19.

Seiffert

Khoshgoftaar

Van Hulse

, et al. RUSBoost: a hybrid approach to alleviating class imbalance. IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans 2010; 40: 185–197.

20.

Zhang

, et al. Cost-sensitive and hybrid-attribute measure multi-decision tree over imbalanced data sets. Information Sciences 2018; 422: 242–256.

21.

Wang

Sun

. The improved AdaBoost algorithms for imbalanced data classification. Information Sciences 2021; 563: 358–374.

22.

Kim

Baik

Cho

. Detecting financial misstatements with fraud intention using multi-class cost-sensitive learning. Expert Systems with Applications 2016; 62: 32–43.

23.

Tao

, et al. Self-adaptive cost weights-based support vector machine cost-sensitive ensemble for imbalanced data classification. Information Sciences 2019; 487: 31–56.

24.

Hosseini

Moattar

. Evolutionary feature subsets selection based on interaction information for high dimensional imbalanced data classification. Applied Soft Computing 2019; 82: 105581.

25.

Moayedikia

Ong

Boo

, et al. Feature selection for high dimensional imbalanced class data using harmony search. Engineering Applications of Artificial Intelligence 2017; 57: 38–49.

26.

Chen

Fan

, et al. Feature selection for imbalanced data based on neighborhood rough sets. Information Sciences 2019; 483: 1–20.

27.

Wang

Chen

, et al. Deep learning model for house price prediction using heterogeneous data analysis along with joint self-attention mechanism. IEEE Access 2021; 9: 55244–55259.