Sage Journals: Discover world-class research

Abstract

Class imbalance is a persistent challenge in deep learning, often leading to suboptimal performance for underrepresented classes. This challenge is particularly pronounced in natural language processing (NLP) tasks, such as named entity recognition (NER), where traditional oversampling methods may overlook important linguistic nuances. In this study, we introduce a novel synonym-based oversampling technique that employs pre-trained Word2Vec embeddings to generate semantically coherent examples. This approach augments minority classes using contextually appropriate synonyms. Experiments on an imbalanced social media NER dataset demonstrate enhanced model performance, with improved recognition of named entities across diverse categories. By generating synthetic samples that closely mirror the original data’s semantic characteristics, our method offers a compelling solution to data imbalance in semantically driven NLP tasks. This research highlights the potential of semantic-based oversampling in enhancing the generalization capabilities of deep learning models for NER challenges.

Keywords

Deep learning natural language processing named entity recognition class imbalance data oversampling semantic coherence

Get full access to this article

View all access options for this article.

References

Shinde

Shah

. A review of machine learning and seep learning applications. In: 2018 fourth international conference on computing communication control and utomation (ICCUBEA), IEEE, 2018, pp.1–6. ISBN 978-1-5386-5257-2. DOI: 10.1109/ICCUBEA.2018.8697857.

LeCun

Bengio

Hinton

. Deep learning. Nature 2015; 521: 436–444.

Kulkarni

Shivananda

. Deep Learning for NLP. Berkeley, CA: Apress, 2021, pp.213–262. DOI: 10.1007/978-1-4842-7351-7_6.

Dai

Liu

J-w

Shi

Y-h

. Class-overlap undersampling based on schur decomposition for class-imbalance problems. Expert Syst Appl 2023; 221: 119735.

Baydogan

Alatas

. Detection of customer satisfaction on unbalanced and multi-class data using machine learning algorithms. In: 2019 1st international informatics and software engineering conference (UBMYK), IEEE, 2019, pp.1–5. ISBN 978-1-7281-3992-0. DOI: 10.1109/UBMYK48245.2019.8965631.

Zhu

Lin

Liu

. Improving interpolation-based oversampling for imbalanced data learning. Knowl Based Syst 2020; 187: 104826.

Yao

Tan

, et al. Equalized focal loss for dense long-tailed object detection. In: 2022 IEEE/CVF conference on computer vision and pattern recognition (CVPR), IEEE, 2022, pp.6980–6989. ISBN 978-1-6654-6946-3. DOI: 10.1109/CVPR52688.2022.00686. https://ieeexplore.ieee.org/document/9879409/.

Mohammed

Rawashdeh

Abdullah

. Machine learning with oversampling and undersampling techniques: Overview study and experimental results. In: 2020 11th international conference on information and communication systems (ICICS), IEEE, 2020, pp.243–248. ISBN 978-1-7281-6227-0. DOI: 10.1109/ICICS49469.2020.239556.

Sun

Zhou

Wang

, et al. Radial-based undersampling approach with adaptive undersampling ratio determination. Neurocomputing 2023; 553: 126544.

10.

Feng

Keung

Xiao

, et al. Improving the undersampling technique by optimizing the termination condition for software defect prediction. Expert Syst Appl 2023; 121084. DOI: 10.1016/j.eswa.2023.121084.

11.

Razumov

Rogov

Dylov

. Optimal MRI undersampling patterns for ultimate benefit of medical vision tasks. Magn Reson Imaging 2023; 103: 37–47.

12.

Bach

Werner

Palt

. The proposal of undersampling method for learning from imbalanced datasets. Procedia Comput Sci 2019; 159: 125–134.

13.

Sundarkumar

Ravi

. A novel hybrid undersampling method for mining unbalanced datasets in banking and insurance. Eng Appl Artif Intell 2015; 37: 368–377.

14.

Bach

. New undersampling method based on the kNN approach. Procedia Comput Sci 2022; 207: 3403–3412.

15.

Liu

Chen

J-H

Liu

. An empirical study of dynamic selection and random under-sampling for the class imbalance problem. Expert Syst Appl 2023; 221: 119703.

16.

Tao

Guo

Zheng

, et al. Self-adaptive oversampling method based on the complexity of minority data in imbalanced datasets classification. Knowl Based Syst 2023; 277: 110795.

17.

Zhang

, et al. A new oversampling approach based differential evolution on the safe set for highly imbalanced datasets. Expert Syst Appl 2023; 234: 121039.

18.

Lin

Tsai

C-F

Lin

W-C

. Towards hybrid over- and under-sampling combination methods for class imbalanced datasets: an experimental study. Artif Intell Rev 2023; 56: 845–863.

19.

Thejas

Hariprasad

Iyengar

, et al. An extension of synthetic minority oversampling technique based on kalman filter for imbalanced datasets. Mach Learn Appl 2022; 8: 100267.

20.

Tao

Zheng

Chen

, et al. SVDD-based weighted oversampling technique for imbalanced and overlapped dataset learning. Inf Sci (Ny) 2022; 588: 13–51.

21.

Kaya

Korkmaz

Sahman

, et al. DEBOHID: A differential evolution based oversampling approach for highly imbalanced datasets. Expert Syst Appl 2021; 169: 114482.

22.

Mahmoud

El-Kilany

Ali

, et al. TGT: A novel adversarial guided oversampling technique for handling imbalanced datasets. Egypt Inform J 2021; 22: 433–438.

23.

Islam

Belhaouari

Rehman

, et al. KNNOR: An oversampling technique for imbalanced datasets. Appl Soft Comput 2022; 115: 108288.

24.

CHURCH

. Word2Vec. Nat Lang Eng 2017; 23: 155–162.

25.

G. LLC. Google news Corpora, 2013. https://code.google.com/archive/p/word2vec/.

26.

Pennington

Socher

Manning

. GloVe: Global vectors for word representation. In: Conference on empirical methods in natural language processing (EMNLP), Association for Computational Linguistics, 2014, pp.1532–1543. http://nlp.

27.

Belbekri

Benchikha

. SocialNER: A training dataset for named entity recognition in short social media texts. In: AID 2022: Artificial intelligence doctoral symposium, Springer Nature, 2022, pp.278–289. DOI: 10.1007/978-981-99-4484-2_21.

28.

Devlin

Chang

M-W

Lee

, et al. BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 conference of the North, Association for Computational Linguistics, 2019, pp.4171–4186. DOI: 10.18653/v1/N19-1423.

29.

Wilson

Martinez

. Reduction techniques for instance-based learning algorithms. Mach Learn 2000; 38: 257–286.

30.

Chawla

. Data Mining for Imbalanced Datasets: An Overview. Boston, MA: Springer, 2005, pp.853–867. DOI: 10.1007/0-387-25465-X_40.

31.

Garcia

. Learning from imbalanced data. IEEE Trans Knowl Data Eng 2009; 21: 1263–1284.

32.

Lin

W-C

Tsai

C-F

Y-H

, et al. Clustering-based undersampling in class-imbalanced data. Inf Sci (Ny) 2017; 409-410: 17–26.

33.

Chawla

Bowyer

Hall

, et al. SMOTE: Synthetic minority over-sampling technique. J Artif Intell Res 2002; 16: 321–357.

34.

Bai

Garcia

, et al. ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In: 2008 IEEE international joint conference on neural networks (IEEE World Congress on Computational Intelligence), IEEE, 2008, pp.1322–1328. ISBN 978-1-4244-1820-6. DOI: 10.1109/IJCNN.2008.4633969. http://ieeexplore.ieee.org/document/4633969/.

35.

Han

Wang

W-Y

Mao

B-H

. Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning, 2005, pp.878–887. DOI: 10.1007/11538059_91.

36.

Bunkhumpornpat

Sinapiromsaran

Lursinsap

. Safe-Level-SMOTE: Safe-Level-Synthetic Minority Over-Sampling TEchnique for Handling the Class Imbalanced Problem. Berlin, Heidelberg: Springer, 2009. pp. 475–482. DOI: 10.1007/978-3-642-01307-2_43.

37.

Wong

Leung

FHF

Ling

S-H

. A novel evolutionary preprocessing method based on over-sampling and under-sampling for imbalanced datasets. In: IECON 2013 - 39th annual conference of the IEEE industrial electronics society, IEEE, 2013, pp.2354–2359. ISBN 978-1-4799-0224-8. DOI: 10.1109/IECON.2013.6699499. http://ieeexplore.ieee.org/document/6699499/.

38.

Rayhan

Ahmed

Mahbub

, et al. CUSBoost: Cluster-Based under-sampling with boosting for imbalanced classification. In: 2017 2nd international conference on computational systems and information technology for sustainable solution (CSITSS), 2017, pp.1–5. DOI: 10.1109/CSITSS.2017.8447534.

39.

Shi

Gao

, et al. A hybrid sampling method based on safe screening for imbalanced datasets with sparse structure. In: 2018 international joint conference on neural networks (IJCNN), IEEE, 2018, pp.1–8. ISBN 978-1-5090-6014-6. DOI: 10.1109/IJCNN.2018.8489569. https://ieeexplore.ieee.org/document/8489569/.

40.

Lin

Tsai

C-F

Lin

W-C

. Towards hybrid over-, under-sampling combination methods for class imbalanced datasets: an experimental study. Artif Intell Rev 2023; 56: 845–863. DOI: 10.1007/s10462-022-10186-5.

41.

Bayer

Kaufhold

M-A

Reuter

. A survey on data augmentation for text classification. ACM Comput Surv 2022; 55: 1–39.

42.

Mosolova

Fomin

Bondarenko

. Text augmentation for neural networks. AIST (Supplement) 2018; 2268: 104–109.

43.

Wang

Yang

. That’s so annoying!!!: A lexical and frame-semantic embedding based data augmentation approach to automatic categorization of annoying behaviors using# petpeeve tweets. In: Proceedings of the 2015 conference on empirical methods in natural language processing, 2015, pp.2557–2563.

44.

Feng

Zhou

Zhu

, et al. Tailored text augmentation for sentiment analysis. Expert Syst Appl 2022; 205: 117605.

45.

Perçin

Galassi

Lagioia

, et al. Combining WordNet and word embeddings in data augmentation for legal texts. In: Proceedings of the natural legal language processing workshop 2022, 2022, pp.47–52.

46.

Wei

Zou

. EDA: Easy data augmentation techniques for boosting performance on text classification tasks. In: Inui K, Jiang J, Ng V and Wan X, (eds), Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), Association for Computational Linguistics, Hong Kong, China, 2019, pp.6382–6388. DOI: 10.18653/v1/D19-1670. https://aclanthology.org/D19-1670.

47.

Kobayashi

. Contextual augmentation: data augmentation by words with paradigmatic relations. In: Walker M, Ji H and Stent A (eds), Proceedings of the 2018 conference of the North American chapter of the association for computational linguistics: Human language technologies, Volume 2 (Short Papers), Association for Computational Linguistics, New Orleans, Louisiana, 2018, pp.452–457. DOI: 10.18653/v1/N18-2072. https://aclanthology.org/N18-2072.

48.

Fellbaum

. WordNet. In: Theory and applications of ontology: computer applications. Springer, 2010, pp.231–243.

49.

Belbekri

Benchikha

Slimani

, et al. SocialNER2.0: a comprehensive dataset for enhancing named entity recognition in short human-produced text. Intell Data Anal 2024; 1–25. DOI: 10.3233/IDA-230588.

Semantics-based oversampling for imbalanced named entity recognition datasets using Word2Vec embeddings

Abstract

Keywords

Get full access to this article

References