Abstract
Class imbalance is a persistent challenge in deep learning, often leading to suboptimal performance for underrepresented classes. This challenge is particularly pronounced in natural language processing (NLP) tasks, such as named entity recognition (NER), where traditional oversampling methods may overlook important linguistic nuances. In this study, we introduce a novel synonym-based oversampling technique that employs pre-trained Word2Vec embeddings to generate semantically coherent examples. This approach augments minority classes using contextually appropriate synonyms. Experiments on an imbalanced social media NER dataset demonstrate enhanced model performance, with improved recognition of named entities across diverse categories. By generating synthetic samples that closely mirror the original data’s semantic characteristics, our method offers a compelling solution to data imbalance in semantically driven NLP tasks. This research highlights the potential of semantic-based oversampling in enhancing the generalization capabilities of deep learning models for NER challenges.
Keywords
Get full access to this article
View all access options for this article.
