Abstract
The advent of Generative Pre-trained Transformer (GPT) has significantly impacted various downstream natural language processing (NLP) tasks, showcasing its remarkable capabilities in language understanding and generation. It has demonstrated state-of-the-art performance in areas such as machine translation, text summarisation and question answering. In this study, we focus on addressing the issue of skewed data sets, which is a challenge across multiple NLP tasks. Conventional balancing techniques including undersampling and oversampling are fraught with limitations and may result in biased predictions and insufficient representation of minority classes. In response, we propose an innovative approach harnessing GPT’s capabilities to generate synthetic samples. We evaluate our approach on a downstream multi-class text classification task and demonstrate significant performance improvements over conventional techniques and state-of-the-art methods, surpassing the accuracy by more than 10%. These findings underscore the potential of GPT to revolutionise data set balancing, thereby augmenting the performance of downstream NLP tasks.
Get full access to this article
View all access options for this article.
