Sage Journals: Discover world-class research

Abstract

The advent of Generative Pre-trained Transformer (GPT) has significantly impacted various downstream natural language processing (NLP) tasks, showcasing its remarkable capabilities in language understanding and generation. It has demonstrated state-of-the-art performance in areas such as machine translation, text summarisation and question answering. In this study, we focus on addressing the issue of skewed data sets, which is a challenge across multiple NLP tasks. Conventional balancing techniques including undersampling and oversampling are fraught with limitations and may result in biased predictions and insufficient representation of minority classes. In response, we propose an innovative approach harnessing GPT’s capabilities to generate synthetic samples. We evaluate our approach on a downstream multi-class text classification task and demonstrate significant performance improvements over conventional techniques and state-of-the-art methods, surpassing the accuracy by more than 10%. These findings underscore the potential of GPT to revolutionise data set balancing, thereby augmenting the performance of downstream NLP tasks.

Keywords

Balancing BERT downstream GPT text classification text generation

Get full access to this article

View all access options for this article.

References

Abdullah

Madain

Jararweh

GPT: fundamentals, applications and social impacts. In: Proceedings of the 2022 Ninth International Conference on Social Networks Analysis, Management and Security (SNAMS), Milan, Italy, 28 November–1 December 2022.

Asim

Wasim

Ali

, et al. Comparison of feature selection methods in text classification on highly skewed datasets. In: Proceedings of the 2017 first international conference on latest trends in electrical engineering and computing technologies (INTELLECT), Karachi, Pakistan, 5–7 April 2017.

Wongvorachan

Bulut

. A comparison of undersampling, oversampling, and SMOTE methods for dealing with imbalanced classification in educational data mining. Information 2023; 14(1): 54.

OpenAI. GPT: optimizing language models for dialogue. Openai, 2022, https://openai.com/blog/GPT/ (accessed 4 February 2023).

Mahmoudi

Salem

Improving multi-class text classification using balancing techniques. In: Salem

Merelo

Siarry

, et al. (eds) Artificial Intelligence: Theories and Applications. ICAITA 2022. Communications in Computer and Information Science. Vol. 1769. Cham: Springer, 2023, pp.264–275.

Chen

Lin

, et al. A novel classification method based on a two-phase technique for learning imbalanced text data. Symmetry 2022; 14(3): 567.

Alharbi

Alamro

Alshehri

, et al. ASAD: a Twitter-based benchmark Arabic sentiment analysis dataset. arXiv 2011.00578, 2020.

Aly

Atiya

. LABR: a large scale Arabic book reviews dataset. arXiv 1411.6718, 2013.

Devlin

Chang

Lee

, et al. BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, Volume 1 (Long and Short Papers) 2–7 June 2019, Minneapolis, MN, pp.4171–4186. Stroudsburg, PA: ACL Anthology.

10.

Jiao

Wang

Huang

, et al. Is GPT a good translator? A preliminary study. arXiv 2301.08745v3, 2023.

11.

Peng

Ding

Zhong

, et al. Towards making the most of ChatGPT for machine translation. SSRN, 2023. doi:10.2139/ssrn.4390455

12.

Luo

Xie

Ananiadou

. GPT as a factual inconsistency evaluator for text summarization. arXiv 2303.15621v2, 2023.

13.

Ray

. GPT: a comprehensive review on background, applications, key challenges, bias, ethics, limitations and future scope. Internet Things Cyber-phys Syst 2023; 3: 121–154.

14.

Dwivedi

Kshetri

Hughes

, et al. ‘So what if GPT wrote it?’ Multidisciplinary perspectives on opportunities, challenges and implications of generative conversational AI for research, practice and policy. Int J Inf Manag 2023; 71: 102642.

15.

Bonatti

Abdali

, et al. Data generation using large language models for text classification: an empirical case study. arXiv 2407.12813, 2024.

16.

Ziyaden

Yelenov

Hajiyev

, et al. Text data augmentation and pre-trained language model for enhancing text classification of low-resource languages. PeerJ Comput Sci 2024; 10: e1974.

17.

Garcia

. Learning from imbalanced data. IEEE Trans Knowl Data Eng 2009; 21(9): 1263–1284.

18.

Zeng

Sun

, et al. A novel integrated framework based on multi-view features for multidimensional social bot detection. J Inf Sci 2022; 50: 1148–1169.

19.

Bellinger

Drummond

. Beyond the boundaries of SMOTE – a framework for manifold-based synthetically oversampling. In: European Conference on Machine Learning and Knowledge Discovery in Databases (ECML PKDD), Riva del Garda, 19–23 September 2016, pp. 248–263.

20.

Amin

Rahim

Ali

, et al. A comparison of two oversampling techniques (SMOTE vs MTDF) for handling class imbalance problem: a case study of customer churn prediction. In: Rocha

Correia

Costanzo

, et al. (eds) New Contributions in Information Systems and Technologies. Vol. 353. Cham: Springer, 2015, pp. 199–207.

21.

Komamizu

Uehara

Ogawa

, et al. MUEnsemble: multi-ratio undersampling-based ensemble framework for imbalanced data. In: Hartmann

Küng

Kotsis

, et al. (eds) Database and Expert Systems Applications. DEXA 2020. Vol. 12392. Cham: Springer, 2020, pp. 213–228.

22.

Silva

Teixeira

Ortega

, et al. Improvement in the prediction of the translation initiation site through balancing methods, inclusion of acquired knowledge and addition of features to sequences of mRNA. BMC Genomics 2011; 12(Suppl 4): S9.

23.

Sundarkumar

Ravi

. A novel hybrid undersampling method for mining unbalanced datasets in banking and insurance. Eng Appl Artif Intell 2015; 37: 368–377.

24.

Zhang

Zheng

, et al. A complete survey on generative AI (AIGC): is GPT from GPT-4 to GPT-5 all you need? arXiv 2303.11717, 2023.

25.

Abdi

Hashemi

. To combat multi-class imbalanced problems by means of over-sampling techniques. IEEE Trans Knowl Data Eng 2016; 28(1): 238–251.

26.

Wissam

Fady

Hazem

. AraBERT: transformer-based model for Arabic language understanding. In: Proceedings of the 4th workshop on open-source Arabic corpora and processing tools, with a shared task on offensive language detection, Marseille, 11–16 May 2020, pp. 9–15. Stroudsburg, PA: ACL Anthology.

Addressing class imbalance in text classification with LLMs: A prompt-based GPT-2 approach

Abstract

Keywords

Get full access to this article

References