XRBi-GAC: A hybrid deep learning framework for multilingual toxicity detection

Abstract

Social media platforms allow people across the globe to share their thoughts and opinions and conveniently communicate with each other. Apart from various advantages of social media, it is also misused by a set of users for hate-mongering with toxic and offensive comments. The majority of the earlier proposed toxicity detection methods are primarily focused on the English language, but there is a lack of research on low-resource languages and multilingual text data. We propose an XRBi-GAC framework comprising XLM-RoBERTa, Bi-GRU with self-attention and capsule networks for multilingual toxic text detection. A loss function is also presented, which fuses the binary cross-entropy loss and focal loss to address the class imbalance problem. We evaluated the proposed framework on two datasets, namely, the Jigsaw Multilingual Toxic Comment dataset and HASOC 2019 dataset and achieved F1-score of 0.865 and 0.829, respectively. The results of the experiments show that the proposed framework has outperformed the state-of-the-art multilingual models XLM-RoBERTa and mBERT on both datasets, which shows the versatility and robustness of the proposed XRBi-GAC framework.

Keywords

Toxicity multilingual text XLM-RoBERTa Bi-GRU self-attention capsule network

Get full access to this article

View all access options for this article.

References

https://www.smartinsights.com/social-mediamarketing/social-media-strategy/new-global-social-mediaresearch. Accessed: 2022-09-25.

Conneau

, Khandelwal

, Goyal

, Chaudhary

, Wenzek

, Guzman

, Grave

, Ott

, Zettlemoyer

and Stoyanov

, “Unsupervised cross-lingual representation learning at scale,” arXiv preprint arXiv:1911.02116, 2019.

Zhao

, Zhou

and Mao

, “Automatic detection of cyberbullying on social networks based on bullying features,” in Proceedings of the 17th international conference on distributed computing and networking, pp. 1–6, 2016.

Pranckevicius

and Marcinkevičius

, “Application of logistic regression with part-of-the-speech tagging for multi-class text classification,” in 2016 IEEE 4th workshop on advances in information, electronic and electrical engineering (AIEEE), pp. 1–5, IEEE, 2016.

Gaydhani

, Doma

, Kendre

and Bhagwat

, “Detecting hate speech and offensive language on twitter using machine learning: An n-gram and tfidf based approach,” arXiv preprint arXiv:1809.08651, 2018.

Neogi

A.S.

, Garg

K.A.

, Mishra

R.K.

and Dwivedi

Y.K.

, Sentiment analysis and classification of indian farmers’ protest using twitter data, International Journal of Information Management DataInsights 1(2) (2021), 100019.

Vaswani

, Shazeer

, Parmar

, Uszkoreit

, Jones

, Gomez

A.N.

, Kaiser

Ł.

and Polosukhin

, Attention is all you need, Advances in Neural Information Processing systems 30(2017).

Devlin

, Chang

M.-W.

, Lee

and Toutanova

, “Bert: Pretraining of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.

Sanh

, Debut

, Chaumond

and Wolf

, “Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter,” arXiv preprint arXiv:1910.01108, 2019.

10.

Liu

, Ott

, Goyal

, Du

, Joshi

, Chen

, Levy

, Zettlemoyer

M. L.

and Stoyanov

, “Roberta: A robustly optimized bert pretraining approach,” arXiv preprint arXiv:1907.11692, 2019.

11.

Lan

, Chen

, Goodman

, Gimpel

, Sharma

and Soricut

, “Albert: A lite bert for self-supervised learning of language representations,” arXiv preprint arXiv:1909.11942, 2019.

12.

Gao

and Huang

, “Detecting online hate speech using context aware models,” arXiv preprint arXiv:1710.07395, 2017.

13.

Singh

and Chand

, “Pardeep at semeval-2019 task 6: Identifying and categorizing offensive language in social media using deep learning,” in Proceedings of the 13th International Workshop on Semantic Evaluation, pp. 727–734, 2019.

14.

Ibrahim

, Torki

and El-Makky

, “Imbalanced toxic comments classification using data augmentation and deep learning,” in 2018 17th IEEE international conference on machine learning and applications (ICMLA), pp. 875–878, IEEE, 2018.

15.

Van Aken

, Risch

, Krestel

and Loser

, “Challenges for toxic comment classification: An in-depth error analysis,” arXiv preprint arXiv:1809.07572, 2018.

16.

Ranasinghe

, Zampieri

and Hettiarachchi

, “Brums at hasoc 2019: Deep learning models for multilingual hate speech and offensive language identification.,” in FIRE (working notes), pp. 199–207, 2019.

17.

Lample

and Conneau

, “Cross-lingual language model pretraining,” arXiv preprint arXiv:1901.07291, 2019.

18.

Dong

X.L.

and de Melo

, “A robust self-learning framework for cross-lingual text classification,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLPIJCNLP), pp. 6306–6310, 2019.

19.

Wang

, Liu

, Ouyang

and Sun

, “Galileo at semeval- 2020 task 12: Multi-lingual learning for offensive language identification using pre-trained language models,” arXiv preprint arXiv:2010.03542, 2020.

20.

Sai

and Sharma

, “Siva@ hasoc-dravidian-codemixfire- 2020: Multilingual offensive speech detection in codemixed and romanized text.,” in FIRE (Working Notes), pp. 336–343, 2020.

21.

Pant

and Dadu

, “Cross-lingual inductive transfer to detect offensive language,” arXiv preprint arXiv:2007.03771, 2020.

22.

Malik

, Aggrawal

and Vishwakarma

D.K.

, “Toxic speech detection using traditional machine learning models and bert and fasttext embedding with deep neural networks,” in 2021 5th International Conference on Computing Methodologies and Communication (ICCMC), pp. 1254–1259, IEEE, 2021.

23.

Song

, Huang

and Xiao

, “A study of multilingual toxic textdetection approaches under imbalanced sample distribution,”, Information 12(5) (2021), 205.

24.

Jhaveri

, Ramaiya

and Chadha

H.S.

, “Toxicity detection for indic multilingual social media content,” arXiv preprint arXiv:2201.00598, 2022.

25.

Lashkarashvili

and Tsintsadze

, “Toxicity detection in onlinegeorgian discussions,”, International Journal of InformationManagement Data Insights 2(1) (2022), 100062.

26.

Mandl

, Modha

, Majumder

, Patel

, Dave

, Mandlia

and Patel

, “Overview of the hasoc track at fire 2019: Hate speech and offensive content identification in indoeuropean languages,” in Proceedings of the 11th forum for information retrieval evaluation, pp. 14–17, 2019.

27.

Rumelhart

D.E.

, Hinton

G.E.

and Williams

R.J.

, Learningrepresentations by back-propagating errors, Nature 323(6088) (1986), 533–536.

28.

Hochreiter

and Schmidhuber

, Long short-term memory, Neural Computation 9(8) (1997), 1735–1780.

29.

Chung

, Gulcehre

, Cho

and Bengio

, “Empirical evaluation of gated recurrent neural networks on sequence modeling,” arXiv preprint arXiv:1412.3555, 2014.

30.

Schuster

and Paliwal

K.K.

, Bidirectional recurrent neuralnetworks, IEEE Transactions on Signal Processing 45(11) (1997), 2673–2681.

31.

Sabour

, Frosst

and Hinton

G.E.

, Dynamic routing betweencapsules, Advances in Neural Information Processing Systems 30 (2017).

32.

Saif

M.A.

, Medvedev

A.N.

, Medvedev

M.A.

and Atanasova

, “Classification of online toxic comments using the logisticregression and neural networks models,”, in AIP conference proceedings 2048 (2018), 060011, AIP Publishing LLC.

33.

Dhamija

, Katarya

, et al., “Comparative analysis of machine learning and deep learning algorithms for detection of online hate speech,” in Advances in Mechanical Engineering, pp. 509–520, Springer, 2021.

34.

Ghosh

, Kumar

, Lepcha

and Jain

S.S.

, “Toxic text classification,” in Data Science and Security, pp. 251–260, Springer, 2021.