Intelligent detection of hate speech in Arabic social network: A machine learning approach

Abstract

Nowadays, cyber hate speech is increasingly growing, which forms a serious problem worldwide by threatening the cohesion of civil societies. Hate speech relates to using expressions or phrases that are violent, offensive or insulting for a person or a minority of people. In particular, in the Arab region, the number of Arab social media users is growing rapidly, which is accompanied with high increasing rate of cyber hate speech. This drew our attention to aspire healthy online environments that are free of hatred and discrimination. Therefore, this article aims to detect cyber hate speech based on Arabic context over Twitter platform, by applying Natural Language Processing (NLP) techniques, and machine learning methods. The article considers a set of tweets related to racism, journalism, sports orientation, terrorism and Islam. Several types of features and emotions are extracted and arranged in 15 different combinations of data. The processed dataset is experimented using Support Vector Machine (SVM), Naive Bayes (NB), Decision Tree (DT) and Random Forest (RF), in which RF with the feature set of Term Frequency-Inverse Document Frequency (TF-IDF) and profile-related features achieves the best results. Furthermore, a feature importance analysis is conducted based on RF classifier in order to quantify the predictive ability of features in regard to the hate class.

Keywords

Hate speech machine learning text vectorization Twitter

Get full access to this article

View all access options for this article.

References

Rout

Choo

K-KR

Dash

, et al. A model for sentiment and emotion analysis of unstructured social media text. Electron Commerce Res 2018; 18(1): 181–199.

Boudad

Faizi

Haj Thami

, et al. Sentiment analysis in Arabic: a review of the literature. Ain Shams Eng J 2017; 9(4): 2479–2490.

European Court. European court of human rights, https://www.echr.coe.int (accessed July 2019).

Facebook Team. Facebook Community Standards, https://web.facebook.com/communitystandards/hate_speech (accessed 12 July 2019).

Twitter Team. Twitter rules and policies – hateful conduct policy, https://help.twitter.com/en/rules-and-policies/hateful-conduct-policy (accessed June 2019).

Abu-Taieh

Alfaries

Al-Otaibi

, et al. Cyber security crime and punishment: comparative study of the laws of Jordan, Kuwait, Qatar, Oman, and Saudi Arabia. Int J Cyb War Terr 2018; 8(3): 46–59.

Jordanian Ministry Jordanian ministry of justice, http://www.moj.gov.jo/EchoBusV3.0/SystemAssets/5d38ea27-5819-443e-a380-b65c7e1f5b56.pdf (accessed July 2019).

Statista Inc. The most common languages on the internet, https://www.statista.com/statistics/262946/share-of-the-most-common-languages-on-the-internet (accessed July 2019).

Badaro

Baly

Hajj

, et al. A survey of opinion mining in Arabic: a comprehensive system perspective covering challenges and advances in tools, resources, models, applications, and visualizations. ACM Trans Asian Low Res Lang Inf Process 2019; 18(3): 27.

10.

Biltawi

Etaiwi

Tedmori

, et al. Sentiment classification techniques for Arabic language: a survey. In: 2016 7th International Conference on Information and Communication Systems (ICICS), Irbid, Jordan, 5–7 April 2016, pp. 339–346. New York: IEEE.

11.

Elouardighi

Maghfour

Hammia

, et al. A machine learning approach for sentiment analysis in the standard or dialectal Arabic Facebook comments. In: 2017 3rd International Conference of Cloud Computing Technologies and Applications (CloudTech), Rabat, Morocco, 24–26 October 2017, pp. 1–8. New York: IEEE.

12.

Biltawi

Al-Naymat

Tedmori

. Arabic sentiment classification: a hybrid approach. In: 2017 International Conference on New Trends in Computing Sciences (ICTCS), Amman, Jordan, 11–13 October 2017, pp. 104–108. New York: IEEE.

13.

Mass

Daly

Pham

, et al. Learning word vectors for sentiment analysis. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, OR, 19–24 June 2011, pp. 142–150. Philadelphia, PA: Association for Computational Linguistics.

14.

Daoud

Sallam

Wheed

. Improving Arabic document clustering using K-means algorithm and particle swarm optimization. In: 2017 Intelligent Systems Conference (IntelliSys), London, 7–8 September 2017, pp. 879–885. New York: IEEE.

15.

Al-Ayyoub

Nuseir

Alsmearat

, et al. Deep learning for Arabic NLP: a survey. J Comput Sci 2018; 26: 522–531.

16.

Al-Azani

El-Alfy

E-SM

. Combining emojis with Arabic textual features for sentiment classification. In: 2018 9th International Conference on Information and Communication Systems (ICICS), Irbid, Jordan, 3–5 April 2018, pp. 139–144. New York: IEEE.

17.

Tubishat

Abushariah

MAM

Idris

, et al. Improved whale optimization algorithm for feature selection in Arabic sentiment analysis. Appl Intell 2019; 49(5): 1688–1707.

18.

Chaturvedi

Cambria

Welsch

, et al. Distinguishing between facts and opinions for sentiment analysis: survey and challenges. Inform Fusion 2018; 44: 65–77.

19.

López

Valdivia

Martnez-Cámara

, et al. E2SAM: evolutionary ensemble of sentiment analysis methods for domain adaptation. Inf Sci 2019; 480: 273–286.

20.

Tsakalidis

Papadopoulos

Voskaki

, et al. Building and evaluating resources for sentiment analysis in the Greek language. Lang Res Eval 2018; 52(4): 1021–1044.

21.

Vizcarra

Mauricio

. A deep learning approach for sentiment analysis in Spanish tweets. In: International Conference on Artificial Neural Networks, Rhodes, 4–7 October 2018, pp. 622–629. Cham: Springer.

22.

Al-Hassan

Al-Dossari

. Detection of hate speech in social networks: a survey on multilingual corpus. Comp Sci Inf Tech 2019; 9(2): 83.

23.

Chetty

Alathur

. Hate speech review in the context of online social networks. Aggress Viol Behav 2018; 40: 108–118

24.

Watanabe

Bouazizi

Ohtsuki

. Hate speech on Twitter: a pragmatic approach to collect hateful and offensive expressions and perform hate speech detection. IEEE Access 2018; 6: 13825–13835.

25.

Robinson

Zhang

Tepper

. Hate speech detection on Twitter: feature engineering vs feature selection. In: European Semantic Web Conference, Heraklion, 3–7 June 2018, pp. 46–49. Basel: Springer.

26.

Pitsilis

Ramampiaro

Langseth

. Effective hate-speech detection in Twitter data using recurrent neural networks. Appl Intell 2018; 48(12): 4730–4742.

27.

Zhang

Robinson

Tepper

. Detecting hate speech on Twitter using a convolution-GRU based deep neural network. In: European Semantic Web Conference, Heraklion, 3–7 June 2018, pp. 745–760. Basel: Springer.

28.

Biere

Bhulai

. Hate speech detection using natural language processing techniques. PhD thesis, Vrije Universiteit Amsterdam, 2018.

29.

Waseem

Thorne

Bingel

. Bridging the gaps: multi task learning for domain transfer of hate speech detection. In: Golbeck

(ed.) Online harassment. Berlin/Heidelberg, Germany: Springer, 2018, pp. 29–55.

30.

Kshirsagar

Cukuvac

McKeown

, et al. Predictive embeddings for hate speech detection on Twitter. arXiv preprint: arXiv:1809.10644, 2018.

31.

Unsvåg

Gambäck

. The effects of user features on twitter hate speech detection. In: Proceedings of the 2nd Workshop on Abusive Language Online (ALW2), Brussels, 31 October 2018, pp. 75–85. Association for Computational Linguistics.

32.

Alfina

Mulia

Fanany

, et al. Hate speech detection in the Indonesian language: a dataset and preliminary study. In: 2017 International Conference on Advanced Computer Science and Information Systems (ICACSIS), Bali, Indonesia, 28–29 October 2017, pp. 233–238. New York: IEEE.

33.

Zhang

Luo

. Hate speech detection: a solved problem? The challenging case of long tail on Twitter. arXiv preprint: arXiv:1803.03662, 2018.

34.

Del Vigna

Cimino

Dell’Orletta

, et al. Hate me, hate me not: hate speech detection on Facebook. In: Proceedings of the First Italian Conference on Cybersecurity (ITASEC17), Venice, 17–20 January 2017.

35.

Kotu

Deshpande

. Data science: concepts and practice. Burlington, MA: Morgan Kaufmann, 2018.

36.

Suykens

JAK

Vandewalle

. Least squares support vector machine classifiers. Neur Process Lett 1999; 9(3): 293–300.

37.

Han

Pei

Kamber

. Data mining: concepts and techniques. Waltham, MA: Elsevier, 2011.

38.

Domingos

Pazzani

. On the optimality of the simple Bayesian classifier under zero-one loss. Mach Learn 1997; 29(2–3): 103–130.

39.

Quinlan

. Induction of decision trees. Mach Learn 1986; 1(1): 81–106.

40.

Breiman

. Random forests. Mach Learn 2001; 45(1): 5–32.

41.

Verzani

. Getting started with RStudio. Sebastopol, CA: O’Reilly Media, Inc., 2011.

42.

Loper

Bird

. NLTK: the natural language toolkit. In: Proceedings of the ACL Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics, Barcelona, Spain, 2004, pp. 214–217. Philadelphia, PA: Association for Computational Linguistics.

43.

Van Rossum

Drake

Jr . Python tutorial. Amsterdam: Centrum Wiskunde & Informatica, 1995.

44.

Pedregosa

Varoquaux

Gramfort

, et al. Scikit-learn: machine learning in Python. J Mach Learn Res 2011; 12: 2825–2830.

45.

Raybaut

. Spyder: scientific Python development environment, 2009, https://github.com/spyder-ide/spyder (accessed 2017).