Sage Journals: Discover world-class research

Abstract

In recent years, the increasing propagation of hate speech on social media and the urgent need for effective counter-measures have drawn significant investment from governments, companies, and researchers. A large number of methods have been developed for automated hate speech detection online. This aims to classify textual content into non-hate or hate speech, in which case the method may also identify the targeting characteristics (i.e., types of hate, such as race, and religion) in the hate speech. However, we notice significant difference between the performance of the two (i.e., non-hate vs. hate). In this work, we argue for a focus on the latter problem for practical reasons. We show that it is a much more challenging task, as our analysis of the language in the typical datasets shows that hate speech lacks unique, discriminative features and therefore is found in the ‘long tail’ in a dataset that is difficult to discover. We then propose Deep Neural Network structures serving as feature extractors that are particularly effective for capturing the semantics of hate speech. Our methods are evaluated on the largest collection of hate speech datasets based on Twitter, and are shown to be able to outperform the best performing method by up to 5 percentage points in macro-average F1, or 8 percentage points in the more challenging case of identifying hateful content.

Keywords

Hate speech classification neural network CNN GRU skipped CNN deep learning natural language processing

1. Introduction

The exponential growth of social media such as Twitter and community forums has revolutionised communication and content publishing, but is also increasingly exploited for the propagation of hate speech and the organisation of hate-based activities [1,2]. The anonymity and mobility afforded by such media has made the breeding and spread of hate speech – eventually leading to hate crime – effortless in a virtual landscape beyond the realms of traditional law enforcement.

The term ‘hate speech’ was formally defined as ‘any communication that disparages a person or a group on the basis of some characteristics (to be referred to as types of hate or hate classes) such as race, colour, ethnicity, gender, sexual orientation, nationality, religion, or other characteristics’ [29]. In the UK, there has been significant increase of hate speech towards the migrant and Muslim communities following recent events including leaving the EU, the Manchester and the London attacks [18]. In the EU, surveys and reports focusing on young people in the EEA (European Economic Area) region show rising hate speech and related crimes based on religious beliefs, ethnicity, sexual orientation or gender, as 80% of respondents have encountered hate speech online and 40% felt attacked or threatened [13]. Statistics also show that in the US, hate speech and crime is on the rise since the Trump election [30]. The urgency of this matter has been increasingly recognised, as a range of international initiatives have been launched towards the qualification of the problems and the development of counter-measures [14].

Building effective counter measures for online hate speech requires as the first step, identifying and tracking hate speech online. For years, social media companies such as Twitter, Facebook, and YouTube have been investing hundreds of millions of euros every year on this task [15,19,23], but are still being criticised for not doing enough. This is largely because such efforts are primarily based on manual moderation to identify and delete offensive materials. The process is labour intensive, time consuming, and not sustainable or scalable in reality [5,15,40].

A large number of research has been conducted in recent years to develop automatic methods for hate speech detection in the social media domain. These typically employ semantic content analysis techniques built on Natural Language Processing (NLP) and Machine Learning (ML) methods, both of which are core pillars of the Semantic Web research. The task typically involves classifying textual content into non-hate or hateful, in which case it may also identify the types of the hate speech. Although current methods have reported promising results, we notice that their evaluations are largely biased towards detecting content that is non-hate, as opposed to detecting and classifying real hateful content. A limited number of studies [2,32] have shown that, for example, state of the art methods that detect sexism messages can only obtain an F1 of between 15 and 60 percentage points lower than detecting non-hate messages. These results suggest that it is much harder to detect hateful content and their types than non-hate.1

¹
Even in a binary setting of the task (i.e., either a message is hate or not), the high accuracy obtainable on detecting non-hate does not automatically translate to high accuracy of the other task due to the highly imbalanced nature in such datasets, as we shall show later.
However, from a practical point of view, we argue that the ability to correctly (Precision) and thoroughly (Recall) detect and identify specific types of hate speech is more desirable. For example, social media companies need to flag up hateful content for moderation, while law enforcement need to identify hateful messages and their nature as forensic evidence.

Motivated by these observations, our work makes two major contributions to the research of online hate speech detection. First, we conduct a data analysis to quantify and qualify the linguistic characteristics of such content on the social media, in order to understand the challenging case of detecting hateful content compared to non-hate. By comparison, we show that hateful content exhibits a ‘long tail’ pattern compared to non-hate due to their lack of unique, discriminative linguistic features, and this makes them very difficult to identify using conventional features widely adopted in many language-based tasks. Second, we propose Deep Neural Network (DNN) structures that are empirically shown to be very effective feature extractors for identifying specific types of hate speech. These include two DNN models inspired and adapted from literature of other Machine Learning tasks: one that simulates skip-gram like feature extraction based on modified Convolutional Neural Networks (CNN), and another that extracts orderly information between features using Gated Recurrent Unit networks (GRU).

Evaluated on the largest collection of English Twitter datasets, we show that our proposed methods can outperform state of the art methods by up to 5 percentage points in macro-average F1, or 8 percentage points in the more challenging task of detecting and classifying hateful content. Our thorough evaluation on all currently available public Twitter datasets sets a new benchmark for future research in this area. And our findings encourage future work to take a renewed perspective, i.e., to consider the challenging case of long tail.

The remainder of this paper is structured as follows. Section 2 reviews related work on hate speech detection and other relevant fields; Section 3 describes our data analysis to understand the challenges of hate speech detection on Twitter; Section 4 introduces our methods; Section 5 presents experiments and results; and finally Section 6 concludes this work and discusses future work.
2. Related work

2.1. Terminology and scope

Recent years have seen an increasing number of research on hate speech detection as well as other related areas. As a result, the term ‘hate speech’ is often seen to co-exist or become mixed with other terms such as ‘offensive’, ‘profane’, and ‘abusive languages’, and ‘cyberbullying’. To distinguish them, we identify that hate speech

targets individual or groups on the basis of their characteristics;

demonstrates a clear intention to incite harm, or to promote hatred;

may or may not use offensive or profane words. For example: ‘Assimilate? No theyall need to go back to theirown countries. #BanMuslimsSorry if someone disagrees too bad.’

In contrast, ‘All you perverts (other than me) who posted today, needs to leave the O Board. Dfasdfdasfadfs’ is an example of abusive language, which often bears the purpose of insulting individuals or groups, and can include hate speech, derogatory and offensive language [28]. ‘i spend my money how i want bitch its my business’ is an example of offensive or profane language, which is typically characterised by the use of swearing or curse words. ‘Our class prom night just got ruined because u showed up. Who invited u anyway?’ is an example of bullying, which has the purpose to harass, threaten or intimidate typically individuals rather than groups.

In the following, we cover state of the art in all these areas with a focus on hate speech.2

²
We will indicate explicitly where works address a related problem rather than hate speech.
Our methods and experiments will only address hate speech, due to both dataset availability and the goal of this work.
2.2. Methods of hate speech detection and related problems

Existing methods primarily cast the problem as a supervised document classification task [37]. These can be divided into two categories: one relies on manual feature engineering that are then consumed by algorithms such as SVM, Naive Bayes, and Logistic Regression [2,9,12,17,21,25,38–42] (classic methods); the other represents the more recent deep learning paradigm that employs neural networks to automatically learn multi-layers of abstract features from raw data [10,15,28,32] (deep learning methods).

Classic methods require manually designing and encoding features of data instances into feature vectors, which are then directly used by classifiers.

Schmidt et al. [37] summarised several types of features used in the state of the art. Simple surface features such as bag of words, word and character n-grams have been used as fundamental features in hate speech detection [2,3,9,10,17,21,38–40], as well as other related tasks such as the detection of offensive and abusive content [5,25,28], discrimination [42], and cyberbullying [44]. Other surface features can include URL mentions, hashtags, punctuations, word and document lengths, capitalisation, etc [5,9,28]. Word generalisation includes the use of low-dimensional, dense vectorial word representations usually learned by clustering [38], topic modelling [41,44], and word embeddings [10,12,28,42] from unlabelled corpora. Such word representations are then used to construct feature vectors of messages. Sentiment analysis makes use of the degree of polarity expressed in a message [2,9,10,16]. Lexical resources are often used to look up specific negative words (such as slurs, insults, etc.) in messages [2,16,28,41]. Linguistic features utilise syntactic information such as Part of Speech (PoS) and certain dependency relations as features [2,5,9,16,44]. Meta-information refers to data about messages, such as gender identity of a user associated with a message [39,40], or high frequency of profane words in a user’s post history [8,41]. In addition, Knowledge-Based features such as messages mapped to stereotypical concepts in a knowledge base [11] and multimodal information such as image captions and pixel features [44] were used in cyberbullying detection but only in very confined context [37].

In terms of classifiers, existing methods are predominantly supervised. Among these, Support Vector Machines (SVM) is the most popular algorithm [2,5,9,17,25,38,41,42], while other algorithms such as Naive Bayes [5,9,21,25,42], Logistic Regression [9,12,25,39,40], and Random Forest [9,41] are also used.

Deep learning based methods employ deep artificial neural networks to learn abstract feature representations from input data through its multiple stacked layers for the classification of hate speech. The input can be simply the raw text data, or take various forms of feature encoding, including any of those used in the classic methods. However, the key difference is that in such a model the input features may not be directly used for classification. Instead, the multi-layer structure can be used to learn from the input, new abstract feature representations that prove to be more effective for learning. For this reason, deep learning based methods typically shift its focus from manual feature engineering to the network structure, which is carefully designed to automatically extract useful features from a simple input feature representation. Indeed we notice a clear trend in the literature that shifts towards the adoption of deep learning based methods and studies have also shown them to perform better than classic methods on this task [15,32]. Note that this categorisation excludes those methods [12,25,42] that used DNN to learn word or text embeddings and subsequently apply another classifier (e.g., SVM, logistic regression) to use such embeddings as features for classification. Instead, we focus on DNN methods that perform the classification task itself.

To the best of our knowledge, methods of this category include [1,10,15,32,43], all of which used simple word and/or character based one-hot encoding as input features to their models, while Vigna et al. [10] also used word polarity. The most popular network architectures are Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN), typically Long Short-Term Memory network (LSTM). In the literature, CNN is well known as an effective network to act as ‘feature extractors’, whereas RNN is good for modelling orderly sequence learning problems [31]. In the context of hate speech classification, intuitively, CNN extracts word or character combinations [1,15,32] (e.g., phrases, n-grams), RNN learns word or character dependencies (orderly information) in tweets [1,10].

In our previous work [43], we showed benefits of combining both structures in such tasks by using a hybrid CNN and GRU (Gated Recurrent Unit) structure. This work largely extends it in several ways. First, we adapt the model to multiple CNN layers; second, we propose a new DNN architecture based on the idea of extracting skip-gram like features for this task; third, we conduct data analysis to understand the challenges in hate speech detection due to the linguistic characteristics in the data; and finally, we perform an extended evaluation of our methods, particularly their capability on addressing these challenges.

2.3. Evaluation of hate speech detection methods

Evaluation of the performance of hate speech (and also other related content) detection typically adopts the classic Precision, Recall and F1 metrics. Precision measures the percentage of true positives among the set of hate speech messages identified by a system; Recall measures the percentage of true positives among the set of real hate speech messages we expect the system to capture (also called ‘ground truth’ or ‘gold standard’), and F1 calculates the harmonic mean of the two. The three metrics are usually applied to each class in a dataset, and often an aggregated figure is computed either using micro-average or macro-average. The first sums up the individual true positives, false positives, and false negatives identified by a system regardless of different classes to calculate overall Precision, Recall and F1 scores. The second takes the average of the Precision, Recall and F1 on different classes.

Existing studies on hate speech detection have primarily reported their results using micro-average Precision, Recall and F1 [1,15,32,39,40,43]. The problem with this is that in an unbalanced dataset where instances of one class (to be called the ‘dominant class’) significantly out-number others (to be called ‘minority classes’), micro-averaging can mask the real performance on minority classes. Thus a significantly lower or higher F1 score on a minority class (when compared to the majority class) is unlikely to cause significant change in micro-F1 on the entire dataset. As we will show in Section 3, hate speech detection is a typical task dealing with extremely unbalanced datasets, where real hateful content only accounts for a very small percentage of the entire dataset, while the large majority is non-hate but exhibits similar linguistic characteristics to hateful content. As argued before, practical applications often need to focus on detecting hateful content and identifying their types. In this case, reporting micro F1 on the entire dataset will not properly reflect a system’s ability to deal with hateful content as opposed to non-hate. Unfortunately, only a very limited number of work has reported performance on a per-class basis [3,32]. As an example, when compared to the micro F1 scores obtained on the entire dataset, the highest F1 score reported for detecting sexism messages is 47 percentage points lower in [3] while 11 points lower in [32]. This has largely motivated our study to understand what causes hate speech to be so difficult to classify from a linguistic point of view, and to evaluate hate speech detection methods by giving more focus on their capability of classifying real hateful content.

3. Dataset analysis – the case of long tail

We first start with an analysis of typical datasets used in the studies of hate speech detection on Twitter. From this we show the very unbalanced nature of such data, and compare the linguistic characteristics of hate speech against non-hate to discuss the challenge of detecting and classifying hateful content.

3.1. Public Twitter datasets

We use the collection of publicly available English Twitter datasets previously compiled in our work [43]. To our knowledge, this is the largest set (in terms of tweets) of Twitter based dataset used in hate speech detection. This includes seven datasets published in previous research. And all of these were collected based on the principle of keyword or hashtag filtering from the public Twitter stream. DT consolidates the dataset by [9] into two types, ‘hate’ and ‘non-hate’. The tweets do not have a focused topic but were collected using a controlled vocabulary of abusive words. RM contains ‘hate’ and ‘non-hate’ tweets focused on refugee and muslim discussions. WZ is initially published by [39] and contains ‘sexism’, ‘racism’, and ‘non-hate’; the same authors created another smaller dataset annotated by domain experts and amateurs separately. The authors showed that the two sets of annotations were different as the supervised classifiers obtained different results on them. We will use WZ-S.amt to denote the dataset annotated by amateurs, and WZ-S.exp to denote the dataset annotated by experts. WZ-S.gb merges the WZ-S.amt and WZ-S.exp datasets by taking the majority vote from both amateur and expert annotations where the expert was given double weights [15]. WZ.pj combines the WZ and the WZ-S.exp datasets [32]. All of the WZ-S.amt, WZ-S.exp, WZ-S.gb, and WZ.pj datasets contain ‘sexism’, ‘racism’, and ‘non-hate’ tweets, but also added a ‘both’ class that includes tweets considered to be both ‘sexism’ and ‘racism’. However, there are only several handful of instances of this class and they were found to be insufficient for model learning. Therefore, following [32] we exclude this class from these datasets. Table 1 summarises the statistics of these datasets.

Table 1
Statistics of datasets used in the experiment

Dataset #Tweets Classes (%)

WZ 16,093 racism (12%) sexism (19.6%) neither (68.4%)

WZ-S.amt 6,579 racism (1.9%) sexism (16.3%) neither (81.8%)

WZ-S.exp 6,559 racism (1.3%) sexism (11.8%) neither (86.9%)

WZ-S.gb 6,567 racism (1.4%) sexism (13.9%) neither (84.7%)

WZ.pj 18,593 racism (10.8%) sexism (20.3%) neither (68.9%)

DT 24,783 hate (5.8%) non-hate (94.2%)

RM 2,435 hate (17%) non-hate (83%)

Dataset	#Tweets	Classes (%)
WZ	16,093	racism (12%) sexism (19.6%) neither (68.4%)
WZ-S.amt	6,579	racism (1.9%) sexism (16.3%) neither (81.8%)
WZ-S.exp	6,559	racism (1.3%) sexism (11.8%) neither (86.9%)
WZ-S.gb	6,567	racism (1.4%) sexism (13.9%) neither (84.7%)
WZ.pj	18,593	racism (10.8%) sexism (20.3%) neither (68.9%)
DT	24,783	hate (5.8%) non-hate (94.2%)
RM	2,435	hate (17%) non-hate (83%)

3.2. Dataset analysis

As shown in Table 1, all datasets are significantly biased towards non-hate, as hate tweets account between only 5.8% (DT) and 31.6% (WZ). When we inspect specific types of hate, some can be even more scarce, such as ‘racism’ and as mentioned before, the extreme case of ‘both’. This has two implications. First, an evaluation measure such as the micro F1 that looks at a system’s performance on the entire dataset regardless of class difference can be biased to the system’s ability of detecting ‘non-hate’. In other words, a hypothetical system that achieves almost perfect F1 in identifying ‘racism’ tweets can still be overshadowed by its poor F1 in identifying ‘non-hate’, and vice versa. Second, compared to non-hate, the training data for hate tweets are very scarce. This may not be an issue that is easy to address as it seems, since the datasets are collected from Twitter and reflect the real nature of data imbalance in this domain. Thus to annotate more training data for hateful content we will almost certainly have to spend significantly more effort annotating non-hate. Also, as we shall show in the following, this problem may not be easily mitigated by conventional methods of over- or under-sampling. Because the real challenge is the lack of unique, discriminative linguistic characteristics in hate tweets compared to non-hate.

As a proxy to quantify and compare the linguistic characteristics of hate and non-hate tweets, we propose to study the ‘uniqueness’ of the vocabulary for each class. We argue that this can be a reasonable reflection of the features used for classifying each class. On the one hand, most types of features are derived from words; on the other hand, our previous work already showed that the most effective features in such tasks are based on words [36].

Specifically, we start with applying a state of the art tweet normalisation tool3

³
https://github.com/cbaziotis/ekphrasis
to tokenise and transform each tweet into a sequence of words. This is done to mitigate the noise due to the colloquial nature of the data. The process involves, for example, spelling correction, elongated word normalisation (‘yaaaay’ becomes ‘yay’), word segmentation on hashtags (‘#banrefugees’ becomes ‘ban refugees’), and unpacking contractions (e.g., ‘can’t’ becomes ‘can not’). Then we lemmatise each word to return its dictionary form. We refer to this process as ‘pre-processing’ and the output as pre-processed tweets.

Next, given a tweet $t_{i}$ , let $c_{n}$ be the class label of $t_{i}$ , $words (t_{i})$ returns the set of different words from $t_{i}$ , and $uwords (c_{n})$ returns the set of class-unique words that are found only for $c_{n}$ (i.e., they do not appear in any other classes), then for each dataset, we measure for each tweet a ‘uniqueness’ score u as: $\begin{matrix} (1) & u (t_{i}) = \frac{| words (t_{i}) \cap uwords (c_{n}) |}{| words (t_{i}) |} . \end{matrix}$ This measures the fraction of class-unique words in a tweet, depending on the class of this tweet. Intuitively, the score can be considered as an indication of ‘uniqueness’ of the features found in a tweet. A high value indicates that the tweet can potentially contain more features that are unique to its class, and as a result, we can expect the tweet to be relatively easy to classify. On the other hand, a low value indicates that many features of this tweet are potentially non-discriminative as they may also be found across multiple classes, and therefore, we can expect the tweet to be relatively difficult to classify.

We then compute this score for every tweet in a dataset, and compare the number of tweets with different uniqueness scores within each class. To better visualise this distribution, we bin the scores into 11 ranges as $u \in [0, 0]$ , $u \in (0, 0.1]$ , $u \in (0.1, 0.2], \dots, u \in (0.9, 1.0]$ . In other words, the first range includes only tweets with a uniqueness score of 0, then the remaining 10 ranges are defined with a 0.1 increment in the score. In Fig. 1 we plot for each dataset,

the distribution of tweets over these ranges regardless of their class (as indicated by the length of the dark horizontal bar, measured against the x axis that shows accumulative percentage of the dataset); and

the distribution of tweets belong to each class (as indicated by the call-out boxes). For simplicity, we label each range using its higher bound on the y axis. As an example, $u \in [0, 0]$ is labelled as 0, and $u \in (0, 0.1]$ as 0.1.

Fig. 1.
Distribution of tweets in each dataset over the 11 ranges of the uniqueness scores. The x-axis shows accumulative percentage of the dataset; the y-axis shows the labels of these ranges. The call-out boxes show for each class, the fraction of tweets fall under that range (in case a class is not present it has a fraction of 0%).

Using the WZ-S.amt dataset as an example, the figure shows that almost 30% of tweets (the bottom horizontal bars in the figure) in this dataset have a uniqueness score of 0. In other words, these tweets contain no class-unique words. This can cause difficulty in extracting class-unique features from these tweets, making them very difficult to classify. The call-out box for this part of data shows that it contains 52% of sexism and 48% of racism tweets. In fact, on this dataset, 76% of sexism and 81% of racism tweets (adding up figures from the call-out boxes for the bottom three horizontal bars) only have a uniqueness score of 0.2 or lower. On those tweets that have a uniqueness score of 0.4 or higher (the top six horizontal bars), i.e., those that may be deemed as relatively ‘easier’ to classify, we find only 2% of sexism and 3% of racism tweets. In contrast, it is 17% for non-hate tweets.

We can notice very similar patterns on all the datasets in this analysis. Overall, it shows that the majority of hate tweets potentially lack discriminative features and as a result, they ‘sit in the long tail’ of the dataset as ranked by the uniqueness of tweets. Note also that comparing the larger datasets WZ.pj and WZ against the smaller ones (i.e., the WZ-S ones), although both the absolute number and the percentage of the racism and sexism tweets are increased significantly in the two larger datasets (see Table 1), this does not improve the long tail situation. Indeed, one can hate or not using the same words. And as a result, increasing the dataset size and improving class balance may not always guarantee a solution.
4. Methodology

In this section, we describe our DNN based methods that implement the intuition of extracting dependency between words or phrases as features from tweets. To illustrate this idea, consider the example tweet ‘These muslim refugees are troublemakers and parasites, they should be deported from my country’. Each of the words such as ‘muslim’, ‘refugee’, ‘troublemakers’, ‘parasites’, and ‘deported’ alone are not always indicative features of hate speech, as they can be used in any context. However, combinations such as ‘muslim refugees, troublemakers’, ‘refugees, troublemakers’, ‘refugees, parasites’, ‘refugees, deported’, and ‘they, deported’ can be more indicative features. Clearly, in these examples, the pair of words or phrases form certain dependence on each other, and such sequences cannot be captured by n-gram like features.

We propose two DNN structures that may capture such features. Our previous work [43] combines traditional a CNN with a GRU layer [43] and for the sake of completeness, we also include its details below. Our other method combines traditional CNN with some modified CNN layers serving as skip-gram extractors – to be called ‘skipped CNN’. Both structures modify a common, base CNN model (Section 4.1) that acts as the n-gram feature extractor, while the added GRU (Section 4.1.1) and the skipped CNN (Section 4.1.2) components are expected to extract the dependent sequences of such n-grams, as illustrated above.

4.1. The base CNN model

The Base CNN model is illustrated in Fig. 2. Given a tweet, we firstly apply the pre-processing described in Section 3.2 to normalise and transform the tweet into a sequence of words. This sequence is then passed to a word embedding layer, which maps the sequence into a real vector domain (word embeddings). Specifically, each word is mapped onto a fixed dimensional real valued vector, where each element is the weight for that dimension for that word. Word embeddings are often trained on very large unlabelled corpus, and comparatively, the datasets used in this study are much smaller. Therefore in this work, we use pre-trained word embeddings that are publicly available (to be detailed in Section x). One potential issue with pre-trained embeddings is Out-Of-Vocabulary (OOV) words, particularly on Twitter data due to its colloquial nature. Thus the pre-processing also helps to reduce the noise in the language and hence the scale of OOV. For example, by hashtag segmentation we transform an OOV ‘#BanIslam’ into ‘Ban’ and ‘Islam’ that are more likely to be included in the pre-trained embedding models.

Fig. 2.

The base CNN model uses three different window sizes to extract features. This diagram is best viewed in colour.

The embedding layer passes an input feature space with a shape of $100 \times 300$ to three 1D convolutional layers, each uses 100 filters and a stride of 1, but different window sizes of 2, 3, and 4 respectively. Intuitively, each CNN layer can be considered as extractors of bi-gram, tri-gram and quad-gram features. The rectified linear unit function is used for activation in these CNNs. The output of each CNN is then further down-sampled by a 1D max pooling layer with a pool size of 4 and a stride of 4 for further feature selection. Outputs from the pooling layers are then concatenated, to which we add another 1D max pooling layer with the same configuration before (thus ‘max pooling × 2’ in the figure). This is because we empirically found that this further pooling layer can lead to an improvement in F1 in most cases (with as much as 5 percentage points). The output is then fed into the final softmax layer to predict probability distribution over all possible classes (n), which will depend on individual datasets.

One of the recent trends in text processing tasks on Twitter is the use of character based n-grams and embeddings instead of word based, such as in [25,32]. The main reason for this is to cope with the noisy and informal nature of the language in tweets. We do not use character based models, mainly because the literatures that compared word based and character based models are rather inconclusive. Although Mehdad et al. [25] obtained better results using character based models, Park et al. [32] and Gamback et al. [15] reported the opposite. Further, our pre-processing already reduces the noise in the language to some extent. Although the state of the art tool we used is non-perfect and still made mistakes such as parsing ‘#YouTube’ to ‘You’ and ‘Tube’, overall it significantly reduced OOVs by the embedding models. Using the DT dataset for example, this improved hashtag coverage from as low as less than 1% to up to 80% depending on the embedding models used (see the Appendix for details). Also word-based models also better fit our intuitions explained before.

4.1.1. CNN + GRU

With this model, we extend the Base CNN model by adding a GRU layer that takes input from the max pooling layer. This treats the features as timesteps and outputs 100 hidden units per timestep. Compared to LSTM, which is a popular type of RNN, the key difference in a GRU is that it has two gates (reset and update gates) whereas an LSTM has three gates (namely input, output and forget gates). Thus GRU is a simpler structure with fewer parameters to train. In theory, this makes it faster to train and generalise better on small data; while empirically it is shown to achieve comparable results to LSTM [7]. Next, a global max pooling layer ‘flattens’ the output space by taking the highest value in each timestep dimension, producing a feature vector that is finally fed into the softmax layer. The intuition is to pick the highest scoring features to represent a tweet, which empirically works better than the normal configuration. The structure of this model is shown in Fig. 3.

Fig. 3.

The CNN + GRU architecture. This diagram is best viewed in colour.

The GRU layer captures sequence orders that can be useful for this task. In an analogy, it learns dependency relationships between n-grams extracted by the CNN layer before. And as a result, it may capture co-occurring word n-grams as useful patterns for classification, such as the pairs of words and phrases illustrated before.

4.1.2. CNN + skipped CNN (sCNN)

With this model, we propose to extend the Base CNN model by adding CNNs that use ‘gapped window’ to extract features from its input, and we call these CNN layers ‘skipped CNNs’. A gapped window is one where inputs at certain (consecutive) positions of the window are ignored, such as those shown in Fig. 4. We say that these positions within the window are ‘deactivated’ while other positions are ‘activated’. Specifically, given a window of size j, applying a gap of i consecutive positions will produce multiple shapes of size j windows, as illustrated in Algorithm 1.

Fig. 4.

Example of a 2 gapped size 4 window and a one gapped size 3 window. The ‘X’ indicates that input for the corresponding position in the window is ignored.

Algorithm 1

Creation of i gapped size j windows. A sequence [O, X, O, O] represents one possible shape of a 1 gapped size 4 window, where the first and the last two positions are activated (‘O’) and the second position is deactivated (‘X’)

As an example, applying a 1-gap to a size 4 window will produce two shapes: [O, X, O, O], [O, O, X, O], where ‘O’ indicates an activated position and ‘X’ indicates a deactivated position in the window; while applying a 2-gap to a size 4 window will produce a single shape of [O, X, X, O].

To extend the Base CNN model, we add CNNs using 1-gapped size 3 windows, 1-gapped size 4 windows and 2-gapped size 4 windows. Then each added CNN is followed by a max pooling layer of the same configuration as described before. The remaining parts of the structure remain the same. This results in a model illustrated in Fig. 5.

Fig. 5.

The CNN + sCNN model concatenates features extracted by the normal CNN layers with window sizes of 2, 3, and 4, with features extracted by the four skipped CNN layers. This diagram is best viewed in colour.

Intuitively, the skipped CNNs can be considered as extractors of ‘skip-gram’ like features. In an analogy, we expect it to extract useful features such as ‘muslim refugees ? troublemakers’, ‘muslim ? ? troublemakers’, ‘refugees ? troublemakers’, and ‘they ? ? deported’ from the example sentence before, where ‘?’ is a wildcard representing any word token in a sequence.

To the best of our knowledge, the work by Nguyen et al. [27] is the only one that uses DNN models to extract skip-gram features that are used directly in NLP tasks. However, our method is different in two ways. First, Nguyen et al. addressed a task of mention detection from sentences, i.e., classifying word tokens in a sentence into sequences of particular entities or not. Our work deals with sentence classification. This means that our modelling of the task input and their features are essentially different. Second, the authors used skip-gram features only, while our method adds skip-grams to conventional n-grams, as we concatenate the output from the skipped CNNs and the conventional CNNs. The concept of skip-grams has been widely quoted in training word embeddings with neural network models since Mikolov et al. [26]. This is however, different from directly using skip-gram as features for NLP. Work such as [34] used skip-grams in detecting irony in language. But these are extracted as features in a separate process, while our method relies on the DNN structure to learn such complex features. A similar concept of atrous (or ‘dilated’) convolution has been used in image processing [4]. In comparison, given a window size of n this effectively places an equal number of gaps between every element in the window. For example, a window of size 3 with a dilation rate of 2 would effectively create a window of the shape [X, O, X, O, X, O, X].

For both CNN + GRU and CNN + sCNN, the input to the each convolutional layer is also regularised by a dropout layer with a ratio of 0.2.

4.1.3. Model parameters

We use the categorical cross entropy loss function and the Adam optimiser to train the models, as the first is empirically found to be more effective on classification tasks than other commonly used loss functions including classification error and mean squared error [24], and the second is designed to improve the classic stochastic gradient descent (SGD) optimiser and in theory combines the advantages of two other common extensions of SGD (AdaGrad and RMSProp) [20].

Our choice of parameters described above are largely based on empirical findings reported previously, default values or anecdotal evidence. Arguably, these may not be the best settings for optimal results, which are always data-dependent. However, we show later in experiments that the models already obtain promising results even without extensive data-driven parameter tuning.

5. Experiment

In this section, we present our experiments for evaluation and discuss the results. We compare our CNN + GRU and CNN + sCNN methods against three re-implemented state of the art methods (Section 5.1), and discuss the results in Section 5.2. This is followed by an analysis to show how our methods have managed to effectively capture hate tweets in the long tail (Section 5.3), and to discover the typical errors made by all methods compared (Section 5.4).

Word embeddings. We experimented with three different choices of pre-trained word embeddings: the Word2Vec embeddings trained on the 3-billion-word Google News corpus with a skip-gram model,4

⁴
https://github.com/mmihaltz/word2vec-GoogleNews-vectors
the ‘GloVe’ embeddings trained on a corpus of 840 billion tokens using a Web crawler [33], and the Twitter embeddings trained on 200 million tweets with spam removed [22].5 ⁵
‘Set1’ in [22].
However, our results did not find any embeddings that can consistently outperform others on all tasks and datasets. In fact, this is rather unsurprising, as previous studies [6] suggested similar patterns: the superiority of one word embeddings model on intrinsic tasks (e.g., measuring similarity) is generally non-transferable to downstream applications, across tasks, domains, or even datasets. Below we choose to only focus on results obtained using the Word2Vec embeddings, for two reasons. On the one hand, this is consistent with previous work such as [32]. On the other hand, the two state of the art methods also perform best overall with Word2Vec.6 ⁶
Based on results in Table 7 in the Appendix, on a per-class basis, Gamback et al. method obtained the highest F1 with Word2Vec embeddings on 11 out of 19 cases, with the other 8 cases obtained with either GloVe or the Twitter embeddings. For Park et al. the figure is 12 out of 19.
Our full results obtained with the different embeddings are available in the Appendix.

Performance evaluation metrics. We use the standard Precision (P), Recall (R) and F1 measures for evaluation. For the sake of readability, we only present F1 scores in the following sections unless otherwise stated. Again full results can be found in the Appendix. Due to the significant class imbalance in the data, we show F1 obtained on both hate and non-hate tweets separately.

Implementation. For all methods discussed in this work, we used the Python Keras7 ⁷
https://keras.io/, version 2.0.2.
with Theano backend8 ⁸
http://deeplearning.net/software/theano/, version 0.9.0.
and the scikit-learn9 ⁹
http://scikit-learn.org/, version 0.19.1.
library for implementation.10 ¹⁰
Code available at https://github.com/ziqizhang/chase.
For DNN based methods, we fix the epochs to 10 and use a mini-batch of 100 on all datasets. These parameters are rather arbitrary and fixed for consistency. We run all experiments in a 5-fold cross validation setting and report the average across all folds.
5.1. State of the art

We re-implemented three state of the art methods covering both the classic and deep learning based methods. First, we use the SVM based method described in Davidson et al. [9]. A number of different types of features are used as below. Unless otherwise stated, these features are extracted from the pre-processed tweets:

Surface features: word unigram, bigram and trigram each weighted by Term Frequency Inverse Document Frequency (TF-IDF); number of mentions, and hashtags;11

¹¹
Extracted from the original tweet before pre-processing.
number of characters, and words;

Linguistic features: Part-of-Speech (PoS)12 ¹²
The NLTK library is used.
tag unigrams, bigrams, and trigrams, weighted by their TF-IDF and removing any candidates with a document frequency lower than 5; number of syllables; Flesch-Kincaid Grade Level and Flesch Reading Ease scores that to measure the ‘readability’ of a document

Sentiment features: sentiment polarity scores of the tweet, calculated using a public API.13 ¹³
https://github.com/cjhutto/vaderSentiment

Our second and third methods are re-implementations of Gamback et al. [15] (GB) and Park et al. [32] (PK) using word embeddings, which obtained better results than their character based counterparts in their original work. Both are based on concatenation of multiple CNNs and in fact, have the same structure as the base CNN model described in Section 4.1 but use different CNN window sizes. They are therefore good reference for comparison to analyse the benefits of our added GRU and skipped CNN layers. We kept the same hyper-parameters for these as well as our proposed methods for comparative analysis.14 ¹⁴
In fact both papers did not detail their hyper-parameter settings, which is another reason for us to use consistent configurations as our methods.

5.2. Overall results

Our experiments could not identify a best performing candidate among the three state of the art methods on all datasets, by all measures. Therefore, in the following discussion, unless otherwise stated, we compare our methods against the best results achieved by any of the three state of the art methods

Overall micro and macro F1. Firstly, we compare micro and macro F1 obtained by each method in Table 2. In terms of micro F1, both our CNN + sCNN and CNN + GRU methods obtained consistently the best results on all datasets. Nevertheless, the improvement over the best results from any of the three state of the art methods is rather incremental. The situation with macro F1 is however, quite different. Again compared against the state of the art, both our methods obtained consistently better results. The improvements in many cases are much higher: 1∼5 and 1∼4 percentage points (or ‘percent’ in short) for CNN + sCNN and CNN + GRU respectively. When only the categories of hate tweets are considered (i.e., excluding non-hate, see ‘macro F1, hate’), the improvements are significant in some cases: a maximum of 8 and 6 percent for CNN + sCNN and CNN + GRU respectively.

Table 2
Micro vs. macro F1 results of different methods (using the Word2Vec embeddings). The best results on each row are highlighted in bold. Numbers within (brackets) indicate the improvement in F1 compared to the best result by any of the three state of the art methods

Dataset Measure SVM GB PK CNN + sCNN CNN + GRU

WZ-S.amt micro F1 0.81 0.92 0.92 0.92 (-) 0.92 (-)

macro F1 0.59 0.66 0.64 0.68 (+0.02) 0.67 (+0.02)

macro F1, hate 0.45 0.51 0.48 0.55 (+0.04) 0.53 (+0.02)

WZ-S.exp micro F1 0.82 0.91 0.91 0.92 (+0.01) 0.92 (+0.01)

macro F1 0.58 0.70 0.69 0.74 (+0.04) 0.71 (+0.01)

macro F1, hate 0.42 0.58 0.56 0.63 (+0.05) 0.59 (+0.01)

WZ-S.gb micro F1 0.83 0.92 0.92 0.93 (+0.01) 0.93 (+0.01)

macro F1 0.64 0.71 0.70 0.76 (+0.05) 0.74 (+0.04)

macro F1, hate 0.51 0.58 0.57 0.66 (+0.08) 0.64 (+0.06)

WZ.pj micro F1 0.72 0.82 0.82 0.83 (+0.01) 0.82 (-)

macro F1 0.66 0.75 0.75 0.77 (+0.02) 0.76 (+0.01)

macro F1, hate 0.59 0.68 0.68 0.71 (+0.03) 0.70 (+0.02)

WZ micro F1 0.72 0.82 0.82 0.83 (+0.01) 0.82 (-)

macro F1 0.65 0.75 0.76 0.77 (+0.01) 0.76 (-)

macro F1, hate 0.58 0.69 0.70 0.71 (+0.01) 0.70 (-)

DT micro F1 0.79 0.94 0.94 0.94 (-) 0.94 (-)

macro F1 0.56 0.69 0.63 0.64 (+0.01) 0.63 (-)

macro F1, hate 0.23 0.28 0.28 0.30 (+0.02) 0.30 (+0.02)

RM micro F1 0.79 0.90 0.89 0.91 (+0.01) 0.90 (-)

macro F1 0.72 0.80 0.80 0.83 (+0.03) 0.81 (+0.01)

macro F1, hate 0.58 0.67 0.66 0.71 (+0.04) 0.68 (+0.02)

Dataset	Measure	SVM	GB	PK	CNN + sCNN	CNN + GRU
WZ-S.amt	micro F1	0.81	0.92	0.92	0.92	(-)	0.92	(-)
macro F1	0.59	0.66	0.64	0.68	(+0.02)	0.67	(+0.02)
macro F1, hate	0.45	0.51	0.48	0.55	(+0.04)	0.53	(+0.02)
WZ-S.exp	micro F1	0.82	0.91	0.91	0.92	(+0.01)	0.92	(+0.01)
macro F1	0.58	0.70	0.69	0.74	(+0.04)	0.71	(+0.01)
macro F1, hate	0.42	0.58	0.56	0.63	(+0.05)	0.59	(+0.01)
WZ-S.gb	micro F1	0.83	0.92	0.92	0.93	(+0.01)	0.93	(+0.01)
macro F1	0.64	0.71	0.70	0.76	(+0.05)	0.74	(+0.04)
macro F1, hate	0.51	0.58	0.57	0.66	(+0.08)	0.64	(+0.06)
WZ.pj	micro F1	0.72	0.82	0.82	0.83	(+0.01)	0.82	(-)
macro F1	0.66	0.75	0.75	0.77	(+0.02)	0.76	(+0.01)
macro F1, hate	0.59	0.68	0.68	0.71	(+0.03)	0.70	(+0.02)
WZ	micro F1	0.72	0.82	0.82	0.83	(+0.01)	0.82	(-)
macro F1	0.65	0.75	0.76	0.77	(+0.01)	0.76	(-)
macro F1, hate	0.58	0.69	0.70	0.71	(+0.01)	0.70	(-)
DT	micro F1	0.79	0.94	0.94	0.94	(-)	0.94	(-)
macro F1	0.56	0.69	0.63	0.64	(+0.01)	0.63	(-)
macro F1, hate	0.23	0.28	0.28	0.30	(+0.02)	0.30	(+0.02)
RM	micro F1	0.79	0.90	0.89	0.91	(+0.01)	0.90	(-)
macro F1	0.72	0.80	0.80	0.83	(+0.03)	0.81	(+0.01)
macro F1, hate	0.58	0.67	0.66	0.71	(+0.04)	0.68	(+0.02)

However, notice that both the overall and hate speech-only macro F1 scores are significantly lower than micro F1, for any methods, on any datasets. This further supports our earlier data analysis findings that classifying hate tweets is a much harder task than non-hate, and micro F1 scores will overshadow a method’s true performance on a per-class basis, due to the imbalanced nature of such data.

F1 per-class. To clearly compare each method’s performance on classifying hate speech, we show the per-class results in Table 3. This demonstrates the benefits of our methods compared to the state of the art when focusing on any categories of hate tweets rather than non-hate: the CNN + sCNN model was able to outperform the best results by any of the three comparison methods, achieving a maximum of 13 percent increase in F1; the CNN + GRU model was less good, but still obtained better results on four datasets, achieving a maximum of 8 percent increase in F1.

Table 3

F1 results of different models for each class (using the Word2Vec embeddings). The best results on each row are highlighted in bold. Numbers within (brackets) indicate the improvement in F1 compared to the best results from any of the three state of the art methods

Dataset	Class	SVM	GB	PK	CNN + sCNN		CNN + GRU
WZ-S.amt	racism	0.22	0.22	0.17	0.29	(+0.07)	0.28	(+0.06)
	sexism	0.68	0.80	0.80	0.81	(+0.01)	0.80	(-)
	non-hate	0.88	0.95	0.95	0.95	(-)	0.95	(-)
WZ-S.exp	racism	0.26	0.51	0.45	0.58	(+0.07)	0.51	(-)
	sexism	0.58	0.66	0.67	0.68	(+0.01)	0.66	(-)
	non-hate	0.89	0.95	0.95	0.96	(+0.01)	0.95	(-)
WZ-S.gb	racism	0.38	0.41	0.38	0.54	(+0.13)	0.49	(+0.08)
	sexism	0.64	0.76	0.76	0.77	(+0.02)	0.78	(+0.02)
	non-hate	0.90	0.96	0.96	0.96	(-)	0.96	(-)
WZ.pj	racism	0.60	0.70	0.69	0.73	(+0.03)	0.73	(+0.03)
	sexism	0.58	0.66	0.67	0.69	(+0.02)	0.68	(+0.01)
	non-hate	0.80	0.88	0.88	0.88	(-)	0.87	(−0.01)
WZ	racism	0.59	0.70	0.72	0.74	(+0.02)	0.73	(+0.01)
	sexism	0.57	0.66	0.66	0.68	(+0.02)	0.67	(+0.01)
	non-hate	0.79	0.87	0.88	0.88	(+0.01)	0.87	(−0.01)
DT	hate	0.23	0.28	0.28	0.30	(+0.02)	0.29	(+0.01)
DT	non-hate	0.88	0.97	0.97	0.97	(-)	0.97	(-)
RM	hate	0.58	0.67	0.66	0.71	(+0.04)	0.68	(+0.01)
RM	non-hate	0.86	0.94	0.94	0.94	(-)	0.94	(-)

Note that the best improvement obtained by our methods were found on the racism class in the three WZ-S datasets. As shown in Table 1, these are minority classes representing a very small population in the dataset (between 1 and 6%). This suggests that our proposed methods can be potentially very effective when there is a lack of training data.

It is also worth to highlight that here we discuss only results based on the Word2Vec embeddings. In fact, our methods obtained even more significant improvement when using the Twitter or GloVe embeddings. We discuss this further in the following sections.

sCNN vs. GRU. Comparing CNN + sCNN with CNN + GRU using Table 3, we see that the first performed much better when only hate tweets are considered, suggesting that the skipped CNNs may be more effective feature extractors than GRU for hate speech detection in very short texts such as tweets. CNN + sCNN consistently outperformed the highest F1 by any of the three state of the art methods on any dataset, and it also achieved higher improvement. CNN + GRU on the other hand, obtained better F1 on four datasets and the same best F1 (as the state of the art) on two datasets. This also translates to overall better macro F1 by CNN + sCNN compared to CNN + GRU (Table 2).

Patterns observed with other word embeddings. While we show detailed results with different word embeddings in the Appendix, it is worth to mention here that the same patterns were noticed with these different embeddings, i.e., both CNN + sCNN and CNN + GRU have outperformed the state of the art methods in the most cases but the first model obtained much better results. The best results however, were not always obtained with the Word2Vec embeddings on all datasets. On an individual class basis, among all the 19 classes from the seven datasets, the CNN + sCNN model has scored 10 best F1 with the Word2Vec embeddings, 11 best F1 with the Twitter embeddings, and 10 best F1 with the GloVe embeddings. For the CNN + GRU model, the figures are 14, 12, and 9 for the Word2Vec, Twitter, and GloVe embeddings respectively.

Both models however, obtained much more significant improvement over the three state of the art methods when using the Twitter or GloVe embeddings instead of Word2Vec. For example, on the three WZ-S datasets with the smallest class ‘racism’, CNN + sCNN outperformed the best results by the three state of the art by between 20 and 22 percentage points in F1 using the Twitter embeddings, or between 21 and 33 percent using the GloVe embeddings. For CNN + GRU, the situation is similar: between 9 and 18 percent using the Twitter embeddings, or between 11 and 20 percent using the GloVe embeddings. However, this is largely because the GB and PK models under-performed significantly when using these embeddings instead of Word2Vec. In contrast, the CNN + sCNN and CNN + GRU models can be seen to be less sensitive to the choice of embeddings, which can be a desirable feature.

Against previously reported results. For many reasons such as the difference in the re-generated datasets,15 ¹⁵

As noted in [43], we had to re-download tweets using previously published tweet IDs in the shared datasets. But some tweets have become no longer available.

possibly different pre-processing methods, and unknown hyper-parameter settings from the previous work,16 ¹⁶

The details of the pre-processing, network structures and many hyper-parameter settings are not reported in nearly all of the previous work. For comparability, as mentioned before, we kept the same configurations for our methods as well as the re-implemented state of the art methods.

we cannot guarantee an identical replication of the previous methods in our re-implementation. Therefore in Table 4, we compare results by our methods against previously reported results (micro F1 is used as this is the case in all the previous work) on each dataset on an ‘as-is’ basis.

Table 4

Comparing micro F1 on each dataset (using the Word2Vec embeddings) against previously reported results on an ‘as-is’ basis. The best performing result on each dataset is highlighted in bold. For [40] and [39], we used the result reported under their ‘Best Feature’ setting

Dataset	State of the art		CNN + sCNN	CNN + GRU
WZ-S.amt	0.84	[39]	0.92	0.92
WZ-S.exp	0.91	[39]	0.92	0.92
WZ-S.gb	0.78	[15]	0.93	0.93
WZ.pj	0.83	[32]	0.83	0.83
WZ	0.74	[40]	0.83	0.83
DT	0.90	[9]	0.94	0.94

Table 4 shows that both our methods have achieved the best results on all datasets, outperforming state of the art on six and in some cases, quite significantly. Note that on the WZ.pj dataset where our methods did not obtain further improvement, the best reported state of the art result was obtained using a hybrid character-and-word embeddings CNN model [32]. Our methods in fact, outperformed both the word-only and character-only embeddings models in that same work.

5.3. Effectiveness on identifying the long tail

While the results so far have shown that our proposed methods can obtain better performance in the task, it is unclear whether they are particularly effective on classifying tweets in the long tail of such datasets as we have shown before in Fig. 1. To understand this, we undertake a further analysis below.

On each dataset, we compare the output from our proposed methods against that from Gamback et al. as a reference. We identify the additional tweets that were correctly classified by either of our CNN + sCNN or CNN + GRU methods. We refer to these tweets as additional true positives. Next, following the same process as that for Fig. 1, we

compute the uniqueness score of each tweet (Equation (1)) as indicator of the fraction of class-unique words in each of them;

bin the scores into 11 ranges; and

show the distribution of additional true positives found on each dataset by our methods over these ranges in Fig. 6 (i.e., each column in the figure corresponds to a method-dataset pair). Again for simplicity, we label each range using its higher bound on the y axis (e.g., $u \in [0, 0]$ is labelled as 0, and $u \in (0, 0.1]$ as 0.1). As an example, the leftmost column shows that comparing the output from our CNN + GRU model against Gamback et al. on the WZ-S.amt dataset, 38% of the additional true positives have a uniqueness score of 0.

Fig. 6.

(Best viewed in colour) Distribution of additional true positives (compared against Gamback et al.) identified by CNN + sCNN (sCNN for shorthand) and CNN + GRU (GRU) over different ranges of uniqueness scores (Equation (1)) when using the Word2Vec embeddings. Each row in the heatmap corresponds to a uniqueness score range. Each column corresponds to a method-dataset pair. The numbers in each column sum up to 100% while the colour scale within each cell is determined by the number in that cell.

The figure shows that in general, the vast majority of additional true positives have low uniqueness scores. The pattern is particularly strong on WZ and WZ.pj datasets, where the majority of additional true positives have very low uniqueness scores and in a very large number of cases, a substantial fraction of them (between 50 and 60%) have $u (t_{i}) = 0$ , suggesting that these tweets have no class-unique words at all and therefore, we expect them to potentially have fewer class-unique features. However, our methods still managed to classify them correctly, while the method by Gamback et al. could not. We believe these results are convincing evidence that our methods of using skipped CNN or GRU structures on such tasks can significantly improve our capability of classifying tweets that lack discriminative features. Again the results shown in Fig. 6 are based on the Word2Vec embeddings but we noticed the same patterns with other embeddings, as shown in Fig. 7 in the Appendix.

5.4. Error analysis

To understand further the challenges of the task, we manually analysed a sample of 200 tweets covering all classes to identify ones that are incorrectly classified by all methods. We generally split these errors into four categories.

Implicitness (46%) represents the largest group of errors, where the tweet does not contain explicit lexical or syntactic patterns as useful classification features. Interpretation of such tweets almost certainly requires complicated reasoning and cultural and social background knowledge. For example, subtle metaphors such as ‘expecting gender equality is the same as genocide’, stereotypical views such as in ‘... these same girls ... didn’t cook that well and aren’t very nice’ are often found in false negative hate tweets.

Non-discriminative features (24%) is the second majority case, where the classifiers were confused by certain features that are frequent, seemingly indicative of hate speech but in fact, can be found in both hate and non-hate tweets. For example, one would assume that the presence of the phrase ‘white trash’ or pattern ‘* trash’ is more likely to be a strong indicator of hate speech than not, such as in ‘White bus drivers are all white trash...’. However, our analysis shows that such seemingly ‘obvious’ features are also prevalent in non-hate tweets such as ‘... I’m a piece of white trash I say it proudly’. The second example does not qualify as hate speech since it does not ‘target individual or groups’ or ‘has the intention to incite harm’.

There is also a large group of tweets that require interpretation of contextual information (18%) such as the threads of conversation that they are part of, or from the included URLs in the tweets to fully understand their nature. In these cases, the language alone often has no connotation of hate. For example, in ‘what they tell you is their intention is not their intention. https://t.co/8cmfoOZwxz ’, the language itself does not imply hate. However, when it is combined with the video content referenced by the link, the tweet incites hatred towards particular religious groups. The content referenced by URLs can be videos, images, websites, or even other tweets. Another example is ‘@XXX Doing nothing does require an inordinate amount of skill’ (where ‘XXX’ is anonymised Twitter user handle) that is part of a conversation that makes derogatory remarks towards feminists.

Finally, we also identify a fair amount of disputable annotations (12%) that we could not completely agree with, even if the context as discussed above has been taken into account. For example, the tweet ‘@XXX @XXX He got one serve, not two. Had to defend the doubles lines also’ is part of a conversation discussing a sports event and is annotated as sexism in the WS.pj dataset. However, we did not find anything of a hateful nature in the conversation. Another example in the same dataset is ‘@XXX Picwhatting? And you have quoted none of the tweets. What are you trying to say ...?’ is questioning a point raised in another tweet which we consider as sexism, but this tweet itself has been annotated as sexism.

Taking into all such examples into consideration, we see that completely detecting hateful tweets purely based on their linguistic content still remains extremely challenging, if not impossible.

6. Conclusion and future work

The propagation of hate speech on social media has been increasing significantly in recent years, both due to the anonymity and mobility of such platforms, as well as the changing political climate from many places in the world. Despite substantial effort from law enforcement departments, legislative bodies as well as millions of investment from social media companies, it is widely recognised that effective counter-measures rely on automated semantic analysis of such content. A crucial task in this direction is the detection and classification of hate speech based on its targeting characteristics.

This work makes several contributions to state of the art in this research area. Firstly, we undertook a thorough data analysis to understand the extremely unbalanced nature and the lack of discriminative features of hateful content in the typical datasets one has to deal with in such tasks. Secondly, we proposed new DNN based methods for such tasks, particularly designed to capture implicit features that are potentially useful for classification. Finally, our methods were thoroughly evaluated on the largest collection of Twitter datasets for hate speech, to show that they can be particularly effective on detecting and classifying hateful content (as opposed to non-hate), which we have shown to be more challenging and arguably more important in practice. Our results set a new benchmarking reference in this area of research.

Lessons learned.First, we showed that the very challenging nature of identifying hate speech from short text such as tweets is due to the fact that hate tweets are found in the long tail of a dataset due to their lack of unique, discriminative features. We further showed in experiments that for this very reason, the practice of ‘micro-averaging’ over both hate and non-hate classes in a dataset adopted for reporting results by most previous work can be questionable. It can significantly bias the evaluation towards the dominant non-hate class in a dataset, overshadowing a method’s ability to identify real hate speech.

Second, our proposed ‘skipped’ CNN or GRU structures are able to discover implicit features that can be potentially useful for identifying hate tweets in the long tail. Interestingly, this may suggest that both structures can be potentially effective in the case where there is a lack of training data, and we plan to further evaluate this in the future. Among the two, the skipped CNNs performs much better.

Future work. We aim to explore the following directions of research in the future.

First, we will explore the options to apply our concept of skipped CNNs to character embeddings, which can further address the problem of OOVs in word embeddings. A related limitation of our work is the lack of understanding of the effect of tweet normalisation on the accuracy of the classifiers. This can be a rather complex problem as our preliminary analysis showed no correlation between the size of OOVs and classifier performance. We will investigate into this further.

Second, we will explore other branches of methods that aim at compensating the lack of training data in supervised learning tasks. Methods such as transfer learning could be potentially promising, as they study the problem of adapting supervised models trained in a resource-rich context to a resource-scare context. We will investigate, for example, whether features discovered from one hate class can be transferred to another, thus enhancing the training of each other.

Third, as shown in our data analysis as well as error analysis, the presence of abstract concepts such as ‘sexism’, ‘racism’ or even ‘hate’ in general is very difficult to detect if solely based on linguistic content. Therefore, we see the need to go beyond pure text classification and explore possibilities to model and integrate features about users, social groups, mutual communication and even background knowledge (e.g., concepts expressed from tweets) encoded in existing semantic knowledge bases. For example, recent work [35] has shown that user characteristics such as their vocabulary, overall sentiment in their tweets, and network status (e.g., centrality, influence) can be useful predictors for abusive content.

Finally, our methods prove to be effective for classifying tweets, a type of short texts. We aim to investigate whether the benefits of such DNN structures can generalise to other short text classification tasks, such as in the context of sentences.

Appendix: Full results

Our full results obtained with different word embeddings are shown in Table 7 for the three re-implemented state of the art methods, and Table 8 for the proposed CNN + sCNNN and CNN + GRU methods. e.w2v, e.twt, and e.glv each denotes the Word2Vec, Twitter, and GloVe embeddings respectively. We also analyse the percentage of OOV in each pre-trained word embeddings models, and show the statistics in Table 6. Note the figures are based on the datasets after applying the Twitter normaliser. Table 5 shows the effect of normalisation, in terms of the coverage of hashtags in the embeddings on different datasets.

Table 5

Percentage of hashtags covered in embedding models before and after applying the Twitter normalisation tool. B – before applying normalisation; A – after applying normalisation

Dataset	e.twt		e.w2v		e.glv

	B	A	B	A	B	A
WZ	37%	91%	<1%	44%	7%	89%
WZ-S.amt	39%	89%	<1%	73%	18%	87%
WZ-S.exp	39%	89%	<1%	73%	18%	87%
WZ-S.gb	39%	89%	<1%	73%	18%	87%
WZ.pj	38%	90%	<1%	52%	10%	89%
DT	35%	78%	<1%	71%	25%	79%
RM	15%	90%	<1%	83%	13%	89%

Table 6

Percentage of OOV in each pre-trained embedding model across all datasets

Dataset	e.twt	e.w2v	e.glv
WZ	4%	13%	6%
WZ-S.amt	3%	10%	5%
WZ-S.exp	3%	10%	5%
WZ-S.amt	3%	10%	5%
WZ.pj	4%	14%	7%
DT	6%	25%	11%
RM	4%	11%	6%

From the results, we cannot identify any word embeddings that consistently outperform others on all tasks and datasets, and there is also no strong correlation between the percentage of OOV in each word embeddings model and the obtained F1 with that model. Using the Gamback et al. baseline as an example (Table 7), despite being the least complete embeddings model, e.w2v still obtained the best F1 when classifying racism tweets on 5 datasets. On the contrary, despite being the most complete embedding model, e.twt only obtained the best F1 when classifying sexism tweets on 3 datasets.

Although counter-intuitive, this may not be very much a surprise, considering previous findings from [6]. The authors showed that the performance of word embeddings on intrinsic evaluation tasks (e.g., word similarity) does not always correlate to that on extrinsic, or downstream tasks. In details, they showed that the context window size used to train word embeddings can affect the trade-off between capturing the domain relevance and functions of words. A large context window not only reduces sparsity by introducing more contexts for each word, it also better captures the topics of words. A small window on the other hand, captures word function. In sequence labelling tasks, they showed that word embeddings trained using a context window of just 1 performed the best.

Arguably in our task, the topical relevance of words may be more important for the classification of hate speech. Although the Twitter based embeddings model has the best coverage, it may have suffered from insufficient context during training, since tweets are often very short compared to other corpora used to train Word2Vec and GloVe embeddings. As a result, the topical relevance of words captured by these embeddings may have lower quality than we expect, and therefore empirically, they do not lead to consistently better performance than Word2Vec or GloVe embeddings that are trained on very large, context rich, general purpose corpus.

Table 7

Full results obtained by all baseline models

Dataset and classes		SVM			Gamback et al. [15]									Park et al. [32]

					e.w2v			e.twt			e.glv			e.w2v			e.twt			e.glv

		P	R	F	P	R	F	P	R	F	P	R	F	P	R	F	P	R	F	P	R	F
WZ-S.amt	racism	0.17	0.44	0.22	0.56	0.14	0.22	0.41	0.06	0.10	0.44	0.03	0.03	0.49	0.10	0.17	0.36	0.04	0.08	0.56	0.03	0.05
	sexism	0.59	0.79	0.68	0.84	0.76	0.80	0.86	0.73	0.79	0.86	0.76	0.81	0.84	0.77	0.80	0.86	0.75	0.80	0.87	0.76	0.81
	non-hate	0.95	0.82	0.88	0.93	0.97	0.95	0.93	0.98	0.95	0.93	0.98	0.95	0.94	0.97	0.95	0.93	0.98	0.95	0.93	0.98	0.96
WZ-S.exp	racism	0.21	0.51	0.26	0.62	0.43	0.51	0.63	0.33	0.43	0.54	0.15	0.24	0.58	0.37	0.45	0.64	0.30	0.39	0.54	0.10	0.17
	sexism	0.46	0.76	0.58	0.74	0.60	0.66	0.75	0.61	0.67	0.73	0.63	0.68	0.74	0.61	0.67	0.75	0.61	0.67	0.74	0.62	0.67
	non-hate	0.96	0.83	0.89	0.94	0.97	0.95	0.94	0.97	0.95	0.94	0.97	0.95	0.94	0.97	0.95	0.94	0.97	0.95	0.94	0.97	0.95
WZ-S.gb	racism	0.28	0.70	0.38	0.56	0.32	0.41	0.65	0.27	0.38	0.71	0.17	0.27	0.52	0.30	0.38	0.66	0.21	0.32	0.60	0.10	0.17
	sexism	0.54	0.79	0.64	0.81	0.72	0.76	0.84	0.71	0.77	0.81	0.72	0.76	0.81	0.72	0.76	0.81	0.72	0.76	0.81	0.73	0.76
	non-hate	0.96	0.85	0.90	0.94	0.97	0.96	0.94	0.98	0.96	0.94	0.97	0.96	0.94	0.97	0.96	0.94	0.98	0.96	0.94	0.97	0.96
WZ.pj	racism	0.54	0.68	0.60	0.75	0.66	0.70	0.74	0.67	0.70	0.76	0.65	0.70	0.75	0.64	0.69	0.74	0.65	0.70	0.75	0.68	0.72
	sexism	0.51	0.66	0.58	0.76	0.60	0.66	0.77	0.58	0.66	0.78	0.60	0.67	0.76	0.60	0.67	0.78	0.57	0.66	0.78	0.61	0.69
	non-hate	0.86	0.75	0.80	0.84	0.91	0.88	0.84	0.92	0.88	0.84	0.92	0.88	0.84	0.91	0.88	0.84	0.92	0.88	0.85	0.91	0.88
WZ	racism	0.54	0.65	0.59	0.74	0.67	0.70	0.72	0.73	0.72	0.75	0.72	0.73	0.74	0.70	0.72	0.74	0.72	0.73	0.75	0.72	0.74
	sexism	0.51	0.66	0.57	0.76	0.61	0.66	0.73	0.62	0.67	0.77	0.59	0.67	0.76	0.61	0.66	0.74	0.60	0.66	0.78	0.60	0.68
	non-hate	0.85	0.75	0.79	0.84	0.90	0.87	0.85	0.89	0.87	0.85	0.91	0.88	0.85	0.91	0.88	0.85	0.90	0.87	0.85	0.91	0.88
DT	hate	0.15	0.52	0.23	0.45	0.20	0.28	0.49	0.21	0.30	0.54	0.25	0.34	0.48	0.20	0.28	0.51	0.22	0.31	0.55	0.24	0.33
DT	non-hate	0.97	0.81	0.88	0.95	0.99	0.97	0.95	0.99	0.97	0.96	0.99	0.97	0.95	0.99	0.97	0.95	0.99	0.97	0.95	0.99	0.97
RM	hate	0.45	0.81	0.58	0.73	0.62	0.67	0.71	0.57	0.63	0.73	0.61	0.66	0.72	0.61	0.66	0.70	0.60	0.65	0.73	0.59	0.65
RM	non-hate	0.96	0.78	0.86	0.92	0.95	0.94	0.92	0.95	0.93	0.92	0.95	0.94	0.92	0.95	0.94	0.92	0.95	0.93	0.92	0.95	0.94

Table 8

Full results obtained by the CNN + GRU and CNN + sCNN models

Dataset and classes		CNN + sCNN									CNN + GRU

		e.w2v			e.twt			e.glv			e.w2v			e.twt			e.glv

		P	R	F	P	R	F	P	R	F	P	R	F	P	R	F	P	R	F
WZ-S.amt	racism	0.45	0.22	0.29	0.43	0.26	0.32	0.36	0.21	0.26	0.47	0.20	0.28	0.46	0.14	0.21	0.45	0.10	0.16
	sexism	0.84	0.78	0.81	0.80	0.79	0.79	0.81	0.81	0.81	0.80	0.79	0.80	0.85	0.73	0.83	0.79	0.81	0.80
	non-hate	0.94	0.97	0.95	0.94	0.96	0.95	0.95	0.96	0.95	0.94	0.96	0.95	0.93	0.97	0.95	0.94	0.97	0.95
WZ-S.exp	racism	0.60	0.57	0.58	0.59	0.67	0.63	0.52	0.64	0.57	0.52	0.48	0.51	0.57	0.48	0.52	0.54	0.38	0.44
	sexism	0.72	0.65	0.68	0.68	0.76	0.72	0.67	0.74	0.70	0.71	0.65	0.66	0.71	0.68	0.69	0.71	0.65	0.68
	non-hate	0.93	0.95	0.96	0.96	0.95	0.95	0.96	0.95	0.95	0.94	0.96	0.95	0.95	0.96	0.95	0.94	0.96	0.95
WZ-S.gb	racism	0.55	0.53	0.54	0.53	0.63	0.58	0.50	0.52	0.51	0.53	0.46	0.49	0.60	0.53	0.56	0.59	0.40	0.48
	sexism	0.82	0.73	0.77	0.78	0.79	0.79	0.78	0.80	0.79	0.76	0.80	0.78	0.80	0.75	0.77	0.78	0.78	0.78
	non-hate	0.94	0.97	0.96	0.96	0.96	0.96	0.96	0.96	0.96	0.96	0.96	0.96	0.95	0.97	0.96	0.95	0.96	0.96
WZ.pj	racism	0.72	0.73	0.73	0.64	0.86	0.74	0.69	0.80	0.74	0.74	0.72	0.73	0.73	0.70	0.72	0.71	0.72	0.71
	sexism	0.75	0.64	0.69	0.69	0.71	0.70	0.70	0.67	0.68	0.66	0.69	0.68	0.73	0.63	0.68	0.71	0.64	0.67
	non-hate	0.86	0.90	0.88	0.89	0.84	0.86	0.87	0.87	0.87	0.87	0.86	0.87	0.85	0.89	0.87	0.86	0.88	0.87
WZ	racism	0.73	0.74	0.74	0.64	0.87	0.74	0.69	0.84	0.76	0.71	0.74	0.73	0.74	0.70	0.72	0.71	0.75	0.73
	sexism	0.79	0.59	0.68	0.73	0.63	0.68	0.73	0.63	0.68	0.72	0.63	0.68	0.74	0.59	0.66	0.75	0.60	0.66
	non-hate	0.85	0.91	0.88	0.87	0.86	0.87	0.87	0.87	0.87	0.86	0.88	0.87	0.84	0.90	0.87	0.85	0.89	0.87
DT	hate	0.51	0.21	0.30	0.49	0.35	0.41	0.50	0.38	0.43	0.37	0.25	0.29	0.44	0.22	0.30	0.41	0.31	0.35
DT	non-hate	0.95	0.99	0.97	0.96	0.98	0.97	0.96	0.98	0.97	0.96	0.97	0.97	0.95	0.98	0.97	0.96	0.97	0.97
RM	hate	0.73	0.69	0.71	0.64	0.75	0.69	0.74	0.65	0.69	0.71	0.65	0.68	0.65	0.66	0.66	0.66	0.66	0.66
RM	non-hate	0.94	0.95	0.94	0.95	0.91	0.93	0.93	0.95	0.94	0.93	0.95	0.94	0.94	0.93	0.94	0.93	0.93	0.93

Fig. 7.

(Best viewed in colour) Distribution of additional true positives (compared against Gamback et al.) identified by CNN + sCNN (sCNN for shorthand) and CNN + GRU (GRU) over different ranges of uniqueness scores (see Equation (1)) on each dataset. Each row in a heatmap corresponds to a uniqueness score range. The numbers in each column sum up to 100% while the colour scale within each cell is determined by the number label for that cell.

References

Badjatiya,

Gupta,

Gupta and

Varma, Deep learning for hate speech detection in tweets, in: Proceedings of the 26th International Conference on World Wide Web Companion (WWW’17 Companion), International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, Switzerland, 2017, pp. 759–760. doi:10.1145/3041021.3054223.

Burnap and

M.L.

Williams, Cyber hate speech on Twitter: An application of machine classification and statistical modeling for policy and decision making, Policy and Internet7(2) (2015), 223–242. doi:10.1002/poi3.85.

Burnap and

M.L.

Williams, Us and them: Identifying cyber hate on Twitter across multiple protected characteristics, EPJ Data Science5(11) (2016), 1. doi:10.1140/epjds/s13688-016-0072-6.

L.-C.

Chen,

Papandreou,

Kokkinos,

Murphy and

A.L.

Yuille, DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs, IEEE Transactions on Pattern Analysis and Machine Intelligence40(4) (2018), 834–848. doi:10.1109/TPAMI.2017.2699184.

Chen,

Zhou,

Zhu and

Xu, Detecting offensive language in social media to protect adolescent online safety, in: Proceedings of the 2012 ASE/IEEE International Conference on Social Computing and 2012 ASE/IEEE International Conference on Privacy, Security, Risk and Trust (SOCIALCOM-PASSAT’12), IEEE Computer Society, Washington, DC, USA, 2012, pp. 71–80. doi:10.1109/SocialCom-PASSAT.2012.55.

Chiu,

Korhonen and

Pyysalo, Intrinsic evaluation of word vectors fails to predict extrinsic performance, in: Proceedings of the 1st Workshop on Evaluating Vector-Space Representations for NLP, Association for Computational Linguistics, 2016, pp. 1–6. doi:10.18653/v1/W16-2501.

Chung,

Gulcehre,

Cho and

Bengio, Empirical evaluation of gated recurrent neural networks on sequence modeling, in: Deep Learning and Representation Learning Workshop at the 28th Conference on Neural Information Processing Systems, Curran Associates, New York, USA, 2014.

Dadvar,

Trieschnigg,

Ordelman and

de Jong, Improving cyberbullying detection with user context, in: Proceedings of the 35th European Conference on Advances in Information Retrieval (ECIR’13), Springer, Berlin/Heidelberg, 2013, pp. 693–696. doi:10.1007/978-3-642-36973-5_62.

Davidson,

Warmsley,

Macy and

Weber, Automated hate speech detection and the problem of offensive language, in: Proceedings of the 11th Conference on Web and Social Media, Association for the Advancement of Artificial Intelligence, Menlo Park, CA, 2017.

10.

Del Vigna,

Cimino,

Dell’Orletta,

Petrocchi and

Tesconi, Hate me, hate me not: Hate speech detection on Facebook, in: Proceedings of the First Italian Conference on Cybersecurity, CEUR Workshop Proceedings, 2017, pp. 86–95.

11.

Dinakar,

Jones,

Havasi,

Lieberman and

Picard, Common sense reasoning for detection, prevention, and mitigation of cyberbullying, ACM Transactions on Interactive Intelligent Systems2(3) (2012), 18:1–18:30. doi:10.1145/2362394.2362400.

12.

Djuric,

Zhou,

Morris,

Grbovic,

Radosavljevic and

Bhamidipati, Hate speech detection with comment embeddings, in: Proceedings of the 24th International Conference on World Wide Web (WWW’15 Companion), ACM, New York, 2015, pp. 29–30. doi:10.1145/2740908.2742760.

13.

EEANews, Countering hate speech online, available at http://eeagrants.org/News/2012/. Last accessed: July 2017.

14.

Galiardone,

Gal,

Alves and

Martinez, Countering online hate speech, UNESCO Series on Internet Freedom, 2015, pp. 1–73.

15.

Gambäck and

Kumar Sikdar, Using convolutional neural networks to classify hate speech, in: Proceedings of the First Workshop on Abusive Language Online, Association for Computational Linguistics, 2017, pp. 85–90. doi:10.18653/v1/W17-3013.

16.

N.D.

Gitari,

Zuping,

Damien and

Long, A lexicon-based approach for hate speech detection, International Journal of Multimedia and Ubiquitous Engineering10(10) (2015), 215–230. doi:10.14257/ijmue.2015.10.4.21.

17.

Greevy and

A.F.

Smeaton, Classifying racist texts using a support vector machine, in: Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’04), ACM, New York, 2004, pp. 468–469. doi:10.1145/1008992.1009074.

18.

Guardian , Anti-muslim hate crime surges after Manchester and London bridge attacks, available at https://www.theguardian.com. Last accessed: July 2017.

19.

Guardian, Zuckerberg on refugee crisis: ‘Hate speech has no place on Facebook’, available at https://www.theguardian.com. Last accessed: July 2017.

20.

D.P.

Kingma and

Ba, Adam: A method for stochastic optimization, in: Proceedings of the 3rd International Conference for Learning Representations, 2015.

21.

Kwok and

Wang, Locate the hate: Detecting tweets against blacks, in: Proceedings of the Twenty-Seventh AAAI Conference on Artificial Intelligence (AAAI’13), Association for the Advancement of Artificial Intelligence, Menlo Park, CA, 2013, pp. 1621–1622.

22.

Li,

Shah,

Liu and

Nourbakhsh, Data sets: Word embeddings learned from tweets and general data, in: Proceedings of the Eleventh International Conference on Web and Social Media, Association for the Advancement of Artificial Intelligence, Menlo Park, CA, 2017, pp. 428–436.

23.

Lomas, Facebook, Google, Twitter commit to hate speech action in Germany. Last accessed: July 2017.

24.

J.D.

McCaffrey, Why you should use cross-entropy error instead of classification error or mean squared error for neural network classifier training, available at https://jamesmccaffrey.wordpress.com. Last accessed: Jan 2018.

25.

Mehdad and

Tetreault, Do characters abuse more than words? in: Proceedings of the 17th Annual Meeting of the Special Interest Group on Discourse and Dialogue, Association for Computational Linguistics, 2016, pp. 299–303. doi:10.18653/v1/W16-3638.

26.

Mikolov,

Chen,

Corrado and

Dean, Efficient estimation of word representations in vector space, CoRRarXiv:1301.3781, 2013.

27.

T.H.

Nguyen and

Grishman, Modeling skip-grams for event detection with convolutional neural networks, in: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, 2016, pp. 886–891. doi:10.18653/v1/D16-1085.

28.

Nobata,

Tetreault,

Thomas,

Mehdad and

Chang, Abusive language detection in online user content, in: Proceedings of the 25th International Conference on World Wide Web (WWW’16), International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, Switzerland, 2016, pp. 145–153. doi:10.1145/2872427.2883062.

29.

J.T.

Nockleby, Hate Speech, Macmillan, New York, 2000, pp. 1277–1279.

30.

Okeowo, Hate on the rise after Trump’s election, available at http://www.newyorker.com/. Last accessed: July 2017.

31.

F.J.

Ordóñez and

Roggen, Deep convolutional and LSTM recurrent neural networks for multimodal wearable activity recognition, Sensors16(1) (2016), 115. doi:10.3390/s16010115.

32.

J.H.

Park and

Fung, One-step and two-step classification for abusive language detection on Twitter, in: ALW1: 1st Workshop on Abusive Language Online, Association for Computational Linguistics, Vancouver, Canada, 2017, pp. 41–45. doi:10.18653/v1/W17-3006.

33.

Pennington,

Socher and

C.D.

Manning. GloVe, Global vectors for word representation, in: Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, 2014, pp. 1532–1543. doi:10.3115/v1/D14-1162.

34.

Reyes,

Rosso and

Veale, A multidimensional approach for detecting irony in Twitter, Language Resources and Evaluation47(1) (2013), 239–268. doi:10.1007/s10579-012-9196-x.

35.

M.H.

Ribeiro,

Calais,

Santos,

Almeida and

Meira, ‘Like sheep among wolves’: Characterizing hateful users on Twitter, in: Proceedings of the Workshop on Misinformation and Misbehavior Mining on the Web, at the 11th ACM International Conference on Web Search and Data Mining, ACM, New York, 2017.

36.

Robinson,

Zhang and

Tepper, Hate speech detection on Twitter: Feature engineering vs. feature selection, in: Proceedings of the 15th Extended Semantic Web Conference, Poster Volume (ESWC’18), Springer, Berlin, 2018, pp. 46–49. doi:10.1007/978-3-319-98192-5_9.

37.

Schmidt and

Wiegand, A survey on hate speech detection using natural language processing, in: International Workshop on Natural Language Processing for Social Media, Association for Computational Linguistics, 2017, pp. 1–10. doi:10.18653/v1/W17-1101.

38.

Warner and

Hirschberg, Detecting hate speech on the World Wide Web, in: Proceedings of the Second Workshop on Language in Social Media (LSM’12), Association for Computational Linguistics, 2012, pp. 19–26.

39.

Waseem, Are you a racist or am I seeing things? Annotator influence on hate speech detection on Twitter, in: Proceedings of the Workshop on NLP and Computational Social Science, Association for Computational Linguistics, 2016, pp. 138–142. doi:10.18653/v1/W16-5618.

40.

Waseem and

Hovy, Hateful symbols or hateful people? Predictive features for hate speech detection on Twitter, in: Proceedings of the NAACL Student Research Workshop, Association for Computational Linguistics, 2016, pp. 88–93. doi:10.18653/v1/N16-2013.

41.

Xiang,

Fan,

Wang,

Hong and

Rose, Detecting offensive tweets via topical feature discovery over a large scale Twitter corpus, in: Proceedings of the 21st ACM International Conference on Information and Knowledge Management (CIKM’12), ACM, New York, 2012, pp. 1980–1984. doi:10.1145/2396761.2398556.

42.

Yuan,

Wu and

Xiang, A two phase deep learning model for identifying discrimination from tweets, in: Proceedings of 19th International Conference on Extending Database Technology, 2016, pp. 696–697.

43.

Zhang,

Robinson and

Tepper, Detecting hate speech on Twitter using a convolution-GRU based deep neural network, in: Proceedings of the 15th Extended Semantic Web Conference (ESWC’18), Springer, Berlin, 2018, pp. 745–760. doi:10.1007/978-3-319-93417-4_48.

44.

Zhong,

Li,

Squicciarini,

Rajtmajer,

Griffin,

Miller and

Caragea, Content-driven detection of cyberbullying on the Instagram social network, in: Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (IJCAI’16), AAAI Press, Menlo Park, CA, 2016, pp. 3952–3958.

Hate speech detection: A solved problem? The challenging case of long tail on Twitter

Abstract

Keywords

1. Introduction

2.1. Terminology and scope

2 We will indicate explicitly where works address a related problem rather than hate speech. Our methods and experiments will only address hate speech, due to both dataset availability and the goal of this work. 2.2. Methods of hate speech detection and related problems

2.3. Evaluation of hate speech detection methods

3. Dataset analysis – the case of long tail

3.1. Public Twitter datasets

4.1. The base CNN model

5. Experiment

6. Conclusion and future work

Appendix: Full results

References

²
We will indicate explicitly where works address a related problem rather than hate speech.
Our methods and experiments will only address hate speech, due to both dataset availability and the goal of this work.
2.2. Methods of hate speech detection and related problems