Abstract
Sarcasm is a linguistic phenomenon often indicating a disparity between literal and inferred meanings. Due to its complexity, it is typically difficult to discern it within an online text message. Consequently, in recent years sarcasm detection has received considerable attention from both academia and industry. Nevertheless, the majority of current approaches simply model low-level indicators of sarcasm in various machine learning algorithms. This paper aims to present sarcasm in a new light by utilizing novel indicators in a deep weighted average ensemble-based framework (DWAEF). The novel indicators pertain to exploiting the presence of simile and metaphor in text and detecting the subtle shift in tone at a sentence’s structural level. A graph neural network (GNN) structure is implemented to detect the presence of simile, bidirectional encoder representations from transformers (BERT) embeddings are exploited to detect metaphorical instances and fuzzy logic is employed to account for the shift of tone. To account for the existence of sarcasm, the DWAEF integrates the inputs from the novel indicators. The performance of the framework is evaluated on a self-curated dataset of online text messages. A comparative report between the results acquired using primitive features and those obtained using a combination of primitive features and proposed indicators is provided. The highest accuracy of 92% was achieved after applying DWAEF, the proposed framework which combines the primitive features and novel indicators together as compared to 78.58% obtained using Support Vector Machine (SVM) which was the lowest among all classifiers.
Keywords
Introduction
Natural languages have evolved gracefully over time all around the globe. Various nuances of a language allow humans to put forth their views on myriad topics with ease and creativity. The use of figurative language by native speakers is one such medium of expressing opinions [36]. Sarcasm, interlaced with irony and wit, affords both sharpness and subtlety to convey contempt. Automatic detection of sarcasm in text is one of the critical challenges faced by researchers in the field of sentiment analysis. Sensing the negative connotation in a sentence containing positive words is required to detect sarcasm in an effective manner. Moreover, sarcasm is closely tied to linguistic and cultural norms and practices, and its use can vary significantly between different languages and cultures. While sarcasm has clear definitions in the literature, its interpretation may be greatly affected by the cultural background and contextual knowledge of the reader [7,14]. Sarcasm can be considered rude or disrespectful in some cultures while being accepted as a form of humour in others. Primitive computational models developed for sarcasm detection made use of primitive features such as n-grams, punctuation and intensifiers and exploited machine learning algorithms for classification purposes.
To identify sarcasm in text in an automated manner, the present study proposes a deep weighted average ensemble-based framework (DWAEF). The proposed framework makes use of three indicators to produce competent results. These indications concern utilising the presence of simile and metaphor in text and identifying small shifts in tone between the constituent clauses of a sentence. The framework leverages deep learning components, namely graph neural network (GNN) [16] and bidirectional encoder representations from transformers (BERT) [11] based embeddings to detect simile and metaphor respectively and fuzzy logic [48] to apprehend polarity shifts between the constituent clauses of a sentence. Finally, the outputs of the three components are provided to DWAEF, an ensemble structure comprising attentive interpretable tabular learning (TabNet) [2], one-dimensional convolutional neural networks (1-D CNN) and multilayer perceptron (MLP) based learners. The results obtained using the ensemble method are thoroughly compared with results obtained using only the base learners. With the accuracy of 92.01% achieved by DWAEF, the proposed ensemble-based approach surpasses the results obtained based on the usage of primitive features only. The main contributions of this study are summarized below:
Leveraging key linguistic features, namely- simile, metaphor and constituent clauses of a sentence for sarcasm detection that, to the best of the authors’ knowledge, have not yet been used together for this purpose Implementing GNN in the framework to detect the presence of simile in a text on the basis of a sentence’s dependency tree Exploiting BERT embeddings to detect the presence of metaphor in a text Capturing the shift in polarity of a sentence’s constituent clauses using fuzzy-logic Harnessing ensemble structure of various deep learning algorithms for facilitating sarcasm classification tasks.
The rest of the paper is organized as follows. Section 2 discussed the earlier research done in the field of sarcasm detection. Section 3 puts forth the motivation behind the proposed study. Section 4 presents and describes the proposed methodology. Section 5 describes the experiments and gives a detailed analysis of the results obtained. Section 6 concludes the paper.
Related work
Detection of sarcasm is a challenge for humans and for machines, even more so. As a result, it has gained popularity in many natural language processing (NLP) applications such as social media analysis, sentiment analysis and customer service [42]. The study in [3] contended that sarcasm is distinct from other forms of figurative language and contributed to the understanding of the linguistic and cognitive processes underlying sarcasm. An extensive survey of the literature brought to light that researchers have mainly employed machine learning techniques to a variety of features, including lexical, syntactic, and semantic features for detecting sarcasm. The common form of sarcasm consists of a positive sentiment situation followed by a negative sentiment situation. The study in [43] discussed an algorithm that automatically learns positive and negative sentiment phrases from sarcastic tweets. While the approach was innovative and showed promising results, there were several potential drawbacks to this method related to limited applicability, lack of context sensitivity and limited evaluations. In [44] authors surveyed several machine learning algorithms to classify the sarcastic tweets and found that a combination of Support Vector Machine (SVM) and convolutional neural network (CNN) resulted in higher prediction accuracy. The review also covered various aspects of sarcasm detection, including datasets used, machine learning algorithms employed, feature selection and evaluation metrics. The authors found that while there was a growing interest in using machine learning algorithms for sarcasm detection in Twitter, the field still faced several challenges, such as the lack of annotated data and the difficulty of detecting sarcasm in short, informal messages. Researchers in [9] applied K-Nearest Neighbours (KNN), Random Forest (RF), SVM and max entropy techniques on the following features- sentiment related, syntactic and semantic, punctuation-related and pattern-related. Their approach relied heavily on pre-defined patterns, which failed to cover all instances of sarcasm on Twitter. New patterns might need to be added or existing patterns might need to be modified as language and culture evolve over time.
In [17], the researchers harvested sarcastic tweets with the help of hashtags such as #
In recent years there has also been a shift towards using neural networks due to their superior performance in a variety of machine learning tasks. Other machine learning techniques, such as Decision Trees or SVM, have limitations in their ability to handle complex and high-dimensional data, such as language data and may require extensive feature engineering. Neural networks excel at recognizing and identifying patterns in large and complex datasets, including language data. By training on a sizeable dataset of sarcastic and non-sarcastic language, neural networks can detect the distinct features of sarcasm, such as negation, exaggeration and tone of voice. Therefore, they are an ideal tool for sarcasm detection, as they can learn from vast amounts of data and identify subtle patterns that other machine learning algorithms may struggle to recognize. Researchers in recent studies have employed various neural network techniques such as CNN and long short-term memory (LSTM) along with different word embeddings namely, Word2Vec, FastText and GloVe on the Reddit Corpus. They accounted for the impact of varying epochs, training size and dropout on the performance [6,30,41]. These studies only used word embeddings and hyperparameter tuning as features to identify sarcasm, which might not be sufficient to capture all the nuances of sarcasm in language. In [1], the study implemented BERT, robustly optimized BERT pretraining approach (ROBERTa), LSTM, bi-directional long short-term memory (Bi-LSTM) and bi-directional gated recurrent unit (Bi-GRU) models for detecting sarcasm in text. They concluded that the transformer-based ensemble performed better than the baseline models. In [22], the researchers used four component methods namely LSTM, CNN-LSTM, SVM and MLP on the Reddit and Twitter datasets. On similar lines, authors in [15] used an ensemble of LSTM, gated recurrent unit (GRU) and Baseline CNNs to detect sarcasm in online text and concluded using a weighted average ensemble resulted in better results. However, the approach used by the researchers failed to detect sarcastic tweets written in a very polite way.
Recent studies have made extensive use of word embeddings in deep neural networks for various NLP tasks. However, there is a growing demand for modelling text data as graphs. In comparison to revolutionary neural networks such as recurrent neural networks (RNN), LSTM, CNN and BERT, the graphical representation of text allows for more efficient extraction of semantic and structural information. Therefore, numerous researchers have investigated graph-based methods and their application to NLP problems. The first graph attention-based model to identify sarcasm on social media was proposed in [38]. The graph model captured complex relationships between a sarcastic tweet and its conversational context by modelling a user’s social and historical context together. GNN, a modern class of networks applied on graph-structured data [47], has found application in the field of sarcasm detection. In [19], a graph convolutional network (GCN) was used to capture global information features in a satirical context and a Bi-LSTM was implemented to capture the sequence features of the comments. The two sets of results were combined and evaluated using a conventional classifier. Later, The authors in [26] proposed an affective dependency graph convolutional network framework to detect messages with implied contradictions and incongruity. Apart from the state-of-the-art technology, many researchers have investigated the use of different features in text data and different methodologies for detecting sarcasm. The article [10] provides a detailed literature survey on sarcasm detection. Additionally, it provided a detailed analysis of the set of features used for sarcasm detection.
Section 2.1 discusses the types of feature sets used in sarcasm detection and how researchers have employed them in past studies.
Types of primitive feature sets used in sarcasm detection
Previous works on sarcasm detection made use of low-level features such as n-grams, punctuation, intensifiers and so forth. Some of the primitive features such as punctuation count, count of mixed-case words, count of repeated words and letters, presence of intensifiers and presence of interjections are later used in this research as part of feature set preparation. The primitive features used in sarcasm detection can be broadly classified into:
Lexical features: This feature set includes text properties such as unigrams, bigrams, n-grams, skip-gram, hashtags, etc. The study in [27] used unigrams, bigrams and trigrams to create more general sarcasm indicators thereby improving the precision and recall of their bootstrapping classifier. Researchers in [4] created binary indicators of lower-cased word unigrams and bigrams along with brown cluster unigrams and bigrams which grouped words used in similar contexts into the same cluster. The authors of [43] extracted every unigram, bigram and trigram that occurred immediately right after a positive sentiment phrase in a sarcastic tweet.
Paralinguistics features: These are some of the main features used for sarcasm detection in text. They include emoticons, smileys, number of hashtags, replies and so forth. The study in [9] included the count of positive, negative and sarcastic emoticons. The authors of [29] considered the effect of sentiment contained in hashtags by developing a set of rules around the number of hashtags and their polarity. Researchers in [32] took into account the sentiment of replies to the user.
Hyperbole features: These features include intensifiers, interjections, quotes, punctuation and so forth. Researchers in [4] created a binary indicator for the presence of 50 intensifiers retrieved from Wikipedia. The authors of the study in [40] opined that writers often use sarcasm-based writing styles to compensate for the lack of visual or verbal cues. The authors in [5] accounted for uppercase and lowercase characters along with the repetition of punctuation marks.
Contextual features: These features comprise extra components, outside the realm of formal linguistics, used frequently in a sentence, especially in online messages. The researchers in [17] harvested a large number of sarcastic tweets with the help of hashtags such as #not#sarcasm and divided them into tweets containing user mention and tweets that do not. In [4], the researchers emphasized extra-linguistic information from the context of a tweet in the form of ‘author features’, ‘audience features’, ‘environment features’ and ‘tweet features’ and achieved gains in accuracy compared to purely linguistic features in sarcasm detection.
The previous research and development in the field of sarcasm detection prompted the authors of this paper to take on this problem and address its concerns at a new level. The following section describes the impetus behind the present study.
Motivation
This section explains the rationale for delving into the complexities of sarcasm detection using similes, metaphors and the clausal structure of a sentence. Section 3.1 discusses similes in literature and forms the base for the proposed methodology for its computational detection. Section 3.2 introduces and explores metaphors in literature thereby building the foundation for its computational detection. Section 3.3 deliberates upon a sentence’s clausal structure as well as the polarity change from one clause to another. Section 3.4 lays the motivation for using deep learning methods in an ensemble structure.
Simile
A simile is a figurative device used to draw comparisons between two unlike things. Its presence is always explicitly indicated with the usage of “
Examples and constituent components of a simile
Examples and constituent components of a simile
A metaphor is a figure of speech that compares two unrelated ideas. At the basic linguistic level, both metaphor and simile involve the juxtaposition of two concepts. However, metaphors lack the usage of “
A metaphor also arises when seemingly unrelated properties of one concept are seen in terms of the properties of some other concept. Metaphorical utterances in sarcastic remarks in certain situations are common. For example, “
Types of metaphors addressed in this study and their examples
Types of metaphors addressed in this study and their examples
Clauses are a group of related words which unlike phrases have a subject and a verb. A clause can be a part of a sentence or be a complete sentence in itself. All sentences have at least one main clause. The main clause is a clause that can stand alone as an independent complete sentence. On the other hand, a subordinate clause is a clause that cannot stand as an independent complete sentence by itself. It is typically introduced with a subordinating conjunction and is dependent on the main clause. Consider the examples taken from an article2
A shift from positive polarity to negative polarity: In this type, sarcastic sentences contain positive expressions followed by negative expressions. Consider the sarcastic sentence, “ A shift from negative polarity to positivity polarity: In this type, sarcastic sentences contain negative expressions followed by positive expressions. For instance, “
To cater to such types of situations, this research proposes to measure the polarity shift from the main clause of a sentence to the subordinate clause of a sentence at various degrees as a potential indicator of sarcasm. The significance of fuzziness comes into play while dealing with ambiguities surrounding how positive or negative a stand-alone clause can be. Its amalgamation with the computational detection of polarity shift may result in efficient results.
Distinctive subtleties between main and subordinate clauses
Current cutting-edge research studies utilise geometric deep learning, BERT and fuzzy logic. This study combines these techniques into a single framework in order to produce competent results. Furthermore, most authors have employed conventional machine learning classifiers for evaluation purposes, whereas ensembles of deep learning algorithms (TabNet, CNN and MLP) are employed in this research. One of the most significant issues with conventional machine learning techniques is that they frequently fail to capture the underlying characteristics and structure of the data. Consequently, poor performance is observed when these algorithms are applied to datasets that are highly imbalanced, high-dimensional and noisy [12]. Therefore, it is essential to construct an efficient model, particularly for complex tasks such as sarcasm detection. Ensemble learning is one of the approaches. Ensemble learning strategies enhance the performance and accuracy of a predictive model by merging multiple models. This approach involves training several models, each with unique algorithms or hyperparameters and then fusing their predictions in a manner that maximizes the final outcome. Any ensemble framework comprises a collection of base learners and meta-learners. Base learners, also known as weak learners, are machine learning classifiers whose predictions are combined with those of other weak learners to compensate for their weaknesses. The meta learner or strong learner is the combined learnt model. The promising results obtained by past researchers with different ensemble structures for sarcasm detection motivated the authors of this work to implement a deep ensemble framework DWAEF. The framework is comprehensively described in the forthcoming section.
Methodology
The methodology followed by this research is elaborated in Fig. 1. A detailed description of dataset preparation is provided in Section 4.1. Once the data has been collected, it goes through several stages of preprocessing as elaborated in Section 4.2. The pre-processed data is then annotated by four expert linguists with 100% agreement that the dataset consisted of both sarcastic and non-sarcastic sentences. Following this, the pre-processed and correctly labelled data goes through the feature extraction process wherein along with the indicators proposed by this research, primitive features are also extracted. Followed by feature set preparation, the results are obtained using the ensemble framework. Each module portrayed in Fig. 1 is explicated in the forthcoming subsections. Section 4.3 discusses the primitive feature set preparation. Section 4.4 discusses the detection of novel features namely, simile, metaphor and change in the polarity of a sentence’s constituent clauses.

The methodology of the proposed study.
Description of the dataset
Description of the dataset
Description of the dataset: The authors of this work prepared a dataset of 2,891 sentences written in English. The dataset utilized in this study was sourced from various platforms, including Twitter, News Headlines and the SARC datasets. TextBlob [25] was used to check whether the text was written only in English. Out of these, 1,538 were sarcastic and were compiled from various sources- i) 520 sentences were extracted from Twitter with hashtags- #sarcasm, #not, #sarcastic, #irony, #satire between the time period of June 2022-October 2022; ii) 520 were taken from the News Headlines dataset curated by [31]; iii) remaining 498 were taken from the SARC dataset curated by [20]. The 1,353 non-sarcastic sentences were compiled from Twitter and the News Headlines dataset. These sources do not belong to the same domain or topic. Twitter data, for instance, encompasses a broad range of subjects, whereas News Headlines may concentrate on specific areas, such as politics, sports, or entertainment. The domain can significantly affect the outcomes of sarcasm detection because the use of sarcasm can vary depending on the context and subject matter. For example, the prevalence of sarcasm in political discussions may differ from its occurrence in conversations about sports or entertainment. The motivation for combining data from various online sources such as Twitter, Reddit and News Headlines is to expose the model to different writing styles as well as myriad domains prevalent online. Moreover, the proposed features in this study operate independently of the domain. Hence, there are no visible effects on the results obtained. This approach is therefore expected to help the model make accurate predictions for a wide range of text messages.
Annotation: Double annotation was performed by four expert linguists who independently performed annotation on the dataset. Once all four annotators had completed their annotations, the annotations were compared with each other to identify any discrepancies or disagreements. A consensus-based approach was followed to resolve these discrepancies or disagreements. This involved having the annotators discuss and come to a consensus on the correct annotation together. The agreed-upon annotations from all four annotators were used to create a final, consensus annotation for the entire dataset. Further, a series of preprocessing measures were taken. After preprocessing the dataset was reduced to 2,889 sentences. The composition of the dataset and the distribution of sarcastic and non-sarcastic sentences are illustrated in Table 4.
The use of slang, hashtags, emoticons, alterations in spelling, and loose punctuation are not inherently redundant and serve various purposes, such as expressing emotions or emphasizing a point. However, they can pose challenges for tasks related to NLP and text analysis, as they often deviate from standard grammar and syntax. The study in [13] provides a more nuanced understanding of how these features of informal text can be modelled and analyzed to improve the quality of NLP frameworks and crowdsourced annotation tasks. As a result, the authors have accordingly taken these features into account before completely removing them. Certain data pre-processing measures were carried out as follows:
Duplicate tweets and re-tweets were also dropped. Hashtags were completely removed. Tweets containing URLs were dropped. Emojis were counted and then removed from the text. The occurrences of the following punctuation marks (‘.’, ‘?’, ‘*’, ‘!’, ‘,’) were first counted and then the data was freed of irrelevant punctuation marks.
Primitive features
The primitive features used by this study include various features explained earlier in Section 2. The said feature set consists of punctuation count, count of mixed-case words, count of repeated words and letters, presence of intensifiers and presence of interjections. Each one of the preceding features is comprehensively explained below.
Punctuation Count: The punctuation marks are sometimes overused to indicate sarcasm. For example, to emphasise a point, users use an asterisk (‘*’). To represent a pause, an ellipsis (‘…’) is used and a bunch of exclamatory marks (‘!!!’) indicate exclamatory utterances [40]. Thus, each of the previously described punctuation marks along with some more (‘.’, ‘?’, ‘*’, ‘!’, ‘,’) were counted as one of the features. Count of mixed-case words: This feature set includes counting the occurrence of mixed-case words in the text. Count of repeated words and letters: Users also tend to repeat letters in words to over-emphasize parts of the text. A similar pattern can be observed in the case of words. As a result, the number of repeated letters and repeated words were counted and used as a set of 2 individual features. Presence of intensifiers: Intensifiers or hyperbolic words are generally adverbs or adjectives which strengthen the evaluative utterance of a sarcastic remark. Consider the utterances were taken from [23], “
Presence of interjections: Interjections are words or phrases primarily used in a sentence to convey emotions. For instance, “
The number of times words having opposite polarities come together: This feature captures the contrast between two words having opposite polarities. Length of the largest sequence of words with polarities unchanged Count of positive and negative words
GNN framework for simile detection

The dependency trees depicting the syntactic dependency between various components of a simile.

The GNN framework for simile classification.

Simile dataset preprocessing workflow.
For the purpose of this research, simile is detected on the basis of its syntactic pattern using a GNN. The process typically involves first parsing the sentence using a syntactic parser to identify the syntactic relationships between the words. In this case, this was done using Stanford NLP Group’s CoreNLP server [28]. Figure 2 presents dependency trees of two sentences containing similes created using Stanford NLP Group’s CoreNLP server. The resulting parse tree is then converted into a graph representation. To capture more complex syntactic relationships between the words in a sentence, the authors of this research implemented Bi-Fuse GraphSAGE [18]. To create a representation of each node, GraphSAGE aggregates information from its neighbouring nodes in the graph. This representation is then refined by Bi-Fuse GraphSAGE, which encodes the graph structure in a bi-directional manner by passing messages between nodes in both forward and backward directions, enabling more complex relationships between nodes to be captured. Subgraphs corresponding to specific syntactic structures of a simile are identified by analyzing these refined node representations. Additionally, a graph-level representation is created by combining these refined node embeddings, summarizing the properties of the entire graph. The class label of a sentence is predicted using these graph-level representations. A GNN-based text classification model is used to learn the dependency structure of similes. The entire set-up for the simile classification model consists of a graph construction module, a graph embedding module and a prediction module. Each of the modules is implemented using Graph4NLP library [46]. The modules are elaborated thoroughly below and the entire framework is summarized in Fig. 3.
Word2Vec and GloVe are two types of architectures used commonly for word embedding, but they have limitations in certain tasks such as representing terms that are not in their vocabulary and distinguishing between opposites. For example, the words “good” and “bad” are often very close to each other in the vector space created by these models, which can be problematic for NLP applications like sentiment analysis. On the other hand, BERT is a pretraining method that uses a self-supervised approach to learn from masked text sections. Developed by a team at Google Research, BERT is based on the transformer architecture and is designed to learn deep bidirectional representations from unlabeled text by conditioning on both left and right contexts. While Word2Vec and GloVe are unidirectional models that can only understand context in one direction, BERT can move sentences both to the left and right to fully comprehend the context of the target word or group of words. As a result, the detection of metaphors is achieved by generating embeddings using a BERT-based network. The basic BERT model, without any fine-tuning, can generate embeddings for the phrases that can be used to calculate cosine similarity. Figure 5 illustrates the framework used for detecting the presence of a metaphor and Fig. 6 illustrates the BERT-based network used for generating the embeddings with the hidden layer representations in red. For the BERT base, each encoder layer outputs a set of dense vectors.

The framework for metaphor detection.
Each vector contains 768 values each of which is nothing but contextual word embeddings. Initially, each sentence is split into two halves. For sentences of type I, phrase A consists of the subject and phrase B consists of the object. On the other hand, for sentences of type II, phrase A consists of a verb which acts on a noun phrase represented by phrase B. BERT-based embeddings are generated for both phrases A and B and cosine similarity is calculated. The entire process can be summarized as follows:

BERT-based network used for generating embeddings.
All the sentences in the dataset are first tokenized using spaCy. The sentences are then split into two phrases: one containing the subject of comparison and the other containing the object of comparison. Dense contextual embeddings are constructed for each of the phrases. The last_hidden_state tensor from the BERT model is extracted to quantify textual similarity. A pooling operation is performed that takes the mean of all token embeddings and compresses them into a single vector space representing a single phrase. Furthermore, the cosine similarity between the two vector spaces of the phrases is calculated. Following multiple trials, a threshold value of 0.7 was chosen to determine the presence or absence of a metaphorical instance in a sentence. A cosine similarity larger than 0.7 accurately indicated the lack of a metaphorical statement, whereas one less than 0.7 suggested its presence.
All the sentences in the dataset are first tokenized using spaCy.
All the sentences are split into two phrases: one containing the personified verb and the other containing the object of comparison.
For each of the phrases, dense contextual embeddings are generated and textual similarity is measured in terms of cosine similarity.
Fuzzy logic is a form of multi-valued logic that deals with reasoning that is approximate rather than precise. In fuzzy logic, a concept can possess a degree of truth (degree of membership) denoted by ‘µ’ anywhere between 0.0 and 1.0, unlike in standard logic where a concept can only be completely true (1.0) or false (0.0). Fuzzy logic is useful for reasoning about inherently vague concepts, such as “tallness”. In traditional set theory, an element either belongs to a set or not, but in fuzzy logic, an element can partially belong to a set, with a degree of membership ranging betwwen 0 and 1. In general, the degree of membership is used to represent the uncertainty or fuzziness in a concept or variable and is a key component in fuzzy logic’s ability to reason with imprecise or uncertain information. In fuzzy logic systems, a membership function is used to map this degree of membership of an element to a set, based on a specified set of fuzzy rules. Trapezoidal membership functions are one way to define fuzzy sets. They are a type of membership function that allows for more flexibility in defining the shape of the fuzzy set.
A trapezoidal membership function is a piecewise linear function that has four parameters: a, b, c and d, where a ⩽ b ⩽ c ⩽ d. The function starts at 0 for x ⩽ a, increases linearly from 0 to 1 from a to b, stays at 1 from b to c, decreases linearly from 1 to 0 from c to d, and stays at 0 for x ⩾ d. By using trapezoidal membership functions, non-polygonal fuzzy sets can be defined that have smooth transitions between membership degrees. The shape of the membership function can be controlled by adjusting its parameters, allowing for the creation of fuzzy sets that match the intended intuition. For instance, a trapezoidal membership function can be used to define the fuzzy set “tall person” with parameters such as a = 170 cm, b = 175 cm, c = 185 cm and d = 190 cm. The resulting function would have a membership degree of 0 for heights below 170 cm and above 190 cm, a membership degree of 1 for heights between 175 cm and 185 cm, and smoothly increasing and decreasing membership degrees for heights between 170 cm and 175 cm and between 185 cm and 190 cm. Trapezoidal membership functions allow for the generation of non-polygonal fuzzy sets, which enable more precise and flexible modelling of linguistic variables in fuzzy logic.
The aim of this study is to assess the change in polarity from the main clause to the subordinate clause of a sentence, at different levels, in order to identify possible instances of sarcasm. Fuzziness is relevant in addressing the uncertainties related to the degree of positivity or negativity of an independent clause. By combining this with the computational identification of polarity shift, more effective outcomes can be achieved. The following steps were performed to detect sarcasm in the form of disparity of sentiments in clauses:
Tokenization: All the pre-processed textual data is first tokenised using spaCy’s NLP object.
Separation of clauses: To find the polarity of a sentence’s constituent clauses, a sentence is first separated into clauses. This is done in two ways. All the sentences are checked for the presence of subordinating conjunctions given in appendix A.3. If a sentence does not contain any of the subordinating conjunction mentioned then splitting of the sentence is done on the basis of two markers. The first marker is the subject of the sentence with syntactic dependency “nominal subject (nsubj)”. The second marker is the last occurrence of any preposition in a sentence. It is marked with syntactic dependency “preposition (prep)”. These markers divide the sentence into three parts, the first part spans from the beginning of the sentence to the first marker (“nsubj’), the second part spans from the first marker(“nsubj”) to the second marker(“prep”) and the third part spans from the second marker(“prep”) to the end of the sentence. For example, the sentence, “
Computing the polarity of clauses: For finding the polarity of a sentence’s constituent clauses, pysentimiento [37] is used. pysentimiento is a python toolkit for sentiment analysis and text classification. It is a transformer-based open-source library. It uses BERTweet [33] as a base model in English. The list of constituent clauses is taken for each sentence and the polarity score corresponding to each clause in the form of positive, negative and neutral proportions is obtained.
Applying fuzzy logic to eliminate overlapping sentiment classes: In each sentence, every clause has three sentiment proportions, i.e., positive, negative and neutral. To determine the overall polarity of a clause fuzzy-logic has been implemented using Simpful [45]. Although pysentimiento provides sentiment proportions for the positivity, negativity and neutrality of a clause, it does not provide any valuable information about the degree (weak, moderate, strong) of each sentiment. This study uses fuzzy logic to determine the overall polarity of the clauses with the help of a set of rules based on the projected degree of each sentiment. The degree of each sentiment namely, positive, negative and neutral have been devised using the assumed ranges given in Table 5. The trapezoidal membership function is used to define a non-polygonal fuzzy set for each sentiment namely, positive, negative and neutral. Figure 7 (a,b,c) illustrates various inputs for each sentiment.
The degree of each sentiment along with the corresponding polarity percentage associated with it
The degree of each sentiment along with the corresponding polarity percentage associated with it
The set of fuzzy rules used in this study can be found in appendix B.2. The fuzzified output is presented in (d) part of Fig. 7. Defuzzification is then applied to get the final polarity output.
Checking for the disparity of sentiments: The total number of clauses with positive, neutral, or negative sentiment labels are counted and utilised to account for polarity shifts from negative to positive or positive to negative. If such a shift occurs, the function returns ‘1’; otherwise, it returns ‘0’.

Fuzzy inputs and output for neutral, negative and positive sentiments.
DWAEF, the proposed deep weighted average ensemble-based framework, is a three-tiered structure. It comprises three base learners: a TabNet, a 1-D CNN and an MLP. The framework is implemented with the help of DeepStack Library [8]. Initially, the curated dataset is used to pretrain the three models. During training, each of the base learners receives the outputs of the GNN-based simile detection framework, the BERT-based metaphor detection framework, the fuzzy-based polarity shift detection framework as well as the primitive features as its inputs. Next, the predictions produced by each module of the ensemble are weighed based on their performance on a hold-out validation dataset. This is represented by a Dirichlet ensemble object. To achieve more accurate solution a weight optimization search is used in conjunction with randomized search based on the Dirichlet distribution on the validation dataset. Randomized search based on the Dirichlet distribution involves randomly sampling the weights from a Dirichlet distribution, which generates a probability distribution over the weights that is continuous and flexible. Weight optimization search iteratively adjusts the ensemble model weights to maximize accuracy on the validation dataset. Through this process the base learners, a TabNet, a 1-D CNN and an MLP, are assigned optimal weights. These weights determine the contribution of each model to the final prediction. Subsequently, the predictions of each model are combined through a weighted average. No meta-learner is used in this ensemble method. Lastly, the ensemble model is assessed on a separate test set to determine how well it can perform on new data.
The findings that were achieved through the utilization of the suggested methodology are reported in Section 5.
Evaluation, results and analysis
This section examines the results acquired by employing different techniques on the dataset. The purpose of the proposed methodology is to efficiently detect sarcasm in online text using the presence of figurative comparisons, i.e., similes and metaphors and shifts in polarity of the text’s constituent clauses. This segment is organised as follows: Section 5.1 discusses the accuracy vs. epochs and loss vs. epoch curves for the GNN framework. The accuracy scores obtained during experimentation with different threshold values by the BERT-based Metaphor Detection Framework are discussed in Section 5.2. The confusion metrics for the fuzzy-based approach are presented in Section 5.3. Section 5.4 analyses the results obtained using DWAEF, the deep weighted average ensemble model.
Results obtained by the GNN-based simile detection framework
The GNN framework described in the Section 4.4.1 was pretrained on a dataset of roughly 3,000 sentences, out of which roughly 50% were similes and the rest 50% were non-similes. The entire dataset was split into three groups: 60% of the data was used for training, 20% was used for testing and the rest 20% was used for validation. This pretrained framework was then deployed on the main collated sarcasm dataset to extract the presence of a simile as one of the features. With a batch size of 32, the model was executed for 100 epochs. The rest of the hyperparameters are given in Table 6. The training and validation curves of the proposed framework are illustrated in Fig. 8. It is evident from both curves that the framework is free from both overfitting and underfitting.

Training and validation curves of the GNN framework for simile detection.
Hyperparameter settings for the GNN framework
The testing accuracy obtained using the proposed GNN framework for simile detection was 99.22% using GloVe word embeddings. The state-of-the-art results ensured accurate detection of the simile in the main sarcasm dataset.
Before settling on the best threshold value to assess the existence or absence of a metaphorical instance in a sentence, several values were tested. The various values tested and the accompanying accuracy values are listed in Table 7. It is clear that at a threshold value of 0.7, the most accurate predictions were achieved for both type 1 and type 2 metaphors. Thus, a cosine similarity of more than 0.7 accurately indicates the absence of a metaphorical remark, whereas a cosine similarity of less than 0.7 indicates its presence.
Various threshold values and their corresponding accuracy score
Various threshold values and their corresponding accuracy score
The confusion matrix shown in Fig. 9 depicts the performance of the fuzzy-based framework. In the matrix, there are four distinct combinations of expected and actual values. It is evident from the matrix that the framework properly identified the variations in polarity that actually indicated sarcasm at the clausal structural level of 1,309 sentences. Additionally, 125 incorrect sentences were identified as true. On the other hand, it also correctly identified the lack of a tone change in 1,228 sentences but failed to recognise a polarity shift in 127 sentences. It is obvious from the matrix that the framework made accurate predictions for 87.81% of the whole dataset. While 91.28% of all predicted true classes were predicted actually true, 85.22% of all real true classes were predicted true by the framework. One of the reasons for the state-of-the-art outcome is the usage of fuzzy rules to cope with uncertainties over whether a solitary sentence is positive or negative.

Confusion matrix for the fuzzy-based polarity shift detection framework.
The study performs and compares the results of three experiments – 1: sarcasm detection using individual base learners, namely – TabNet, 1-D CNN and MLP with primitive features; 2: sarcasm detecton using the same base learners with a combination of both primitive and proposed features; and 3: sarcasm detection using DWAEF with a combination of primitive and proposed features. The hyperparameter settings of the base learners in all the three experiments were kept the same to ensure that any observed differences in performance were due to the changes in the feature set rather than differences in hyperparameters. Table 8 gives the corresponding hyperparameter settings for each of the base learners.
The results of experiment 1 and 2 are displayed in Table 9. The comparison shows that the outcome of experiment 2 is better than that of experiment 1. Hence, the accuracy values of experiment 2 is then compared with the accuracy value of the third experiment (the overall accuracy of DWAEF) and also with the three most powerful and widely used traditional machine learning classifiers, RF, SVM and AdaBoost (AB). The corresponding detailed comparison report is summarised in Table 10.
A confusion matrix depicting the performance of DWAEF is presented in Fig. 10. Based on the presented confusion matrix, the performance of the model can be inferred as follows. The model achieved a high accuracy rate of 92.01%, indicating that it correctly predicted the majority of cases. Additionally, the precision of the model was found to be 92.98%, indicating that when it predicted a positive class, it was correct 92.98% of the time. Furthermore, the recall rate of the model was 89.59%, indicating that the model correctly identified 89.59% of actual positive cases. The F1 score of the model was also found to be high at 91.62%, indicating that the model performed well in terms of both precision and recall. However, the model exhibited some limitations, such as a number of false positives (FP) and false negatives (FN). Specifically, the model produced 91 false positive predictions, which could lead to unnecessary actions or decisions being taken. Additionally, the model produced 140 false negative predictions, which could result in missed opportunities or errors in decision-making. In conclusion, while the model performed well with high accuracy, precision, recall and F1 score, there is still room for improvement in reducing the number of false positives and false negatives to further enhance its performance. These findings may have implications for decision-making processes in applications that utilize this model, and can inform future improvements in its development.
Hyperparameter settings for TabNet, 1-D CNN and MLP used in DWAEF
Hyperparameter settings for TabNet, 1-D CNN and MLP used in DWAEF
Accuracy scores for DWAEF base learners – TabNet, 1-D CNN and MLP
It can be inferred that the DWAEF outperforms all the models used in this study in terms of accuracy. Also, the proposed features of this research, namely, presence of figurative comparisons, i.e., simile and metaphor and shift in polarity of a sentence’s constituent clauses, successfully aid in better detection of sarcasm in online text messages. The detection of sarcasm has also been boosted by switching to a deep weighted average ensemble framework because the framework assigns each base member’s share of the prediction weight based on how well it performed individually during training. Moreover, researchers in [15] failed to detect sarcasm in sentences written in a formal and polite tone. However, including the proposed novel indicators successfully detected sarcasm in such sentences. Table 11 presents some of them.
Summary of accuracy scores of all classifiers

Confusion matrix for DWAEF.
Examples of correctly classified sentences written in a polite/formal tone
Detection of sarcasm poses one of the leading challenges in sentiment analysis, as even a single sarcastic remark can influence sentiment analyzers to produce undesirable results. Primitive techniques used in sarcasm detection used low-level features and traditional machine learning algorithms. The study looked into sarcasm detection with a new perspective. It proposed DWAEF, a deep-weighted ensemble-based framework for sarcasm detection. The framework utilized figurative speech components mainly, the presence of simile, the presence of metaphor and the change in the polarity of a sentence’s constituent clauses using deep learning techniques. The predictions done by the above modules were then fed into DWAEF, which comprised a 1-D CNN, a TabNet and an MLP as its base learners. Based on the results, it can be concluded that combining the proposed indicators with the primitive features achieved better results across all classifiers. It was seen that the proposed ensemble framework performed better as compared to traditional machine learning classifiers. The proposed technique achieved the highest accuracy of 92.01% when proposed indicators were combined with primitive features and evaluated using a weighted average ensemble of deep learning algorithms. The study employed various state-of-the-art tools and techniques; still, the proposed framework may be made to improve the model’s characteristics and efficiency.
In future, the researchers plan to investigate the use of larger datasets for sarcasm detection in deep ensemble framework, while considering potential drawbacks such as increased time and complexity for training and a lack of annotation resources. Efficient resource management strategies such as unsupervised learning techniques and pre-trained models, and crowd sourcing model for annotation shall be explored to mitigate these challenges. The optimal dataset size will depend on the specific application and available resources, and further research should identify the trade-offs and limitations associated with using larger datasets for sarcasm detection.
While working in the domain of sarcasm detection, the authors also came across other crucial perspectives regarding sarcasm. It was sensed that the interpretation of sarcasm can also be influenced by cultural and linguistic factors, and people from different cultures may interpret sarcasm differently. Different languages may have their own unique styles and patterns of sarcasm, which may not be easily translatable or understandable to those unfamiliar with the language. Therefore, cultural and linguistic contexts are crucial when studying or detecting sarcasm. Sarcasm detection models that work well in one language or culture may not perform well in another, highlighting the importance of considering these factors in model design and evaluation. Hence, in future, the authors plan to incorporate more advanced tools in the DWAEF framework and fine-tune it to perform cross-lingual, cross-cultural and multimodal predictions.
