Sage Journals: Discover world-class research

Abstract

Recently, emotion recognition in conversation (ERC) has become more crucial in the development of diverse Internet of Things devices, especially closely connected with users. The majority of deep learning-based methods for ERC combine the multilayer, bidirectional, recurrent feature extractor and the attention module to extract sequential features. In addition to this, the latest model utilizes speaker information and the relationship between utterances through the graph network. However, before the input is fed into the bidirectional recurrent module, detailed intrautterance features should be obtained without variation of characteristics. In this article, we propose a residual-based graph convolution network (RGCN) and a new loss function. Our RGCN contains the residual network (ResNet)-based, intrautterance feature extractor and the GCN-based, interutterance feature extractor to fully exploit the intra–inter informative features. In the intrautterance feature extractor based on ResNet, the elaborate context feature for each independent utterance can be produced. Then, the condensed feature can be obtained through an additional GCN-based, interutterance feature extractor with the neighboring associated features for a conversation. The proposed loss function reflects the edge weight to improve effectiveness. Experimental results demonstrate that the proposed method achieves superior performance compared with state-of-the-art methods.

Introduction

Emotion representing an individual's mental state, which is a combination of feelings and thoughts, is a key semantic component for communication. In C. Darwin's third major work of evolutionary theory,¹ he has hypothesized that emotions are linked to their origins in animal behavior. His biological emphasis leads to a focus on six emotional states: happiness, sadness, fear, anger, surprise, and disgust. Plutchik² identified eight primary emotions, visualized by the wheel of emotions, as illustrated in Figure 1. These eight emotions can be grouped into polar opposites: joy and sadness, acceptance and disgust, fear and anger, and surprise and anticipation.

FIG. 1.

Plutchik's wheel of emotion.² Color images are available online.

The Internet of Things (IoT) technology is the network of physical objects that are embedded with sensors, software, and communication functions for connecting each other. The internet-connected objects exchange and analyze data and provide learned information to users. In our daily life, we generate vast amounts of data through various activities. These data consist of important information related to our lifestyle, daily routine, various activities, personality, thoughts, and emotions.

As the IoT environment with connections increases, the big data-driven approach becomes the key connecting point to provide intelligent and very personalized services. The representative technology is deep learning based on the deep neural network (DNN).^3–7 By analyzing the enormous amount of information (data) from diverse internet-connected objects, we can make a more effective learning system that can provide efficient smart services such as a smart home, smart city, and smart health care.

For the interactive smart home, smart city, and smart health care, emotion recognition is indispensable in all areas that require an automated assistant service. Especially, artificial intelligence (AI) speakers are trying to understand the emotional status of users from their conversations. In this field, the deep learning scheme is very popular with a large amount of conversational data as the big data-driven approach.

Recently, with the growth of user-friendly IoT devices such as AI speaker and ChatBot, the IoT technology has further improved the conveniences of life. From these assistant devices, users expect empathy in addition to accurate information. Emotions play an important role in providing services with empathy. By enhancing the performance of emotion recognition in the IoT environment, we can expect better IoT services.

Nowadays, emotion recognition has started receiving more attention in the field of natural language processing (NLP)^8–10 due to its increased demand in recommendation systems, health care, and so on. Emotion recognition in conversation (ERC) is obviously important in a dialog system with an automatic assistant. Establishing an AI assistant with strong emotion recognition ability is a step toward the construction of an artificial secretary very close to a real person.

In ERC, context modeling of both the individual utterances and the relationship between them is required. Recent works on ERC utilized a DNN to extract meaningful context information and recognize the emotions of utterances. The bidirectional encoder representations from transformers (BERT)¹¹ is a pretrained language representation model based on the transformer structure¹² to obtain effective context representations. The transformer model can quickly and accurately process sequential data based on a self-attention mechanism without using a recurrent structure. The BERT representations can be fine-tuned for application in various NLP tasks.

Most of the state-of-the-art methods^13–15 in ERC have structures in which a multilayer, bidirectional, recurrent feature extractor such as long short-term memory (LSTM),¹⁶ the gated recurrent unit¹⁷ and self-attention.¹² Since conversation flows in order, a bidirectional recurrent structure is suitable for extracting temporal information using previous and subsequent sentences. In various fields of researches,^12,18–20 the attention weighting mechanism that captures strong features by fusion of meaningful features has been used. Especially, the self-attention mechanism has been proven by the BERT model to be efficient in many NLP classification tasks. Therefore, it is possible to emphasize the context of each sentence through the self-attention process in ERC.

However, this mechanism-based existing ERC model does not consider speaker information associated with the utterances and the relative positions of other utterances from the target utterance. Recognizing the emotion of a sentence is strongly associated with recognizing the emotion of the speaker, therefore speaker information is the essential key.

In particular, when a speaker's emotion changes, considering the speaker information is crucial since it is mostly caused by the utterances of the other speaker. In addition, the relative positions of target and other utterances are necessary to decide how past utterances influence future utterances, and vice versa. The dialog graph convolution network (DialogueGCN)²¹ alleviated this problem by modeling the conversation using a directed graph. To fit the conversation into the GCN^22–24 structure, the nodes represent individual utterances and edges represent the dependency between a pair of utterances.

Although there is structural improvement by using the GCN, there are still limitations. Feeding the input dialog with the pretrained embedding technique directly into the bidirectional recurrent module has the potential to lose the unique characteristics of individual utterances. To minimize distortion of attributes, sufficient intrautterance features should be extracted before executing the temporal dependence module.

Moreover, the existing method uses the edge weight of the graph when calculating the node features, but does not use it for classification as an independent component. The edge weight in the GCN is crucial to represent a correlation between utterances. For this reason, if the edge weight is predicted incorrectly, the node feature is likely to be represented improperly. Although node features are eventually used as the final probability of classification, the prediction accuracy of edge weights is also an important factor that is worth considering.

In this article, we propose a residual GCN (RGCN) for ERC. For extracting an independent intrautterance feature, we utilize the residual network (ResNet),²⁵ which is widely used as a feature extractor in the computer vision field. ResNet effectively extracts features, even if the set of layers is deeply constructed, by adding the input feature of the layer to the output feature. Along with this, we design a new loss function consisting of node loss and edge loss. By adding a loss term for the edge weight in addition to the loss function for the predicted final node feature, we enhance the sophistication of the loss function.

This article is organized as follows. In the Related Work section, we introduce the related works. In the Methodology section, we present our methodology. The experimental results are shown in the Experiments section. Finally, the Conclusions section makes the concluding remarks for this article.

Related Work

A prior study on emotion recognition used the sentiment analysis. Sentiment analysis is one of the methods used for opinion mining, which is one of the most active research areas in NLP. The main use of sentiment analysis is to examine texts such as posts and reviews uploaded by users regarding the opinions about a product, service, media, and so on. For that reason, numerous researches of sentiment analysis have explored the general texts related to specific domains such as social media analytics,^26,27 marketing,^28,29 and finance.³⁰ In general, sentiment analysis aims to classify three classes: positive, negative, and neutral classes.

As an extension of sentiment analysis, emotion recognition predicts the emotions of texts such as happiness, sadness, fear, anger, surprise, and disgust. Recently, ERC has become a new trend in NLP. As one of the early ERC methods, Poria et al.¹³ proposed bidirectional contextual LSTM (bc-LSTM) to capture contextual information from their surroundings. Gupta et al.³¹ proposed a method to detect and enhance the emotion of the user on social media by analyzing the electroencephalogram signals from the brain. Hazarika et al.³² proposed a conversational memory network for dyadic dialogs utilizing speaker-specific context modeling.

Later, Hazarika et al.³³ proposed an interactive, conversational memory network, which is an improved approach of their previous research.³² Majumder et al.¹⁴ proposed a dialog recurrent neural network (DialogueRNN) that keeps the order of the individual party states in the sequential utterances. Zhong et al.¹⁵ proposed a knowledge-enriched transformer (KET), which applies dynamic knowledge from external knowledge bases and emotion lexicons. The current state-of-the-art model in ERC is the DialogueGCN proposed by Ghosal et al.²¹ DialogueGCN solved the problem related to context propagation issues in DialogueRNN through the graph network.

The graph neural network (GNN) has recently received a lot of attention because it can represent data as a graph structure. The earliest study on GNNs was proposed by Scarselli et al.²² Later, Kipf and Welling²³ proposed GCNs, which generalized the feature of the convolution filter to the graph. Schlichtkrull et al.²⁴ proposed a model of relational data with the GCN.

The existing architecture of the GCN is inappropriate for ERC. Each utterance in a dialog to be used as a node requires additional information as well as the text itself such as speaker information and the order of sentences. Ghosal et al.²¹ organized the structure of the GCN to fit the dialog data.

Methodology

In this section, we introduce the proposed RGCN for ERC and loss function in detail. We first define the ERC task problem. After that our model and loss function are specifically explained.

Task problem definition

Given a conversation C, the ERC task aims to predict the emotions of constituent utterances $u_{1}, u_{2}, \dots, u_{N}$ , where each utterance $u_{i} = [u_{i, 1}, u_{i, 2}, \dots, u_{i, T}]$ consists of T words $u_{i, t}$ and is spoken by a speaker/party p_k. Figure 2 illustrates an example of a conversation between two people from the multimodal EmotionLines dataset (MELD).³⁴ In general, the ERC task is formulated as a multiclass classification problem under a given dialog.

FIG. 2.

An example of an ERC task from MELD.³⁴ ERC, emotion recognition in conversation; MELD, multimodal EmotionLines dataset. Color images are available online.

Overview

Our RGCN for ERC is illustrated in Figure 3, which consists of three modules, intrautterance feature extraction, interutterance feature extraction, and classification. In intrautterance feature extraction, the context features for each utterance can be obtained. For this stage, the input utterances become the robust features that represent the meaningful characteristics of sentences. In interutterance feature extraction, using the intra features, we can obtain more advanced features based on the relationships between utterances and between speakers. As a final stage, the classification module outputs the probabilities of each label.

FIG. 3.

The network architecture of the proposed RGCN for ERC. RGCN, residual-based graph convolution network. Color images are available online.

The input utterances $u_{1}, u_{2}, \dots, u_{N}$ are fed into the intrautterance feature extractor, and then the output features and input utterances are concatenated. The concatenated feature becomes the input for the interutterance feature extractor. The output features, after the process of feature extraction, pass through a classification module. Finally, the output probabilities of each utterance $y_{1}, y_{2}, \dots, y_{N}$ for emotions are obtained.

ResNet-based, intrautterance feature extraction

ResNet is extensively utilized for feature extraction in the DNN in the computer vision field. In the general CNN, when the network is deeper, the problem of vanishing/exploding gradients occurs. ResNet introduces a skip connection that adds the input x to the output after few weight layers: $H (x) = F (x) + x$ , where $F (x)$ and $H (x)$ denote the output after weight layers and final output of a residual block, respectively. Even if there is a vanishing gradient for the weight layers, the identity x can prevent the problem. With this, the skip connection enables a deeper network.

Since recent datasets on ERC include long sentences, the task becomes more challenging. In addition, the output vector size of most embedding mechanisms^35,36 is not small, with 300, 600, and 1024 dimensions. Such being the case, the feature extractor of each utterance in conversation should be performed in a sufficiently deep network. For this reason, the ResNet-based structure is more suitable.

We design the intrautterance feature extractor based on ResNet, as illustrated in Figure 4. For the input of our network, a pretrained GloVe vector representation³⁵ is applied for the embedding mechanism. GloVe is widely used because it can capture fine-grained syntactic and semantic regularities. After obtaining the embedded utterances $G (u_{1}), G (u_{2}), \dots, G (u_{N})$ , we first feed each of them to one convolutional layer with an output feature size equal to 64 and kernel size equal to 3. Then, 10 convolutional blocks with an output feature size equal to 64 and kernel size equal to 3, one convolutional block for upsampling, and 10 convolutional blocks with an output feature size equal to 128 and kernel size equal to 3 are performed in order.

FIG. 4.

Structural details of the ResNet-based, intrautterance feature extractor. ResNet, residual network. Color images are available online.

Each convolutional block consists of a structure in which the convolutional layer, batch normalization layer, and rectified linear unit activation are repeated twice. The kernel size of all the convolutional layers is 3.

By exploiting the ResNet for the intrautterance feature extractor, we can generate a fine sentence-level feature. In other words, it can avoid the misrepresentation of its own context of utterance.

GCN-based, interutterance feature extraction

The filter of convolution has a limitation, in that it is effective for fixed grid-type data such as an image. The GCN appeared to enable effective feature extraction even from nongrid data through the structure using a graph that gives the same effect as a convolution filter. The GCN is suitable for extracting an interutterance feature in the ERC task where the relationship between utterances and the relationship between speakers can be important factors.

The architecture of the GCN-based, interutterance feature extractor is shown in Figure 5. First, we feed the intra feature into bidirectional LSTM to derive the sequential feature, $x_{1}, x_{2}, \dots, x_{N}$ . To capture separate features for each party, we adopt a structure using the GCN. Our GCN-based, interutterance feature extractor is based on the network by Ghosal et al.²¹

FIG. 5.

Structural details of the GCN-based, interutterance feature extractor. Color images are available online.

The representation of feature vectors after intrautterance feature extraction as a directed graph is constructed with vertices $x_{i} \in V$ , edges $e_{i j} \in E$ between x_i and x_j, and edge weights $w_{i j} \in W$ of the edge $e_{i j}$ , where $0 \leq w_{i j} \leq 1$ and $i, j \in [1, 2, \dots, N]$ . Through this definition, a fully connected graph can be derived. In other words, it is connected by edges between all vertices, including itself. However, calculating features for all relationships requires high computational complexity. As an efficient solution, we define a window size for previous and posterior vertices. Because we define the window size as 10, as in the study by Ghosal et al.,²¹ we only consider the front 10 features and the back 10 features.

The attention mechanism is used to calculate the edge weights $w_{i j}$ between two vertices x_i and x_j. Each vertex is multiplied by all attended features of vertices within the window size. Using the softmax function, we can derive the edge weights between 0 and 1. Since vertices and edge weights are defined, the feature vector h_i is computed using the relation-specific transformation in the study by Schlichtkrull et al.²⁴ Utilizing local neighboring features in window size, a hidden feature h_i is computed by calculating the weighted sum with vertices, edge weights, and learnable parameters.

Emotion classifier

After all the feature extraction processes are completed, the sequential feature x_i and final hidden feature h_i are concatenated. The final utterance feature f_i is derived using the softmax function with the concatenated feature as follows:

where $A_{α}$ is an attention module.

Eventually, the probabilities of classification for emotions of each utterance can be derived as follows: $p_{i} = s o f t m a x (R e l u (f_{i})),$ (3)

Loss function

We design a new loss function consisting of node loss L_n and edge loss L_e terms. For the node loss L_n, we use the categorical cross-entropy loss function:

where i indexes samples and j indexes classes, $t_{i, j}$ is the target, and $y_{i, j}$ is the prediction.

To refine the loss function for the relationship of vertices, we propose an edge loss term. Because the desired edge weight for itself should be 1, we design edge loss using this property. The proposed edge loss term is defined based on the sum of absolute deviations as

Finally, the total loss is defined as $L = L_{n} + λ L_{e},$ (7)

where $λ$ is empirically set to 0.001 to balance these two loss terms.

Experiments

Datasets

In this article, we compare the proposed RGCN with edge loss with four methods using the interactive emotional dyadic motion capture (IEMOCAP),³⁷ MELD,³⁴ and EmoContext (EC)³⁸ datasets.

Interactive emotional dyadic motion capture

The IEMOCAP dataset is a multimodal dataset containing textual, visual, and acoustic information. In this article, we only target text. Each conversation consists of two speakers, namely a dyadic dialog. The emotion labels include happy, sad, neutral, angry, excited, and frustrated. Table 1 shows the data distribution of the IEMOCAP dataset.

Table 1.

Data distribution in the interactive emotional dyadic motion capture dataset^*

	No. of utterances
	Train	Val	Test
Happy	351		144
Sad	937		245
Neutral	191		384
Angry	1059		170
Excited	2206		299
Frustrated	3449		381
Overall	8193		1623

Busso et al.³⁷

Multimodal EmotionLines dataset

The MELD is also a multimodal dataset containing textual, visual, and acoustic information generated from the Friends TV series. As the MELD consists of multiparty conversations, it is more challenging than other datasets. It is an extended version of the EmotionLines³⁹ dataset. It contains seven classes, neutral, surprise, fear, sadness, joy, disgust, and anger. Table 2 presents the data distribution of the MELD.

Table 2.

Data distribution in the multimodal EmotionLines dataset^*

	No. of utterances
	Train	Val	Test
Neutral	4710	470	1256
Surprise	1205	150	281
Fear	268	40	50
Sadness	683	111	208
Joy	1743	163	402
Disgust	271	22	68
Anger	1109	153	345
Overall	9989	1109	2610

Poria et al.³⁴

EmoContext

A dialog of the EC dataset contains three utterances with two speakers. In the EC dataset, an emotion label is assigned to only the last utterance of each dialog. Examples of the EC dataset are shown in Figure 6. The emotion labels include others, angry, sad, and happy. Table 3 presents the data distribution of the EC dataset.

FIG. 6.

Examples of the EC³⁸ dataset. EC, EmoContext.

Table 3.

Data distribution in the EmoContext dataset^*

	No. of utterances
	Train	Val	Test
Others	14,948	2338	4677
Angry	5506	150	298
Sad	5463	125	250
Happy	4243	142	284
Overall	30,160	2755	5509

Chatterjee et al.³⁸

Implementation details

In our experiments, the embedding mechanism of the 300-dimensional pretrained 840B GloVe was applied to the input of the network. We trained with setting the size of a minibatch to 32. We used the Adam optimizer⁴⁰ and initially set the learning rate to $1 \times 1 0^{- 4}$ . To alleviate overfitting, we employed dropout,⁴¹ setting the dropout rate as $0.5$ . Our model was implemented using PyTorch.⁴²

Performance comparisons

We compared our RGCN with edge loss with several state-of-the-art methods, including bc-LSTM+Att,¹³ DialogueRNN,¹⁴ KET,¹⁵ and DialogueGCN,²¹ on three datasets. For the comparison models, we trained the models using each of the three datasets. We used the weighted average F1-scores for measuring the overall performance.

The quantitative results in terms of weighted average F1-scores on the IEMOCAP dataset are shown in Table 4. On the IEMOCAP dataset, our RGCN with edge loss achieves a new state-of-the-art F1-score of 65.08%, which is 1.15% better than the second-best model, DialogueGCN. The IEMOCAP dataset has many conversations with over 70 utterances. For the difference of performance, we can explain that extracting refined the intra features and considering edge weight for the loss function is critical in the dataset containing numerous utterances in a dialog.

Table 4.

Quantitative comparison with weighted average F1-scores on the interactive emotional dyadic motion capture dataset

Model	Happy	Sad	Neutral	Angry	Excited	Frustrated	Average
bc-LSTM+Att¹³	32.65	65.14	48.53	63.55	69.63	62.68	58.41
DialogueRNN¹⁴	36.71	83.3	58.99	58.51	76.82	58.79	63.87
KET¹⁵	42.86	69.92	49.47	59.48	66.77	58.86	58.41
DialogueGCN²¹	54.04	79.21	59.12	60.77	72.73	57.22	63.93
RGCN+EdgeLoss	62.69	80.15	59.72	63.69	74.06	55.27	65.08

bc-LSTM, bidirectional contextual long short-term memory; DialogueGCN, dialog graph convolution network; DialogueRNN, dialog recurrent neural network; KET, knowledge-enriched transformer; RGCN, residual-based graph convolution network.

Comparing the results for each emotion class, our RGCN does not show the best result for all emotions in the IEMOCAP dataset. Analyzing this result together with Table 1, it can be seen that other methods also show good results in the case of the emotion class with a large number of training datasets such as sad, excited, and frustrated. On the other hand, for the emotion class with a small number of training datasets, we can see that our RGCN produces the best performance. In other words, our RGCN can perform well even when the dataset is not huge. For the sad class, DialogueRNN¹⁴ achieved the best result due to the RNN model. This means that temporal information was more effective to verify this emotion.

In Table 5, we show the quantitative results in terms of weighted average F1-scores on MELD and EC datasets. The MELD comprises multiparty dialogs with relatively short utterances. In addition, utterances rarely contain emotion-specific expressions. For that reason, emotion modeling is very difficult. Our proposed model exceeds the second-best model, DialogueRNN, by 0.41% on the MELD. Because it is a hard problem, this experimental result is good.

Table 5.

Quantitative comparison with weighted average F1-scores on multimodal EmotionLines and EmoContext datasets

Model	EC					MELD
Model	Others	Angry	Sad	Happy	Average	Average
bc-LSTM+Att¹³	94.37	68.08	76.15	65.91	90.65	54.57
DialogueRNN¹⁴	94.15	69.4	74.05	62.61	90.27	55.57
KET¹⁵	94.72	67.55	74.09	66.92	90.88	54.35
DialogueGCN²¹	95.14	72.1	74.95	69.24	91.63	55.54
RGCN+EdgeLoss	95.21	69.75	76.17	69.24	91.63	55.98

EC, EmoContext; MELD, multimodal EmotionLines dataset.

On the EC dataset, the proposed model and DialogueGCN give the same result. The EC dataset consists of three short utterances in a dialog. It means that the intra feature extractor comprising deep layers is not suitable. Furthermore, because the dataset comprises a few edges, the proposed loss function did not play a significant role.

Conclusions

In various IoT-based systems such as smart home, smart city, and smart health care, technologies related to automated assistants that recognize the emotions of users lead to a high level of services. Particularly, the proposed algorithm can be employed in AI speakers and ChatBot to infer the emotional status of the user.

In this article, we have proposed the RGCN for ERC. Furthermore, we designed a new loss function comprising node and edge loss terms. In our network, the feature extraction module consists of intrautterance and interutterance feature extractors. The proposed intrautterance feature extractor derives its elaborate features using a ResNet-based structure with deep layers. The effective interutterance features can be extracted using the GCN structure that can utilize the relationship. Our RGCN outperforms the two datasets and gives the same result on a dataset compared with the existing state of the art. From experimental results, we verified that the proposed RGCN was suitable for datasets comprising numerous long utterances in a dialog.

Footnotes

Acknowledgments

This research project was supported by the Ministry of Culture, Sports and Tourism (MCST) and the Korea Copyright Commission in 2020.

Author Disclosure Statement

No competing financial interests exist.

Funding Information

No funding was received.

Abbreviations Used

References

Darwinand

, Prodger

. The expression of the emotions in man and animals. USA: Oxford University Press, 1998.

Plutchik

A psychoevolutionary theory of emotions. Soc Sci Inform. 1982; 21:529—553.

Krizhevsky

, Sutskever

, Hinton

. Imagenet classification with deep convolutional neural networks. Commun ACM. 2017; 60:84–90.

Simonyan

, Zisserman

. Very deep convolutional networks for large-scale image recognition. arXiv Preprint arXiv: 1409.1556, 2014.

Mukherjee

, Saini

, Kumar

, et al. Fight detection in hockey videos using deep network. J Multimed Inform Syst. 2017; 4:225–232.

Kim

Convolutional neural networks for sentence classification. arXiv Preprint arXiv: 1408.5882, 2014.

Zhang

, Chu

, Leng

, Miao

. Mask-refined r-cnn: A network for refining object details in instance segmentation. Sensors. 2020; 20:1010.

Kratzwald

, Ilic

, Kraus

, et al. Decision support with text-based emotion recognition: Deep learning for affective computing. arXiv Preprint arXiv: 1803.06397, 2018.

Colneriĉ

, Demsar

. Emotion recognition on twitter: Comparative study and training a unison model. IEEE Trans Affective Comput. 2018; DOI: 10.1109/TAFFC.2018.2807817.

10.

Poria

, Majumder

, Mihalcea

, Hovy

. Emotion recognition in conversation: Research challenges, datasets, and recent advances. IEEE Access, 2019; 7:100.943–100.953.

11.

Devlin

, Chang

M-W

, Lee

, Toutanova

. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv Preprint arXiv: 1810.04805, 2018.

12.

Vaswani

, Shazeer

, Parmar

, et al. Attention is all you need. Adv Neural Inform Process Syst. 2017; 5998–6008.

13.

Poria

, Cambria

, Hazarika

, et al. Context-dependent sentiment analysis in user-generated videos. In: Proceedings of the 55th annual meeting of the association for computational linguistics (volume 1: Long papers), 2017. pp. 873–883.

14.

Majumder

, Poria

, Hazarika

, et al. Dialoguernn: An attentive rnn for emotion detection in conversations. In: Proceedings of the AAAI Conference on Artificial Intelligence, 2019, vol. 33. pp. 6818–6825.

15.

Zhong

, Wang

, Miao

. Knowledge-enriched transformer for emotion detection in textual conversations. arXiv Preprint arXiv: 1909.10681, 2019.

16.

Hochreiter

, Schmidhuber

. Long short-term memory. Neural Comput. 1997; 9:1735–1780.

17.

Cho

, Van Merriënboer

, Gulcehre

, et al. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv Preprint arXiv: 1406.1078, 2014.

18.

Leng

, Li

, Kim

, Bi

. Dual-source discrimination power analysis for multi-instance contactless palmprint recognition. Multimed Tools Appl. 2017; 76:333–354.

19.

Wang

, Girshick

, Gupta

, He

Non-local neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2018. pp. 7794–7803.

20.

Wang

, Chan

, Yu

, et al. Edvr: Video restoration with enhanced deformable convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2019.

21.

Ghosal

, Majumder

, Poria

, et al. Dialoguegcn: A graph convolutional neural network for emotion recognition in conversation. arXiv Preprint arXiv: 1908.11540, 2019.

22.

Scarselli

, Gori

, Tsoi

, et al. The graph neural network model. IEEE Trans Neural Netw. 2008; 20:61–80.

23.

Kipf

, Welling

. Semi-supervised classification with graph convolutional networks. arXiv Preprint arXiv: 1609.02907, 2016.

24.

Schlichtkrull

, Kipf

, Bloem

, et al. Modeling relational data with graph convolutional networks. In: European Semantic Web Conference. Springer, 2018. pp. 593–607.

25.

, Zhang

, Ren

, Sun

Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016. pp. 770–778.

26.

Kouloumpis

, Wilson

, Moore

. Twitter sentiment analysis: The good the bad and the omg!. In: Fifth International AAAI conference on weblogs and social media. Citeseer, 2011.

27.

Severyn

, Moschitti

Twitter sentiment analysis with deep convolutional neural networks. In: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2015. pp. 959–962.

28.

Cambria

, Grassi

, Hussain

, Havasi

. Sentic computing for social media marketing. Multimed Tools Appl. 2012; 59:557–577.

29.

Khedr

, Salama

, Yaseen

. Predicting stock market behavior using data mining technique and news sentiment analysis. Int J Intell Syst Appl. 2017; 9:22.

30.

Xing

, Cambria

, Welsch

. Natural language based financial forecasting: A survey. Artif Intell Rev. 2018; 50:49–73.

31.

Gupta

, Sahu

, Nanecha

, et al. Enhancing text using emotion detected from eeg signals. J Grid Comput. 2019; 17:325–340.

32.

Hazarika

, Poria

, Zadeh

, et al. Conversational memory network for emotion recognition in dyadic dialogue videos. In: Proceedings of the conference. Association for Computational Linguistics. North American Chapter. Meeting vol. 2018. NIH Public Access, 2018. p. 2122.

33.

Hazarika

, Poria

, Mihalcea

, et al. Icon: Interactive conversational memory network for multimodal emotion detection. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018. pp. 2594–2604.

34.

Poria

, Hazarika

, Majumder

, et al. Meld: A multimodal multi-party dataset for emotion recognition in conversations. arXiv Preprint arXiv: 1810.02508, 2018.

35.

Pennington

, Socher

, Manning

. Glove: Global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 2014. pp. 1532–1543.

36.

Peters

, Neumann

, Iyyer

, et al. Deep contextualized word representations. arXiv Preprint arXiv: 1802.05365, 2018.

37.

Busso

, Bulut

, Lee

C-C

, et al. Iemocap: Interactive emotional dyadic motion capture database. Lang Resour Eval. 2008; 42:335.

38.

Chatterjee

, Narahari

, Joshi

, Agrawal

Semeval-2019 task 3: Emocontext contextual emotion detection in text. In: Proceedings of the 13th International Workshop on Semantic Evaluation, 2019. pp. 39–48.

39.

Chen

S-Y

, Hsu

C-C

, Kuo

C-C

, et al. Emotionlines: An emotion corpus of multi-party conversations. arXiv Preprint arXiv: 1802.08379, 2018.

40.

Kingma

, Ba

. Adam: A method for stochastic optimization. arXiv Preprint arXiv: 1412.6980, 2014.

41.

Srivastava

, Hinton

, Krizhevsky

, et al. Dropout: A simple way to prevent neural networks from overfitting. J Mach Learn Res. 2014; 151929–1958.

42.

Paszke

, Gross

, Massa

, et al. Pytorch: An imperative style, high-performance deep learning library. Adv Neural Inform Process Syst. 2019; 32:8026–8037.

Residual-Based Graph Convolutional Network for Emotion Recognition in Conversation for Smart Internet of Things

Abstract

Introduction

Related Work

Methodology

Task problem definition

Overview

ResNet-based, intrautterance feature extraction

GCN-based, interutterance feature extraction

Emotion classifier

Loss function

Experiments

Datasets

Interactive emotional dyadic motion capture

Multimodal EmotionLines dataset

EmoContext

Implementation details

Performance comparisons

Conclusions

Footnotes

Acknowledgments

Author Disclosure Statement

Funding Information

Abbreviations Used

References