Sage Journals: Discover world-class research

Abstract

Emotion recognition is one of the most important research directions in the field of brain–computer interface (BCI). However, to conduct electroencephalogram (EEG)-based emotion recognition, there exist difficulties regarding EEG signal processing; moreover, the performance of classification models in this regard is restricted. To counter these issues, the 2022 World Robot Contest successfully held an affective BCI competition, thus promoting the innovation of EEG-based emotion recognition. In this paper, we propose the Transformer-based ensemble (TBEM) deep learning model. TBEM comprises two models: a pure convolutional neural network (CNN) model and a cascaded CNN-Transformer hybrid model. The proposed model won the abovementioned affective BCI competition’s final championship in the 2022 World Robot Contest, demonstrating the effectiveness of the proposed TBEM deep learning model for EEG-based emotion recognition.

Keywords

affective brain–computer interface electroen-cephalogram Transformer deep learning ensemble learning

1 Introduction

Emotion recognition, as an emerging research field, has been attracting increasing attention from researchers of various fields in recent years. It has immense potential for applications in the diagnosis and treatment of mental illnesses [1] and is one of the most important research topics in the field of brain–computer interaction (BCI) [2, 3]. Traditional emotion recognition uses human voice [4, 5], and facial expression [6 –8] to identify human emotions; however, these “signals” cannot detect human emotions efficiently, especially when subjects intentionally disguise their voice and facial expressions. Currently, physiological signals have received increasing attention from researchers in the field of emotion recognition [9, 10]. Many research teams have used physiological signals such as breathing modes, body temperature, electromyography (EMG), electrocardiogram (ECG), and electroencephalogram (EEG) to investigate emotion recognition. Among these, EEG is physiological signal produced directly by the body’s central nervous system or brain [11]. Certain brain regions have a high correlation with emotions; for example, amygdala [12 –14] is widely believed to be related to human emotions; further, the prefrontal lobe [15, 16], which is the advanced information processing and processing node of the human brain, is also believed to be related to emotions. Therefore, EEG can reflect human emotions more truly and reliably as compared to other physiological signals.

In recent years, EEG-based emotion recognition has attracted increasing attention from domestic and international research teams; subsequently, a new direction of BCI, namely, affective BCI (aBCI), has been developed [17]. EEG-based aBCI systems mainly include preprocessing, feature extraction, and classification; among these, preprocessing can significantly improve the signal-to-noise ratio of EEG. EEG, as an unsteady signal, is easily affected by noise such as motion artifact, respiration, electrocardiogram, and electroophthalmogram [18]; therefore, appropriate preprocessing is essential. As a physiological signal with high temporal resolution [19], EEG has a relatively wide frequency range [20]. Therefore, extracting information from different frequency bands through filtering are an important part of preprocessing. Because nearby channels receive information from the same region of the brain, they collect similar EEG signals and are affected by the setting of the reference electrodes, resulting in some redundancy of EEG signals. Therefore, it is necessary to remove redundancy through repeated references, such as common mean reference (CAR), etc. In recent years, intensive studies of EEG features used for emotion recognition, including time domain features, frequency domain features, time-frequency domain features, nonlinear features, asymmetric features, brain connectivity features, etc. have been conducted [21, 22]. Time domain features include the Hjorth parameter [23], high-order crossover (HOC) [24], etc. Time-frequency feature extraction methods mainly include short-time Fourier transform (STFT) [25, 26], wavelet analysis [27], etc. The characteristics of the frequency domain include five classical frequency bands: power spectral density of δ (1–3 Hz), θ (4–7 Hz), α (8–13 Hz), β (14–30 Hz), and γ (31-50 Hz), etc. Differential entropy (DE) [28] is a well-known characteristic describing nonlinear dynamical systems. Asymmetric features generally refer to specific phenomena found in cognitive studies [29, 30]. For example, cognitive studies based on brain imaging technologies have found the lateralization of different regions in the brain during emotional processing [31, 32]. Characteristics of connections of adult functional brain networks refer to the analysis of connection density, clustering coefficient, and other indicators through EEG-derived brain network, and considering them as features for emotion recognition [33 –35]. In addition to the above features, pre-processed EEG signals have standalone characteristics. Ding et al. [36] used them to obtain high emotion recognition performance on DEAP dataset.

Classification models, including convolutional neural network (CNN) [37], recurrent neural network (RNN) [38], long short-term memory (LSTM) [39], restricted Boltzmann machine (RBM) [40], etc. have increasingly attracted researchers’ attention in recent years. Deep learning is favored in many tasks due to its self-adaptive task-related features without manual feature extraction. Transformer was introduced by Google in 2017 [41] and has since been used in natural language processing for various tasks, such as machine translation, and text generation, showing high performance [42, 43]. The advent of DETR (DEtection TRansformer) has led to researchers using Transformer on other modes of data. Although Transformer has been successful in many areas, only a few studies have employed it in the area of EEG-based emotion recognition. Wang et al. [44] used a cascade Transformer combined with prior knowledge to determine spatial dependence between different brain regions. Abdallah Tubaishat et al. [45] used a novel architecture based on Transformer to process multimodal neural signals and identify emotions. Arjun et al. [46] divided EEG signals or time-frequency images generated by wavelet transform into patches by using a Vision Transformer, and subsequently using a Transformer encoder for emotion recognition. Although ensemble learning methods have been proposed in the field of EEG-based emotion recognition, most of the ensemble models employed in previous studies are machine learning models.

Currently, many researchers are increasingly focusing on the application of deep learning in the field of emotion recognition. However, because different models have different preferences for parameter adjustment in the training process of deep learning, ensemble learning is an efficient method to set multiple models and make unified decisions. In cases where there are significant differences between different models, ensemble learning is best suited for EEG-based emotion recognition. Li et al. [47] proposed an ensemble learning method based on multi-objective particle swarm optimization, that includes an ensemble of particle swarm including several traditional machine learning classifiers, such as support vector machine (SVM), decision tree (DT), k-nearest neighbor, etc. Awan et al. [48] obtained the most accurate results on the AMIGOS dataset by using a model integrating SVM, random forest, and LSTM. Although ensemble learning methods have been proposed in the field of EEG-based emotion recognition, most of the ensemble models are machine learning models.

As part of the World Robot Contest, the BCI Controlled Robot Contest aims to promote innovative breakthroughs in BCI technology. The affective BCI technology competition is part of the BCI Controlled Robot Contest. The technical competition focuses on the participating team’s ability to create and improve algorithms. The competition is divided into preliminary and final stages. The preliminary rounds are held in the cloud and the finals are held live, with real-time rankings based on the accuracy and efficiency of the calculated results.

This paper outlines the method used by us in the emotion recognition contest of BCI Controlled Robot Contest, and presents a systematic analysis of this method. In the competition, we proposed a deep learning method based on ensemble learning, terming it as Transformer-based ensemble model (TBEM), as shown in Fig. 1. This method integrates two structures: a hybrid model of CNN and Transformer and a pure CNN model. The chapters of the article are planned as follows. Section 2 introduces the models, datasets, preprocessing methods, experimental setup, and training strategies used by us in the competition in detail. Section 3 presents the experimental results and discussion. Finally, Section 4 presents a summary of TBEM.

Fig. 1

Transformer-based ensemble deep learning model (TBEM) structure. (a) TBEM is composed of two models: a cascaded CNN-Transformer hybrid model (HybridNet) and a pure convolutional neural network model (PureConvNet). (b) HybridNet model architecture: the upper branch of TBEM architecture is a hybrid model of convolutional neural network and Transformer. (c) PureConvNet model architecture: the lower branch of TBEM architecture is a pure convolutional neural network model.

2 Method

2.1 Data

In the preliminary phase of the competition, the organizers provided data containing 80 subjects (50 women and 30 men, with a mean age of 20.16 years, ranging from 17 to 24 years), while the organizers provided data containing 6 subjects (4 women and 2 men) in the final phase. The collection of preliminary and final competition data adopted a similar experimental paradigm, wherein 28 video clips were used as stimuli. The emotions of stimuli included anger, disgust, fear, sadness, amusement, joy, inspiration, tenderness, and neutral emotions. Roughly the same number of stimuli was used for each emotion (anger = 3, disgust = 3, fear = 3, sadness = 3, amusement = 3, joy = 3, inspiration = 3, tenderness = 3, neutral = 4). Each subject was asked to participate in seven experimental blocks, each block contained four trials (one trial containing one stimulus); additionally, the subject was asked to solve 20 arithmetic problems between the two blocks [49]. The experimental paradigm difference between preliminary and final competition data collection is that the experimental paradigm used in the final competition did not include partitioning blocks and did not ask participants to calculate arithmetic problems. Additionally, a 32-channel NeuSen.W32 wireless EEG system was used in the preliminary emotional induction experiment, while a 64-channel wireless EEG equipment was used in the final emotional induction experiment.

2.2 Model

In many competitions on Kaggle, the world’s famous data mining competition platform, participants have adopted the ensemble learning method and achieved more accurate results as compared to those obtained via other methods. Therefore, this paper also adopted the idea of ensemble learning. By combining multiple weak supervised models, ensemble learning aims to obtain a strong supervised model with good generalization. The underlying idea of ensemble learning is that even if a weak classifier makes a wrong prediction, other weak classifiers can correct the error.

This paper outlines a deep learning method based on ensemble learning called TBEM [Fig. 1(a)] that comprises two models: a HybridNet model, where a CNN is integrated with Transformer [Fig. 1(b)] and PureConvNet, that only comprises CNN [Fig. 1(c)]. Additionally, we show the specific parameter settings for both models in Table 1.

Table 1

Specific parameter settings of HybridNet and PureConvNet.

Model	Step No.	Step Name	Parameters	Output shape
TBEM- (HybridNet)	1	Input	Input shape = (Chans, 1750)	(1, Chans, 1750)
	2	Conv2D	Size = (1, 16), Num = 32, Pad = same, BN, ELU, Dropout = 0.5, Pool size = (1, 2)	(32, Chans, 875)
	3	Conv2D	Size=(1, 16), Num = 32, Pad = same, BN, ELU, Dropout = 0.5, Pool size = (1, 2)	(32, Chans, 437)
	4	Conv2D	Size = (Chans, 1), Num = 64, BN, ELU, Dropout = 0.5, Pool size = (1, 2)	(64, 1, 218)
	5	Reshape & Extra Class Token	Input shape = (64, 1)	(64, 219)
	6–8	Transformer Encoder	Head = 4, Head dim = 64, MLP dim = 64	(64, 219)
	9	Extract Class Token	–	(64, 1)
	10	FC	Linear units = 9	9
TBEM- （PureConvNet）	1	Input	Input shape = (Chans, 1750)	(1, Chans, 1750)
	2	Conv2D	Size = (1, 16), Num = 32, Pad = same, BN, ELU, Dropout = 0.5, Pool size = (1, 2)	(32, Chans, 875)
	3	Conv2D	Size = (1, 16), Num = 32, Pad = same, BN, ELU, Dropout = 0.5, Pool size = (1, 2)	(32, Chans, 437)
	4	Conv2D	Size = (Chans, 1), Num = 64, BN, ELU, Dropout = 0.5, Pool size = (1, 2)	(64, 1, 218)
	5	Conv2D	Size = (1, 16), Num = 128, Pad = same, BN, ELU, Dropout = 0.5, Pool size = (1, 6)	(32, 1, 13)
	6	FC	Linear units = 9	9

HybridNet is a model with a large number of parameters; it is designed with reference to Vision Transformer (ViT) [50]. Similar to the original design intention of ViT, Transformer has achieved the performance of SOTA in many pattern recognition tasks, such as BERT [42], GPT3 [43], SwinTransfomer [51], etc. However, because Transformer is not sensitive to the task of processing long sequences, it cannot efficiently process long time series EEG signals in the EEG-based recognition task. Referring to the design of ViT, HybridNet introduces several convolutional modules to compress the sequence and encode the features of EEG signals; thus, the final Transformer module can participate in the processing of data. HybridNet is mainly composed of two sub-blocks: one is a convolutional subblock composed of three convolutional modules, and the other is a Transformer subblock composed mainly of three Transformer encoders (TE).

Each convolutional module contains the same five layers. These include: the convolutional layer, the batch normalization layer (BN), the activation function layer (specifically implemented through the ELU activation function), the dropout layer, and the average pooling layer. While the convolutional layer and pooling layer in the convolutional subblock are both 2-dimensional modules, they realize the functions of one dimension, including time convolution, and space convolution, which is similar to EEGNet [52]. In the convolutional subblock, the first two convolutional modules realize the time perception (or time dimension feature extraction) function, and the third convolutional module realizes the space perception (or space dimension feature extraction) function. The kernel size of each temporal convolution is 1 × 16, while the kernel size of each spatial convolution is chans ×1, which is set to 30 × 1. All the average pooling layers have a pooling size of 1 × 2. The inactivation rate is set to 0.5 for all dropout layers. In addition to TE, the Transformer subblock [Fig. 2(b)] includes the learnable classification marker class token and position-coding, wherein the position-coding adopts the learnable position coding of BERT type instead of the position-coding represented by the sine and cosine functions in ViT. In TE, for MutLI-heads of all multi-head self-attention mechanisms, the number of heads is set to 4; each head processes a sequence dimension of 64, while the fully connected network processes a sequence dimension of 64.

PureConvNet is a lightweight model with fewer parameters as compared to that of HybridNet, designed by referring to EEGNet. The model consists of four convolutional modules, which are similar to the convolutional modules in HybridNet. Each convolutional module consists of a convolutional layer, a BN layer, an ELU activation function layer, a dropout layer, and an average pooling layer. Similarly, the corresponding 1-dimensional feature extraction operation is also implemented through a 2-dimensional convolutional layer. Specifically, PureConvNet adds a convolutional module for time-aware functionality. The kernel size is 1 × 16 for all temporal convolutions; it is chans × 1 for spatial convolutions, and 1 × 2 for the average pooling layer; additionally, the remaining layers remain consistent with HybridNet’s settings.

2.3 Preprocessing

For the offline data (preliminary round data), the abovementioned preprocessing steps were applied to each trial; these included five steps: re-referencing, segmentation, filtering, down-sampling, and normalization. First, re-referencing can be conducted via two methods: bipolar re-reference and common average re-reference (CAR) methods. The bipolar re-reference method subtracts the potential of two adjacent electrodes (according to the location relationship of the 10–20 system). It is important to note that the experiment in the preliminary round employed all 32 electrodes of the NeuSen.W32 wireless EEG system; further, while the device used in the final round does not contain A1 and A2 electrodes, we excluded two channels (N = 30) from the re-reference data in the final round. Additionally, in the preliminary round, we employed a third re-referencing method, namely, mastoid reference (MPR) method, that subtracts the potential of the left mastoid (A1) from the channels located in the left brain, and the potential of the right mastoid (A2) from the channels located on the right side. In the second step, a sliding window method with a window length of 14 s and a step size of 4 s was used for segmentation. In the third step, two filtering operations were performed, namely, a 50 Hz notch filter and a 0.5–45 Hz band-pass filter, both of which were implemented through a 6th order Butterworth filter. In the fourth step, to reduce subsequent computational complexity, the EEG signal was down-sampled to 125 Hz; this reduced the volume of data by half, given that the original sampling rate was relatively high. In the fifth step, Z-score normalization was applied to each down-sampled EEG segment. After implementing the abovementioned five preprocessing steps, due to different trial lengths, the average number of EEG segments obtained was 13.5 (min = 6, max = 29).

These procedures were applied to the online data (finals data), with the exception of differences in data partitioning. Due to the competition’s requirement to report a result every 1 second, the window length for data partitioning remained unchanged, and the step size was updated to 1 second.

2.4 Postprocessing

Similar to many international competitions, in the prediction phase in the affective BCI competition, postprocessing methods were used to integrate the results of multiple classifiers to obtain the final result [53]. The postprocessing method described in this paper includes adding the prediction vectors (i.e., the prediction probabilities for each category) of m HybridNet models to the prediction vectors of n PureConvNet models to obtain the prediction vector of the ensemble model. Finally, the category with the maximum prediction vector is the prediction result of the entire ensemble model. It is important to note that the m HybridNet models are trained by using different subsets of the training set, and the n PureConvNet models are also obtained through different subsets of the training set. However, there may be overlapping training set subsets between the two types of models. In the experiment, both m and n are 3, indicating that the contribution of the two types of models to the results is the same (i.e., the two types of models have the same weight).

2.4 Experimental settings

In the experiment, we used the Adam optimizer with the initial learning rate set to 0.001. The learning rate decay strategy of ReduceOnPlateau was employed [54]; in other words, if the loss did not decrease after seven epochs, then the learning rate decayed to 10% of the original. The batch size was set to 64, the number of epochs was set to 200, and the early stopping strategy was applied. For the accuracy of the results to remain unchanged, the maximum tolerance was 10 epochs. Our model was implemented in the Tensorflow 2.11 framework by using Python 3.10 and run on a GeForce RTX 3090 GPU.

3 Results

In this section, we briefly describe the experimental setup and results.

For the offline analysis in the preliminary round of the affective BCI competition, to evaluate the performance of each algorithm, we used 5-fold cross-validation (CV), with 4-fold CV for training and 1-fold CV for testing. Each fold contained 16 subjects for a total of five cycles. In the online analysis in the final round of the affective BCI competition, when a new continuous EEG signal was obtained, each team needed to provide real-time feedback with an interval of 1 s. Under this requirement, all contestants needed to consider two scenarios of the model. One required adjusting the model according to the existing test data, requiring the real-time data, and light weight of the model. The other required training based on the existing training data and feedback on the results of the new test data. Given that we could fully utilize the existing training data to obtain a model with high performance, and the scenario did not consider model size, we considered using a completely offline approach in the end.

We compared accuracy differences between average and bipolar re-references based on preliminary data (THUR-EP dataset, excluding several subjects online). Among them, PureConvNet obtained the highest accuracy under average re-reference method (31.9 ± 6.9) and HybridNet obtained the highest accuracy under bipolar re-reference method (42.5 ± 6.8). Therefore, in the final round of the competition, we set the following: PureConvNet used an average re-reference method to re-reference, while HybridNet used a bipolar re-reference method to re-reference.

Figure 2 shows the results of the top six teams in the finals. Among them, our team was TJU_TPU. The recognition framework provided by our team won the first place in the final round, and our score was significantly higher than the other teams, showing the effectiveness of the emotion recognition framework proposed by our team.

Fig. 2

Bar chart of the top six scores in the final phrase of the competition. Note: TJU indicates team from Tianjin University. THU indicates team from Tsinghua University. XUT indicates team from Xi’an University of Technology. ZJU indicates team from Zhejiang University. WYU indicates team from Wuyi University. UM indicates team from University of Macau.

Specifically, Fig. 3 describes the accuracy of the results of the final cross-subject classification of different models (online analysis). As can be seen from Fig. 3(b), the integration model proposed in this paper achieved the highest accuracy regarding three subjects (S1–S3), and the second highest accuracy regarding three subjects S4–S6. Of the two sub-models of the ensemble model, HybridNet exceeded the baseline model EEGNet by a significant extent in the data of multiple subjects (S4–S6); further, even the average accuracy of HybridNet in S3 exceeded that of EEGNet by 23.2%, being twice as high. PureConvNet had a 10% average accuracy advantage regarding S3 and S5. HybridNet and PureConvNet had their own advantages in the test data of different subjects. Regarding S3 and S6, HybridNet had 10% average accuracy advantage, while regarding S1, and S5, PureConvNet had 3% accuracy advantage. It is important to note that the integration model proposed by us achieved the highest average performance among the six participants in the final round [Fig. 3(a)], further proving the efficiency of the proposed integration model.

Fig. 3

Cross-subject classification accuracy of different models on final data. (a) Average classification accuracy of different models on six subjects. (b) Classification accuracy of different models on six subjects respectively.

Compared with offline analysis, the collected EEG signals for the online emotion recognition task had obvious environmental noise because the environment of the subjects was not sufficiently quiet, and the subjects were easily distracted by the surrounding environment; therefore, it was difficult to induce strong emotions in the subjects. The average accuracy of HybridNet in the online results was lower than that of the offline results except for S3 and S4, and the online performance of PureConvNet was also slightly lower than the offline performance.

The computational complexity of the algorithm is one of the most important indexes of an online emotion recognition task. The time required for each method to calculate a single sample (14 s channel EEG signal with a matrix size of 30 × 1750) is shown in Fig. 4. The baseline model EEGNet had the lowest computational complexity (0.217 GFLOPs) due to the low number of parameters. However, the computational complexity of the two models proposed by us is similar. It is important to note that the integration model is derived from the integration of HybridNet and PureConvNet, and that the proportion of the two models is the same; therefore, the computational complexity of the integration model is between these sub-models.

Fig. 4

Computational complexity of different models

4 Discussion

Ensemble learning is a highly effective method that can significantly improve a model’s performance in competitions. Thus, we employed ensemble learning to design TBEM. In TBEM, HybridNet generally outperformed PureConvNet in terms of performance, as shown in S2, S3, S4, and S6 [Fig. 3(b)] and for offline data. Moreover, HybridNet’s computational complexity was found to be slightly lower than that of PureConvNet.

The preprocessing method also has a significant impact on the model’s performance, and re-referencing is one aspect of it. Studies have found that re-referencing methods can affect the accuracy of EEG electrode placement [55 –58]. By comparing multiple re-referencing methods for EEG, we found that each method has its advantages. Using TBEM, we combined different re-referencing methods with their corresponding models, it resulted in high scores in the EEG emotion recognition task in the affective BCI competition.

It is challenging to build accurate cross-subject emotion recognition models by using traditional machine learning methods because the neural representation of emotion is highly complex and has individual differences [32]. The existing transfer learning method based on domain adaptation provides a possible solution to the problem of individual differences in emotion recognition models [59]. Thus, using transfer learning is an effective choice to obtain a high classification accuracy in future online emotion recognition tasks.

The online task had a high requirement regarding the computational complexity of the model; based on the data in the previous section, it can be seen that since EEGNet had a low computational complexity, its performance is low in the experiment. In general, the performance of a model increases with its computational complexity. However, ensemble learning provides a solution for online tasks with high real-time performance by integrating multiple models with lower computational complexity.

The results of offline and online experiments prove that the performance of the offline experiment is slightly better than that of the online experiment, which may also be related to the lack of inducement of emotions in subjects in online scenes [60]. This finding also highlights the challenge of inducing high-intensity emotions in real situations.

The postprocessing method includes averaging the predictions of each model. Each type of model has the same weight in the final result, rather than focusing on a particular type of model. In some specific test data, poorly performing models may affect the overall performance of the ensemble model [S4 and S6 in Fig. 3(b)], that results in the accuracy of the ensemble model being lower than that of a single branch model. In future, researchers can assign different weights to multiple models to improve the performance of the ensemble model. Furthermore, other postprocessing methods, such as voting, can be attempted.

5 Conclusion

This paper provides a systematic introduction to the approach employed by us, the winning team of the WRC2022 EEG-based emotion recognition competition, to design TBEM, an ensemble deep learning method. We found that ensemble learning is a highly effective method that can significantly improve a model’s performance in EEG-based emotion recognition competitions. TBEM comprises two models: HybridNet, based on a CNN, and Transformer, and the PureConvNet model, which is based solely on a convolutional neural network; it was observed that HybridNet outperformed the PureConvNet model in terms of the accuracy of EEG-based emotion recognition. Moreover, HybridNet’s computational complexity was slightly lower than that of PureConvNet. Thus, the results of the experiments showed that the model combining CNN and Transformer has great potential in the EEG-based emotion recognition task. The proposed method is meaningful for developing effective BCI systems.

Footnotes

Ethical approval

None.

Consent

All the subjects were informed and have signed a written consent to collect their EEG.

Conflict of interests

All contributing authors report no conflict of interests in this work.

Funding

This work was granted by National Key Research and Development Program of China “Biology and Information Fusion” Key Project (Grant No. 2021YFF1200600), National Natural Science Foundation of China (Grant Nos. 61906132 and 81925020), and Key Project & Team Program of Tianjin City (Grant No. XC202020).

Acknowledgements

All authors thank Haolin Wu (Tsinghua University), Dan Zhang (Tsinghua University), and Chengcheng Hong (Beijing University of Posts and Telecommunications) for their support in providing the data for the final competition.

Authors’ contribution

Dong Ming and Xiaopeng Si provided support for this work; Dong Huang and Yulin Sun provided model and analysis; Dong Huang, Shudi Huang and He Huang completed the original draft writing; and Xiaopeng Si reviewed and revised the manuscript. All the authors approved the final manuscript.

Data availability

None.

References

Kuusikko

Haapsamo

Jansson-Verkasalo

, et al. Emotion recognition in children and adolescents with autism spectrum disorders. J Autism Dev Disord 2009, 39(6): 938–945.

Shanechi

. Brain-machine interfaces from motor to mood. Nat Neurosci 2019, 22(10): 1554–1564.

Gao

Wang

Chen

, et al. Interface, interaction, and intelligence in generalized brain-computer interfaces. Trends Cogn Sci 2021, 25(8): 671–684.

Ramakrishnan

El Emary

IMM

. Speech emotion recognition approaches in human computer interaction. Telecommun Syst 2013, 52(3): 1467–1478.

Zhang

Zhao

Tian

. Spontaneous speech emotion recognition using multiscale deep convolutional LSTM. IEEE Trans Affect Comput 2022, 13(2): 680–688.

Valstar

Pantic

Patras

. Motion history for facial action detection in video. In 2004 IEEE International Conference on Systems, Man and Cybernetics (IEEE Cat. No. 04CH37583), The Hague, Netherlands, 2004, pp 635–640.

Zeng

Pantic

Roisman

, et al. A survey of affect recognition methods: audio, visual, and spontaneous expressions. IEEE Trans Pattern Anal Mach Intell 2009, 31(1): 39–58.

Toisoul

Kossaifi

Bulat

, et al. Estimation of continuous valence and arousal levels from faces in naturalistic conditions. Nat Mach Intell 2021, 3(1): 42–50.

Soleymani

Lichtenauer

Pun

, et al. A multimodal database for affect recognition and implicit tagging. IEEE Trans Affect Comput 2012, 3(1): 42–55.

10.

Koelstra

Muhl

Soleymani

, et al. DEAP: a database for emotion analysis using physiological signals. IEEE Trans Affect Comput 2012, 3(1): 18–31.

11.

Alarcão

Fonseca

. Emotions recognition using EEG signals: a survey. IEEE Trans Affect Comput 2019, 10(3): 374–393.

12.

LeDoux

. The amygdala. Curr Biol 2007, 17(20): R868–R874.

13.

Gallagher

Chiba

. The amygdala and emotion. Curr Opin Neurobiol 1996, 6(2): 221–227.

14.

Adolphs

. The unsolved problems of neuroscience. Trends Cogn Sci 2015, 19(4): 173–175.

15.

Dixon

Thiruchselvam

Todd

, et al. Emotion and the prefrontal cortex: an integrative review. Psychol Bull 2017, 143(10): 1033–1081.

16.

Salzman

Fusi

. Emotion, cognition, and mental state representation in amygdala and prefrontal cortex. Annu Rev Neurosci 2010, 33: 173–202.

17.

Chen

Wang

, et al. Ten challenges for EEG-based affective computing. Brain Sci Adv 2019, 5(1): 1–20.

18.

Luck

. An introduction to the event-related potential technique. Cambridge: MIT Press, 2014.

19.

Niedermeyer

da Silva

. Electroencephalography: basic principles, clinical applications, and related fields. Philadelphia, USA: Lippincott Williams & Wilkins, 2005.

20.

Gonzalez

George

Muzaffar

, et al. Hardware acceleration of EEG-based emotion classification systems: a comprehensive survey. IEEE Trans Biomed Circuits Syst 2021, 15(3): 412–442.

21.

Zhang

Tiwari

, et al. EEG based emotion recognition: A tutorial and review. ACM Computing Surveys 2022, 55(4): 1–57.

22.

Huang

, et al. Video-triggered EEG-emotion public databases and current methods: a survey. Brain Science Advances 2020, 6(3): 255–287.

23.

Hjorth

. EEG analysis based on time domain properties. Electroencephalogr Clin Neurophysiol 1970, 29(3): 306–310.

24.

Petrantonakis

Hadjileontiadis

. Emotion recognition from EEG using higher order crossings. IEEE Trans Inf Technol Biomed 2010, 14(2): 186–197.

25.

Kiymik

Güler

Dizibüyük

, et al. Comparison of STFT and wavelet transform methods in determining epileptic seizure activity in EEG signals for real-time application. Comput Biol Med 2005, 35(7): 603–616.

26.

Delorme

Makeig

. EEGLAB: an open source toolbox for analysis of single-trial EEG dynamics including independent component analysis. J Neurosci Methods 2004, 134(1): 9–21.

27.

Akin

. Comparison of wavelet transform and FFT methods in the analysis of EEG signals. J Med Syst 2002, 26(3): 241–247.

28.

Zheng

. Investigating critical frequency bands and channels for EEG-based emotion recognition with deep neural networks. IEEE Trans Auton Ment Dev 2015, 7(3): 162–175.

29.

Schwartz

Davidson

Maer

. Right hemisphere lateralization for emotion in the human brain: interactions with cognition. Science 1975, 190(4211): 286–288.

30.

Güntürkün

Ströckens

Ocklenburg

. Brain lateralization: a comparative perspective. Physiol Rev 2020, 100(3): 1019–1063.

31.

Peng

Luo

, et al. Facial expression recognition: a meta-analytic review of theoretical models and neuroimaging evidence. Neurosci Biobehav Rev 2021, 127: 820–836.

32.

Wang

Zhang

. Similar brains blend emotion in similar ways: neural representations of individual difference in emotion profiles. NeuroImage 2022, 247: 118819.

33.

Ismail

Karwowski

. A graph theory-based modeling of functional brain connectivity based on EEG: a systematic review in the context of neuroergonomics. IEEE Access 2020, 8: 155103–155135.

34.

Liu

, et al. EEG based emotion recognition by combining functional connectivity network and local activations. IEEE Trans Biomed Eng 2019, 66(10): 2869–2881.

35.

Zhang

, et al. Effective emotion recognition by learning discriminative graph topologies in EEG brain networks. IEEE Trans Neural Netw Learn Syst 2023, PP: 10.1109/TNNLS.2023.3238519.

36.

Liu

Ding

, et al. Multi-channel EEG-based emotion recognition via a multi-level features guided capsule network. Comput Biol Med 2020, 123: 103927.

37.

Krizhevsky

Sutskever

Hinton

. ImageNet classification with deep convolutional neural networks. Commun ACM 2017, 60(6): 84–90.

38.

Sutskever

Vinyals

. Sequence to sequence learning with neural networks. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2. Montreal , Canada, 2014, pp 3104–3112.

39.

Hochreiter

Schmidhuber

. Long short-term memory. Neural Comput 1997, 9(8): 1735–1780.

40.

Ackley

Hinton

Sejnowski

. A learning algorithm for Boltzmann machines. Cogn Sci 1985, 9(1): 147–169.

41.

Vaswani

Shazeer

Parmar

, et al. Attention is all you need. In Proceedings of Advances in Neural Information Processing Systems(NIPS 2017), Long Beach, USA, 2017, pp 30.

42.

Devlin

Chang

Lee

, et al. BERT: pre-training of deep bidirectional transformers for language understanding. arXiv, 2018: 1810.04805. https://arxiv.org/abs/1810.04805

43.

Brown

Mann

Ryder

, et al. Language models are few-shot learners. In NIPS’20: Proceedings of the 34th International Conference on Neural Information Processing Systems, 2020, pp 1877–1901.

44.

Wang

, et al. Transformers for EEG-based emotion recognition: a hierarchical spatial information learning model. IEEE Sens J 2022, 22(5): 4359–4368.

45.

Aadam

Tubaishat A

Al-Obeidat

, et al. EmoPercept: EEG-based emotion classification through perceiver. Soft Comput 2022, 26(20): 10563–10570.

46.

Arjun

Rajpoot

Raveendranatha Panicker

. Introducing attention mechanism for EEG signals: emotion recognition with vision transformers. In 2021 43rd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC) . Mexico, 2021, pp 5723–5726.

47.

Ren

Zhang

, et al. A novel ensemble learning method using multiple objective particle swarm optimization for subject-independent EEG-based emotion recognition. Comput Biol Med 2021, 140: 105080.

48.

Awan

Usman

Khalid

, et al. An ensemble learning method for emotion charting using multimodal physiological signals. Sensors 2022, 22(23): 9480.

49.

Zhuang

, Wang F, et al. fNIRS evidence for recognizably different positive emotions. Front Hum Neurosci 2019, 13: 120.

50.

Dosovitskiy

Beyer

Kolesnikov

, et al. An image is worth 16x16 words: transformers for image recognition at scale. arXiv, 2020: 2010.11929. https://arxiv.org/abs/2010.11929

51.

Liu

Lin

Cao

, et al. Swin transformer: hierarchical vision transformer using shifted windows. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV) , Montreal, QC, Canada, 2022, pp 9992–10002.

52.

Lawhern

Solon

Waytowich

, et al. EEGNet: a compact convolutional neural network for EEG-based brain-computer interfaces. J Neural Eng 2018, 15(5): 056013.

53.

Dong

Cao

, et al. A survey on ensemble learning. Front Comput Sci 2020, 14(2): 241–258.

54.

Zhang

Zhao

Guan

. Multimodal continuous emotion recognition: a technical report for ABAW5. arXiv, 2023: 2303.10335. https://arxiv.org/abs/2303.10335

55.

Lai

Valdes-Sosa

, et al.

How do reference montage and electrodes setup affect the measured scalp EEG potentials?

J Neural Eng 2018, 15(2): 026013.

56.

Yao

Bringas-Vega

, et al. The statistics of EEG unipolar references: derivations and properties. Brain Topogr 2019, 32(4): 696–703.

57.

Yao

Valdes-Sosa

. Unified Bayesian estimator of EEG reference at infinity: rREST (regularized reference electrode standardization technique). Front Neurosci 2018, 12: 297.

58.

Yao

Qin

, et al.

Which reference should we use for EEG and ERP practice?

Brain Topogr 2019, 32(4): 530–549.

59.

Liang

, et al. Multi-source joint domain adaptation for cross-subject and cross-session emotion recognition from electroencephalography. Front Hum Neurosci 2022, 16: 921346.

60.

Tang

Chen

. Comparison of cross-subject EEG emotion recognition algorithms in the BCI Controlled Robot Contest in World Robot Contest 2021. Brain Sci Adv 2022, 8(2): 142–152.

Transformer-based ensemble deep learning model for EEG-based emotion recognition

Abstract

Keywords

1 Introduction

2 Method

2.1 Data

2.2 Model

2.3 Preprocessing

2.4 Postprocessing

2.4 Experimental settings

3 Results

4 Discussion

5 Conclusion

Footnotes

Ethical approval

Consent

Conflict of interests

Funding

Acknowledgements

Authors’ contribution

Data availability

References