Short video emotion transmission based on 3D convolutional neural network and multi-head attention mechanism

Abstract

Generally, emotions in short videos are transmitted through characters behaviour and text content in the video. The mainstream recognition model is based on a convolutional neural network. However, its processing ability for video data is limited, its recognition accuracy is not high enough, and it is limited in processing complex text. To solve these problems, a new text recognition emotion classification model is designed by improving the text feature fusion module on the basis of the convolutional neural network model. Moreover, a new human behaviour recognition emotion classification model is designed by introducing multi-head attention mechanisms on the basis of a 3D convolutional neural network model. These results confirmed that the accuracy of the improved text recognition model was around 75%, while the original convolutional model's average accuracy was only about 67%. The 3D convolutional model under the multi-head attention mechanism had the highest recognition accuracy, with a recognition accuracy of 94.5% on the UCF101 database, which was 12 to 39 percentage points higher than the model under other attention mechanisms. The improved convolutional network model for short video text classification and behaviour recognition has more advantages than traditional models. These research results have certain value for classification models and can serve as technical references.

Keywords

three-dimensional convolutional neural network multi-head self-attention emotional communication behavior recognition text classification

Get full access to this article

View all access options for this article.

References

Guo

Mustafaoglu

Koundal

. Spam detection using bidirectional transformers and machine learning classifier algorithms. J Comput Cognit Eng 2023; 2: 5–9.

Blaivas

. Are all deep learning architectures alike for point-of-care ultrasound?: Evidence from a cardiac image classification model suggests otherwise. J Ultrasound Med 2020; 39: 1187–1194.

Meng

Zhang

Chen

, et al. Triplet interactive attention network for cross-modality person re-identification. Pattern Recognit Lett 2021; 152: 202–209.

Lin

Wang

. Integrated image sensor and light convolutional neural network for image classification. Math Probl Eng 2021; 2021: 5573031.1–5573031.7.

Tang

Zhang

Yin

. Temporal consistency two-stream CNN for human motion prediction. Neurocomputing 2022; 468: 245–256.

Zhu

Chen

Zheng

, et al. Automatic recognition of lactating sow postures by refined two-stream RGB-D faster R-CNN – Science Direct. Biosyst Eng 2020; 189: 116–132.

Peng

Huang

Tsoi

, et al. Motion boundary emphasised optical flow method for human action recognition. IET Comput Vision 2020; 14: 378–390.

Wang

, et al. Sentence semantic matching based on 3D CNN for human–robot language interaction. ACM Trans Internet Technol 2021; 21: 1–24.

Wang

Tang

Yang

, et al. A novel network with multiple attention mechanisms for aspect-level sentiment analysis. Knowl-Based Syst 2021; 227: 107196.1–107196.12.

10.

Gan

Feng

Zhang

. Scalable multi-channel dilated CNN-BiLSTM model with attention mechanism for Chinese textual sentiment analysis. Future Gener Comput Syst 2021; 118: 297–309.

11.

Qin

Zhang

Liu

, et al. A visual place recognition approach using learnable feature map filtering and graph attention networks. Neurocomputing 2021; 457: 277–292.

12.

Chen

Huang

, et al. Neighbor enhanced graph convolutional networks for node classification and recommendation. Knowl-based Syst 2022; 246: 108594.1–108594.10.

13.

Wang

Yan

, et al. Neural graph personalized ranking for top-N recommendation. Knowl-Based Syst 2020; 213: 106426.1–106426.9.

14.

Mao

, et al. Improving convolutional neural network for text classification by recursive data pruning. Neurocomputing 2020; 414: 143–152.

15.

Jin

Zhou

, et al. Attention mechanism-based CNN for facial expression recognition. Neurocomputing 2020; 411: 340–350.

16.

Mishra

Sanyal

, et al. Real time human action recognition using triggered frame extraction and a typical CNN heuristic. Pattern Recognit Lett 2020; 135: 329–336.

17.

. Research on methods of English text detection and recognition based on neural network detection model. Sci Programming 2021; 2021: 6406856.1–6406856.11.

18.

Wei

Kamruzzaman

. Inter/intra-category discriminative features for aerial image classification: a quality-aware selection model. Future Gener Comput Syst. 2021; 119: 77–83.