Emotion recognition from speech using deep learning on spectrograms

Abstract

In speech emotion recognition, most emotional corpora generally have problems such as inconsistent sample length and imbalance of sample categories. Considering these problems, in this paper, a variable length input CRNN deep learning model based on Focal Loss is proposed for speech emotion recognition of anger, happiness, neutrality and sadness in IEMOCAP emotional corpus. In this model, Firstly, a variable-length strategy is introduced to input the speech spectra of the filled speech samples into CNN. Then the effective part of the input sequence is preserved and output by masking matrix and convolution layer. Thirdly, the effective output of input sequence is input into BiGRU network for learning. Finally, the focal loss is used for network training to control and adjust the contribution of various samples to the total loss. Compared with the traditional speech emotion recognition model, simulations show that our method can effectively improve the accuracy and performance of emotion recognition.

Keywords

Speech emotion recognition spectrograms CRNN focal loss

Get full access to this article

View all access options for this article.

References

Björn

and Liu

Schuller.

, Speech emotion recognition: two decades in a nutshell, benchmarks, and ongoing trends, J. Communications of the Acm 61(5) (2018), 90–99.

Schmidt

E.M.

and Kim

Y.E.

, Learning emotion-based acoustic features with deep belief networks [C], (2011), 65–68.

Han

, Yu

and Tashey

, Speech emotion recognition using deep neural network and extreme learning machine [C]//Fifteenth annual conference of the international speech communication association, 2014.

Zheng

W.Q.

, Yu

J.S.

and Zou

Y.X.

, An experimental study of speech emotion recognition based on deep convolutional neural networks [C]//2015 international conference on affective computing and intelligent interaction (ACII), IEEE (2015), 827–831.

Huang

, Gong

, Fu

, et al. A research of speech emotion recognition based on deep belief network and SVM [J], Mathematical Problems in Engineering (2014), 2014.

Mao

, Dong

and Huang

, et al. Learning salient features for speech emotion recognition using convolutional neural networks [J], IEEE Transactions on Multimedia 16(8) (2014), 2203–2213.

Satt

, Rozenberg

and Hoory

, Efficient Emotion Recognition from Speech Using Deep Learning on Spectrograms [C]//INTERSPEECH, (2017), 1089–1093.

Lim

, Jang

and Lee

, Speech emotion recognition using convolutional and recurrent neural networks [C]//2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), IEEE (2016), 1–4.

Mirsamadi

, Barsoum

and Zhang

, Automatic speech emotion recognition using recurrent neural networks with local attention [C]//2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE (2017), 2227–2231.

10.

Aytar

, Vondrick

and Torralba

, Soundnet: Learning sound representations from unlabeled video [C]//Advances in neural information processing systems, (2016), 892–900.

11.

Sainath

T.N.

, Vinyal

and Sak

, ”Convolutional, long short-term memory,fully connected deep neural networks”, in IEEE International Conference on Acoustics, Speech and Signal Processing, (2015), 4580–4584.

12.

Lee

and Tashev

, “High-level feature representation using recurrent neural network for speech emotion recognition,” in INTERSPEECH, 2015.

13.

Rana

, ”Emotion classification from noisy speech – a deep learning approach,” arXiv preprint arXiv:1603.05901, 2016.

14.

Chernykh

, Sterling

and Prihodko

, “Emotion recognition from speech with recurrent neural networks,” arXiv preprint arXiv:1701.08071, 2017.

15.

Sainath

T.N.

, Weiss

R.J.

, Senior

A.W.

, Wilson

K.W.

and Vinyals

, ”Learning the speech front-end with raw waveform cldnns”, in INTERSPEECH, 2015, pp. 1–5.

16.

Hunnun

, Case

, Casper

, Catanzaro

, Diamos

, Elsen

, Prenger

, Satheesh

, Sengupta

and Coates”

, Deep speech: Scaling up end-to-end speech recognition,” Computer Science, 2014.

17.

Amodei

, Anubhai

, Battenberg

, Case

, Casper

, Catanzaro

, Chen

, Chrzanowski

, Coates

and Diamos

, ”Deep speech 2: End-to-end speech recognition in english and mandarin,”, Computer Science 2015.

18.

Variani

, Lei

, Mcdermott

, Moreno

I.L.

and Gonzalez-Dominguez

, “Deep neural networks for small footprint text-dependent speaker verification,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2014, pp. 4052–4056.

19.

Zhao

Jianfeng

, Mao

Xia

and Chen

Lijiang

, Speech emotion recognition using deep 1D & 2D CNN LSTM networks, in Biomedical Signal Processing and Control, 2019, pp. 312–323.

20.

Trigeorrgis

, Ringeval

, Brueckner

, Marchi

, Nicolaou

M.A.

, Schuller

and Zafeiriou

, ”Adieu features? end- to-end speech emotion recognition using a deep convolutional recurrent network,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2016.

21.

Redmon

, Divvala

, Girshick

and Farhadi

, You only look once: Unified, real-time object detection. In CVPR, 2016.

22.

Redmon

and Farhadi

, YOLO9000: Better, faster, stronger. In CVPR, 2017.

23.

Liu

, Anguelov

, Erhan

, Szegedy

and Reed

, SSD: Single shot multibox detector. In ECCV, 2016.

24.

C.-Y.

, Liu

, Ranga

, Tyagi

and Berg

A.C.

, DSSD: Deconvolutional single shot detector. arXiv: 1701. 06659, 2016.

25.

Lin

T.Y.

, Goyal

and Girshick

, et al. Focal loss for dense object detection [C]//Proceedings of the IEEE International Conference on Computer Vision, (2017), 2980–2988.

26.

, Wu

Zhiyong

and Jia

Jia

, et al. Emotion Recognition from Variable-Length Speech Segments Using Deep Learing on Sepectrograms, In INTERSPEECH, 2018, pp. 3683–3687.