Sage Journals: Discover world-class research

Abstract

Audio sentiment analysis is pivotal for discerning nuances in spoken communication, impacting fields such as customer support, healthcare, and beyond. This paper seeks to investigate the latest methodologies and technological advancements in audio sentiment analysis, assessing the primary techniques used to recognize and interpret emotions within audio signals. By utilizing diverse multilingual datasets, it demonstrates versatility and applicability across different languages. The proposed emotion classification model, combining (LSTM + CNN) with Logistic Regression, achieved an accuracy of 93.33%. Leveraging the strengths of LSTM networks, VGGish features through a Convolutional Neural Network, and logistic regression for stacking, this model offers a rich framework for analyzing emotional content in audio recordings. Future work will focus on implementing this model in practical applications, such as enhancing user experience in virtual assistants, improving mental health monitoring systems, and integrating emotion recognition in everyday communication tools. Key challenges, including data diversity and model robustness, are discussed, along with emerging trends and future research possibilities. This study provides a comprehensive view of the current field, identifying promising directions for further development in audio sentiment analysis.

Keywords

audio sentiment analysis CNN emotion detection ensemble learning LSTM machine learning speech processing

Get full access to this article

View all access options for this article.

References

Abburi

, et al. Multimodal sentiment analysis using deep neural networks. Mining Intelligence and Knowledge Exploration: 4th International Conference, MIKE 2016, Mexico City, Mexico, November 13–19, 2016, Revised Selected Papers 4, Springer International Publishing, 2017.

Bengio

Courville

Vincent

(2013). Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8), 1798–1828. https://doi.org/10.1109/TPAMI.2013.50

Cunningham

, et al. (2021). Supervised machine learning for audio emotion recognition: Enhancing film sound design using audio features, regression models and artificial neural networks. Personal and Ubiquitous Computing, 25(4), 637–650. https://doi.org/10.1007/s00779-020-01389-0

El Ayadi

Kamel

M. S.

Karray

(2011). Survey on speech emotion recognition: Features, classification schemes, and databases. Pattern Recognition, 44(3), 572–587. https://doi.org/10.1016/j.patcog.2010.09.020

Ezzat

El Gayar

Ghanem

M. M.

(2012). Sentiment analysis of call centre audio conversations using text classification. International Journal of Computer Information Systems and Industrial Management Applications, 4(1), 619–627.

Fayek

Lech

Cavedon

(2017). Evaluating deep learning architectures for speech emotion recognition. Neural Networks, 92, 60–68. https://doi.org/10.1016/j.neunet.2017.02.013

García-Ordás

M. T.

, et al. (2021). Sentiment analysis in non-fixed length audios using a fully convolutional neural network. Biomedical Signal Processing and Control, 69, 102946. https://doi.org/10.1016/j.bspc.2021.102946

Garg

, et al. (2020). Prediction of emotions from the audio speech signals using MFCC, MEL and Chroma. 2020 12th International Conference on Computational Intelligence and Communication Networks (CICN), IEEE.

Issa

Demirci

M. F.

Yazici

(2020). Speech emotion recognition with deep convolutional neural networks. Biomedical Signal Processing and Control, 59, 101894. https://doi.org/10.1016/j.bspc.2020.101894

10.

Han

Zhang

Guan

(2014). Speech emotion recognition using deep neural network and extreme learning machine. Proc. IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems (MFI), pp. 1–6.

11.

Ramakrishnan

El Emary

I. M.

(2013). Speech emotion recognition approaches in human-computer interaction. Telecommunication Systems, 52(3), 1467–1478. https://doi.org/10.1007/s11235-011-9624-z

12.

Ringeval

Sonderegger

Sauer

Lalanne

(2013). Introducing the RECOLA multimodal corpus of remote collaborative and affective interactions. Proc. IEEE International Conference on Automatic Face and Gesture Recognition (FG), pp. 1–8.

13.

Schuller

, et al. (2013 ). The INTERSPEECH 2013 computational paralinguistics challenge: Social signals, conflict, emotion, autism. Proc. INTERSPEECH, pp. 148–152.

14.

Tao

Tan

(2009). Affective computing: A review. In Affective information processing (pp. 51–62). Springer.

15.

Tripathi

Beigi

(2018). Multi-modal emotion recognition on IEMOCAP dataset using deep learning. Proc. IEEE Conference on Automatic Speech Recognition and Understanding (ASRU), pp. 78–85.

16.

Tzirakis

Zhang

Schuller

B. W.

(2018). End-to-end speech emotion recognition using deep neural networks. Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 5089–5093.

17.

Zheng

Xie

Yang

(2022). Multimodal emotion recognition for one-minute-gradual emotion challenge via spectrogram and graph attention network. IEEE Transactions on Affective Computing, 13(3), 1672–1684. https://doi.org/10.1109/TAFFC.2022.3151411

Multilingual Sentiment Classification of Audio Using VGGish and LSTM Models

Abstract

Keywords

Get full access to this article

References