Foreign accent classification using deep neural nets

Abstract

Speech analysis for extracting attributes such as the speaker, gender, accent and like has been a field of great interest and has been widely studied. The paper presents a novel architecture for accent identification by using a cascade of two deep-learning architecture. We design and test our proposed architecture on common voice dataset. The architecture consists of a cascade of Convolutional Neural Network (CNN) and Convolutional Recurrent Neural Network (CRNN). It is trained on Mel-spectrogram of the audios. We consider five of the most popular English accents groups namely India, Australia, US, England, Canada in this study. The proposed model has an accuracy of 78.48% using CNN and 83.21% using CRNN.

Keywords

Mel-spectrogram deep neural networks foreign accent classification recurrent neural network

Get full access to this article

View all access options for this article.

References

Deshpande

, Chikkerur

and Govindaraju

, Accent classification in speech, in Proceedings - Fourth IEEEWorkshop on Automatic Identification Advanced Technologies, AUTO ID 2005, 2005 (2005), 139–143.

Pedersen

and Diederich

, Accent Classification Using Support Vector Machines, in 6th IEEE/ACIS International Conference on Computer and Information Science (ICIS 2007), no. July. IEEE, (2007), pp 444–449.

Najafian

, Safavi

, Weber

and Russell

, Identification of British English regional accents using fusion of i-vector and multi-accent phonotactic systems, in Proceedings of Odyssey - The Speaker and Language Recognition Workshop, no. June, (2016), pp 132–139.

Torres-Carrasquillo

P.A.

, Sturim

, Reynolds

D.A.

, McCree

, Eigen-channel compensation and discriminatively trained Gaussian mixture models for dialect and accent recognition, in Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, no. 1, (2008), pp 723–726.

Debyeche

, Haton

J.P.

and Houacine

, Improved Vector Quantization Approach for Discrete HMM Speech Recognition System, Int Arab J Inf Technol4.4 (2007), 338–344.

Chu

, Lai

and Le

, Accent Classification of Non-Native English Speakers, 1–8, (2017) (unpublished)

Chan

M.V.

, Feng

, Heinen

J.A.

and Niederjohn

R.J.

, Classification of SpeechAccentswithNeuralNetworks, in Proc. ICNN’94, International Conference on Neural Networks, (1994), pp. 4483–4486.

Kao

C.-C.

, Wang

, Sun

and Wang

, RCRNN: Region-based Convolutional Recurrent Neural Network for Audio Event Detection, 8 (2018), 1–5.

Zuo

, Shuai

, Wang

, Liu

, Wang

and Chen

, Convolutional recurrent neural networks: Learning spatial dependencies for image representation, in2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), vol. 00. IEEE, 6 (2015), pp. 18–26.

10.

Bartz

, Herold

, Yang

and Meinel

, Language Identification Using Deep Convolutional Recurrent Neural Networks, in Neural Information Processing, Springer International Publishing, (2017), pp 880–890.

11.

Choi

, Fazekas

, Sandler

and Cho

, Convolutional Recurrent Neural Networks for Music Classification, ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, 136(1) 2392–2396, 9, (2016).

12.

Vogl

, Dorfer

, Widmer

and Knees

, Drum Transcription via Joint Beat and Drum Modeling using Convolutional Recurrent Neural Networks, in Proceedings of the International Society for Music Information Retrieval (ISMIR) Conference, (2017), pp. 150–157.

13.

Najafian

, Safavi

, Hansen

J.H.

and Russell

, Improving speech recognition using limited accent diverse British English training data with deep neural networks, in IEEE International Workshop on Machine Learning for Signal Processing, MLSP, vol. 2016-November IEEE, 9 (2016), pp 1–6.

14.

Trevino

, Accent Classification using Neural Networks. OpenStax CNX. 15 Dec (2005).

15.

Jiao

, Tu

, Berisha

and Liss

, Accent Identification by Combining Deep Neural Networks and Recurrent Neural Networks Trained on Long and Short-Term Features, in Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, vol. 0812-Sept, no. September, 9 (2016), pp. 2388–2392.

16.

Han

and Lee

, Acoustic scene classification using convolutional neural network and multiple-width frequency-delta data augmentation, arXiv preprint arXiv:1607.02383. 2016 Jul 8.

17.

Humphrey

, Brook

and MacDonald

, Exposing audio data to the web: an API and prototype, Proceedings of the 19th international conference on World Wide Web. ACM, (2010).

18.

Kutateladze

, Major differences between American and British English in business communication, Journal in Humanities3.2 (2015), 23–26.

19.

Ensslin

, Goorimoorthee

, Carleton

, Bulitko

and Hernandez

S.P.

, Deep Learning for Speech Accent Detection in Videogames in Thirteenth Artificial Intelligence and Interactive Digital Entertainment Conference 2017 Sep 19.

20.

Stefan

, New-Dialect Formation in Canada: Evidence from the English Modal Auxiliaries, Amsterdam and Philadelphia: John Benjamins (2008).

21.

Yan

, Vaseghi

, Rentzos

, Ho

C.-H.

and Turajlic

, Analysis of acoustic correlates of British, Australian and American accents, in 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721). IEEE, pp. 345–350.