Deep residual networks for pre-classification based Indian language identification

Abstract

This paper proposes a pre-classification based language identification (LID) system for Indian languages. In this system, firstly, languages are pre-classified into tonal and non-tonal categories and then individual languages are identified from the languages of the respective category. In this work, language discriminating ability of various acoustic features like, pitch Chroma, mel-frequency Cepstral coefficients (MFCCs) and their combination has been investigated. The system performance has been analyzed for features extracted using different analysis units, like, syllables and utterances. The effectiveness of deep residual networks (ResNets) model in identification of Indian languages has been studied. Also, the system performance has been compared with the performances of other deep neural network architectures like, Convolutional Neural network (CNN) model, cascade CNN-long short-term memory (LSTM) model and shallow architecture like, ANN. Experiments have been carried out on NIT Silchar language database (NITS-LD) and OGI-Multilingual database (OGI-MLTS). Experimental analysis suggests that proposed ResNets model, based on syllable-level features, outperforms the other models. The pre-classification module provides accuracies of 96.6%, 93.2% and 90.6% for NITS-LD, and 92.1%, 89.3% and 85.4% for OGI-MLTS database, with 30s, 10s and 3s test data respectively. The pre-classification module helps to improve the system performance by 3.8%, 4.1% and 4.3% for 30s, 10s and 3s test data respectively. For OGI-MLTS database, the respective improvements are 6.8%, 6.5% and 5.4%.

Keywords

Language identification tonal and non-tonal languages ResNets Chroma and MFCC NITS-LD

Get full access to this article

View all access options for this article.

References

Ambikairajah ,

Li ,

Wang ,

Yin and

Sethu , Lan-guage identification: A tutorial,IEEE Circuits and Systems Magazine 11(2011), 82–108.

Jothilakshmi ,

Ramalingam and

Palanivel , A hierar-chical language system for Indian languages, Digital Signal Processing 22 (2012), 544–553.

V.R.

Reddy ,

Maity and

K.S.

Rao , Identification of Indian languages using multi-level spectral and prosodic features, Int Journal of Speech Tech 16 (2013), 489–511.

Dan and

Robert Ladd , Linguistic tone is related to the population frequency of the adaptive haplogroups of two brain size genes,ASPM and Microcephalin PANS (2007). doi: 10.1073/pnas.0610848104

Maddieson , Tone . In: M.S.

Dryer

and M.

Haspelmath

, (ed.,)The World Atlas of Language Structures Online. Leipzig: Max Planck Institute for Evolutionary Anthropology, 2013.

Wang ,

Ambikairajah and

E.H.

Choi ,Automatic tonal and non-tonal language classification and language identification using prosodic information, In IEEE International Conference on Multimedia and Expo 2007, 2006, pp.352–355.

Wang ,

Ambikairajah and . Choi, Automatic language recognition with tonal and non-tonal language pre-classification, In 15th European Signal Processing Conference, 2007, pp.2375–2379.

Deutsch ,

Dooley and

Henthorn , Pitch circularity from tones comprising full harmonic series, The Journal of the Acoustical Society of America 124 (2008), 589–597.

Deutsch , The paradox of pitch circularity, Acoustics Today (2010), 8–15.

10.

Ryant ,

Yuan and

Liberman , Mandarin tone clas-sification without pitch tracking, In Acoustics, Speech and Signal Processing, 2014, pp. 4868–4872.

11.

Thomas ,

Ganapathy ,

Saon and

Soltau , Analyzing convolutional neural networks for speech activity detection in mismatched acoustic conditions, In Acoustics, Speech and Signal Processing, 2014, pp. 2519–2523.

12.

Ganapathy ,

Han ,

Thomas ,

Omar ,

M.V.

Seg-broeck and

S.S.

Narayanan , Robust language identification using convolutional neural network features, In Fifteenth Annual Conference of the International Speech Communication Association, 2014.

13.

Richardson ,

Reynolds and

Dehak , A unified deep neural network for speaker and language recognition, Proceedings Interspeech, 2015. arXiv preprint arXiv:1504.00923.

14.

K.V.

Mounika ,

Achanta ,

H.R.

Lakshmi ,

S.V.

Gangashetty and

A.K.

Vuppala , An Investigation of Deep Neural Network Architectures for Language Recognition in Indian Languages, In Proceedings Interspeech, 2016, pp. 2930–2933.

15.

Zhang ,

Sun ,

Liu ,

Chen ,

Huo and

Zhang , Deep Recurrent Convolutional Neural Network: Improving Performance for Speech Recognition, 2016. arXiv preprint arXiv:1611.07174.

16.

Zazo ,

Lozano-Diez ,

J.Gonzalez-DominguezandD.T.

Toledano , Language identification in short utterances using long short-term memory (LSTM) recurrent neural networks, PloS One 11 (2015), 1–17.

17.

Bengio , Learning deep architectures for AI, Foundations and Trends in Machine Learning 2 (2009), 1–127.

18.

De-la-Calle-Silos ,

Peláez-Moreno and

Gallardo-Antolin , Deep Residual Networks with Auditory Inspired Features for Robust Speech Recognition.

19.

H.K.

Vydana and

A.K.

Vuppala , Residual neural networks for speech recognition,25th European conference in Signal Processing Conference (EUSIPCO),2017, pp. 543–547.

20.

He ,

Zhang ,

Ren and

Sun , Identity mappings in deep residual networks, In European Conference on Computer Vision, Springer, Cham, 2016, pp. 630–645.

21.

Atterer and

D.R.

Ladd , On the phonetics and phonology of ‘segmental anchoring’ of F0, Journal of Phonetics 32 (2004), 177–197.

22.

A.K.

Singh , A computational phonetic model for Indian language scripts, In Constraints on Spelling Changes Fifth International Workshop on Writing Systems, Nijmegen, theNetherlands, 2006.

23.

A.N.

Khan ,

S.V.

Gangashetty and

Yegnanarayana , Syllabic properties of three Indian languages: Implications for speech recognition and language identification,International Conference on Natural Language Processing, 2003, pp. 125–134.

24.

S.R.M.

Prasanna ,

B.V.S.

Reddy and

Krishnamurthy , Vowel onset point detection using source, spectral peaks, and modulation spectrum energies, IEEE Transactions on Audio, Speech, and Language Processing 17 (2009), 556–565.

25.

Baby ,

J.J.

Prakash ,

Vignesh and

H.A.

Murthy , Deep Learning Techniques in Tandem with Signal Processing Cues for Phonetic Segmentation for Text to Speech Synthesis in Indian Languages, In Proceedings Interspeech, 2017, pp. 3817–3821.

26.

Y.K.

Muthusamy ,

R.A.

Cole and

B.T.

Oshika , The OGI multi-language telephone speech corpus, In Second Inter-national Conference on Spoken Language Processing, 1992.

27.

Sell and

Clark , Music tonality features for speech/music discrimination, In Acoustics, Speech and Signal Processing, 2014, pp. 2489–2493.

28.

C.C.

Lee ,

Mower ,

Busso ,

Lee and

Narayanan , Emotion recognition using a hierarchical binary decision tree approach, Speech Communication 53 (2011), 1162–1171.

29.

W.M.

Campbell ,

J.P.

Campbell ,

D.A.

Reynolds ,

Singer and

P.A.

Torres-Carrasquillo , Support vector machines for speaker and language recognition, ComputerSpeech & Lan-guage 20(2006), 210–229.

30.

Casale ,

Russo ,

Scebba and

Serrano , Speech emotion classification using machine learning algorithms, In Semantic Computing, 2008, pp. 158–165.