SVM based Voice Activity Detection by fusing a new acoustic feature PLMS with some existing acoustic features of speech

Abstract

Voice activity detection (VAD) identifies the presence/absence of human speech in a frame of a given speech signal. Presence/Absence of human speech can easily be identified in clean speech signal but its accuracy decreases with decreasing Signal-to-Noise ratio (SNR) value. Robust VAD helps to enhance the efficiency of speech signal based automated applications like speech enhancement, speaker identification, hearing aid devices etc. In this paper, a new feature of speech signal- “Peak of Log Magnitude Spectrum (PLMS)” is introduced and used for VAD. This newly defined feature PLMS along with three existing acoustic features(MFCC;RASTA-PLP and Formant Frequency) are used to train SVM classifier for VAD. Experimentally, it is found that coefficients of PLMS play most prominent role. Experimentally, it is also observed that the accuracy of the trained SVM classifier for VAD is the highest when compared with other state of the art methods (Sohn VAD and VAD G.729).

Keywords

VAD PLMS SVM MFCC RASTA-PLP Formant Frequency

Get full access to this article

View all access options for this article.

References

http://practicalcryptography.com/miscellaneous/machinelearning/guide-mel-frequency-cepstral-coefficients-mfccs/.

http://iitg.vlab.co.in/index.php?sub=59&brch=164&sim=615&cnt=1 [Accessed on 18.05.2016].

Davis

, Nordholm

, and Togneri

, Statistical voice activity detection using low-variance spectrum estimationand an adaptive threshold, in: IEEE Transactions on Audio, Speech, and Language Processing14 (2) (2006), 412–424.

Ying

, Yan

, Dang

, and Soong

, Voice activity detection based on an unsupervised learning framework, in: IEEE Transactions on Audio, Speech, and Language Processing19 (8) (2011), 2624–2644.

Nemer

, Goubran

, and Mahmoud

, Robust voice activity detection using higher-order statistics in the LPC residual domain, in: IEEE Transactions on Speech and Audio Processing9 (3) (2001), 217–231.

Aneeja

, and Yegnanarayana

, Single frequency filtering approach for discriminating speech and on speech, in: IEEE/ACM Trans Audio, Speech, Lang Process23 (4) (2015), 705–717.

Yoo

I.-C.

, Lim

, and Yook

, Formant-based robust voice activity detection, in: IEEE/ACM Trans Audio,Speech, Lang Process23 (12) (2015), 2238–2245.

McCowan

, Dean

, McLaren

, Vogt

, and Sridharan

, The delta-phase spectrum with application to voice activity detection and speaker recognition, in: IEEE Trans Acoust, Speech, Language Process19 (7) (2011), 2026–2038.

Haigh

and Mason

, Robust voice activity detection using cepstral features, in: Proc TENCON (1993), pp. 321–324.

10.

Chang

J.H.

, Kim

N.S.

, and Mitra

S.K.

, Voice activity detection based on multiple statistical models, in: IEEE Transactions on Signal Processing54 (6) (2006), 1965–1976.

11.

Ramrez

, Segura

J.C.

, Bentez

, Torre

A.D.L.

and Rubio

, Efficient voice activity detection algorithms using long-term speech information, in: Speech Communication42 (2004), 3–4.

12.

Ramirez

, Segura

J.C.

, Benitez

, Garcia

and Rubio

, Statistical voice activity detection using a multiple observation likelihood ratio test, in: IEEE Signal Processing Letters12 (10) (2005), 689–692.

13.

Ramirez

, Ye lamos

, Gorriz

J.M.

and Segura

J.C.

, SVM based speech endpoint detection using contextual speech features, in: Electronics Letters42 (7) (2006), 426–428.

14.

Sohn

, Kim

N.S.

, and Sung

, A statistical model-based voice activity detection, in: IEEE Signal Process Letters6 (1) (1999), 1–3.

15.

Dhananjaya

and Yegnanarayana

, Voiced/nonvoiced detection based on robustness of voiced epochs, in:IEEE Signal Process Lett17 (3) (2010), 273–276.

16.

Davis

and Mermelstein

, Comparison of parametric representations for monosyllabic word recognition incontinuously spoken sentences, in: IEEE Transactions on Acoustics, Speech, and Signal Processing28 (4) (1980), 357–366.

17.

Ghosh

, Tsiartas

, and Narayanan

, Robust voice activity detection using long-term signal variability, in: IEEE Trans Acoust, Speech, Language Process19 (3) (2011), 600–613.

18.

Teng

and Jia

, Voice activity detection via noise reducing using non-negative sparse coding, in: IEEE Signal Process Lett20 (5) (2013), 475–478.

19.

Gazor

and Zhang

, A soft voice activity detector based on a Laplacian-Gaussian model, in: IEEE Trans Acoust, Speech, Language Process11 (5) (2003), 498–505.

20.

Mousazadeh

and Cohen

, Voice activity detection in presence of transient noise using spectral clustering, in: IEEE Trans Audio, Speech, Lang Process21 (6) (2013), 1261–1271.

21.

Sadjadi

S.O.

and Hansen

J.H.L.

, Unsupervised speech activity detection using voicing measures and perceptual spectral flux, in: IEEE Signal Process Lett20 (3) (2013), 197–200.

22.

Hughes

and Mierle

, Recurrent neural networks for voice activity detection, in: Proc IEEE Int ConfAcoust, Speech, Signal Process, 2013, pp. 7378–7382.

23.

Kinnunen

, Chernenko

, Tuononen

, Franti

, and Li

, Voice activity detection using MFCC features and support vector machine, in: Proc Int Conf on Speech and Computer (SPECOM07), 22007, pp. 556–561.

24.

Pham

, Tang

, and Stadtschnitzer

, Using artificial neural network for robust voice activity detection under adverse conditions, in: Int Conf on Computing and Communication Technologies, RIVF, 2009, pp. 1–8.

25.

McLoughlin

, The use of low-frequency ultrasound for voice activity detection, in: Proc Interspeech (2014), 1553–1557.

26.

Zhang

X.-L.

, and Wang

D.L.

, Boosted deep neural networks and multi-resolution cochleagram features for voice activity detection, in: Proc Interspeech, 2014, pp. 1534–1538.

27.

Zhang

X.-L.

and Wu

, Deep belief networks based voice activity detection, IEEE Trans Acoust, Speech, Language Process21(4) (2013), 697–710.

28.

Ephraim

and Malah

, Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator, in: IEEE Trans Audio, Speech, Signal Process32 (6) (1984), 1109–1121.

29.

and Loizou

, Subjective evaluation and comparison of speech enhancement algorithms, in: Speech Communication49 (2007), 588–601.

30.

Wang

, Chen

, and Wang

D.L.

, Deep neural network based supervised speech segregation generalizes to novelnoises through largescale training, Dept. of Comput. Sci. and Eng., The Ohio State Univ., Columbus, OH, USA,Tech. Rep. OSUCISRC-3/15-TR02, 2015.

31.

Zhang

X.-L.

and Wang

, Boosting, contextual information for deep neural network based voice activity detection, IEEE/ACM Transactions on Audio, Speech, and Language Processing24 (2) (2016), 252–264.

32.

Shahsavari

, Sameti

, and Hadian

, Speech Activity Detection Using Deep Neural Networks, in: 25th Iranian Conference on Electrical Engineering (lCEE2017) (2017), pp. 1564–1568.

33.

Jie

and Datao

, Enhanced, speech based jointly statistical probability distribution function for voiceactivity detection, in: Chinese Journal of Electronics26 (2) (2017), 325–330.

34.

Sertsi

, Boonkla

, Chunwijitra

, Kurpukdee

, and Wutiwiwatchai

, Robust Voice, Activity, Detection Based on LSTM Recurrent Neural Networks and Modulation Spectrum, in: Proceedings of APSIPA Annual Summit and Conference 20172017, pp.342–346.

35.

Baby

, Thomas

A.L.

, Nishanthi

N.L.

, and Consortium

, Resources for Indian languages, in: Community Based Building of Language Resources (CBBLR), 2016, pp. 37–43, Brno, Czech Republic: Tribun EU. [Online]. Available: https://www.iitm.ac.in/donlab/tts/index.php.