Sage Journals: Discover world-class research

Abstract

In real-world situation, speech signals reaching our ears are usually degraded by the background noise. These distortions are detrimental to the speech quality and intelligibility and also cause a serious problem to many speech-related applications, such as automatic speech recognition and speaker identification. In order to deal with the background noise distortions, we propose a strategy to enhance the degraded speech in this paper, where speech enhancement is conducted using supervised deep neural network models. The models are trained to learn a mapping from the features of noisy speech to estimate the ideal-ratio mask (IRM). The estimated IRM is then applied to the noisy speech in order to obtain an enhanced version of the degraded speech. The mean square error (MSE) is used as an objective cost function. Additionally, Global Variance Equalization is performed as a post-processing step to equalize variances of the features. Systematic evaluations and comparisons show that the proposed supervised method improves objective metrics of speech quality and intelligibility substantially and significantly outperforms the competing and baseline speech enhancement methods. Finally, the proposed method is examined in speaker identification task in noisy situations. The proposed method leads to the highest speaker identification rates when compare to the competing and baseline speech enhancement methods.

Keywords

Speech enhancement deep neural networks supervised learning global variance quality intelligibility

Get full access to this article

View all access options for this article.

References

Boll

, Suppression of acoustic noise in speech using spectral subtraction, IEEE Transactions on Acoustics, Speech, and Signal Processing 27(2) (1979), 113–120.

Scalart

, Speech enhancement based on a priori signal to noise estimation, IEEE International Conference on Acoustics, Speech, and Signal Processing (1996), ICASSP96.

Ephraim

and Malah

, Speech enhancement using a minimum mean-square error log-spectral amplitude estimator, IEEE Transactions on Acoustics, Speech, and Signal Processing 33(2) (1985), 443–445.

Saleem

and Irfan

, Noise Reduction Based on Soft Masks by Incorporating SNR Uncertainty in Frequency Domain, Circuits, Systems, and Signal Processing (2017), 1–22.

Cohen

, Relaxed statistical model for speech enhancement and a priori SNR estimation, IEEE Transactions on Speech and Audio Processing 13(5) (2005), 870–881.

Cohen

, Speech enhancement using a noncausal a priori SNR estimator, IEEE Signal Processing Letters 11(9) (2004), 725–728.

Cohen

, Optimal speech enhancement under signal presence uncertainty using log-spectral amplitude estimator, IEEE Signal Processing Letters 9(4) (2002), 113–116.

Ephraim

and Van

H.L.

, Trees, A signal subspace approach for speech enhancement, IEEE Transactions on Speech and Audio Processing 3(4) (1995), 251–266.

Hasan

M.K.

, Salahuddin

and Khan

M.R.

, A modified a priori SNR for speech enhancement using spectral subtraction rules, IEEE Signal Processing Letters 11(4) (2004), 450–453.

10.

H.T.

and Yu

, Adaptive noise spectral estimation for spectral subtraction speech enhancement, IET Signal Processing 1(3) (2007), 156–163.

11.

Mohammadiha

, Smaragdis

and Leijon

, Supervised and unsupervised speech enhancement using nonnegative matrix factorization, IEEE Transactions on Audio, Speech, and Language Processing 21(10) (2013), 2140–2151.

12.

Hershey

J.R.

, Rennie

S.J.

, Olsen

P.A.

and Kristjansson

T.T.

, Super-human multi-talker speech recognition: A graphical modeling approach, Computer Speech & Language 24(1) (2013), 45–66.

13.

Reddy

A.M.

and Raj

, Soft mask methods for single-channel speaker separation, IEEE Transactions on Audio, Speech, and Language Processing 15(6) (2007), 1766–1776.

14.

Virtanen

, Gemmeke

J.F.

and Raj

, Active-set Newton algorithm for overcomplete non-negative representations of audio, IEEE Transactions on Audio, Speech, and Language Processing 21(11) (2013), 2277–2289.

15.

Wang

and Wang

, A structure-preserving training target for supervised speech separation. (ICASSP), 2014.

16.

Wang

, On ideal binary mask as the computational goal of auditory scene analysis Speech separation by humans and machines, (2005), 181–197.

17.

Tamura

, An analysis of a noise reduction neural network, International Conference on the Acoustics, Speech, and Signal Processing, ICASSP-89. 1989.

18.

Tamura

S.i.

and Waibel

, Noise reduction using connectionist models, International Conference on the Acoustics, Speech, and Signal Processing, ICASSP-88. 1988.

19.

Xie

and Van

, Compernolle, A family of MLP based nonlinear spectral estimators for noise reduction, IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP-94., 1994.

20.

Roman

, Wang

and Brown

G.J.

, Speech segregation based on sound localization, The Journal of the Acoustical Society of America 114(4) (2003), 2236–2252.

21.

Seltzer

M.L.

, Raj

and Stern

R.M.

, A Bayesian classifier for spectrographic mask estimation for missing feature speech recognition, Speech Communication 43(4) (2004), 379–393.

22.

Jin

and Wang

, A supervised learning approach to monaural segregation of reverberant speech, IEEE Transactions on Audio, Speech, and Language Processing 17(4) (2009), 625–638.

23.

Kim

, Lu

, Hu

and Loizou

P.C.

, An algorithm that improves speech intelligibility in noise for normal-hearing listeners, The Journal of the Acoustical Society of America 126(3) (2009), 1486–1494.

24.

Chen

and Wang

, DNN based mask estimation for supervised speech separation Audio source separation, (2018), 207–235.

25.

Brungart

D.S.

, Chang

P.S.

, Simpson

B.D.

and Wang

, Isolating the energetic component of speech-on-speech masking with ideal time-frequency segregation, The Journal of the Acoustical Society of America 120(6) (2006), 4007–4018.

26.

and Loizou

P.C.

, Factors influencing intelligibility of ideal binary-masked speech: Implications for noise reduction, The Journal of the Acoustical Society of America 123(3) (2008), 1673–1682.

27.

Wang

, Kjems

, Pedersen

M.S.

, Boldt

J.B.

and Lunner

, Speech intelligibility in background noise with ideal binary time-frequency masking, The Journal of the Acoustical Society of America 125(4) (2009), 2336–2347.

28.

Saleem

, Shafi

, Mustafa

and Nawaz

, A novel binary mask estimation based on spectral subtraction gain-induced distortions for improved speech intelligibility and quality. University of Engineering and Technology Taxila, Technical Journal 20(4) (2015), 36.

29.

Kjems

, Boldt

J.B.

, Pedersen

M.S.

, Lunner

and Wang

, Role of mask pattern in intelligibility of ideal binary-masked noisy speech, The Journal of the Acoustical Society of America 126(3) (2009), 1415–1426.

30.

Hummersone

, Stokes

and Brookes

, On the ideal ratio mask as the goal of computational auditory scene analysis Blind source separation, (2014), 349–368.

31.

Narayanan

and Wang

, Ideal ratio mask estimation using deep neural networks for robust speech recognition, IEEE International Conference on the Acoustics, Speech and Signal Processing (ICASSP), 2013.

32.

Srinivasan

, Roman

and Wang

, Binary and ratio time-frequency masks for robust speech recognition, Speech Communication 48(11) (2006), 1486–1501.

33.

Wang

, Narayanan

and Wang

, On training targets for supervised speech separation, IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP) 22(12) (2014), 1849–1858.

34.

Spille

, Ewert

S.D.

, Kollmeier

and Meyer

B.T.

, Predicting speech intelligibility with deep neural networks, Computer Speech & Language 48 (2018), 51–66.

35.

Hussain

, Siniscalchi

S.M.

, Lee

C.-C.

, Wang

S.-S.

, Tsao

and Liao

W.-H.

, Experimental Study on Extreme Learning Machine Applications for Speech Enhancement, IEEE Access 5 (2017), 25542–25554.

36.

Wang

and Wang

, Towards scaling up classification-based speech separation, IEEE Transactions on Audio, Speech, and Language Processing 21(7) (2013), 1381–1390.

37.

Hermansky

, Perceptual linear predictive (PLP) analysis of speech, The Journal of the Acoustical Society of America 87(4) (1990), 1738–1752.

38.

Hermansky

and Morgan

, RASTA processing of speech, IEEE Transactions on Speech and Audio Processing 2(4) (1994), 578–589.

39.

Shao

and Wang

, Robust speaker identification using auditory features and computational auditory scene analysis, IEEE International Conference on the Acoustics, Speech and Signal Processing, ICASSP 2008.

40.

Zhao

, Shao

and Wang

, CASA-based robust speaker identification, IEEE Transactions on Audio, Speech, and Language Processing 20(5) (2012), 1608–1616.

41.

Chen

C.-P.

and Bilmes

J.A.

, MVA processing of speech features, IEEE Transactions on Audio, Speech, and Language Processing 15(1) (2007), 257–270.

42.

Chen

, Wang

and Wang

, A feature study for classification-based speech separation at low signal-to-noise ratios, IEEE/ACM Transactions on Audio, Speech, and Language Processing 22(12) (2014), 1993–2002.

43.

De La Torre

, Peinado

A.M.

, Segura

J.C.

, Pérez-Córdoba

J. L.

, Benítez

M.C.

and Rubio

A.J.

, Histogram equalization of speech representation for robust speech recognition, IEEE Transactions on Speech and Audio Processing 13(3) (2005), 355–366.

44.

Toda

, Black

A.W.

and Tokuda

, Spectral conversion based on maximum likelihood estimation considering global variance of converted parameter, IEEE International Conference on Acoustics, Speech, and Signal Processing, Proceedings. (ICASSP’05). 2005.

45.

Rothauser

, IEEE recommended practice for speech quality measurements, IEEE Trans. on Audio and Electroacoustics 17 (1969), 225–246.

46.

Hirsch

H.-G.

and Pearce

, The Aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions. ASR-Automatic Speech Recognition: Challenges for the new Millenium ISCA Tutorial and Research Workshop, (2000).

47.

, Du

, Dai

L.-R.

and Lee

C.-H.

, A regression approach to speech enhancement based on deep neural networks, IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP) 23(1) (2015), 7–19.

48.

Rashmirekha

and Mohnaty

M.N.

, Enhancement of speech using deep neural network with discrete cosine transform, Journal of Intelligent & Fuzzy Systems Preprint: 1–8.

49.

Taal

C.H.

, Hendriks

R.C.

, Heusdens

and Jensen

, An algorithm for intelligibility prediction of time–frequency weighted noisy speech, IEEE Transactions on Audio, Speech, and Language Processing 19(7) (2011), 2125–2136.

50.

Rix

A.W.

, Beerends

J.G.

, Hollier

M.P.

and Hekstra

A.P.

, Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs, IEEE International Conference on the Acoustics, Speech, and Signal Processing, 2001. Proceedings.(ICASSP’01). 2001.

51.

Min

, Zhang

, Zou

and Sun

, Mask estimate through Itakura-Saito nonnegative RPCA for speech enhancement, IEEE International Workshop on the Acoustic Signal Enhancement (IWAENC). 2016.

52.

and Loizou

P.C.

, Evaluation of objective quality measures for speech enhancement, IEEE Transactions on Audio, Speech, and Language Processing 16(1) (2008), 229–238.

53.

Ciira

W.M.

and MacLaren

, Walsh, Joint speech enhancement and speaker identification using approximate Bayesian inference, IEEE Transactions on Audio, Speech, and Language Processing 19(6) (2011), 1517–1529.

54.

Reynolds

D.A.

, Richard

and Robust

Rose.

text-independent speaker identification using Gaussian mixture speaker models, IEEE Transactions on Speech and Audio Processing 3(1) (1995), 72–83.

55.

Steven

and Mermelstein

, Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences, IEEE Transactions on Acoustics, Speech, and Signal Processing 28(4) (1980), 357–366.

56.

A. Reynolds

Douglas

, Speaker identification and verification using Gaussian mixture speaker models, Speech Communication 17(2) (1995), 91–108.

57.

Ming

Ji.

, J. Hazen

Timothy

, R. Glass

James

and d A. Reynolds

Douglas

, Robust speaker recognition in noisy conditions, IEEE Transactions on Audio, Speech, and Language Processing 15(5), 1711–1723.

Supervised speech enhancement based on deep neural network

Abstract

Keywords

Get full access to this article

References