Speech enhancement based on perceptually motivated guided spectrogram filtering

Abstract

In this paper, a single-channel speech enhancement algorithm is proposed by using guided spectrogram filtering based on masking properties of human auditory system when considering a speech spectrogram as an image. Guided filtering is capable of sharpening details and estimating unwanted textures or background noise from the noisy speech spectrogram. If we consider the noisy spectrogram as a degraded image, we can estimate the spectrogram of the clean speech signal using guided filtering after subtracting noise components. Combined with masking properties of human auditory system, the proposed algorithm adaptively adjusts and reduces the residual noise of the enhanced speech spectrogram according to the corresponding masking threshold. Because the filtering output is a local linear transform of the guidance spectrogram, the local mask window slides can be efficiently implemented via box filter with O(N) computational complexity. Experimental results show that the proposed algorithm can effectively suppress noise in different noisy environments and thus can greatly improve speech quality and speech intelligibility.

Keywords

Auditory masking properties guided filtering guided spectrogram filtering spectrogram speech enhancement

Get full access to this article

View all access options for this article.

References

Benesty

, Introduction. In Fundamentals of speech enhancement; Springer: Berlin, Germany, 2018; pp. 1–3.

Zhang

, Principle and application of speech enhancement. In Digital audio processing and MATLAB simulation, 2th ed.; PHEI: Bejing, China, 2016; pp. 199–218.

Loizou

P.C.

, Noise-estimation algorithms. In Speech enhancement: theory and practice, 2th ed.; CRC Press: Boca Raton, USA, 2013; pp. 377–433.

Cohen

, Benesty

J.J.

and Gannot

, Simultaneous detection and estimation approach for speech enhancement and interference suppression. In Fundamentals of speech enhancement; Springer: Berlin, Germany, 2010; pp. 127–149.

Erkelens

J.S.

, Hendriks

R.C.

, Heusdens

and Jensen

, Minimum mean-square error estimation of discrete Fourier coefficients with generalized Gamma priors, IEEE Trans Audio, Speech, Language Process 15 (2007), 1741–1752.

Sayoud

, Djendi

and Guessoum

, A new speech enhancement adaptive algorithm based on fullband–subband MSE switching, J Speech Technol 22 (2019), 865–884.

and Loizou

P.C.

, Factors influencing intelligibility of ideal binary-masked speech: Implications for noise reduction, J Acoust Soc Am 123 (2008), 1673–1682.

Loizou

P.C.

and Kim

, Reasons why current speech-enhancement algorithms do not improve speech intelligibility and suggested solutions, IEEE Trans Audio, Speech, Language Process 19 (2011), 47–56.

, Yang

, Zhang

, Yan

, Hu

and Loizou

P.C.

, Comparative intelligibility investigation of single-channel noise-reduction algorithms for Chinese, Japanese, and English, J Acoust Soc Am 129 (2011), 3291–3301.

10.

Wang

, Liu

, Zheng

and Li

, Spectral subtraction based on two-stage spectral estimation and modified cepstrum thresholding, Acoust 74 (2013), 450–458.

11.

Laufer

, Laufer-Goldshtein

and Gannot

, ML estimation and CRBs for reverberation, speech, and noise PSDs in rank-deficient noise field, IEEE Trans Audio, Speech, Language Process 28 (2020), 619–634.

12.

Zhang

, Nicolson

A.M.

, Wang

, et al., Deep MMSE: A deep learning approach to MMSE-based noise power spectral density estimation, IEEE Trans Audio, Speech, Language Process 2020, pp. 1–1.

13.

Gerkmann

and Hendriks

R.C.

, Unbiased MMSE-based noise power estimation with low complexity and low tracking delay, IEEE Trans Audio, Speech, Language Process 20 (2011), 1383–1393.

14.

, Du

, Dai

L.R.

and Lee

C.H.

, An experimental study on speech enhancement based on deep neural networks, IEEE Signal Processing Lett 21 (2013), 65–68.

15.

, Du

, Dai

L.R.

and Lee

C.H.

, A regression approach to speech enhancement based on deep neural networks, IEEE Trans Audio, Speech, Language Process 23 (2015), 7–19.

16.

Kolbk

, Tan

Z.H.

and Jensen

, Speech intelligibility potential of general and specialized deep neural network based speech enhancement systems, IEEE Trans Audio, Speech, Language Process 25 (2017), 153–167.

17.

Pandey

and Wang

D.L.

, A New Framework for CNN-Based Speech Enhancement in the Time Domain, IEEE Trans Audio, Speech, Language Process 27 (2019), 1179–1188.

18.

, Sun

and Tang

, Guided image filtering, IEEE Trans Pattern Anal Machine Intell 35 (2013), 1397–1409.

19.

Hao

, Pan

, Guo

, Hong

and Wang

, Image detail enhancement with spatially guided filters, Signal Process 120 (2016), 789–796.

20.

, Kang

and Hu

, Image fusion with guided filtering, IEEE Trans Image Processing 22 (2013), 2864–2875.

21.

Zheng

, Tan

Z.H.

, Peng

and Li

, Guided spectrogram filtering for speech dereverberation, Acoust 134 (2018), 154–159.

22.

Buades

, Coll

and Morel

J.M.

, A non-local algorithm for image denoising. In Proceedings of 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Diego, CA, USA, 2005.

23.

Jiang

, Meng

H.Q.

, Ma

X.L.

, Wang

, Zhou

, et al., Nighttime image Dehazing with modified models of color transfer and guided image filter, Tools Appl 77 (2018), 3125–3141.

24.

Draper

N.R.

and Smith

, The general regression situation, Applied Regression Analysis, 2014.

25.

Hastie

, Tibshirani

and Friedman

, Linear Methods for Regression. In The Elements of Statistical Learning; Springer: Berlin, Germany, 2009, pp. 43–99.

26.

Chai

, Du

, Liu

Q.F.

, et al., Using generalized gaussian distributions to improve regression error modeling for deep learning-based speech enhancement. IEEE Trans Audio, Speech, Language Process 2019, pp. 1–1.

27.

Wang

, Zhang

and Tang

, Speech enhancement based on NMF under electric vehicle noise condition, IEEE Access 6 (2018), 9147–9159.

28.

Varga

and Steeneken

H.J.M.

, Assessment for automatic speech recognition: II. NOISEX-92: a database and an experiment to study the effect of additive noise on speech recognition systems, Speech Commun 31 (2014), 11–20.

29.

Cao

, Zhang

T.Q.

, Gao

H.X.

, et al., Multi-band spectral subtraction method combined with auditory masking properties for speech enhancement, International Congress on Image & Signal Processing, Chongqing, China, 2013.

30.

Tao

, Zhao

and Gong

, Speech enhancement based on masking properties of human auditory system and bark wavelet transform, Chin J Acoust 30 (2005), 367–372.

31.

Lynn

M.M.

, Su

and Maw

K.K.

, Efficient Feature Extraction for Emotion Recognition System. Forth International Conference for Convergence in Technology, Ujire, India, 2018.

32.

Surendran

and Kumar

T.K.

, Variance Normalized Perceptual Subspace Speech Enhancement, AEU –International Journal of Electronics and Communications 74 (2017), 44–54.

33.

Plapous

, Marro

and Scalart

, Improved signal-to-noise ratio estimation for speech enhancement, IEEE Trans Audio, Speech, Language Process 14 (2006), 2098–2108.

34.

Zheng

, Zhou

and Li

, A modified a priori SNR estimator based on the united speech presence probabilities, J Electron Inf Technol 30 (2008), 1680–1683.

35.

Lin

B.S.

, Wu

H.D.

, et al., Wheeze recognition based on 2D bilateral filtering of spectrogram, Biomedical Engineering Applications Basis and Communications 18 (2006), 128–137.

36.

Wang

and Yan

, Bilateral spectrogram filtering for speech denoising. In Proceedings of the National Acoustics Congress, Beijing, China, 2018.

37.

Loizou

P.C.

, Speech enhancement: Theory and practice. Crc Press, 2017.

38.

Bhat

G.S.

, Shankar

, Reddy

C.K.A.

and Panahi

I.M.S.

, A real-Time convolutional neural network based speech enhancement for hearing impaired listeners using smartphone, IEEE Access 7 (2019), 78421–78433.

39.

Ram

and Mohanty

M.N.

, Performance analysis of adaptive variational mode decomposition approach for speech enhancement, International Journal of Speech Technology 21 (2018), 369–381.

40.

Shan

, Zhang

and Li

, A novel encoder-decoder model via NS-LSTM used for bone-conducted speech enhancement, IEEE Access 6 (2018), 62638–62644.

41.

, Hu

and Loizou

P.C.

, Objective measures for predicting speech intelligibility in noisy conditions based on new band-importance functions, J Acoust Soc Am 125 (2009), 3387–3405.

42.

Zheng

, Yang

, Zhang

, Sun

and Cao

, A spectra-based equalization-generation combined framework for throat microphone speech enhancement, IEEE Access 6 (2018), 71455–71463.

43.

Taal

C.H.

, Hendriks

R.C.

, Heusdens

and Jensen

, An algorithm for intelligibility prediction of time–frequency weighted noisy speech, IEEE Trans Audio, Speech, Language Process 19 (2011), 2125–2136.