Study on the enhancement effect of the attention mechanism for noisy English speech

Abstract

In light of the extensive use of English in daily life, research on improving the quality of noisy English speech plays a crucial role. This paper proposed an enhanced U-Net model for enhancing noisy English speech by incorporating an attention mechanism and optimizing the loss function. The enhancement effect of the method was evaluated using the VoiceBank-DEMAND dataset. The optimized loss function yielded superior enhancement outcomes for noisy English speech. The perceptual evaluation of speech quality (PESQ) was 3.18, the short-time objective intelligibility (STOI) was 0.95, and the CSIG, CBAK, and COVL were 4.41, 3.65, and 3.83, respectively. These results outperformed other U-Net improved models and existing models. The findings validate the efficacy of the proposed approach for enhancing noisy English speech, thereby demonstrating its practical applicability.

Keywords

Attention mechanism noisy English speech U-Net loss function applied linguistics

Get full access to this article

View all access options for this article.

References

Subbaiah

Deepthi

. A study to analyze enhancement techniques on sound quality for bone conduction and air conduction speech processing. Scalable Computing 2020; 21(1): 57–62.

Sun

, et al. Single channel speech enhancement Algorithm based on BLSTM-DNN bidirectional optimized hybrid model. IOP Conf Ser Mater Sci Eng 2020; 719(1): 1–6.

Jeeva

MPA

Nagarajan

Vijayalakshmi

. Adaptive multi-band filter structure-based far-end speech enhancement. IET Signal Process 2020; 14(5): 288–299.

Tyagi

Várkonyi-Kóczy

Szénási

. A comprehensive investigation into the noise reduction techniques for speech. In: 2023 IEEE 21st World Symposium on Applied Machine Intelligence and Informatics (SAMI), Herl'any, Slovakia, 2023, pp. 000207–000212. DOI: 10.1109/SAMI58000.2023.10044486.

Dowerah

Serizel

Jouvet

, et al. Joint optimization of diffusion probabilistic-based multichannel speech enhancement with far-field speaker verification. In: 2022 IEEE spoken Language technology workshop (SLT), Doha, Qatar, 2023, pp. 428–435. DOI: 10.1109/SLT54892.2023.10022350.

Peng

Zhen

Chen

, et al. Study on convolutional recurrent neural networks for speech enhancement in fiber-optic microphones. J Phys: Conf Ser 2022; 2246(1): 012084.

Shimada

Bando

Mimura

, et al. Unsupervised speech enhancement based on Multichannel NMF-informed beamforming for noise-robust automatic speech recognition. IEEE/ACM Trans Audio Speech Lang Process 2019; 27(5): 960–971.

Xiang

Højvang

Rasmussen

, et al. A bayesian permutation training deep representation learning method for speech enhancement with variational autoencoder. In: ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, Singapore, 2022, pp. 381–385. DOI: 10.1109/ICASSP43922.2022.9747036.

Walczyna

Piotrowski

. Wave-U-Net speech denoising. In: IBIMA Conference on Artificial intelligence and Machine Learning, Madrid, Spain, 2024, pp. 52–57. DOI: 10.1007/978-3-031-62843-6_5.

10.

Jannu

Vanambathina

. Shuffle attention U-Net for speech enhancement in time domain. Int J Image Graph 2023; 24(4): 1–21.

11.

Guimares

Nagano

Silva

. Monaural speech enhancement through deep Wave-U-Net. Expert Syst Appl 2020; 158: 113582.

12.

Al-Barhan

Elyass

Saeed

, et al. Modified speech separation deep learning network based on hamming window. IOP Conf Ser Mater Sci Eng 2021; 1076(1): 012059.

13.

Zhang

Wang

, et al. An improved unscale S-Transform in frequency domain. IEEE Geosci Remote Sens Lett 2023; 20: 1–5.

14.

Yin

Shao

. CFU-Net: a coarse–fine U-Net with multilevel attention for medical image segmentation. IEEE Trans Instrum Meas 2023; 72: 1–12.

15.

Zang

, et al. Fiber communication receiver models based on the multi-head attention mechanism. Chin Opt Lett 2023; 21(3): 1–34.

16.

Valentini-Botinhao

Wang

Takaki

, et al. Investigating RNN-Based speech enhancement methods for noise-robust text-to-speech. In: 9th ISCA Speech Synthesis Workshop, 2016, pp. 159–165. DOI: 10.21437/SSW.2016-24.

17.

Veaux

Yamagishi

King

. The voice bank corpus: design, collection and data analysis of a large regional accent speech database. In: 2013 International Conference Oriental COCOSDA held jointly with 2013 Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE), Gurgaon, India, 2013, pp. 1–4. DOI: 10.1109/ICSDA.2013.6709856.

18.

Thiemann

Ito

Vincent

. The diverse environments multi-channel acoustic noise database: a database of multichannel environmental noise recordings. J Acoust Soc Am 2013; 133(5): 3591.

19.

Rix

Beerends

Hollier

, et al. Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs. In: 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221), Salt Lake City, UT, USA, 2001, pp. 749–752. DOI: 10.1109/ICASSP.2001.941023.

20.

Taal

Hendriks

Heusdens

, et al. An algorithm for intelligibility prediction of time–frequency weighted noisy speech. IEEE Trans Audio Speech Lang Process 2011; 19(7): 2125–2136.

21.

Loizou

. Evaluation of objective quality measures for speech enhancement. IEEE Trans Audio Speech Lang Process 2008; 16(1): 229–238.

22.

Kohl

SAA

Romera-Paredes

Meyer

, et al. A probabilistic U-net for segmentation of ambiguous images. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, 2018, pp. 6965–6975. DOI: 10.48550/arXiv.1806.05034.

23.

Choi

Kim

Huh

, et al. Phase-aware speech enhancement with deep complex U-Net. arXiv preprint. 2019. DOI: 10.48550/arXiv.1903.03107.

24.

Defossez

Synnaeve

Adi

. Real time speech enhancement in the waveform domain. arXiv preprint. 2020. DOI: 10.48550/arXiv.2006.12847.

25.

Lim

JSA

Oppenheim

. All-pole modeling of degraded speech. IEEE Trans Acoust 1978; 26(3): 197–210.

26.

Pascual

Bonafonte

Serrà

. SEGAN: speech enhancement generative adversarial network. 2017. DOI:10.48550/arXiv.1703.09452.

27.

Tan

Wang

. Complex spectral mapping with a convolutional recurrent network for monaural speech enhancement. In: ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 2019, pp. 6865–6869. DOI: 10.1109/ICASSP.2019.8682834.

28.

Tan

Wang

. Learning complex spectral mapping with gated convolutional recurrent networks for monaural speech enhancement. IEEE/ACM Trans Audio Speech Lang Process 2020; 28: 380–390.

29.

Wang

Zhu

. TSTNN: two-stage transformer based neural network for speech enhancement in the time domain. In: ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 2021, pp. 7098–7102. DOI: 10.1109/ICASSP39728.2021.9413740.

30.

Soni

Shah

Patil

. Time-frequency masking-based speech enhancement using generative adversarial network. ICASSP 2018: 5039–5043. DOI: 10.1109/ICASSP.2018.8462068.

31.

Yin

Luo

Xiong

, et al. PHASEN: a phase-and-harmonics-aware speech enhancement network. Proc AAAI Conf Artif Intell 2020; 34(5): 9458–9465.