Sage Journals: Discover world-class research

Abstract

Background noise often distorts the speech signals obtained in a real-world environment. This deterioration occurs in certain applications, like speech recognition, hearing aids. The aim of Speech enhancement (SE) is to suppress the unnecessary background noise in the obtained speech signal. The existing approaches for speech enhancement (SE) face more challenges like low Source-distortion ratio and memory requirements. In this manuscript, Recalling-Enhanced Recurrent Neural Network (R-ERNN) optimized with Chimp Optimization Algorithm based speech enhancement is proposed for hearing aids (R-ERNN-COA-SE-HA). Initially, the clean speech and noisy speech are amassed from MS-SNSD dataset. The input speech signals are encoded using vocoder analysis, and then the Sample RNN decode the bit stream into samples. The input speech signals are extracted using Ternary pattern and discrete wavelet transforms (TP-DWT) in the training phase. In the enhancement stage, R-ERNN forecasts the associated clean speech spectra from noisy speech spectra, then reconstructs a clean speech waveform. Chimp Optimization Algorithm (COA) is considered for optimizing the R-ERNN which enhances speech. The proposed method is implemented in MATLAB, and its efficiency is evaluated under some metrics. The R-ERNN-COA-SE-HA method provides 23.74%, 24.81%, and 19.33% higher PESQ compared with existing methods, such as RGRNN-SE-HA, PACDNN-SE-HA, ARN-SE-HA respectively.

Keywords

Speech enhancement hearing aids MS-SNSD dataset ternary pattern and discrete wavelet transforms Recalling-Enhanced Recurrent Neural Network chimp optimization algorithm

1. Introduction

Hearing aids is a minor electronic gadgets plan towards better hearing on people have damaged hearing, utilizing sophisticated audio signal processing approaches and methods [1, 2, 3]. SE approaches for hearing aids separate ecological sound, increase speech though giving reflection to hearing attributes as well as ecological environments [4, 5, 6, 7, 8].

SA algorithm was suggested to increase speech grade of hearing aid environment via solicit sound depletion with deep neural network (DNN) learning on the basis of sound categorization [9, 10]. To assess speech improvement in real hearing aid environs, ten types of noise for using convolutional neural networks (CNN) [11, 12, 13]. Sound depletion with speech improvement was used through DNN construct at sound organization [14, 15]. The speech enhancements removed utilizing the DNN and connected ecological sound displays the enhancement of the conventional hearing aid approaches [16, 17, 18]. The speech qualities are calculated with perceptional assessment on speech quality score; short-time impartial intelligibility score, log likelihood correspondence score and the total quality combine scale [19, 20].

Speech enhancement (SE) is an important and difficult task in most of the applications. Some approaches for SE have been developed so far. The existing approaches can deal background noise perfectly, but they are limited in non-stationary noise as well as complicated. They are not appropriate for real-time applications due to their poor Source-distortion ratio, lengthy training time, and memory requirements. They also do not utilise information found in the phase spectrum. To overcome these issues, R-ERNN optimized with Chimp Optimization Algorithm based SE for hearing aid is proposed. The novelty of the proposed approach is the lessening of training time and improves noise suppression using R-ERNN. The proposed technique introduces to address the limitations of existing methods for speech enhancement, including the high speech distortion and low Source-distortion ratio.

The major contributions of this manuscript are summarized below:

•
In this manuscript, R-ERNN optimized with Chimp Optimization Algorithm based Speech Enhancement for Hearing Aid is proposed.
•
R-ERNN is introduced to reduce the training period of deep network and improve speech enhancement by separating clean speech spectra via noisy speech spectra.
•
COA is employed to improve the non-stationary sound without any pre-training of sound methods.
•
The performance metrics is examined to find the robustness of the proposed speech enhancement system.
•
This paper attempts to directly incorporate the short-time objective intelligibility measure (STOI) in an R-ERNN speech enhancement approach to optimize for speech intelligibility.

Remaining manuscript is organized as follows: the literature survey is depicted in Section 2, the proposed approach is illustrated in Section 3, Section 4 demonstrates the result, conclusion is given in Section 5.
2. Related works

Many study works were suggested in the works related to Speech Enhancement for Hearing Aids; a few recent works are reviewed here.

Saleem et al. [21], have presented a residual gated RNN-augmented kalman filtering for SE with identification. The clean speech along noise signals was modeled as autoregressive procedure then the parameters were consists of linear prediction coefficients and driving noise variances. RNN was train to assess the line spectrum frequencies, while optimization issue was overwhelmed to reach noise variations to lessen the difference among the modeled and estimated noise contaminated speech autoregressive spectrums. It provides higher SSR and lower PESQ metrics.

Hasannezhad et al. [22], have presented a phase-aware composite deep neural network for SE to handle higher computational complexity and memory requirements defies. To be more precise, two important sub tasks of the novel network to improve the phase and magnitude spectra were phase reconstruction with phase derivative and magnitude processing with spectral mask. It provides higher SSR and lower PESQ metrics.

Pandey and Wang [23], have presented self-attending RNN for speech enhancement to enhance cross-corpus generalization. To promote cross-corpus generalization, a self-attending recurrent neural network, also called an attentive recurrent network (ARN) was presented for time domain speech augmentation. ARN contains recurrent neural networks enhanced with feed forward and self-attention blocks. Then assess ARN in low SNR scenarios using several corpora containing non-stationary noise. The outcomes of the experiments show that ARN performs significantly better at time domain speech augmentation than competing methods. It provides higher PESQ than SSR metrics.

Lei et al. [24], have suggested a Low-Latency Hybrid Multiple Channel SE scheme for Hearing Aids. Three modules make up the system: post-processing, multi-channel augmentation, and rule-based dereverberation. The system may achieve average hearing aid speech perception index score 0.696 together with hearing aid speech quality index score 0.320 without the use of head rotation information along enrollment speech. It provides higher SSR than PESQ metrics

Cantu and Hohmann [25], have suggested Spectro-Temporal Post-Filtering under Short-Time Target Cancellation (STTC) for Directional SE in a Dual-Microphone Hearing AID. STTC processing takes advantage of the computational power of the Short-Time Fourier Transform (STFT) for the post-filtering, while the hearing aid technique was employed for the adaptive beamforming. STTC processing was effective, and simple STFT-based processing to attain real-time lower latency ( $\leqslant$ 20 ms) spatial spectro-temporal filtering. It provides higher PESQ and lower SSR.

Wang et al. [26], have suggested the FNeural SE with Less Algorithmic Latency and Complexity utilizing Integrated full-and sub-band Modeling. To increase speech in the short-time Fourier transform (STFT) domain for both single and multi-channel applications, FSB-LSTM based architecture was used. Through many FSB-LSTM modules, the method sustains information highway to flow over-complete input depiction. An FSB-LSTM module has full-band block that simulates spectro-temporal patterns on every frequency as well as sub-band block that simulates patterns on every sub-band. It provides higher SSR and lower CSII.

3. Proposed methodology

Figure 1.

Proposed R-ERNN-COA-SE-HA methodology.

In this manuscript, R-ERNN by COA based SE-HA is discussed in this section. Figure 1 displays the proposed R-ERNN-COA-SE-HA methodology. The detail discussion regarding proposed R-ERNN optimized by Chimp Optimization Algorithm based speech enhancement for hearing aids is discussed as follows.

3.1 Data acquisition

Initially the input of clean speech and noisy speech data are taken from Microsoft Scalable Noisy Speech Dataset (MS-SNSD) [27]. Then the input speech signals are fed for the encoded process.

3.2 Input signal encoded using vocoder analysis

Here, the input speech signals are first encoded using vocoder analysis, which compress input speech signals into the compact bitstream (encoder) and then the Sample RNN decode the bitstream into samples [28]. The extensive-band variant of the linear prediction coding (LPC) vocoder serves as the foundation of the encoder scheme. The following parameters are generated as a consequence of per-frame input signal analysis: A $N^{\text{th}}$ command LPC filter, an LPC enduring RMS level, an area value of $f_{0}$ , and a $k$ -band voicing vector make up the first four components. The proportion of repetitive energy within a band is given by the voicing component $v\left(i\right)$ , where $i=1,2,\ldots,k$ . The operating bitrate affects the order of the LPC method $N$ in the suggested encoder design. To attain encoding efficacy to the proper perceptual consideration, standard combinations of source coding methods, such as vector quantization (VQ), analytical coding, also entropy coding, are used. At domain $d B$ , the residual level $s$ is quantized using a vocoder analysis. Analytical scheme utilizing fine uniform quantization is used to identify, signal with one bit, and code small level inter-frame variations. Other times, the coding is memory less and has a bigger but uniform step-size that spans a variety of levels.

Pitches are quantized utilizing a vocoder analysis of prophetic with memory less coding. Although uniform quantization is used, it is carried out in a pitch-distorted domain. Where input speech signal is encoded using Eq. (1) as follows,

$\displaystyle f_{W}={df_{0}}/({d+f_{0}})$ (1)

where, $d=$ 500 Hz and $f_{W}$ quantized and coded utilizing 10 bit/frame. Finally by the vocoder analysis input speech signals are compressed into the compact bitstream The Model RNN is a deep neural generative method suggested for decode the bitstream into samples. The Sample RNN is prepared by number of multi-rate recurring layers that can represent a sequence’s dynamics at various time scales. By factorizing the combined dissemination into the product of the distinct bitstream model distributions constrained on wholly models, Sample RNN models the likelihood of a series of input speech signals. The combined probability dissemination of a series of encoded models $Y=\left\{{y_{1},\ldots,y_{T}}\right\}$ are represented as in Eq. (2),

$\displaystyle P(Y)=\prod\limits_{i=1}^{T}{P\left({y_{i}\left|{y_{1},\ldots,y_{% i-1}}\right.}\right)}$ (2)

where $y$ is a encoded samples, $y_{i}$ is a $i^{\text{th}}$ encoded samples and $T$ represents time. The Sample RNN model forecasts one bitstream sample at a time at inference time by randomly selecting from $P\left({y_{i}\left|{y_{1},\ldots,y_{i-1}}\right.}\right)$ . The previously reconstructed samples are then used to conduct recursive conditioning. Sample RNN is only able to “babble” in the absence of conditioning input. Therefore, give the decrypted vecoder parameters $h_{f}$ , as training data. Thus, Eq. (2) can be modified into Eq. (3),

$\displaystyle P\left({Y\left|H\right.}\right)=\prod\limits_{i=1}^{T}{p\left({y% _{i}\left|{y_{1},\ldots,y_{i-1},h_{f}}\right.}\right)}$ (3)

where $h_{f}$ indicates the vocoder parameters consistent to the input speech at time $T$ . Then the samples through decoded bitstream are fed for feature extraction.

3.3 Feature extraction using ternary pattern and discrete wavelet transform

While teaching stage, the types of speech signals are removed using TP-DWT [29]. The proposed TP-DWT approach provides the acoustic signal and features. The features extracted from acoustic signals for identifying types of SE. In this, TP-DWT is utilized to extract spectral features of speech signals. In this, the TP-DWT base feature extraction extracts the local features with 3 $\times$ 3 sizes of non-overlapping nearby blocks. The ternary patterns extract the upper and lower of local features from the signals and are expressed in Eq. (4),

$\displaystyle TP_{\textit{features}}\left({\textit{first},\text{sec}}\right)=% \left\{{{\begin{array}[]{ll}-1,&\,\textit{first}-\text{sec}\textit{ond}<-% \textit{thres}\\ 0,&-\textit{thres}\leqslant\textit{first}-\text{sec}\textit{ond}\leqslant% \textit{thres}\\ 1,&\textit{first}-\text{sec}\textit{ond}>\textit{thres}\\ \end{array}}}\right.$ (4)

where $TP_{\textit{features}}$ represents ternary function for extracting features, $\textit{first},\text{sec}$ denotes input parameters for ternary functions (TF), thres implies threshold value. TF creates $-$ 1, 0, 1. In this, TP-DWT non-overlapping block of size is utilized in lieu of 3 $\times$ 3 size of non-overlapping matrices. By utilizing the centre value along ternary function, the eight upper and lower bits are enhanced from a block. By utilizing the estimated values, the feature signals of upper including lower are acquired. Then the extracted feature signals also concatenate to acquire feature vector. TP-DWT is mainly used to extract spectral features, like Mel Frequency Differential Power Cepstral Coefficients (MFDPCC), Root Mean Square Energy, Spectral Centroid, and Spectral Subband Centroid which are explained as follows.

The set of coefficients used in features is called MFDPCC, and it is built using the vocal track information’s frequencies. It offers information based on the spectrum’s structure of the speech signal. Twenty factors in total were used.

The global energy of speech signal are determined in its root mean square energy, which are computed using the Eq. (5),

$\displaystyle y_{\textit{rms}}=\sqrt{\frac{1}{n}\sum\limits_{i=1}^{n}{j_{i}^{2% }}}$ (5)

where $j_{i}$ represents signal amplitude on $i^{\text{th}}$ frame, $n$ represents frames count at the sample length. The geometric center of the spectrum is known as the spectral centroid. It is determined as the signal’s average of its frequencies. It is computed using Eq. (6) as follows,

$\displaystyle\textit{SC}_{i,b}=\frac{\sum\nolimits_{f=l_{b}}^{u_{b}}{f\left|{S% _{i}\left[n\right]}\right|}^{2}}{\sum\nolimits_{f=l_{b}}^{u_{b}}{\left|{S_{i}% \left[f\right]}\right|}^{2}}$ (6)

where $S_{i}\left[n\right]$ represents the $i^{\text{th}}$ speech signal $S_{i}\left[f\right]$ indicates the scope of the signal, $b$ denotes the sub band, $l_{b}$ indicates the lesser frequency edge and $u_{b}$ expresses the higher frequency edge. In Spectral Subband Centroid (SSC) features, the frequency band is divided into a predetermined number of subbands, and the power spectrum is used to calculate the centroid of each subband from the speech signal. Then extracting spectral features the trained data are presented to the enhancement stage for calculate the consistent clean speech spectra.

3.4 Enhancement stage

While improvement stage, the noisy speech features deals with Recalling-Enhanced Recurrent Neural Network (R-ERNN) based method to calculate the clean speech features, with valued log power spectra features of attained clean speech waveform.

3.4.1 Recalling-Enhanced Recurrent Neural Network (R-ERNN)

Nowadays, RNN based networks can attain speech enrichment, however its computation load remains big and takes more training time. Recalling-Enhanced Recurrent Neural Network (R-ERNN) [30] a condensed version of recurrent architecture is used in this study to address these issues. In this part, we go over the R-ERNN network-based speech enhancement procedure. The first step is to use the spectral characteristics of voice like input network and output network. R-ERNN architecture then parallel concept is explained. Full model structure, containing the general learning framework as well as R-ERNN network training process is finally provided in detail.

The selected features are given to R-ERNN to classify speech enhancement problems in means of neural network depends upon optimal features. The RERNN has 7 layers: input, output, state, memory, sum, hidden, and delay. Let $\left\{{A_{g},Z_{g}}\right\}_{g=1}^{G}\subset\Re^{k}\times\Re^{l}$ be the dataset delivered to R-ERNN where, $A_{g}$ is $g^{\text{th}}$ input sampled speech signal, and $Z_{g}$ denotes the consistent target output signal. In the input layer contain $M+N$ nodes in linear. Wherever the blank nodes of $M$ is utilized with the feature extracted samples, which is represented in Eq. (7),

$\displaystyle A_{g}=\left({A_{g1},A_{g2},\ldots,A_{gM}}\right)$ (7)

Then the $M$ blank nodes shows outcome of hidden layer is expressed in Eq. (8) as follows,

$\displaystyle H_{g-1}=\left({H_{\left({g-1}\right)1},H_{\left({g-1}\right)2},% \ldots,H_{\left({g-1}\right)M}}\right)$ (8)

Form Eq. (8), $\left({H_{\left({g-1}\right)1},H_{\left({g-1}\right)2},\ldots,H_{\left({g-1}% \right)M}}\right)=\left({H_{g1},H_{g2},\ldots,H_{gM}}\right)$ supplied by the speech signals as well as data dismissal. The formal layer execute 0/1 range for the layer of memory is estimated by Eq. (9),

$\displaystyle f(.)=\left\{{{\begin{array}[]{ll}0,&{\textit{if}\,B_{\left({g-1}% \right)i}\,\textit{is unimpor}\,\tan\,\textit{t}}\\ 1,&{\textit{if}\,B_{\left({g-1}\right)j}\,\textit{is impor}\,\tan\,\textit{t}}% \\ \end{array}}}\right.$ (9)

where $B_{(g-1)j}$ defined as speech signal input. To generate the incompatibility of formal layer, a function $\log\textit{sig}$ is used. Therefore, there are $N$ number of nodes fitted out with $\log\textit{sig}$ functions. The outcome of formal layer node is given in following Eq. (10),

$\displaystyle f_{\textit{gi}}=\frac{1}{1+e^{-{v}^{\prime}_{v}{X}^{\prime}_{g}}}$ (10)

where $g$ is a node vector, $g=1,2,\ldots,G,i=1,2,\ldots,N$ . Let $V^{*}=\left[{{\begin{array}[]{*{20}c}{{\begin{array}[]{*{20}c}{v_{1}^{*}}\\ {\ldots}\\ {\ldots}\\ \end{array}}}\\ {v_{N}^{*}}\\ \end{array}}}\right]_{N\times\left({M+N}\right)}$ be the matrix of weight combining the input and state layer and $f_{g}=\left({f_{g1},\ldots f_{gN}}\right)$ . Consequently the output of $i^{\text{th}}$ node is given in Eq. (11),

$\displaystyle B_{\left({g-1}\right)i}^{*}=B_{\left({g-1}\right)i}f_{\textit{gi}}$ (11)

where $g=1,2,\ldots,G$ and $i=1,2,\ldots,N$ . In sum layer, the nodes obtain present input, final hidden outcome and the memory layer output. Then $i^{\text{th}}$ node is given in Eq. (12),

$\displaystyle B_{\textit{gi}}=V_{i}^{X}X_{g}^{t}+V_{i}^{H}H_{g-1}^{t}+B_{\left% ({g-1}\right)i}f_{\textit{gi}}$ (12)

where $V_{i}=\left({v_{i1}^{X},v_{i2}^{X},\ldots,v_{im}^{X},v_{i\left({M+1}\right)}^{% H},\ldots,v_{i\left({M+N}\right)}^{H}}\right)$ denotes weight vector linking the cover of input and $i^{\text{th}}$ number node. Let $V=\left[{{\begin{array}[]{*{20}c}{V_{1}}\\ {\ldots}\\ {V_{N}}\\ \end{array}}}\right]_{N\times\left({M+N}\right)}$ is weight matrix joining input layer and sum layer.

Concealed layer contains $N$ sum of nodes, distinct to the conventional network, in R-ERNN the output of hidden layer is provided in Eq. (13),

$\displaystyle H_{\textit{gi}}=\tan H({B_{\textit{gi}}})=\tan H\left({\hat{B}_{% \textit{gi}}+B_{\left({g-1}\right)i}f_{\textit{gi}}}\right)$ (13)

From Eq. (13), $H_{\textit{gi}}=\left({H_{g1},H_{g2},\ldots,H_{\textit{gN}}}\right)$ , where $H_{\textit{gi}}$ indicates the $i^{\text{th}}$ output of the hidden layer. Also, the sum layer output vector $B_{g}=\left({B_{g1},B_{g2},\ldots,B_{\textit{gN}}}\right)$ fed back on memory layer.

Finally, Output layer contains $y$ number of nodes and every node represents a particular module of speech enhancement output. Let $V_{s}=\left({V_{1s},V_{2s},\ldots,V_{\textit{Ns}}}\right)^{t}\in\Re^{N}$ denotes weight vector joining hidden layer and $s^{\text{th}}$ output node, after instantly require the whole weight matrix among the hidden with output layers. Then, $s^{\text{th}}\left({s=1,2,\ldots,y}\right)$ actual speech enhancement output of R-ERNN is given as following Eq. (14)

$\displaystyle Z_{\textit{gs}}=F\left({H_{g}V_{s}}\right)=H_{g}V_{s}$ (14)

where $F$ indicates the signal noise, $V_{s}$ is a weight vector, $g=1,2,\ldots,G$ and $s=1,2,\ldots y$ . R-ERNN attains a clean speech waveform. Finally, the proposed Recalling Enhanced Recurrent Neural Network forecast the associated clean speechspectra via noise speechspectra, then reconstruct a clean speech waveform. The speech signals are correctly detects the clean speech waveform and it is regained in the log file. Then COA is applied to enhance the better presentation of speech enhancement. The weight parameters $Z_{\textit{gs}}$ and $H_{g}V$ , where $Z_{\textit{gs}}$ is utilized to increase the speech quality and $H_{g}V$ is utilized to clean the waveform. The attained outcomes indicate that the proposed R-ERNN model has effectively predicted the consistent clean speech spectra at the noisy speech spectra. To get more accurate prediction the weight parameters $Z_{\textit{gs}}$ and $H_{g}V$ , of R-ERNN method are optimized using COA.

In this manuscript, COA can be exploited to expand the R-ERNN for discovering the optimum parameters. COA is used for tuning the hyper parameters of R-ERNN Network. Especially, some techniques are used for restriction creation; for example, the explorations of grid, manual and random. However, these sharing its unfamiliar feebleness concerning reiteration period and not has subterfuge-assembled acquainted enquiry. Therefore, intimidated the problems, COA are used. COA is a meta-heuristic algorithm. It gathers a prospect method to predict the speech signal and clear waveform utilizing random exploration of the purpose of minimum error rate. COA approach is selected, because it have own improvement; it acquires less iteration time than other tuning methods to determine optimal hyper parameters value.

3.5 Step by step procedure of COA for optimizing the weight parameter of R-ERNN

Chimp Optimization Algorithm (COA) [31] is proposed for exploiting R-ERNN method presentation with speech enhancement. Therefore, the proposed R-ERNN method provides appropriate speech waveform and also decreases the possibility of error to minimum. The COA is founded on the intelligence displayed by chimps during group hunting. Even though no two chimps in a group are exactly alike in terms of intellect or behaviour, they all work well together to complete tasks. Similar to humans, chimps seek out meat in an erratic manner in order to gain communal benefits like grooming and sex. These two qualities play a crucial role in the creation of the COA. Fast convergence of the COA results in an efficient optimal answer. The step by step procedure of COA are expressed as follows,

Step 1: Initialization

Initially, Chimp population are initially dispersed at random throughout the search area and are represented in Eq. (15)

$\displaystyle Y^{0}\left({a,b}\right)=Y^{L}\left(a\right)+\textit{rand}\cdot% \left({Y^{U}\left(a\right)-Y^{L}\left(a\right)}\right)$ (15)

where $a=1,2,\ldots,T_{V}$ , $b=1,2,\ldots,T_{C}$ , $Y^{0}\left({a,b}\right)$ indicates the initial value for the $b^{\text{th}}$ chimp, $Y^{L}\left(a\right)$ and $Y^{U}\left(a\right)$ indicates the lower and upper bound of the $a^{\text{th}}$ variable, rand denotes the uniformly distributed number in the interval $\left[{0,1}\right]$ , $T_{V}$ and $T_{C}$ indicates the total count of variable respectively.

Step 2: Random Generation

Afterward the initialization, the input parameters of COA is created at random to obtain the best solution.

Step 3: Fitness function estimation

It achieves the objective function to optimize $Z_{\textit{gs}}$ and $H_{g}V$ weight parameters of R-ERNN. The fitness function equation is labeled in Eq. (16),

$\displaystyle\textit{Fitness function}=\textit{Optimize}({Z_{\textit{gs and}}% \,H_{g}V})$ (16)

Step 4: Chasing and driving the prey for optimizing $Z_{\textit{gs}}$

For the period of the exploration with exploitation phases, the prey is hunted. Equation (17) expresses the drive and chase of the prey in an excessively mathematical way,

$\displaystyle D=\left|{\textit{co}_{\textit{vec}}.Y_{\textit{prey}}\left(T% \right)-\textit{Mo}_{\textit{vec}}.Y_{\textit{chimp}}\left(T\right)}\right|$ (17)

where the number of current iteration is represented as $T$ , the coefficient vectors are represented as $\textit{co}_{\textit{vec}}$ and $\textit{Mo}_{\textit{vec}}$ , position of vector prey is represented as $Y_{\textit{prey}}$ , chimp vector position is represented as $Y_{\textit{chimp}}$ .

Step 5: Attacking method for optimizing the $H_{g}V$

Attacker chimpanzee leads the hunting procedure. Chimpanzees who act as driver, chaser and barrier occasionally engage in hunting behaviour. However, there is currently no knowledge about the ideal position during an abstract search area. The very first attacker, barrier, driver and chaser are considered to have superior knowledge of the locations of feasible prey to numerically emulate the behaviour of chimpanzees. The remaining chimps are compelled to modify their positions in accordance with the positions of the 4 best solutions so far found, which are saved. The Eq. (18) shows the relationship between the attacker, barrier, driver and chaser,

$\displaystyle\begin{array}[]{ll}D_{\textit{attac}\ker}=\left|{\textit{Co}_{1}Y% _{\textit{attac}\ker}-\textit{Mo}_{1}Y}\right|,&D_{\textit{Barrier}}=\left|{% \textit{Co}_{2}Y_{\textit{Barrier}}-\textit{Mo}_{2}Y}\right|,\\ D_{\textit{chaser}}=\left|{\textit{Co}_{3}Y_{\textit{chaser}}-\textit{Mo}_{31}% Y}\right|,&D_{\textit{driver}}=\left|{\textit{Co}_{4}Y_{\textit{driver}}-% \textit{Mo}_{4}Y}\right|\\ \end{array}$ (18)

Step 6: Termination

Finally, the proposed R-ERNN with COA calculate the consistent clean speech spectra in the noisy speech spectra, optimum solution is achieved, after iteration are stopped, if not, steps 1, 2 and 3 is repeated till the halting measures, $y=y+1$ is lit. Figure 2 shows that the Flowchart of COA.

4. Result and discussion

This section describes about R-ERNN optimized with COA based speech enhancement with hearing aids. The suggested method is executed in MATLAB. The method was implemented on PC along Intel-core i5, 2.50 GH central processing unit and 8 GB of Random Access Memory. The performance metrics, like PESQ, STOI, CSII, SD SDR is analyzed. The efficacy of the proposed technique is analyzed to the existing models, such as residual gated RNN-augmented Kalman filtering for SE with identification (RGRNN-SE-HA), a phase-aware composite deep neural network for speech enhancement (PACDNN-SE-HA), Self-attending recurrent neural network for SE to enhance cross-corpus generalization (ARN-SE-HA) respectively.

Table 1
Perceptual estimation of speech quality

perceptual estimation of speech quality (PESQ)
Techniques	SNR (dB)
	2	4	6	8	10
RGRNN-SE-HA	1.56	1.65	1.69	1.72	1.67
PACDNN-SE-HA	1.65	1.58	1.63	1.72	1.78
ARN-SE-HA	1.73	1.64	1.52	1.65	1.74
R-ERNN-COA-SE-HA (proposed)	1.78	1.84	1.88	1.94	1.96

Figure 2.

Flowchart of COA.

4.1 Performance metrics

The proficiency of the proposed technique is assessed under the metrics like PESQ, STOI , CSII, SD as well as source-to-distortion ratio SDR are employed.

Table 1 demonstrates PESQ analysis. Here, the proposed R-ERNN-COA-SE-HA method attains 23.74%, 24.81%, and 19.33% higher PESQ for SNR rate 2; 32.16%, 26.19%, and 34.80% higher PESQ for SNR rate 4; 38.71%, 20.90% and 27.38% higher PESQ for SNR rate 6; 32.48%, 25.87% and 36.10% higher PESQ for SNR rate 8; 20.27%, 28.04% and 40.77% higher PESQ for SNR rate 10 compared with existing RGRNN-SE-HA, PACDNN-SE-HA and ARN-SE-HA models respectively.

Table 2
Short-time objective intelligibility

Short – time objective intelligibility (STOI)
Techniques	SNR (dB)
	0.6	1	1.4	1.8	2.2
RGRNN-SE-HA	0.48	0.61	0.53	0.73	0.60
PACDNN-SE-HA	0.50	0.53	0.58	0.64	0.68
ARN-SE-HA	0.71	0.74	0.79	0.82	0.86
R-ERNN-COA-SE-HA (proposed)	0.87	0.89	0.92	0.93	0.95

Table 2 demonstrates Short-time objective intelligibility analysis. Here, the proposed R-ERNN-COA-SE-HA method attains 34.19%, 24.38% and 32.38% higher STOI for SNR rate 0.6; 19.78%, 30.33% and 27.18% higher STOI for SNR rate 1; 28.37%, 34.92% and 26.15% higher STOI for SNR rate 1.4; 36.21%, 24.78% and 20.24% higher STOI for SNR rate 1.8; compared with existing RGRNN-SE-HA, PACDNN-SE-HA and ARN-SE-HA models respectively.

Table 3

Coherence speech intelligibility index

Coherence speech intelligibility index (CSII)
Techniques	SNR (dB)
	0.5	1	1.5	2	2.5
RGRNN-SE-HA	0.32	0.38	0.42	0.48	0.52
PACDNN-SE-HA	0.42	0.46	0.50	0.55	0.51
ARN-SE-HA	0.51	0.63	0.68	0.72	0.76
R-ERNN-COA-SE-HA (proposed)	0.80	0.84	0.88	0.90	0.87

Table 3 demonstrates Coherence Speech Intelligibility Index (CSII) analysis. Here, the proposed R-ERNN-COA-SE-HA method attains 32.41%, 19.65% and 28.26% higher CSII for SNR rate 0.5; 16.15%, 40.32% and 51.21% higher CSII for SNR rate 1; 17.95%, 18.34% and 31.74% higher CSII for SNR rate 1.5; 30.04%, 24.76% and 27.84% higher CSII for SNR rate 2.5; compared with existing RGRNN-SE-HA, PACDNN-SE-HA and ARN-SE-HA models respectively.

Table 4

Speech distortion

Speech distortion (SD)
Techniques	SNR (dB)
	0.5	1	1.5	2	2.5
RGRNN-SE-HA	6.61	6.90	7.43	7.89	8.35
PACDNN-SE-HA	7.47	7.94	8.38	7.42	7.85
ARN-SE-HA	5.81	6.38	6.71	7.24	8.06
R-ERNN-COA-SE-HA (proposed)	2.25	3.54	3.98	4.24	4.67

Table 4 demonstrates Speech Distortion (SD) analysis. Here, the proposed R-ERNN-COA-SE-HA method attains 28.78%, 26.79% and 31.38% lower SD for SNR rate 0.5; 18.54%, 26.61% and 17.78% lower SD for SNR rate 1; 26.84%, 21.28% and 34.32% lower SD for SNR rate 1.5; 24.95%, 24.17% and 26.71% lower SD for SNR rate 2.5; compared with existing RGRNN-SE-HA, PACDNN-SE-HA and ARN-SE-HA models respectively.

Table 5

Source-distortion ratio

Source – distortion ratio (SDR)
Techniques	SNR (dB)
	0.5	1	1.5	2	2.5
RGRNN-SE-HA	65.3	69.2	72.8	76.1	78.9
PACDNN-SE-HA	74.7	77.5	78.1	79.5	80.5
ARN-SE-HA	70.7	71.9	72.3	74.8	75.1
R-ERNN-COA-SE-HA (proposed)	80.8	84.3	86.2	88.9	0.9

Table 5 demonstrates Source-distortion ratio analysis. Here, proposed R-ERNN-COA-SE-HA method attains 24.85%, 21.34% and 35.30% higher SDR for SNR rate 0.5; 27.45%, 26.74% and 26.96% higher SDR for SNR rate 1; 17.75%, 19.40% and 26.84 higher SDR for SNR rate 1.5; 26.21%, 28.20% and 19.81% higher SDR for SNR rate 2.5; compared with existing RGRNN-SE-HA, PACDNN-SE-HA and ARN-SE-HA models respectively.

Figure 3.

Training and validation label distribution.

The training and validation label distribution of proposed method is shown in Fig. 3.

5. Conclusion

In this manuscript, R-ERNN optimized with COA based speech enhancement for hearing aids is successfully implemented. The proposed R-ERNN-COA-SE-HA is activated in MATLAB and the efficiency is evaluated by using several performance metrics, such as PESQ, STOI, CSII, SD and SDR. Here, the proposed R-ERNN-COA-SE-HA method attains 34.19%, 24.38% and 32.38% higher STOI and 28.78%, 26.79% and 31.38% lower SD compared with existing methods like RGRNN-SE-HA, PACDNN-SE-HA and ARN-SE-HA models respectively. Various adaptive beamforming techniques are introduced which can improve speech enhancement of hearing aids in the future. The proper functioning of speech recognition systems for challenging and complex issues may work. In the future, the new task of speech recognition system will be to use all available incentives to project the ideals of society and business into human-machine conversations.

References

Park

Cho

Kim

Lee

. Speech enhancement for hearing aids with deep learning on environmental noises. Applied Sciences. 2020; 10(17): 6077.

Green

Hilkhuysen

Huckvale

Rosen

Brookes

Moore

Naylor

Lightburn

Xue

. Speech recognition with a hearing-aid processing scheme combining beamforming with mask-informed speech enhancement. Trends in Hearing. 2022; 2623312165211068629.

Shankar

Bhat

Reddy

Panahi

. Noise dependent super gaussian-coherence based dual microphone speech enhancement for hearing aid application using smartphone. arXiv preprint arXiv. 2001; 09571. 2020.

Chen

Shi

Xiao

Wang

Shang

Meng

Zheng

. A cascaded speech enhancement for hearing aids in noisy-reverberant conditions. InProc. Clarity Workshop on Machine Learning Challenges for Hearing Aids. 2021.

Sun

Jiang

Chen

Xie

Wang

. A supervised speech enhancement method for smartphone-based binaural hearing aids. IEEE Transactions on Biomedical Circuits and Systems. 2020; 14(5): 951-60.

Shajin

Rajesh

Thilaha

. Bald eagle search optimization algorithm for cluster head selection with prolong lifetime in wireless sensor network. Journal of Soft Computing and Engineering Applications. 2020; 1(1): 7.

Shajin

Aruna Devi

Prakash

Sreekanth

Rajesh

. Sailfish optimizer with Levy flight, chaotic and opposition-based multi-level thresholding for medical image segmentation. Soft Computing. 2023; 1-26.

Gogate

Dashtipour

Adeel

Hussain

. CochleaNet: A robust language-independent audio-visual model for real-time speech enhancement. Information Fusion. 2020; 63: 273-85.

Schröter

Rosenkranz

Escalante-B

Maier

. LACOPE: Latency-Constrained Pitch Estimation for Speech Enhancement. InInterspeech. 2021; 656-660.

10.

Michelsanti

Tan

Zhang

Jensen

. An overview of deep-learning-based audio-visual speech enhancement and separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing. 2021; 29: 1368-96.

11.

Schröter

Rosenkranz

Escalante-B

Maier

. Low latency speech enhancement for hearing aids using deep filtering. IEEE/ACM Transactions on Audio, Speech, and Language Processing. 2022; 30: 2716-28.

12.

Passos

Papa

Hussain

Adeel

. Canonical cortical graph neural networks and its application for speech enhancement in audio-visual hearing aids. Neurocomputing. 2023; 527: 196-203.

13.

Hoang

Tan

De Haan

Jensen

. The minimum overlap-gap algorithm for speech enhancement. IEEE Access. 2022; 10: 14698-716.

14.

Hoang

De Haan

Tan

Jensen

. Multichannel speech enhancement with own voice-based interfering speech suppression for hearing assistive devices. IEEE/ACM Transactions on Audio, Speech, and Language Processing. 2022; 30: 706-20.

15.

Garg

. Speech enhancement using long short term memory with trained speech features and adaptive wiener filter. Multimedia Tools and Applications. 2023; 82(3): 3647-75.

16.

Girirajan

Pandian

. Real-Time Speech Enhancement Based on Convolutional Recurrent Neural Network. Intelligent Automation & Soft Computing. 2023; 35(2).

17.

Patil

Jaware

Patil

Badgujar

Albu

Mahariq

Al-Sheikh

Nayak

. Marathi Speech Intelligibility Enhancement Using I-AMS Based Neuro-Fuzzy Classifier Approach for Hearing Aid Users. IEEE Access. 2022; 10: 123028-42.

18.

Passos

Papa

Del Ser

Hussain

Adeel

. Multimodal audio-visual information fusion using canonical-correlated graph neural network for energy-efficient speech enhancement. Information Fusion. 2023; 90: 1-1.

19.

Lin

van Wijngaarden

Wang

Smith

. Speech enhancement using multi-stage self-attentive temporal convolutional networks. IEEE/ACM Transactions on Audio, Speech, and Language Processing. 2021; 29: 3440-50.

20.

Kim

Shin

. Improved speech enhancement considering speech PSD uncertainty. IEEE/ACM Transactions on Audio, Speech, and Language Processing. 2022; 30: 1939-51.

21.

Saleem

Gao

Khattak

Rauf

Kadry

Shafi

. Deepresgru: residual gated recurrent neural network-augmented kalman filtering for speech enhancement and recognition. Knowledge-Based Systems. 2022; 238: 107914.

22.

Hasannezhad

Zhu

Champagne

. PACDNN: A phase-aware composite deep neural network for speech enhancement. Speech Communication. 2022; 136: 1-3.

23.

Pandey

Wang

. Self-attending RNN for speech enhancement to improve cross-corpus generalization. IEEE/ACM Transactions on Audio, Speech, and Language Processing. 2022; 30: 1374-85.

24.

Lei

Hou

Yang

Sun

Rong

Wang

Chen

. A low-latency hybrid multi-channel speech enhancement system for hearing aids. InICASSP 2023–2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2023; 1-2. IEEE.

25.

Cantu

Hohmann

. Spectro-Temporal Post-Filtering Via Short-Time Target Cancellation for Directional Speech Enhancement in a Dual-Microphone Hearing AID. InICASSP 2023–2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2023; 1-5. IEEE.

26.

Wang

Cornell

Choi

Lee

Kim

Watanabe

. FNeural speech enhancement with very low algorithmic latency and complexity via integrated full-and sub-band modeling. InICASSP 2023-2023; IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2023; 1-5. IEEE.

27.

https://dagshub.com/hazalkl/MS-SNSD/src/master/Data.

28.

Klejsa

Hedelin

Zhou

Fejgin

Villemoes

. High-quality speech coding with sample RNN. InICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2019; 7155-7159. IEEE.

29.

Tuncer

Dogan

Subasi

. Surface EMG signal classification using ternary pattern and discrete wavelet transform based feature extraction for hand movement recognition. Biomedical Signal Processing and Control. 2020; 58: 101872.

30.

Gao

Gong

Zhang

Lin

Wang

Huang

Zurada

. A Recalling-Enhanced Recurrent Neural Network: conjugate gradient learning algorithm and its convergence analysis. Information Sciences. 2020; 519; 273-88.

31.

Khishe

Mosavi

. Chimp optimization algorithm. Expert Systems with Applications. 2020; 149: 113338.

Recalling-Enhanced Recurrent Neural Network optimized with Chimp Optimization Algorithm based speech enhancement for hearing aids

Abstract

Keywords

1. Introduction

3. Proposed methodology

3.2 Input signal encoded using vocoder analysis

3.4.1 Recalling-Enhanced Recurrent Neural Network (R-ERNN)

Table 1 Perceptual estimation of speech quality

Table 2 Short-time objective intelligibility

References

Table 1
Perceptual estimation of speech quality

Table 2
Short-time objective intelligibility