Nonnegative Matrix Factorization Based Adaptive Noise Sensing over Wireless Sensor Networks

Abstract

An adaptive noise sensing method is proposed to improve the speech sensing performance of speech-based applications operated over wireless sensor networks. The proposed method is based on nonnegative matrix factorization (NMF), which consists of adaptive noise sensing and noise reduction. In other words, adaptive noise sensing is performed by adapting a priori noise basis matrix of the NMF, which is estimated from the noise signal, resulting in an adapted noise basis matrix. Subsequently, the adapted noise basis matrix is used for the NMF decomposition of noisy speech into clean speech and background noise. The estimated clean speech signal is then applied to a front-end of the speech-based applications. The performance of the proposed NMF-based noise sensing and reduction method is first evaluated by measuring the source to distortion ratio (SDR), the source to interferences ratio (SIR), and the source to artifacts ratio (SAR). In addition, the proposed method is applied to an automatic speech recognition (ASR) system, which is a typical speech-based application, and then the average word error rate (WER) of the ASR is compared with that employing either a Wiener filter, or a conventional NMF-based noise reduction method using only a priori noise basis matrix.

1. Introduction

Speech-based user applications operated over wireless sensor networks are increasingly being utilized in various environments, for example, smart home, smart TV, and cars, since they have become a key feature of smart user interfaces [1–4]. However, as the number of the speech-based application fields has increased, the various types of background noise could negatively affect the speech sensing performance of the applications deployed over wireless sensor networks. These background noises can be classified into two types—stationary and nonstationary—depending on the variability of their characteristics over time. Many conventional methods, including spectral subtraction [5], minimum mean square error log-spectral amplitude (MMSE-LSA) [6, 7], and Wiener filtering [8], have been reported to effectively reduce stationary noise that was recorded with speech signals. Consequently, they were successfully applied to a front-end of an automatic speech recognition (ASR) system, that is, a typical speech-based application over wireless sensor networks [9]. However, since these conventional methods were developed based on the stationary noise assumption, their performance could degrade under nonstationary noise conditions [10, 11]. Thus, the reduction of nonstationary noise is important for reliable noise-robust speech-based applications over wireless sensor networks under various kinds of environmental noise conditions.

As an alternative, nonnegative matrix factorization (NMF)-based noise reduction methods have been proposed to estimate the noise spectrum effectively under nonstationary noise conditions [12–16]. In particular, recent research works have reported that NMF-based noise reduction methods have been successfully applied to a front-end of an ASR system under various nonstationary noise environments [14–16]. However, the performance of NMF-based noise reduction methods degraded substantially when there was a mismatch in noise type for noise basis training and estimation using NMF [17, 18]. There have been several approaches proposed to improve the noise reduction performance of NMF when there was a mismatch between the training and estimation of speech and/or the noise basis [19–21]. In particular, real-time semisupervised source separation methods in [19, 20] assumed that the noise basis was prepared at the training stage, while the speech basis was learned online under nonstationary noisy conditions. The methods dealt with the mismatch in speech basis training and estimation, but there was no consideration on the mismatch between the noise basis training and estimation. In [21], a universal model was introduced to overcome the mismatch between the noise basis training and estimation, where each noise basis was trained to represent a certain type of noise source. Consequently, the performance of the universal model would be limited because noise could be represented by multiple overlapping sources in a real world environment [18].

In this paper, an NMF-based adaptive noise sensing and reduction method is proposed to improve the performance of an ASR system under the mismatch in a noise type between the noise basis training and estimation. The proposed method adaptively updates a priori noise basis matrix of the NMF on the fly by estimating the noise signal prior to the actual speech signal. Next, NMF decomposition is carried out with the adapted noise basis matrix in order to estimate clean speech and background noise from noisy speech. Finally, the estimated clean speech signal is applied to a front-end of an ASR system in order to improve the performance of ASR under various noise types.

The rest of this paper is organized as follows. Following this introduction, Section 2 briefly reviews a conventional NMF-based noise reduction method. Section 3 proposes an NMF-based noise sensing and reduction method. Section 4 evaluates the performance of the proposed method and compares it with those of conventional methods. Finally, the paper is concluded in Section 5.

2. Conventional NMF-Based Noise Reduction Method

Figure 1 shows the procedure of a conventional NMF-based noise reduction method applied for an ASR. As shown in the figure, noisy speech is captured by a microphone; then a block of speech signal, called a speech frame, is transformed into a frequency domain by applying a short-time Fourier transform (STFT). Next, an NMF technique is applied to estimate the clean speech spectrum by reducing the noise spectrum. Consequently, the estimated clean spectrum is transformed back into the time domain by applying an inverse STFT. Finally, an ASR system, which is typically based on hidden Markov models [22], is constructed from the feature parameters extracted from this estimated clean speech.

Figure 1

Procedure of a conventional NMF-based noise reduction method.

As mentioned in Figure 1, an NMF-based noise reduction method attempts to decompose a noisy speech signal into separate speech and noise signals by exploring the sparseness of the noisy speech [18]. To explain how to estimate speech and noise with NMF, noisy speech at the ith speech frame, $y_{i} (n)$ , is represented as

\begin{matrix} y_{i} (n) = s_{i} (n) + d_{i} (n), \end{matrix}

(1)

where

s_{i} (n)

and

d_{i} (n)

are clean speech and additive noise at the ith speech frame, respectively, and

d_{i} (n)

is assumed to be uncorrelated with

s_{i} (n)

. By applying an STFT to (1),

y_{i} (n)

can be represented as the spectral components, as

\begin{matrix} Y_{i} (k) = S_{i} (k) + D_{i} (k) for k = 0,1, \dots, K - 1, \end{matrix}

(2)

where

Y_{i} (k), S_{i} (k)

, and

D_{i} (k)

denote the kth spectral components of

y_{i} (n), s_{i} (n)

, and

d_{i} (n)

, respectively. In order to estimate

S_{i} (k)

from

Y_{i} (k)

, the spectral magnitudes of several speech frames are concatenated together so that

Y = S + D

is obtained. Note that it is assumed here that

| Y_{i} (k) | \approx | S_{i} (k) | + | D_{i} (k) |

, because this assumption has provided satisfactory results for NMF-based noise reduction [23, 24]. Thus, the matrices

Y, S

, and D are all

K \times N

matrices, where K and N are the number of frequency bins and the number of concatenated frames, respectively.

In the NMF framework, Y, S, and D are represented as $Y = B_{Y} A_{Y}$ , $S = B_{S} A_{S}$ , and $D = B_{D} A_{D}$ , respectively, while $B_{Y}$ , $B_{S}$ , and $B_{D}$ are the basis matrices of Y, S, and D, respectively, and $A_{Y}$ , $A_{S}$ , and $A_{D}$ are the activation matrices corresponding to $B_{Y}$ , $B_{S}$ , and $B_{D}$ , respectively. If it is assumed that S and D are fully separable from $Y, Y$ can be rewritten as [23]

\begin{array}{l} Y = B_{Y} A_{Y} \\ = [B_{S} B_{D}] [\begin{bmatrix} A_{S} \\ A_{D} \end{bmatrix}] = B_{S} A_{S} + B_{D} A_{D} = S + D, \end{array}

(3)

where

B_{Y} = [B_{S} B_{D}]

and

A_{Y} = {[A_{S} A_{D}]}^{T}

. Here, T is a transpose operator. Since

R_{S}

and

R_{D}

(

R_{Y} = R_{S} + R_{D}

) are the ranks of the basis matrices for S and D, the dimensions of

B_{Y}

B_{S}

, and

B_{D}

are

K \times R_{Y}

K \times R_{S}

, and

K \times R_{D}

, respectively, and the dimensions of

A_{Y}

A_{S}

, and

A_{D}

are

R_{Y} \times N

R_{S} \times N

, and

R_{D} \times N

, respectively. As described in (3), S and D can be obtained, if

B_{S}

B_{D}

A_{S}

, and

A_{D}

are known a priori. Therefore, most conventional NMF-based noise reduction methods estimate

B_{\tilde{S}}

and

B_{\tilde{D}}

from previously stored speech and noise databases,

\tilde{S}

and

\tilde{D}

[14, 15]. For given

\tilde{S}

and

\tilde{D}

B_{\tilde{S}}

and

B_{\tilde{D}}

can be obtained through a multiplicative update rule [25] as

\begin{matrix} B_{\tilde{S}}^{t} = B_{\tilde{S}}^{t - 1} \otimes \frac{(\tilde{S} / B_{\tilde{S}}^{t - 1} A_{\tilde{S}}^{t - 1}) {(A_{\tilde{S}}^{t - 1})}^{T}}{1 {(A_{\tilde{S}}^{t - 1})}^{T}}, \end{matrix}

(4)

\begin{matrix} A_{\tilde{S}}^{t} = A_{\tilde{S}}^{t - 1} \otimes \frac{{(B_{\tilde{S}}^{t})}^{T} (\tilde{S} / B_{\tilde{S}}^{t} A_{\tilde{S}}^{t - 1})}{{(B_{\tilde{S}}^{t})}^{T} 1}, \end{matrix}

(5)

\begin{matrix} B_{\tilde{D}}^{t} = B_{\tilde{D}}^{t - 1} \otimes \frac{(\tilde{D} / B_{\tilde{D}}^{t - 1} A_{\tilde{D}}^{t - 1}) {(A_{\tilde{D}}^{t - 1})}^{T}}{1 {(A_{\tilde{D}}^{t - 1})}^{T}}, \end{matrix}

(6)

\begin{matrix} A_{\tilde{D}}^{t} = A_{\tilde{D}}^{t - 1} \otimes \frac{{(B_{\tilde{D}}^{t})}^{T} (\tilde{D} / B_{\tilde{D}}^{t} A_{\tilde{D}}^{t - 1})}{{(B_{\tilde{D}}^{t})}^{T} 1}, \end{matrix}

(7)

where t is an iteration index and both multiplication, ⊗, and division are applied on an element-by-element basis. In addition,

1

is a

K \times N

matrix in which all elements are equal to unity. Note that all elements for

B_{\tilde{S}}^{0}

A_{\tilde{S}}^{0}

B_{\tilde{D}}^{0}

, and

A_{\tilde{D}}^{0}

can be set initially at random values between 0 and 1.

In the NMF training, $B_{\tilde{S}}$ is obtained by repeating (4) and (5) until the relative reduction of the NMF objective function according to the iteration arriving at a value below a predefined threshold. In this paper, the Kullback-Leibler (KL) divergence is employed as an NMF objective function [25]. Similarly, $B_{\tilde{D}}$ is also obtained by repeating (6) and (7), and the estimation process is terminated based on the KL divergence.

As described above, the conventional noise reduction methods are performed using $B_{S} = B_{\tilde{S}}^{T_{S}}$ and $B_{D} = B_{\tilde{D}}^{T_{D}}$ , where $T_{S}$ and $T_{D}$ are the final iterations of the NMF training. Then, the activation matrices, $A_{S}$ and $A_{D}$ , are calculated by the multiplicative NMF update rule as

\begin{array}{l} [\begin{bmatrix} A_{S}^{t} \\ A_{D}^{t} \end{bmatrix}] \\ = [\begin{bmatrix} A_{S}^{t - 1} \\ A_{D}^{t - 1} \end{bmatrix}] \otimes \frac{{[B_{S} B_{D}]}^{T} (Y / [B_{S} B_{D}] {[A_{S}^{t - 1} A_{D}^{t - 1}]}^{T})}{{[B_{S} B_{D}]}^{T} 1}, \end{array}

(8)

where all elements for

A_{S}^{0}

and

A_{D}^{0}

are also set as random values between 0 and 1. Similar to the NMF training, (8) is repeated until an NMF objective function converges. Subsequently,

S^{'}

is set to

B_{S} A_{S}^{T_{A}}

, where

T_{A}

is the last iteration of (8). Finally, an estimate of clean speech,

s_{i}^{'} (n)

, is obtained by applying an inverse STFT to

S^{'}

, and it is fed to the front-end of an ASR system.

However, the main drawback of the conventional NMF-based noise reduction methods is that the noise reduction performance is not reliable when there is a mismatch in noise types between the noise basis training and estimation using NMF. In other words, the basis matrices, $B_{\tilde{D}}^{T_{D}}$ , are inappropriate for the ASR test. To overcome this problem, $B_{\tilde{D}}^{T_{D}}$ should be updated during the ASR test.

3. Proposed NMF-Based Adaptive Noise Sensing and Reduction

In this section, an NMF-based adaptive noise sensing and reduction method is proposed to mitigate the degradation of noise reduction when there is a mismatch in noise types between noise basis training and estimation using NMF. Figure 2 shows the procedure of the proposed NMF-based adaptive noise sensing and reduction method. As shown in the figure, the procedure is divided into three different processing stages: a priori NMF basis modeling, NMF-based adaptive noise sensing, and noise reduction. The first processing stage of the proposed method is the same as that of the conventional method described in Section 2. In other words, clean speech signals and noise signals are separately applied to the NMF training in order to obtain the a priori basis matrices. In the second processing stage, the adaptive noise sensing is performed to decompose the noisy input spectrum into speech and noise spectrum using a priori speech basis matrix estimated by the first processing stage. That is, the noise basis and activation matrices are obtained by adapting a priori noise basis from the instantaneous noise frames of the noisy input signal. Finally, the third processing stage of the proposed method estimates the noise-reduced speech signal by constructing a Wiener filter [8] using the adaptively estimated noise spectrum. The following subsections describe a priori NMF basis acquisition, NMF-based adaptive noise sensing, and noise reduction in detail.

Figure 2

Procedure of the proposed NMF-based adaptive noise sensing and reduction method.

3.1. Modeling of A Priori NMF Basis Matrix

This subsection describes how to obtain the NMF basis matrices of speech and noise signals. As mentioned in Section 2, for given speech and noise database, $\tilde{S}$ and $\tilde{D}$ , the basis and activation matrices for speech and noise, $B_{\tilde{S}}$ , $A_{\tilde{S}}$ , $B_{\tilde{D}}$ , and $A_{\tilde{D}}$ , are obtained by iterating equations (4)–(7). Notice that the dimensions of $B_{\tilde{S}}$ , $A_{\tilde{S}}$ , $B_{\tilde{D}}$ , and $A_{\tilde{D}}$ are $K \times R_{S}$ , $R_{S} \times N$ , $K \times R_{D}$ , and $R_{D} \times N$ , respectively, where $R_{S}$ and $R_{D}$ are the number of bases for $\tilde{S}$ and $\tilde{D}$ . In this paper, K, N, $R_{S}$ , and $R_{D}$ are set to 257, 2000, 80, and 40, respectively. In particular, as a termination condition for the NMF iteration, the divergence cost function [25] for (4) and (5) is defined as

\begin{array}{l} Div (\tilde{S}; A_{\tilde{S}}^{t}, B_{\tilde{S}}^{t}) \\ = \sum_{K, N} | (\tilde{S} \otimes \log (\frac{\tilde{S}}{(B_{\tilde{S}}^{t} A_{\tilde{S}}^{t})})) - (\tilde{S} - B_{\tilde{S}}^{t} A_{\tilde{S}}^{t}) |, \end{array}

(9)

where the summation notation means adding all the elements of a matrix together. Accordingly, if

| Div (\tilde{S}; A_{\tilde{S}}^{t}, B_{\tilde{S}}^{t}) - Div (\tilde{S}; A_{\tilde{S}}^{t - 1}, B_{\tilde{S}}^{t - 1}) | / Div (\tilde{S}; A_{\tilde{S}}^{t - 1}, B_{\tilde{S}}^{t - 1}) < θ

, then

B_{\tilde{S}} = B_{\tilde{S}}^{t}

and

A_{\tilde{S}} = A_{\tilde{S}}^{t}

, where, in this paper, θ is set to 0.001 from the preliminary experiments at the training stage. Similarly,

B_{\tilde{D}} = B_{\tilde{D}}^{t}

and

A_{\tilde{D}} = A_{\tilde{D}}^{t}

are obtained from (6) and (7), if the change in divergence falls within a predefined threshold, such as

| Div (\tilde{D}; A_{\tilde{S}}^{t}, B_{\tilde{S}}^{t}) - Div (\tilde{D}; A_{\tilde{S}}^{t - 1}, B_{\tilde{S}}^{t - 1}) | / Div (\tilde{D}; A_{\tilde{S}}^{t - 1}, B_{\tilde{S}}^{t - 1}) < θ

, where θ is also set to 0.001. Finally,

B_{\tilde{S}}

and

B_{\tilde{D}}

are applied to the NMF-based adaptive noise sensing stage, which will be explained in the next subsection.

3.2. NMF-Based Adaptive Noise Sensing

This subsection describes how the noise spectrum is adapted into the NMF framework. First, noise frames are detected from noisy input speech. Then, the detected noise frames are concatenated to construct a noise matrix, $D^{'}$ , which is used to adapt $B_{\tilde{D}}$ , which is estimated a priori as described in Section 3.1. Specifically, the ratio of speech and noise magnitudes for the kth frequency bin at the ith frame, $R_{i} (k)$ , are first calculated as

\begin{matrix} R_{i} (k) = \frac{| {\tilde{S}}_{i} (k) |}{| {\tilde{D}}_{i} (k) |}, \end{matrix}

(10)

where

| {\tilde{S}}_{i} (k) |

and

| {\tilde{D}}_{i} (k) |

are the clean and noise spectral components estimated from

\tilde{S} = B_{\tilde{S}} A_{S}

and

\tilde{D} = B_{\tilde{D}} A_{D}

, respectively. Then, a speech mask at the ith frame,

m_{s} (i)

, is calculated as

\begin{matrix} m_{s} (i) = \frac{1}{K} \sum_{k = 0}^{K - 1} R_{i} (k) . \end{matrix}

(11)

Using (11), a set of noise frames, $I_{D}$ , is selected which satisfies the equation of

\begin{matrix} I_{D} = {i | m_{s} (i) \leq η}, \end{matrix}

(12)

where η is a threshold for detecting noise frames. In this paper, η is determined by considering the mean and variance of

m_{s} (i)

such that

\begin{matrix} η = μ_{m} + κ σ_{m}, \end{matrix}

(13)

where

μ_{m} = (1 / I) \sum_{i = 1}^{I} m_{s} (i)

and

σ_{m}^{2} = (1 / I) \sum_{i = 1}^{I} (m_{s} (i) - μ_{m})^{2}

. For the mean and variance calculation, the first I frames of noisy speech are assumed to be noise frames and

I = 20

in this paper. In addition, κ in (13) is set so that approximately 80% of the initial I frames are included in

I_{D}

. As a result, a noise binary mask for the kth frequency bin at the ith frame,

M_{i} (k)

, can be defined as

\begin{matrix} M_{i} (k) = {\begin{cases} 1, & if i \in I_{D} \\ 0, & otherwise \end{cases} \end{matrix}

(14)

and a noise matrix is estimated as

D^{'} = M \otimes Y

, where M is a

(K \times N)

noise mask matrix constructed by (14).

Next, $B_{\hat{D}}$ is adapted using $D^{'}$ by the following iterative equations of

\begin{matrix} B_{\hat{D}}^{t} = B_{\hat{D}}^{t - 1} \otimes \frac{(D^{'} / B_{\hat{D}}^{t - 1} A_{\hat{D}}^{t - 1}) {(A_{\hat{D}}^{t - 1})}^{T}}{1 {(A_{\hat{D}}^{t - 1})}^{T}}, \\ A_{\hat{D}}^{t} = A_{\hat{D}}^{t - 1} \otimes \frac{{(B_{\hat{D}}^{t})}^{T} (D^{'} / B_{\hat{D}}^{t} A_{\hat{D}}^{t - 1})}{{(B_{\hat{D}}^{t})}^{T} 1}, \end{matrix}

(15)

where

B_{\hat{D}}^{t}

and

A_{\hat{D}}^{t}

are the adapted basis and activation matrices of noise at the tth iteration. As an initial condition for (15), all elements of

A_{\hat{D}}^{0}

can be set as random values between 0 and 1, whereas

B_{\hat{D}}^{0} = B_{\tilde{D}}

. Similar to the NMF training described in (4)–(7), the procedure of (15) is terminated if the condition

| Div (\hat{D}; A_{\hat{D}}^{t}, B_{\hat{D}}^{t}) - Div (\hat{D}; A_{\hat{D}}^{t - 1}, B_{\hat{D}}^{t - 1}) | / Div (\hat{D}; A_{\hat{D}}^{t - 1}, B_{\hat{D}}^{t - 1}) < θ_{Adapt}

is satisfied. It should be noted that the number of iterations for the noise basis adaptation should be smaller than that of the NMF training to prevent

B_{\hat{D}}

from representing only the basis of

D^{'}

. For this reason,

θ_{Adapt}

is set to 0.01, which is 10 times greater than θ used in Section 3.1. Consequently,

B_{\hat{D}} = B_{\hat{D}}^{t}

and

A_{\hat{D}} = A_{\hat{D}}^{t}

are obtained when the procedure terminates at the tth iteration.

As a final processing step for the adaptation, NMF decomposition is performed in order to calculate $A_{S}$ and $A_{D}$ . To this end, a multiplicative update rule with $B_{\tilde{S}}$ and $B_{\hat{D}}$ is applied, where $B_{\tilde{S}}$ is the basis matrix obtained in Section 3.1. That is, the NMF decomposition also iterates the following equation

\begin{array}{l} B_{\tilde{D}}^{T_{D}} [\begin{bmatrix} A_{S}^{t} \\ A_{D}^{t} \end{bmatrix}] \\ = [\begin{bmatrix} A_{S}^{t - 1} \\ A_{D}^{t - 1} \end{bmatrix}] \otimes \frac{{[B_{\tilde{S}} B_{\hat{D}}]}^{T} (Y / [B_{\tilde{S}} B_{\hat{D}}] {[A_{S}^{t - 1} A_{D}^{t - 1}]}^{T})}{{[B_{\tilde{S}} B_{\hat{D}}]}^{T} 1}, \end{array}

(16)

where all elements in

A_{S}^{0}

and

A_{D}^{0}

, are set as random values between 0 and 1. The termination condition is also defined as

Div (Y; [A_{S}^{t - 1} A_{D}^{t - 1}]^{T}, [B_{\tilde{S}} B_{\hat{D}}])

. Thus,

A_{S}

and

A_{D}

are set to

A_{S}^{t}

and

A_{D}^{t}

, respectively, when the procedure of (16) is terminated at the tth iteration. Finally,

\hat{S}

and

\hat{D}

are obtained by

\hat{S} = B_{\tilde{S}} A_{S}

and

\hat{D} = B_{\hat{D}} A_{D}

, respectively, and they are used for noise reduction, which will be explained in the next subsection.

3.3. Noise Reduction

This subsection describes how to reduce noise from noisy input speech using the adapted noise basis of NMF, which is the third processing stage of Figure 2. First, a $(K \times N)$ noise attenuation gain, G, is calculated as

\begin{matrix} G = \frac{\hat{S}}{\hat{S} + Δ \otimes \hat{D}}, \end{matrix}

(17)

where

Δ

is a

(K \times N)

noise reduction control matrix with all elements equal to a constant, δ, in order to control the degree of noise reduction by scaling

\hat{D}

. In this paper, δ is set to 3 since this value of δ provides the best noise reduction performance. Next, each column of G is applied as a transfer function of the Wiener filter to each ith frame of the noisy input speech,

y_{i} (n)

, resulting in an estimation of clean speech

s_{i}^{'} (n)

[8].

4. Performance Evaluation

The performance of the proposed method was first evaluated by measuring the source to distortion ratio (SDR), source to interferences ratio (SIR), and source to artifacts ratio (SAR) [26]. Next, the average word error rate (WER) of an ASR system employing the proposed method was measured. Finally, the performance of the proposed method was compared with those of the two-stage mel-warped Wiener filter method (Mel-WF) [8] and the NMF-based noise reduction method without noise basis adaptation (NMF-Conv) [14].

For the evaluation, 10 males and 10 females spoke 20 sentences each, resulting in 400 sentences. This recording was performed in a quiet room without any reverberation. Next, each sentence was mixed with four different kinds of background noise recorded at bus stops, restaurants, subways, and a living room with a TV on, where signal-to-noise ratio (SNR) was changed from 0 to 20 dB with a step of 5 dB. The bus stop, restaurant, and subway noises were used to simulate high stationary noise environments, while the living room noise was used in order to simulate a high nonstationary noisy environment in which a person was speaking while watching different genres of TV programs such as drama, news, sports, and movies. It should be noted that the restaurant and living room noise signals were recorded in a nonreverberant room. The speech and noise signals used in the evaluation were sampled at 16 kHz with a 16-bit resolution. A priori basis matrices for the evaluation were prepared as follows. First, a priori basis matrix for speech, $B_{\tilde{S}}$ , was trained for each individual speaker with a 20-second long clean sentence. Next, a priori basis matrix for noise, $B_{\tilde{D}}$ , was trained with 60 seconds of cafeteria noise, where the cafeteria noise was different from other four types of background noise used for the performance evaluation.

4.1. Noise Reduction Performance

In this subsection, the noise reduction performance of the proposed method was evaluated under both nonstationary and stationary noise conditions by measuring the SDR, SIR, and SAR. As shown in (1), a noisy speech signal was composed of clean speech and noise as $y (n) = s (n) + d (n)$ , and the estimates of $s (n)$ and $s^{'} (n)$ were obtained by using the proposed method described in Section 3. Then, the true clean signal and its estimate were related by $s^{'} (n) = s (n) + e_{interf} (n) + e_{noise} (n) + e_{artif} (n)$ , where $e_{interf} (n), e_{noise} (n)$ , and $e_{artif} (n)$ were the errors associated with the interference, noise, and artifacts, respectively, and they were obtained through least-square projection [26]. By using those errors, SDR, SIR, and SAR were defined as [26]

\begin{matrix} SDR = 10 \log_{10} \frac{{∥ s (n) ∥}^{2}}{{∥ e_{interf} (n) + e_{noise} (n) + e_{artif} (n) ∥}^{2}}, \\ SIR = 10 \log_{10} \frac{{∥ s (n) ∥}^{2}}{{∥ e_{interf} (n) ∥}^{2}}, \\ SAR = 10 \log_{10} \frac{{∥ s (n) + e_{interf} (n) + e_{noise} (n) ∥}^{2}}{{∥ e_{artif} (n) ∥}^{2}}, \end{matrix}

(18)

where

∥ \cdot ∥

is the norm operator.

First, Table 1 compares the SDRs, SIRs, and SARs of the proposed method and those of the conventional methods under a nonstationary noise condition such as the living room condition. As shown in the table, the proposed method significantly increased the average SDR, SIR, and SAR values, compared to both the Mel-WF and the NMF-Conv. In particular, the proposed method achieved a dramatically higher average SIR than Mel-WF and NMF-Conv, by 15.01 dB and 8.60 dB, respectively, under the living room noise condition. This implies that the proposed method could provide a speech signal with significantly lower interference than the conventional methods under the nonstationary noise condition.

Table 1

Comparison of the average SDRs, SIRs, and SARs (in dB) of the proposed method and the conventional methods under a living room noise condition.

SNR (dB)	Living Room
	Mel-WF			NMF-Conv			Proposed
	SDR	SIR	SAR	SDR	SIR	SAR	SDR	SIR	SAR
20	19.76	23.30	23.48	21.87	26.74	23.78	24.50	32.22	26.18
15	16.14	18.43	21.53	19.66	23.96	21.94	22.91	30.83	25.13
10	11.99	13.41	19.25	16.29	19.76	19.27	20.67	28.80	21.80
5	7.32	8.22	16.67	13.04	16.13	16.54	17.46	26.29	18.83
0	2.24	2.86	14.12	9.04	11.63	13.39	14.22	23.10	15.15

Average	11.49	13.24	19.01	15.98	19.64	18.98	19.83	28.25	21.15

The performance evaluation was then repeated under three different stationary noise conditions such as bus stop, restaurant, and subway noises. Table 2 shows the SDRs, SIRs, and SARs of the noise-reduced signals processed by the proposed and conventional methods under the stationary noise conditions. Similar to the results under the living room noise condition, the proposed method achieved a substantially higher average of SDR, SIR, and SAR than either the Mel-WF or the NMF-Conv under all stationary noise conditions. It could be concluded that the NMF method employing the proposed noise basis adaptation method performed noise reduction more effectively than the conventional methods under both the stationary and nonstationary noise conditions.

Table 2

Comparison of the average SDRs, SIRs, and SARs (in dB) between the proposed method and the conventional methods under stationary noise conditions such as (a) bus stop, (b) restaurant, and (c) subway noise condition.

SNR (dB)	Mel-WF			NMF-Conv			Proposed
SNR (dB)	SDR	SIR	SAR	SDR	SIR	SAR	SDR	SIR	SAR
	(a) Bus Stop
20	18.20	24.34	19.56	19.01	22.41	22.81	20.36	25.36	25.52
15	15.04	19.80	17.04	15.85	17.85	20.36	16.41	21.93	20.77
10	11.54	15.01	14.54	11.46	12.76	17.69	13.86	18.11	19.10
5	7.52	9.84	12.14	6.78	7.81	14.39	9.90	13.77	16.36
0	2.74	4.14	10.18	1.66	2.55	11.09	5.56	9.15	11.18

Average	11.01	14.63	14.69	10.95	12.68	17.27	13.66	17.66	19.84

	(b) Restaurant
20	18.77	24.67	20.28	19.55	23.08	22.23	20.82	27.42	24.27
15	15.52	20.05	17.74	16.40	18.88	20.24	17.87	24.14	20.89
10	11.89	15.06	15.24	12.72	14.50	17.79	14.61	20.39	18.67
5	7.56	9.59	12.67	8.11	9.54	14.38	11.38	16.12	16.28
0	2.27	3.56	10.09	3.32	4.49	11.22	6.56	11.45	11.50

Average	11.21	14.59	15.21	12.02	14.10	17.17	14.62	19.90	19.22

	(c) Subway
20	18.55	24.20	20.06	19.81	22.47	23.42	20.41	25.73	25.25
15	15.24	19.65	17.39	15.96	17.98	20.51	17.33	22.88	22.01
10	11.64	14.91	14.76	11.92	13.37	17.80	14.97	19.35	19.86
5	7.61	9.86	12.27	7.26	8.40	14.44	10.78	15.41	16.36
0	2.94	4.37	10.23	2.18	3.08	11.50	7.05	11.22	12.34

Average	11.20	14.60	14.94	11.42	13.06	17.53	14.49	18.92	19.28

Next, the spectrograms obtained by the proposed method were compared with those by the conventional methods. Figure 3 shows the spectrograms of the noise signals, noisy speech signals at 5 dB SNR, and the estimated noise signals obtained by different noise reduction methods under four different background noise conditions. It was shown by pairwise comparison between Figures 3(e)–3(h) and Figures 3(m)–3(p) that the noise reduction performance of the proposed method was comparable to that of the Mel-WF under stationary noise conditions including bus stop and restaurant noise. On the other hand, the proposed method successfully reduced nonstationary noise under the living room noise condition, whereas the Mel-WF failed to handle the nonstationary noise. Furthermore, it was demonstrated by comparing Figures 3(i)–3(l) and Figures 3(m)–3(p) that the proposed method provided more distinctive speech signals than the NMF-Conv under all the noise conditions.

Figure 3

Spectrograms of signals: (a)–(d) noisy speech signals at 5 dB SNR, (e)–(h) noise-reduced speech signals from Mel-WF, (i)–(l) those of NMF-Conv, and (m)–(p) those by the proposed method under four different noise conditions.

4.2. ASR Performance

To evaluate the recognition performance of the proposed noise reduction method in an ASR system, a hidden Markov model (HMM)-based speech recognition system was constructed. To this end, acoustic models based on three-state left-to-right HMMs were first built from 170,000 phonetically balanced words, which were recorded in quiet rooms by 1,800 speakers. Every recorded speech signal was also sampled at 16 kHz at a 16-bit resolution. As a speech recognition feature, 12 mel-frequency cepstral coefficients (MFCCs) with logarithmic energy were extracted and their delta and acceleration coefficients were concatenated, resulting in a 39-dimensional feature vector [27].

Table 3 compares average WERs of an ASR system employing the proposed method as a front-end with those of ASR systems employing the conventional methods under the nonstationary noise condition. As shown in the table, the proposed method significantly reduced average WER than the conventional methods. Specifically, the proposed method relatively reduced average WER by 65.22% and 24.21% compared to the Mel-WF and the NMF-Conv, respectively.

Table 3

Comparison of average word error rates (WERs) (%) of an ASR system employing the proposed method and the conventional methods under a living room noise condition.

SNR (dB)	Living Room
SNR (dB)	Noisy	Mel-WF	NMF-Conv	Proposed
20	24.21	26.19	12.04	10.98
15	33.33	34.54	14.81	11.90
10	49.87	51.07	21.03	17.33
5	74.34	70.76	29.26	21.83
0	94.18	93.21	49.40	33.86

Average	55.19	55.15	25.31	19.18

Second, Table 4 compares average WERs of an ASR system employing the proposed method as a front-end with those of ASR systems employing the conventional methods under stationary noise conditions. As shown in the table, the proposed method relatively reduced average WER under bus stop, restaurant, and subway noise conditions by 0.93%, 11.34%, and 6.56% compared to the Mel-WF and 12.13%, 13.10%, and 11.50% compared to the NMF-Conv, respectively. Consequently, it was concluded that the proposed method provided a better ASR performance than the conventional methods under the stationary and nonstationary noise conditions.

Table 4

Comparison of average word error rates (WERs) (%) of an ASR system employing the proposed method and the conventional methods under stationary noise conditions such as (a) bus stop, (b) restaurant, and (c) subway noise condition.

SNR (dB)	(a) Bus Stop				(b) Restaurant				(c) Subway
SNR (dB)	Noisy	Mel-WF	NMF-Conv	Proposed	Noisy	Mel-WF	NMF-Conv	Proposed	Noisy	Mel-WF	NMF-Conv	Proposed
20	16.67	10.19	13.62	10.85	17.99	12.30	12.30	12.30	14.81	11.64	13.89	11.77
15	25.66	14.02	16.53	15.34	34.79	17.72	16.27	13.76	20.77	12.43	14.95	13.36
10	60.32	23.02	24.34	21.96	64.02	26.32	25.66	23.54	51.98	19.58	21.16	21.56
5	90.87	46.03	58.86	45.90	94.18	56.35	57.40	44.05	87.83	46.96	47.49	41.53
0	99.74	90.20	93.52	87.70	99.87	90.08	95.24	86.11	99.21	84.59	87.70	75.66

Average	58.65	36.69	41.37	36.35	62.17	40.55	41.37	35.95	54.92	35.08	37.04	32.78

5. Conclusion

In this paper, an NMF-based noise sensing method has been proposed to reduce stationary and nonstationary noises for speech-based applications over wireless sensor networks. The proposed method adapted the initially estimated noise basis matrix on the fly when the noisy input spectrum was applied to a front-end of a speech-based application. After constructing a Wiener filter using the estimated clean speech and noise spectra in the NMF frame, a clean speech signal was estimated and used for speech recognition. The performance of the proposed method was evaluated by measuring the SDR, SIR, and SAR. In addition, the proposed method was applied to an ASR system and then average WER of the ASR system was evaluated. The performance of the proposed method was also compared with those of conventional methods such as the two-stage mel-warped Wiener filter method and the NMF-based noise reduction method without noise basis adaptation. As a result, it was shown that the proposed method provided better performance in terms of the SDR, SIR, SAR, and WER than the conventional methods under both nonstationary and stationary noise conditions.

Footnotes

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

This work was supported in part by the IT R&D Program of MSIP/KEIT [10035252, development of dialog-based spontaneous speech interface technology on mobile platform] and the National Research Foundation of Korea (NRF) Grant funded by the Government of Korea (MSIP) (no. 2012-010636).

References

Lecouteux

Vacher

Portet

Distant speech recognition in a smart home: comparison of several multisource ASRs in realistic conditions

Proceedings of the 12th Annual Conference of the International Speech Communication Association (INTERSPEECH ′11)

August 2011

Florence, Italy

2273 2276

Zhao

Sun

Luo

Design and implementation of enhanced surveillance platform with low-power wireless audio sensor network

International Journal of Distributed Sensor Networks 2012 2012 18

854325

10.1155/2012/854325

Chandak

M. B.

Dharaskar

Natural language processing based context sensitive, content specific architecture & its speech based implementation for smart home applications

International Journal of Smart Home 2010 4 2 1 10

2-s2.0-80055009848

Bourlard

Sub-band based log-energy and its dynamic range stretching for robust in-car speech recognition

Proceedings of the 13th Annual Conference of the International Speech Communication Association (INTERSPEECH ′12)

September 2012

Portland, Ore, USA

314 317

Udrea

R. M.

Vizireanu

Ciochina

Halunga

Nonlinear spectral subtraction method for colored noise reduction using multi-band Bark scale

Signal Processing 2008 88 5 1299 1303

2-s2.0-38949218036

10.1016/j.sigpro.2007.11.023

Martin

Speech enhancement using MMSE short time spectral estimation with Gamma distributed speech priors

Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP ′02)

May 2002

Orlando, Fla, USA

253 256

2-s2.0-0036296949

Dang

Nakai

Khan

I. A.

Noise power spectral density estimation based on maximum a posteriori and generalized gamma distribution

International Journal of Advanced Science and Technology 2013 54 1 77 88

Xie

Liu

Yao

Dai

Improved two-stage Wiener filter for robust speaker identification

Proceedings of the 18th International Conference on Pattern Recognition (ICPR ′06)

August 2006

Hong Kong

310 313

2-s2.0-34147100054

10.1109/ICPR.2006.696

Droppo

Deng

Acero

A noise-robust ASR front-end using Wiener filter constructed from MMSE estimation of clean speech and noise

Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU ′03)

November 2003

Saint Thomas, Virgin Islands, USA

321 326

10.

Rangachari

Loizou

P. C.

A noise estimation algorithm with rapid adaptation for highly non-stationary environments

Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP ′04)

May 2004

Montreal, Canada

305 308

2-s2.0-4544352727

11.

Rangachari

Loizou

P. C.

A noise-estimation algorithm for highly non-stationary environments

Speech Communication 2006 48 2 220 231

2-s2.0-29444448046

10.1016/j.specom.2005.08.005

12.

Schmidt

M. N.

Larsen

Reduction of non-stationary noise using a non-negative latent variable decomposition

Proceedings of the IEEE Workshop on Machine Learning for Signal Processing (MLSP ′08)

October 2008

Cancún, Mexico

486 491

2-s2.0-58049135604

10.1109/MLSP.2008.4685528

13.

Wilson

K. W.

Raj

Smaragdis

Divakaran

Speech denoising using nonnegative matrix factorization with priors

Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP ′08)

April 2008

Las Vegas, Nev, USA

4029 4032

2-s2.0-51449092704

10.1109/ICASSP.2008.4518538

14.

Kim

S. M.

Kim

H. K.

Lee

S. J.

Lee

Y. K.

Noise robust speech recognition based on a non-negative matrix factorization

Proceedings of the 40th International Congress and Exposition on Noise Control Engineering (INTER NOISE ′11)

September 2011

Osaka, Japan

1 7

15.

Kim

S. M.

Park

J. H.

Kim

H. K.

Lee

S. J.

Lee

Y. K.

Non-negative matrix factorization based noise reduction for noise robust automatic speech recognition

Latent Variable Analysis and Signal Separation 2012 7191

Berlin, Germany

Springer

338 346 Lecture Notes in Computer Science

2-s2.0-84857319882

10.1007/978-3-642-28551-6_42

16.

Weninger

Wollmer

Geiger

Schuller

Gemmeke

J. F.

Hurmalainen

Virtanen

Rigoll

Non-negative matrix factorization for highly noise-robust ASR: to enhance or to recognize?

Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP ′12)

March 2012

Kyoto, Japan

4681 4684

17.

Gemmeke

J. F.

Virtanen

Hurmalainen

Exemplar-based sparse representations for noise robust automatic speech recognition

IEEE Transactions on Audio, Speech and Language Processing 2011 19 7 2067 2080

2-s2.0-79960657803

10.1109/TASL.2011.2112350

18.

Hurmalainen

Gemmeke

J. F.

Virtanen

Modelling non-stationary noise with spectral factorisation in automatic speech recognition

Computer Speech & Language 2013 27 3 763 779

10.1016/j.csl.2012.07.008

19.

Duan

Mysore

G. J.

Smaragdis

Online PLCA for real-time semi-supervised source separation

Proceedings of the International Conference on Latent Variable Analysis and Signal Separation (LVA/ICA ′12)

March 2012

Tel-Aviv, Israel

34 41

20.

Duan

Mysore

G. J.

Smaragdis

Speech enhancement by online non-negative spectrogram decomposition in non-stationary noise environments

Proceedings of the 13th Annual Conference of the International Speech Communication Association (INTERSPEECH ′12)

September 2012

Portland, Ore, USA

595 598

21.

Sun

D. L.

Mysore

G. J.

Universal speech models for speaker independent single channel source separation

Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP ′13)

May 2013

Vancouver, Canada

141 145

22.

Huang

Acero

Hon

H. W.

Spoken Language Processing 2001

Prentice-Hall

23.

Virtanen

Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria

IEEE Transactions on Audio, Speech and Language Processing 2007 15 3 1066 1074

2-s2.0-50249152311

10.1109/TASL.2006.885253

24.

Mohammadiha

Speech enhancement using nonnegative matrix factorization and hidden Markov models [Ph.D. thesis] 2013

Stockholm, Sweden

KTH Royal Institute of Technology

25.

Lee

D. D.

Seung

H. S.

Algorithms for non-negative matrix factorization

Proceedings of Neural Information Processing Systems (NIPS ′00)

November 2000

Denver, Colo, USA

556 562

26.

Vincent

Gribonval

Fevotte

Performance measurement in blind audio source separation

IEEE Transactions on Audio, Speech and Language Processing 2006 14 4 1462 1469

2-s2.0-33744975847

10.1109/TSA.2005.858005

27.

Thangarajan

Natarajan

A. M.

A robust front-end processor combining mel frequency cepstral coefficient and sub-band spectral centroid histogram methods for automatic speech recognition

International Journal of Signal Processing, Image Processing and Pattern Recognition 2009 2 2 67 74